Bạn có biết Twitter lưu trữ dữ liệu trên database nào?

  1. Công nghệ thông tin

Từ khóa: 

công nghệ thông tin

Mình có search được 1 chút thì có một bài trên quora có cùng ý tưởng như vầy. Ý tưởng là sử dụng rất nhiều loại DB khác nhau với công dụng khác nhau.

Their GitHub page is quite informative on this topic, if you don't mind digging around a bit:

  • MySQL - Twitter uses MySQL heavily for primary storage of Tweets and Users, and maintains a custom fork that they recently open-sourced: 
    https://github.com/twitter/mysql
    . More information on the engineering blog: 
    http://engineering.twitter.com/2...
  • FlockDB - This is Twitter's in-house graph database, which they use to store social graph information (following, etc). Ultimately it's built on MySQL, but it's still basically a proper database in its own right: 
    https://github.com/twitter/flockdb
  • Memcached: Twitter uses a "heavily modified" fork of Memcached 1.4.4 which they call Twemcache. It's been in production for well over a year already (as of 7/2012): 
    https://github.com/twitter/twemc...
    . Also check out lots more info on the Engineering Blog: 
    http://engineering.twitter.com/2...
  • Cassandra: Twitter has spent a lot of time experimenting with Cassandra. Ultimately this led to services like Snowflake for quickly generating unique identifiers: 
    https://github.com/twitter/snowf...
    . There's also a Ruby gem for Cassandra here: 
    https://github.com/twitter/cassa...
  • Gizzard: This is an in-house scala framework for building custom distributed databases (with arbitrary storage technology), that underlies a number of the other systems discussed here. GitHub page here: 
    https://github.com/twitter/gizzard
  • Apache Lucene: This isn't on the GitHub page, but has been talked about publicly on the engineering blog. The search index is now powered by Lucene, through a system they call Earlybird. See 
    http://engineering.twitter.com/2...
     and 
    http://engineering.twitter.com/2...
     for more detail.
  • HBase and Hadoop: Twitter uses Hadoop and HBase heavily, although this also isn't clear from the GitHub page. Check out 
    Kevin Weil
    's "Elephant Bird" project on GitHub: 
    https://github.com/kevinweil/ele...
    , with more information on the blog: 
    http://engineering.twitter.com/2...
  • Redis: Finally, there's some experimental timeline storage technology that was developed on Redis. It's unclear whether Redis is being used in production or not at this time, but see the Haplocheirus project: 
    https://github.com/twitter/haplo...

There's plenty more, and surely lots of technology that has not been publicly disclosed, but that's the majority of public information that's available on the topic.

Trả lời

Mình có search được 1 chút thì có một bài trên quora có cùng ý tưởng như vầy. Ý tưởng là sử dụng rất nhiều loại DB khác nhau với công dụng khác nhau.

Their GitHub page is quite informative on this topic, if you don't mind digging around a bit:

  • MySQL - Twitter uses MySQL heavily for primary storage of Tweets and Users, and maintains a custom fork that they recently open-sourced: 
    https://github.com/twitter/mysql
    . More information on the engineering blog: 
    http://engineering.twitter.com/2...
  • FlockDB - This is Twitter's in-house graph database, which they use to store social graph information (following, etc). Ultimately it's built on MySQL, but it's still basically a proper database in its own right: 
    https://github.com/twitter/flockdb
  • Memcached: Twitter uses a "heavily modified" fork of Memcached 1.4.4 which they call Twemcache. It's been in production for well over a year already (as of 7/2012): 
    https://github.com/twitter/twemc...
    . Also check out lots more info on the Engineering Blog: 
    http://engineering.twitter.com/2...
  • Cassandra: Twitter has spent a lot of time experimenting with Cassandra. Ultimately this led to services like Snowflake for quickly generating unique identifiers: 
    https://github.com/twitter/snowf...
    . There's also a Ruby gem for Cassandra here: 
    https://github.com/twitter/cassa...
  • Gizzard: This is an in-house scala framework for building custom distributed databases (with arbitrary storage technology), that underlies a number of the other systems discussed here. GitHub page here: 
    https://github.com/twitter/gizzard
  • Apache Lucene: This isn't on the GitHub page, but has been talked about publicly on the engineering blog. The search index is now powered by Lucene, through a system they call Earlybird. See 
    http://engineering.twitter.com/2...
     and 
    http://engineering.twitter.com/2...
     for more detail.
  • HBase and Hadoop: Twitter uses Hadoop and HBase heavily, although this also isn't clear from the GitHub page. Check out 
    Kevin Weil
    's "Elephant Bird" project on GitHub: 
    https://github.com/kevinweil/ele...
    , with more information on the blog: 
    http://engineering.twitter.com/2...
  • Redis: Finally, there's some experimental timeline storage technology that was developed on Redis. It's unclear whether Redis is being used in production or not at this time, but see the Haplocheirus project: 
    https://github.com/twitter/haplo...

There's plenty more, and surely lots of technology that has not been publicly disclosed, but that's the majority of public information that's available on the topic.