Sign up FAST! Login

Tumblr: 20 engineers, 1000 servers, ONE colo.

Stashed in: Tumblr!, DevOps, @marissamayer, Active Users

To save this post, select a stash from drop-down menu or type in a new one:


Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.

  • 500 million page views a day
  • 15B+ page views month
  • ~20 engineers
  • Peak rate of ~40k requests per second
  • 1+ TB/day into Hadoop cluster
  • Many TB/day into MySQL/HBase/Redis/Memcache
  • Growing at 30% a month
  • ~1000 hardware nodes in production
  • Billions of page visits per month per engineer
  • Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
  • Dashboard runs at a million writes a second, 50K reads a second, and it is growing.

All media is stored on S3; 70% of pageviews are the dashboard:

  • Tumblr has a different usage pattern than other social networks.
    • With 50+ million posts a day, an average post goes to many hundreds of people. It’s not just one or two users that have millions of followers. The graph for Tumblr users has hundreds of followers. This is different than any other social network and is what makes Tumblr so challenging to scale.
    • #2 social network in terms of time spent by users. The content is engaging. It’s images and videos. The posts aren’t byte sized. They aren’t all long form, but they have the ability. People write in-depth content that’s worth reading so people stay for hours.
    • Users form a connection with other users so they will go hundreds of pages back into the dashboard to read content. Other social networks are just a stream that you sample.
    • Implication is that given the number of users, the average reach of the users, and the high posting activity of the users, there is a huge amount of updates to handle.
  • Tumblr runs in one colocation site. Designs are keeping geographical distribution in mind for the future.
  • Two components to Tumblr as a platform: public Tumblelogs and Dashboard
    • Public Tumblelog is what the public deals with in terms of a blog. Easy to cache as its not that dynamic.
    • Dashboard is similar to the Twitter timeline. Users follow real-time updates from all the users they follow.
      • Very different scaling characteristics than the blogs. Caching isn’t as useful because every request is different, especially with active followers.
      • Needs to be real-time and consistent. Should not show stale data. And it’s a lot of data to deal with. Posts are only about 50GB a day. Follower list updates are 2.7TB a day. Media is all stored on S3.
    • Most users leverage Tumblr as tool for consuming of content. Of the 500+ million page views a day, 70% of that is for the Dashboard.
    • Dashboard availability has been quite good. Tumblelog hasn’t been as good because they have a legacy infrastructure that has been hard to migrate away from. With a small team they had to pick and choose what they addressed for scaling issues.

Lots of great details in the article; lessons learned:

  • Automation everywhere.
  • MySQL (plus sharding) scales, apps don't.
  • Redis is amazing.
  • Scala apps perform fantastically.
  • Scrap projects when you aren’t sure if they will work.
  • Don’t hire people based on their survival through a useless technological gauntlet.  Hire them because they fit your team and can do the job.
  • Select a stack that will help you hire the people you need.
  • Build around the skills of your team.
  • Read papers and blog posts. Key design ideas like the cell architecture and selective materialization were taken from elsewhere.
  • Ask your peers. They talked to engineers from Facebook, Twitter, LinkedIn about their experiences and learned from them. You may not have access to this level, but reach out to somebody somewhere.
  • Wade, don’t jump into technologies. They took pains to learn HBase and Redis before putting them into production by using them in pilot projects or in roles where the damage would be limited.

There are only 30-50 million monthly dashboard users, not 300 million as originally reported:

Also, 2012 revenues were $5 million. Wow.

You May Also Like: