MySQL is Facebook-scale so why use NoSQL?
Adam Rifkin stashed this in DevOps
This article on Gigaom really stuck with me: http://gigaom.com/cloud/facebook-shares-some-secrets-on-making-mysql-scale/
It's reassuring to know that MySQL could scale to Facebook's needs. I'm wondering why a team would consider NoSQL for scaling.
Even if they have "Mountains of Metadata"...
At least part of the issue is that big databases are hard without really good DBAs and really good DBAs are very hard to hire. When you run into those issues, blaming the technology is easier than blaming the people wielding it.
If you switch to a NoSQL solution, you may or may not have gained anything, but you'll have gone in knowing that there are only 10 experts on your chosen NoSQL solution worldwide, so they are impossible to hire. NoSQL solutions bring their own problems, but by eliminating some of the expensive benefits of RDBMSs they stop you from making some scaling blunders.
Another perspective on this topic from Adam D'Angelo, who as ex-CTO of Facebook was probably instrumental in the decision to go with and/or continue betting on MySQL at Facebook during the early / hyper growth years -- http://www.quora.com/Quora-Infrastructure/Why-does-Quora-use-MySQL-as-the-data-store-instead-of-NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc/answer/Adam-DAngelo
Adam D'angelo has lots of specific reasons on why start-ups should choose MySQL over NoSQL, his reason #3 resonated with me more than anything else -- your start-up has many risks so why take on an additional technology risk of using NoSQL if you don't have to?
There is a bit of a false dichotomy.
If you look at how facebook, (or really anyone else for that matter) has scaled out MySQL, it looks more and more like a NoSQL solution. (at least the distributed key/value stores, not necessarily the graphdbs) What I mean is that as soon as you start scaling out with read slaves, denormalizing and/or sharding, you start to lose some of the 'relational' qualities of the RDBMS, putting more and more logic in the application to handle manage the interaction with the data store.
This article from last year is talking about MySQL at twitter makes the same arguments in more depth.
What you don't get with MySQL is some of the flexibility and operational advantages of some of the new databases.
I think that's fair.
My takeaway is that we can use NoSQL techniques in our use of MySQL to solve some scaling problems.
Thanks for the gigaom link. Great read!
When you shard, you are reducing the size of the datasets that you will maintain consistency over, for the trade off in complexity of pulling the logic for choosing the partition up into the application. What you have read slaves, you are giving up some consistency for read throughput/availability.
If you haven't read them already, Amazon's Dynamo paper and the Gilbert & Lynch paper on the CAP theorem are worth going through once.
At least for the dynamo inspired data stores, the advantages are the scaling characteristics, fault tolerance and operational overhead.
If Facebook was starting over in 2012, I don't think they would end up with the same architecture.
Thanks for the references, I'll have to hunt them down.
Agreed that Facebook would start with a different architecture; Adam D'Angelo implies that in his writeup of why Quora chose Python for its development.
I see these discussions pop up now and again, and I'm often startled that no one points out how you can use MySQL as a NoSQL datastore. @Andrew good to see someone thinking in that direction. But I'll add a bit more to the discussion with a quote from the largest social games platform providers in Japan:
We do not use NoSQL, either. Why? Because we could get much better performance from MySQL than from other NoSQL products. In our benchmarks, we could get 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients.
They decided to build a MySQL plug-in that bypasses things like parsing SQL statements, opening & locking tables, making SQL execution plans, unlocking & closing tables, etc...
..the overhead imposed by SQL parsing, locks and concurrency controls has nothing to do with the reading or writing of data from the underlying storage engine. In other words, if all you need is direct access to the index, then you can bypass the SQL layer altogether - that is exactly what HandlerSocket provides.
Here are some awesome Ruby examples...
What I really love about this is you can dynamically switch between a relational query, and a no-sql query at run-time depending upon what may work the best for any given use case. If you're a fan of Percona, and who isn't, they've installed the plug-in by default.
What I like about the 'NoSQL movement' is the revisiting of assumptions about what it means to store and retrieve data resulting in more flexible and situationally appropriate options.
The key-value and document dbs have gotten more of the attention, but the graph dbs are super compelling for certain applications, particularly modeling connectivity and relationships.
I'm partial to Neo4J, but that's also probably because I met Emil in 2008.
Have you ever used Neo4J in a production environment? Curious how it handled under load (never played with it)...
Just to tinker. Graphs are a great model for a lot of data, but I haven't pounded on a graph db or had a real project using one. It's just a point of interest I'm watching.
I try to keep up with whatever innovation is happening, but I only want to make bets on things that approach the front ramp of the chasm with velocity, if you are familiar with that metaphor. Tech cycles expand and contract. There was a proliferation of NoSQL technical options, most of which have merits, but there will definitely be a contraction.
One issue with Neo4j, which isn't technology, but definitely influences decision making, they made the advanced features AGPL, which seems like the worst license ever.
"really good DBAs are very hard to hire" this is even more true with NoSQL infrastructure
The more I build, the more I concur.
See also: incendiary article on why Postgres smokes MySQL and NoSQL but fails to address where we're going to find really good DBAs.
If postgres could support pluggable backend engines like mysql, there would be no mysql anymore.
I have spent many, many years learning many, many tricks and tweaks to get mysql to behave in the manner in which I want it to behave. You have to tweak and retweak settings at different points (many of them undocumented) in your scaling equation just to keep it from falling over, and that's presuming you've architected your data structure in a mysql-friendly way.
Forget about foreign keys, you'll kill your performance.
Forget about sprocs, they're just collections of sql statements and you can't do anything even remotely complex.
Never EVER permit a trigger action to fire another trigger, or risk data corruption.
Forget about getting any meaningful performance from views.
And depending upon your replication strategy, be very, VERY careful with any ambiguity in your deterministic queries.
Postgres, on the other hand, just works. And being a descendant of Ingres, like Oracle and DB2 (and i think SQL Server), it has decades of stable history behind it. As of the 8.0 series, tends to be faster than mysql on equivalent hardware, and now natively supports replication.
What it doesn't have is the ability to plug in different storage engines, which would rock. However, 9.0 series allows you to plug in remote data stores similar to Mysql's CLUSTER engine. It also has the HStore table type, which negates the usefulness of mysql's HandlerSocket (which is fast, but also a very very nice way to corrupt your entire tablespace if any other process throws an fsync)
My understanding is FB leverages MySQL for user actions but relies on Cassandra (which was co-developed by Facebook) for the streams/feeds.
Additionally, most NoSQL solutions have a very concrete set of limitations that you must work within. If you do, you will receive consistent performance at 1 byte and 1 terabyte of persistance.
The scaling characteristics of an RDBMS are less concrete and may be unique to the scaling scenario.