Tuesday, December 15, 2009

Database replication lag

As explained in an earlier blog post, we recently started using MySQL master-slave replication on drupal.org in order to provide the scalability necessary to accommodate our growing demands. With one or more replicas of our database, we can instruct Drupal to distribute or load balance the SQL workload among different database servers.

MySQL's master-slave replication is an asynchronous replication model. Typically, all the mutator queries (like INSERT, UPDATE, DELETE) go to a single master, and the master propagates all updates to the slave servers without synchronization or communication. While the asynchronous nature has its advantages, it is also means that the slaves might be (slightly) out of sync.
Consider the following pseudo-code:



$nid = node_save($data);

$node = node_load($nid);


Because node_save() executes a mutator query (an INSERT or UPDATE statement) is has to be executed on the master, so the master can propagate the changes to the slaves. Because node_load() uses a read-only query, it can go to the master or any of the available slaves. Because of the lack of synchronization between master and slaves, there is one obvious caveat: when we execute node_load() the slaves might not have been updated. In other words, unless we force node_load() to query the master, we risk not being able to present the visitor the data that he just saved. In other cases, we risk introducing data inconsistencies due to the race conditions.

So what is the best way to fix this?
  1. Our current solution on drupal.org is to execute all queries on the master, except for those that we know can't introduce race conditions. In our running example, this means that we'd chose to execute all node_load()s on the master, even in absence of a node_save(). This limits our scalability so this is nothing but a temporary solution until we have a good solution in place.
  2. One way to fix this is to switch to a synchronous replication model. In such a model, all database changes will be synchronized across all servers to ensure that all replicas are in a consistent state. MySQL provides a synchronous replication model through the NDB cluster storage engine. Stability issues aside, MySQL's cluster technology works best when you avoid JOINs and sub-queries. Because Drupal is highly relational, we might have to rewrite large parts of our code base to get the most out of it.
  3. Replication and load balancing can be left to some of the available proxy layers, most notably Continuent's Sequoia and MySQL Proxy. Drupal connects to the proxy as if it was the actual database, and the proxy talks to the underlying databases. The proxy parses all the queries and propagates mutator queries to all the underlying databases to make sure they remain in a consistent state. Reads are only distributed among the servers that are up-to-date. This solution is transparent to Drupal, and should work with older versions of Drupal. The only downside is that it not trivial to setup, and based on my testing, it requires quite a bit of memory.
  4. We could use database partitioning, and assign data to different shards in one way or another. This would reduce the replica lag to zero but as we don't have that much data or database tables with millions of rows, I don't think partitioning will buy drupal.org much.
  5. Another solution is to rewrite large parts of Drupal so it is "replication lag"-aware. In its most naive form, the node_load() function in our running example would get a second parameter that specifies whether the query should be executed on the master or not. The call should then be changed to node_load($nid, TRUE) when proceeded by a node_save().

    I already concluded through research that this is not commonly done; probably because such a solution still doesn't provide any guarantees across page request.
  6. A notable exception is MediaWiki, the software behind Wikipedia which has some documented best practices to deal with replication lag. Specifically, they recommend to query the master (using a very fast query) to see what version of the data they have to retrieve from the slave. If the specified version is not yet available on the slave due to replication lag, they simply wait for it to become available. In our running example, each node should get a version number and node_load() would first retrieve the latest version number from the master and then use that version number to make sure it gets an up-to-date copy from the slave. If the right version isn't yet available, node_load() will try loading the data again until it becomes available.


Anonymous:

Another emerging SQL proxy which can be compared to Continuent's Sequoia is MySQL's own MySQLProxy , which is slowly maturing. It offers a Lua interpreter which allows scripting the query flow, and will eventually offer all the features required to solve the challenge outlined above in a much more lightweight fashion than the Continuent solution.
August 24, 2007 - 12:52

* reply

Jakub Suchy:

Isn't Sequoia only for Java applications?
August 27, 2007 - 15:13

* reply

Philippe Jadin:

As for the solution 6, if Drupal enforces new revisions on each node save, it should be possible to compare the version numbers. This would apply to nodes only though (what about taxonomy terms and other stuff ?)
August 24, 2007 - 13:09

* reply

Lukas:

Behold! The silver bullet :)

Another option would be for your slaves to mostly read their data out of memcache. Then you could modify the master and memcache at the same time. Things would get replicated on the DB level, so that any of the slaves could rebuild the memcache entry if it were to disappear. At the same time memcache is clustered and therefore there would be no replication lag.
August 24, 2007 - 13:35

* reply

Anonymous:

A master - master setup with sticky sessions? Should work and be fairly simple to set up.
August 24, 2007 - 14:58

* reply

Luc:

We did some tests with Drupal 5 & master-master replication and it's not that simple. The master-master setup itself is no problem, but master-master breaks the sequences concept (db_next_id()) in Drupal and it also causes problems in the caching tables. We quickly gave up with it. We still use master-master for high availability, but point all our frontend servers to the same DB-server. You still need to be able to handle all load with one server that way, but it makes it easier to failover.
Master-master should become easier with Drupal 6, I'll try to do some tests with it once I have time.
August 24, 2007 - 22:07

* reply

webchick:

If we as a community/project were willing to shift our mental map and work on the various UI issues, revisioned everything (option 6) sounds like a win-win approach.

The Requiring node revisions thread from the devel list has some background discussion on this option. There are a lot of nay-sayers in that thread, and I counted myself among them until I read an excerpt from one of Derek's posts here:

> How often do we do a rollback on drupal.org? How often have you
> done a rollback of one of your blog posts? How often have you
> rolled back a change on your company website? If the answer is
> "never" or "once a year" this should be reflected in the UIs.

IMHO, those are the wrong questions. These are more appropriate:
- How often have you wondered who and when was the last person to modify something?
- How often do you wish you could have seen exactly what they changed?
- How often do you wonder how frequently something is being changed?
...

For myself, and probably a large majority of drupal.org users, the answer to these is "all the time".
August 24, 2007 - 17:06

* reply

FnH:

My knee-jerk reaction would be to keep the layers separate and ignorant of each other (that was my original reaction to the earlier post as well).

I'm guessing Drupal has an order of magnitude more reads than mutations. Therefore, my first try would be to see what a fully synchronous master-master setup would buy me, before introducing a lot of logic in the application layer to solve the database layer's scalability problem.

Both MySQL and PostgreSQL seem capable to provide master-master setups. Some numbers would be interesting too.
August 24, 2007 - 19:17

* reply

Dormando:

Memcached? When storing data into the master, update the cache. Read first from cache. Hard to get sync issues. If it gets slow, add more memcached's. Everyone's doing it. :)

Otherwise, if data doesn't exist on the slave (make new id for comment edits? Dunno...) just query the master... If your boxes are relatively fast you won't often end up in a situation where someone can post, then try to reload a page before replication can get to it.
August 25, 2007 - 05:46

* reply

Robert Douglass:

I like the idea of versioned data but I don't like the idea of blocking on synchronization. I'd feel better if the strategy were "Is the latest version of the data on the slave? If not, load from the master".

Versioned data also makes caching very efficient, especially memcache (as mentioned above).

Oh, and everything webchick said =) Not only would I be in favor of requiring revisions for node data, I would support making revisions tables for all of our first class objects.
August 25, 2007 - 08:14

* reply

Matt:

You might consider it a hack, but it has worked very well for us if there is a write query done to send future reads for that DB to the master. It's only for that page load.
August 26, 2007 - 03:00

* reply

Jeremy Cole:

Hi Dries,

The one true solution here as suggested already is to use memcache for scaling reads, rather than replication. Invalidate the cache entry on node_save() and populate it on node_load() if not present. Given the already present caching infrastructure within Drupal, I would think it's quite easy to do.

Use dual master replication (master-master) for failover, writing to only one master at a time (thus removing any concerns about sequences, etc.). Typically this is done using IP takeover.

Regards,

Jeremy
August 29, 2007 - 18:17

* reply

Jacques Mattheij:

Hi there, I'm pretty green when it comes to Drupal, so forgive me any ignorance.

I think that it all depends on who is doing the 'looking', if the majority of your users are not logged in then they could go to any one of the slaves and it would never matter what they see, if it's outdated it won't be by much (and most of the times that content could come straight from the cache anyway since anonymous users are all the same user so they will see exactly the same content), a user that refreshes a page and is not himself/herself the cause of an update has no way of knowing if a page really is the last version or 1 or 2 seconds out of date, as long as the page is consistent they'll be happy. (When stuff starts breaking that's a different matter of course.)

As soon as someone logs in the situation changes dramatically because these are the users that have update rights (unless of course you allow your anonymous users to update content, in which case a warning message that their update will show up shortly should be provided, and which is fairly common on high volume sites.)

So, as long as the anonymous / logged-in users division is reasonable you'll get a big payoff without any penalties. You'd have to put a figure on that ratio to be able to decide if the scenario if feasible.

Another possibility is to 'bind' the user to the machine that they land on, use a load distribution machine that passes through the complete HTTP request to the 'bound' node, update the bound node (which will give the user an immediate response and will always show consistent output) and the master, the master should take care of replicating to the other nodes. Sequence numbers would still have to be assigned centrally.

Another issue is sequence numbers.

I think I can see why Drupal uses sequence numbers in a separate table (portability ?) but that makes atomic inserts a lot harder and creates a headache during synchronization, possibly it would be better to document which dbms's/table types support auto_increment and to handle the ones that don't in single node configurations only.

Best regards,

Jacques Mattheij
December 13, 2007 - 21:12

* reply

No comments: