Tuesday, May 8, 2007

Changing masters

MySQL replication is cool. Almost everyone, when asked what they like best about MySQL, will mention replication. There can be no doubt that many web companies, Yahoo included, owes much to MySQL replication. How else can you scale the reading capacity of your database so easily?

It's not all rosy, replication has its problems too. It's very simplistic (I like the term brittle, if any errors happen when executing the SQL, it just stops), it can be inefficient in some places (think UPDATE that does a table scan to change one value, though 5.1 starts to address this with row-based replication), and it is single-threaded.

Single Threadedness

Let me stop here for a second. A lot of people spend good money on a nice Master DB, multi-core CPU's, plenty of memory, and good drives/RAID, etc. When it comes to their slaves, any old piece of junk will do. If you have low write volume on your master, this can work reasonably well. Your data is protected on your master, and if you lose a slave, no big deal. But once your writes go up, your bottleneck will become your slave(s) getting further and further behind during peak; I promise you.

Since the slave is single threaded, it executes the writes that came into the master serially, writes that probably hit the master in parallel. So, you have hundreds or thousands of writes on your master, but your slave can only do half of that and keeps getting further behind because it has no disk throughput, no CPU, and no memory. Whoops!

The lesson?

If you need to scale reads and your write volume is growing fast, get the same hardware (or better) on your slaves as on your master.

Back on topic...

But I digress. That's one problem with replication. The one that I'm more interested in is what to do with your slave when your master goes down. Let us suppose you have a single master and two slaves. If your master crashes (and isn't coming back anytime soon), you can simply start writing to one of your slaves and life goes on. But wait, you were writing to your master, and reading from both of your slaves. You just went from three servers down to one!

Your other slave is sitting there doing nothing, but replication is broken, so how can you get it replicating from your new master? You need to find the precise binlog and position on your new master where your slave left off of from the old master.

You have a few options:

  1. Start binary logging from scratch on your new master before you start writing to it, and start your slave replicating from file 01, position 0.

    Outcome: not too bad, but you better hope that both your new master and your slave were at exactly the same position on the old master for this to work right.

  2. Guess at the replication position.

    Outcome: missing data, or tons of replication errors with "duplicate keys". You probably want to guess conservatively, so you'll be sludging through the writes that already happened and hoping things stay consistent. The SQL_SLAVE_SKIP_COUNTER variable will become your friend, but I only recommend skipping one record at a time.

  3. Stop your new master and get a snapshot.

    Outcome: consistent data, but more downtime for your site. A workaround for this would be the various hotbackup tools that should let you take a consistent backup of your master without any downtime. Just make sure you snag the master's binlog and position of the instant of your backup.

  4. Scan through your new master's binlog and look for the exact position where you left off from on your slave. mysqlbinlog and SHOW BINLOG EVENTS will be your friend. Start your slave from there.

    Outcome: Consistency, no master downtime, no recopying the data to the slave, but a lot of effort, especially if you do this manually.

My idea for this, which is a hack, is to create a "heartbeat" table and consistently update it with an incremented field on your master. This could be done via Events, or cron. It's important that the actual value of the field is in the INSERT or UPDATE statement, so that it is viewable when you scan the binary logs and see the statements.

Once that is running and is trickling down to all of your slaves via replication, make sure all of your slaves are doing 'log-bin' and 'log-slave-updates', so theoretically any of them can be promoted to master at any time. When you are trying to troll through the binary logs on the new master, you have a statement that you know will be in there some place at regular intervals, the value of the increment at each binary log position, and you know what the value of the increment is on your slave.

With some clever script-hackery you can write a quick and dirty binary search (or the search algorithm of your choice) that will quickly find the exact binlog and position on your new master where the increment value that updated on your slave is located. Once you have that, you can slowly roll forward through the binary log until you find the precise spot that matches the binary log on your slave and you are have your position!

This could be done better, without script-hackery and in the MySQL code. I'm not much of a C hacker, but there's talk of something similar coming soon. MySQL cluster implements this same concept in 5.1 called 'epochs' to support failing over to different master or slave SQL nodes, but it's only for clusters at this point.

Next time your master crashes, get that back up first. But, don't forget about your read capacity in your slave pool, and how much your application is suffering without it.


Danil said...

Thanks! Great article!

vivpuri said...

MySQL gives another approach for MySQL failover: http://dev.mysql.com/doc/refman/5.1/en/replication-solutions-switch.html