Monday, November 5, 2007

Fairness in Storage Engine Benchmarking

Is it fair to compare Innodb to MyISAM with innodb_flush_log_at_trx_commit=1 and flush=OFF?

Granted different applications have different requirements. But, when you're trying to show a graph comparing the two Engines, isn't it reasonable to try to keep the same settings as close to equal as possible?

I suppose that begs the question of how similar innodb_flush_log_at_trx_commit is to flush, and if it's even possible to get a valid comparison.

Innodb tends to be slower (I'm on 4.1 currently, ancient history), but it seems to me that it's doing a whole lot more for the money. What do you think?

[UPDATE] Additionally, I'm running sequential benchmarks back to back, for example loading the table, then turning around and running a read test on the table. For innodb, this often means that a good chunk of the buffer pool is dirty and the read test has to slog through while the dirty buffers are being flushed, potentially changing performance.

Should sequential benchmarks have some built in time to allow the database to return to a stable state, or is it better just to plow straight through them all at once?

What is Innodb doing?

I've been running a series of sysbench tests on Innodb with MySQL 4.1 My tests include doing both reads and writes on the table.

I've noticed that after the test is complete, and while there are no processes in show processlist, I see a fairly large amount of buffer pool reads and writes, as well as a large number of Modified db pages. Now, I know that it could take a while for innodb to flush out the dirty buffers after a large amount of writes, but I'm seeing the Modified db pages actually getting larger for a while, and then gradually getting to the point where it finally settles down and returns to 0.

The only thing I can think of is Innodb is doing some sort of table optimization, which begs the question of whether I should rerun my tests with the optimized table or not.

Has anyone else seen this?

UPDATE: I believe I have figured out what's happening. My sysbench test is working on a 100M row table, and the last run I do is 10M inserts with these options:


I found out that this test deletes all of my rows before it runs, so I see around 50k deletes per second for quite a while, and then the test runs. After the test finishes, I suspect Innodb is cleaning up a lot of unused pages in the data file, hence all the read and write activity on the buffer pool.

Friday, October 12, 2007

Creating a Job queue in Innodb

This post is half-this is how you do it, and half-is there a better way to do it?

So you want to build a system that does jobs. You want the jobs to be able to be run in parallel for speed, but also for redundancy. This system needs to be coordinated so, for example, the same jobs aren't being done twice, the status of every job is easy to see, and multiple servers can run jobs by simply querying the central source.

How can we build code around Innodb to have MySQL be our central controlling scheme?

id int(10) not null auto_increment,

updated timestamp not null,
started timestamp not null,

state ENUM('NEW', 'WORKING', 'DONE', 'ERROR' ) default 'NEW',

) ENGINE=Innodb;

In this schema, our job table has a unique id, started and last updated time, and a STATE. In a real system, there would probably be some more meta data here about the nature of the job.

When new jobs need to be done, rows are inserted into the table (with started=NOW(), updated changes automatically)

insert into job_queue set started=NOW();

Now, in perl, we write our job controller. This program could be run on every server you want to run jobs:


# Don't leave zombies around, shamelessly stolen from 'man perlipc'
use POSIX ":sys_wait_h";
sub REAPER {
my $child;
while (($child = waitpid(-1,WNOHANG)) > 0) {
$Kid_Status{$child} = $?;

use DBD::mysql;

my $dbh = DBI->connect(
'test', 'test',
{ RaiseError => 1, AutoCommit => 0 }

while( 1 ) {
my $row;
eval {
my $sth = $dbh->prepare( "select id from job_queue where state='NEW' li
mit 1 for update" );
$row = $sth->fetchrow_hashref();

if( $@ or !$row ) {
# Couldn't lock or lock wait timeout
sleep 10;

# Got one, change state, commit, and fork a worker

$dbh->do( "update job_queue set state='WORKING' where id=" . $row->{id} );

# Fork a worker
if( fork ) {
# Parent, let the child have the old connection and reconnect to
# the db.
$dbh->{InactiveDestroy} = 1;
$dbh = DBI->connect(
'test', 'test',
{ RaiseError => 1, AutoCommit => 0 }
} else{
# Child
print "Locking\n";
$dbh->do( "select * from job_queue where id=" . $row->{id}
. " for update " );

print "Working\n";
#simulated work here
srand( time );
sleep rand 30;

$dbh->do( "update job_queue set STATE='DONE' where
id=" . $row->{id} );
print "Committing\n";


sleep 1;

Now, this isn't bad. But it has at least one thing I don't like about it. If a forked worker dies before the job finishes (and commits), there's no way for the job to be reassigned to another worker.

Ideally each worker would have a lock on his job row inside of a transaction, which it is doing now, but, instead of the job being in the 'WORKING' state first, I'd rather it was 'NEW' before the transaction started. If that were the case, if a worker died, it's unfinished transaction would be rolled back, and the job would be unlocked and NEW again.

However, because of the way SELECT ... FOR UPDATE works, Innodb will deadlock waiting for 'NEW' jobs to unlock from the workers (or wait until they are done). This is sub-optimal, since my parent process could be forking new jobs in the meantime. What would fix this is a SELECT ... FOR UPDATE that skipped locked rows without blocking.

If anyone knows a good way to achieve this, please let me know!

However,  assuming we have no better alternatives, we can create a reaper process like this:


use DBD::mysql;

my $dbh = DBI->connect(
'test', 'test',
{ RaiseError => 1, AutoCommit => 0 }

while( 1 ) {
my $row;
eval {
my $sth = $dbh->prepare( "select id from job_queue where state='WORKING' limit 1 for update" );
$row = $sth->fetchrow_hashref();

if( $@ or !$row ) {
# Couldn't lock or lock wait timeout
sleep 10;

# Got one, change state, commit, and fork a worker

$dbh->do( "update job_queue set state='NEW' where id=" . $row->{id} ); $dbh->commit();

sleep 1;

This will find the first WORKING row and try to lock it. Since normal jobs will be moved to DONE before they are committed, this should only ever find jobs that are in WORKING and unlocked, which means the worker died.

However this isn't perfect. Because SELECT ... FOR UPDATE will try to lock a row already locked, we have to wait until all jobs before our stalled job are complete, getting deadlocks along the way (be sure to set your innodb_lock_wait_timeout fairly low!. Even further, I've seen dead worker processes leave behind idle mysql connections that hold onto their row locks.

Is there any smoother way to do this? I'd love to hear other people's advice.

Thursday, October 4, 2007

How safe is setting read_only?

We use a lot of dual master setups for redundancy and recoverability. Typically, most apps can't write to both masters at the same time, so for consistency we keep only one of the masters read-write at any given instant. To switch over gracefully, turn the primary read-only, wait for replication to catch up, turn the secondary read-write. Simple, eh?

Well, I've just had a report that setting the primary read-only caused a transaction to be partially replicated to a slave. Wait, what? If I'm reading the email correctly, they switched masters in the middle of a transaction. The transaction commit correctly failed on the master with a read-only error, but somehow the other master (or a slave) got part of the transaction that failed and had to be skipped.

This was an innodb table in MySQL 5.0. My first thought was that XA was disabled (though we typically use 4.1 here, so I thought of innodb_safe_binlog). I verified that the XA support appeared to be enabled correctly. My other thoughts are maybe the binlog didn't get synced properly or a thousand other possibilities... I'll update if find out more.

Now, I'm not panicking too much, this could be a non-issue, bad report, what-have-you. However, it got me thinking. How safe *is* it to simply set read-only? Transactionally speaking, is this doing what I want, which is an atomic "deny writes from now on" change? Is this safe with binlogging?

My user claims that doing a flush tables with read lock before setting read-only is safer, but is it? Does setting read-only somehow muck with the binlog to prevent it from being resynced with the inno transactional state somehow?

I figured I'd post it out here and see what people say...

UPDATE: I found out the user isn't using normal transactions, but rather AUTOCOMMIT. I'm following up to try to figure out what the exact nature of the slave error was.

However, another related question is: Is there a race condition between applying a write statement and writing the binary log? Is it possible to do one without the other if there's a sudden server crash? With Innodb, I'd expect the XA transactions to handle this event, but what about plain old MyISAM?

Saturday, May 19, 2007

My first database

I grew up on a farm. Lest anyone automatically load up their preconceived notions of farmers, my Dad was (and is) very interested in computers and purchased the original IBM PC back in the early/mid 80's. He found a lot of practical use in the Lotus 1-2-3 spreadsheet program, and could fly through the keyboard shortcuts (I guess they weren't really shortcuts in those days, as there wasn't a mouse) fast enough to put him on the same level with any Unix geek today with Emacs or Vi.

these are fun to driveIn the 90's, I remember him buying our 486/66 with 16MB of RAM from Micron and describing it as a computer that would "peel out on the lawn." My dad is a tractor-guy and has always been obsessed with horsepower. He still likes to rev up his Steiger tractors and listen to the power. This type of personality translates well into the computer field.

Anyway, I had been interested in computers since he brought the first one home. Sometime around when he got the Micron, he paid a lot of money for this thing called a database. I'm not even sure which one it was, I think it might have been FoxPro 2. He wanted to use it to track information on our crop fields: what seed brand had been planted, when it had been worked, fertilizers used, statistics on the yield, stuff like that. He didn't really know how this new software worked, but he understood that it could do stuff like that better than a spreadsheet could. He was totally right.

Unfortunately for him, he was relying on me to make this work (think lazy teenager, I'm not saying all teenagers are lazy, but I was). He knew I grasped computer concepts much more quickly than he did, but I couldn't make heads or tails of it: at least for the 20 minutes I tried. The FoxPro box sat on the shelf, unused, and it probably is still sitting there.

I titled this post "My First Database", and I guess that story doesn't fit the requirement since I didn't really actually create a database. So here's this story:

A few years later, after my first two years of college, I went on a Co-op at Pratt & Whitney as a programmer. That was where I first learned Perl, among other things, and I can remember building a few web applications that were meant to help access these legally retained documents (think plane crash, faulty engine, lawsuit). Incredulously, I basically had to design and build these applications myself (just finished 2nd year of Computer Science). Perl CGI was pretty cutting edge in 1996, but at the time (and not even until after I graduated) I still didn't really have the foggiest notion of a database. Having no other guidance from the more experienced engineers or my manager, I based what was essentially the database on the unix filesystem and a set of storage conventions I coded into my application.

This was, in essence, my first database, but I had enough sense at this point to realize that:

  1. this didn't feel right
  2. there had to be a better way to manage the data
  3. this thing was going to break as soon as my co-op ended and no one would be able to fix it.
I never heard what happened.

That's right, my first database was flat files. One of my co-workers at Yahoo, Dave, likes to listen to me and Tim, the Oracle Architect, talk about databases and he (Dave) likes to joke that flat files are his database expertise. He would be so proud.

What does this have to do with anything? I think about how both of these experiences would have been different if it had been set today. How awesome is it that a) MySQL is freely available on the net (indeed, along with plenty of other good choices) and b) the tools to learn and understand these systems are freely available to anyone. There are so many more opportunities for guys who are interested enough, like my dad, to take these tools and build something without wasting a lot of money.

I'm not sure my Dad would get MySQL, he's more into trading stocks nowadays than learning new technologies (though he did just buy his first Mac), or if he would have "gotten" it then. He did have me, and realized I was capable of understanding it, I was just too busy playing X-wing.

I realize now that this post is a gushy open-source loving rant (of which I never bother to read myself anymore once I identify it) but I guess that's at least somewhat reflective of how I feel on the subject after careful analysis. I'm no hardcore zealot, but it is cool to look back just a short 10-15 years and see how much has changed.

Wednesday, May 16, 2007

MySQL is to Oracle...

I came up with this analogy today comparing MySQL and Oracle and it seemed to fit. Only real Unix geeks will get this, but here goes:

MySQL is to Oracle as Vi is to Emacs

Maybe this has been said before, but think about it.

Here's what they all have in common:

  • All are cool technologies and worthy of respect in their own right
  • All have zealots vigorously defend their technology to the death
  • Each set (Oracle/Emacs and MySQL/Vi) represents a different way of looking at and solving a problem

For the Oracle/Emacs world things work like this:

The technology is the be-all/end-all, make it do everything we want.

In many ways this is a huge advantage, things are tightly integrated, you don't need to worry so much about system conventions and portability, because new features can easily be added inside the existing framework. This continues until the technology becomes very Operating System-like.

The disadvantage (to my mind) seems to be bloat. What the technology was really made for becomes a little murky with all the extra features and complexity of setup and use.

I can't spend as much time explaining this camp here, because I'm not a part of it. I do know that Oracle installations take up multiple Gigabytes by default and Emacs has its own web browser and email client, both attributes that are common with Operating Systems.

On the MySQL/Vi side, things are more like this:

Keep this technology focused on its purpose, lean and mean

In this camp, there's far less bloat. It's very clear (most of the time) what the technology is supposed to do, and installing/configuring/using the technology is centered around the core purpose of that technology.

What's the benefit? In theory, a well made piece of technology that does its job in an efficient manner.

What's the disadvantage? "Is that all it can do??!" There are less features, harder integrate third-party components, and it's harder to add that stuff in (a pluggable what?!)

Of course, it's not as simple as all that, and there are plenty shades of gray. MySQL has a pluggable storage engine framework (I heard it's drag and drop), and Oracle still has a RDBMS somewhere in there (if I pick on MySQL, I can pick on Oracle).

Lest I'm misunderstood, I'm simply saying that these technologies seem to represent a particular way of looking at developing software over time.

It seems to be generally true that commercial vendors tend to gravitate towards the "be-all/end-all" way of looking at things, but I'm pretty sure no one made millions licensing Emacs.

The pragmatists amongst us (myself included) would probably say:

  1. Use what works for you.

    If you know it, and you have it, use it!

  2. Save the baby, throw out the bathwater

    At the end of the day, keep what you need, ditch the rest.

  3. Can't we all just get along?

    Prejudice works both ways, everyone thinks their technology is the best.

Some after thoughts
(in no particular order, and with no particular point)

  • It sure is hard to switch to an alternative technology after you have one under your belt. I've attempted to grasp Emacs a few times, just can't do it. I'd like to learn Python or Ruby, but it's sure hard when I can just use Perl.
  • I really try to be neutral on these holy wars, but hey, everyone believes in something.
  • *flashes the Vi gang-sign to Paul Tuckfield*

Thursday, May 10, 2007

Where cluster needs to be

I'm really excited about MySQL Cluster. It has a lot of potential to be a good competitor to Oracle RAC, and not just in a copy-cat kind of way.

The thing that makes MySQL different and a good balance, when you're fortunate enough to have the choice, is the "Share nothing" concept. In RAC (and I'm no expert), you use shared storage, usually a Netapp Filer. While filers are seriously cool, they are also seriously expensive, and, no matter how much you gussy it up, it's still a single point of failure.

I was glad I sat in on the Intro to cluster talk at the Users conference, it helped me understand what cluster has to offer, and what it still needs. These needs probably aren't any big secret to the Cluster developers, they openly talked about them and made them very clear. Others are my (admittedly limited) opinions based only on what I heard at the conference and not actually from using cluster mysql (yet).

  1. Dynamically adding and removing Storage nodes

    Doing a lot of queries on a lot of storage nodes is fine, but until I can dynamically add and remove nodes without cluster downtime, I can't easily recover from failures, I can't add capacity, and I can't get my data to re-balance itself over the nodes.

  2. I need to scale bigger than physical memory

    I know 5.1 has disk-based storage for non-indexed data, but I need to scale my indexes beyond RAM too. RAM is too expensive to scale this way forever. For what cluster should be useful for, like a terabyte or two of data, I'd probably have a hard time fitting all my indexes in physical memory.

  3. Replicating a cluster

    Talk about SPoFs all you want, but until you consider your colocation a point of failure, you're still vulnerable. The replication in 5.1 for cluster is good in some ways, but I can't help feeling like it's a little hacky. Passing the replication data back (upstream) into the SQL nodes from the data cluster just doesn't feel right somehow.

    I do like the epoch concept, I hope that makes it upstream into generic replication and isn't just NDB specific forever.

    From what I heard, even with single threaded replication, cluster is so fast that it can scale quite a ways anyway. With clever partitioning, multiple replication threads can be used, but it wouldn't be easy to implement in normal cases.

  4. So how do I backup this thing?

    So I know data nodes can be redundant, and I know I can replicate to another colo. But how can I simply get a cumulative data node cluster's data onto a single backup medium in a consistent state? Maybe I missed this at the conference, but is backing up your binary log enough? It's not how I would back up any other MySQL server.

    Netapps may be expensive, but taking a consistent snapshot of all of your data comes in pretty handy.

All of that aside, I'm still looking forward to sitting down and playing with NDB.

Tuesday, May 8, 2007

Changing masters

MySQL replication is cool. Almost everyone, when asked what they like best about MySQL, will mention replication. There can be no doubt that many web companies, Yahoo included, owes much to MySQL replication. How else can you scale the reading capacity of your database so easily?

It's not all rosy, replication has its problems too. It's very simplistic (I like the term brittle, if any errors happen when executing the SQL, it just stops), it can be inefficient in some places (think UPDATE that does a table scan to change one value, though 5.1 starts to address this with row-based replication), and it is single-threaded.

Single Threadedness

Let me stop here for a second. A lot of people spend good money on a nice Master DB, multi-core CPU's, plenty of memory, and good drives/RAID, etc. When it comes to their slaves, any old piece of junk will do. If you have low write volume on your master, this can work reasonably well. Your data is protected on your master, and if you lose a slave, no big deal. But once your writes go up, your bottleneck will become your slave(s) getting further and further behind during peak; I promise you.

Since the slave is single threaded, it executes the writes that came into the master serially, writes that probably hit the master in parallel. So, you have hundreds or thousands of writes on your master, but your slave can only do half of that and keeps getting further behind because it has no disk throughput, no CPU, and no memory. Whoops!

The lesson?

If you need to scale reads and your write volume is growing fast, get the same hardware (or better) on your slaves as on your master.

Back on topic...

But I digress. That's one problem with replication. The one that I'm more interested in is what to do with your slave when your master goes down. Let us suppose you have a single master and two slaves. If your master crashes (and isn't coming back anytime soon), you can simply start writing to one of your slaves and life goes on. But wait, you were writing to your master, and reading from both of your slaves. You just went from three servers down to one!

Your other slave is sitting there doing nothing, but replication is broken, so how can you get it replicating from your new master? You need to find the precise binlog and position on your new master where your slave left off of from the old master.

You have a few options:

  1. Start binary logging from scratch on your new master before you start writing to it, and start your slave replicating from file 01, position 0.

    Outcome: not too bad, but you better hope that both your new master and your slave were at exactly the same position on the old master for this to work right.

  2. Guess at the replication position.

    Outcome: missing data, or tons of replication errors with "duplicate keys". You probably want to guess conservatively, so you'll be sludging through the writes that already happened and hoping things stay consistent. The SQL_SLAVE_SKIP_COUNTER variable will become your friend, but I only recommend skipping one record at a time.

  3. Stop your new master and get a snapshot.

    Outcome: consistent data, but more downtime for your site. A workaround for this would be the various hotbackup tools that should let you take a consistent backup of your master without any downtime. Just make sure you snag the master's binlog and position of the instant of your backup.

  4. Scan through your new master's binlog and look for the exact position where you left off from on your slave. mysqlbinlog and SHOW BINLOG EVENTS will be your friend. Start your slave from there.

    Outcome: Consistency, no master downtime, no recopying the data to the slave, but a lot of effort, especially if you do this manually.

My idea for this, which is a hack, is to create a "heartbeat" table and consistently update it with an incremented field on your master. This could be done via Events, or cron. It's important that the actual value of the field is in the INSERT or UPDATE statement, so that it is viewable when you scan the binary logs and see the statements.

Once that is running and is trickling down to all of your slaves via replication, make sure all of your slaves are doing 'log-bin' and 'log-slave-updates', so theoretically any of them can be promoted to master at any time. When you are trying to troll through the binary logs on the new master, you have a statement that you know will be in there some place at regular intervals, the value of the increment at each binary log position, and you know what the value of the increment is on your slave.

With some clever script-hackery you can write a quick and dirty binary search (or the search algorithm of your choice) that will quickly find the exact binlog and position on your new master where the increment value that updated on your slave is located. Once you have that, you can slowly roll forward through the binary log until you find the precise spot that matches the binary log on your slave and you are have your position!

This could be done better, without script-hackery and in the MySQL code. I'm not much of a C hacker, but there's talk of something similar coming soon. MySQL cluster implements this same concept in 5.1 called 'epochs' to support failing over to different master or slave SQL nodes, but it's only for clusters at this point.

Next time your master crashes, get that back up first. But, don't forget about your read capacity in your slave pool, and how much your application is suffering without it.

Monday, May 7, 2007

Will this stick?

I've semi-started a lot of blogs, will this one last?

My name is Jay Janssen. Amongst other things I am a MySQL guy (tm) at Yahoo! This is a role that I more fell into, rather than one I specifically applied for. My background is Operations, somewhere between Coding/Developing and a System Admin/Colo Tech. I've done both, to some extent, and what's great about Yahoo is finding a lot of opportunity to set stuff up, see it run and fix it when it breaks.

MySQL, of course, never breaks. Unfortunately this only really true most often for the blogger, the guy on the hosted website or on an old P-III 1 Ghz running Debian Sarge off of his cable modem. These are the guys who don't know what hit them when comes around. In reality, the true test of any software is what happens when you beat the snot out of it. What's a better OS, Linux or BSD? What happens under extreme loads to each? Try it and find out.

The same holds true for MySQL. There's a lot of great (and free, as in beer) features, 5.1 is badass. But when there's a few million in advertising, brand reputation, and a lot of execs breathing down your neck, you need to think hard about stability. Once you slap a thousand page views a second on a website, and all sorts of User-generated, Web 2.0 features, you'll really find out where the rubber meets the road.

If you don't believe me, ask Google. There was a lot of head-scratching going on at the MySQL User's Conference when the Google presenters said they used 4.0.

So that's what I think about all day. Yahoo has a lot of sites and sub-sites, acquisitions and in-house builds, small sites and big sites. I get to see how they all work and try to make sense out of how we can best make use of MySQL at Yahoo.

This blog might end up with lots of posts of general types of MySQL nuggets that I can talk about, or it might not. I'm not that good at keeping blogs up.