Jay Janssen -- MySQL Guy

Monday, November 5, 2007

Fairness in Storage Engine Benchmarking

Is it fair to compare Innodb to MyISAM with innodb_flush_log_at_trx_commit=1 and flush=OFF?

Granted different applications have different requirements. But, when you're trying to show a graph comparing the two Engines, isn't it reasonable to try to keep the same settings as close to equal as possible?

I suppose that begs the question of how similar innodb_flush_log_at_trx_commit is to flush, and if it's even possible to get a valid comparison.

Innodb tends to be slower (I'm on 4.1 currently, ancient history), but it seems to me that it's doing a whole lot more for the money. What do you think?

[UPDATE] Additionally, I'm running sequential benchmarks back to back, for example loading the table, then turning around and running a read test on the table. For innodb, this often means that a good chunk of the buffer pool is dirty and the read test has to slog through while the dirty buffers are being flushed, potentially changing performance.

Should sequential benchmarks have some built in time to allow the database to return to a stable state, or is it better just to plow straight through them all at once?

What is Innodb doing?

I've been running a series of sysbench tests on Innodb with MySQL 4.1 My tests include doing both reads and writes on the table.

I've noticed that after the test is complete, and while there are no processes in show processlist, I see a fairly large amount of buffer pool reads and writes, as well as a large number of Modified db pages. Now, I know that it could take a while for innodb to flush out the dirty buffers after a large amount of writes, but I'm seeing the Modified db pages actually getting larger for a while, and then gradually getting to the point where it finally settles down and returns to 0.

The only thing I can think of is Innodb is doing some sort of table optimization, which begs the question of whether I should rerun my tests with the optimized table or not.

Has anyone else seen this?

UPDATE: I believe I have figured out what's happening. My sysbench test is working on a 100M row table, and the last run I do is 10M inserts with these options:

--oltp-test-mode=nontrx
--oltp-nontrx-mode=insert

I found out that this test deletes all of my rows before it runs, so I see around 50k deletes per second for quite a while, and then the test runs. After the test finishes, I suspect Innodb is cleaning up a lot of unused pages in the data file, hence all the read and write activity on the buffer pool.

Friday, October 12, 2007

Creating a Job queue in Innodb

This post is half-this is how you do it, and half-is there a better way to do it?

So you want to build a system that does jobs. You want the jobs to be able to be run in parallel for speed, but also for redundancy. This system needs to be coordinated so, for example, the same jobs aren't being done twice, the status of every job is easy to see, and multiple servers can run jobs by simply querying the central source.

How can we build code around Innodb to have MySQL be our central controlling scheme?


CREATE TABLE IF NOT EXISTS job_queue(
   id int(10) not null auto_increment,

   updated timestamp not null,
   started timestamp not null,

   state ENUM('NEW', 'WORKING', 'DONE', 'ERROR' ) default 'NEW',

   PRIMARY KEY ( id ),
   KEY( STATE )
) ENGINE=Innodb;

In this schema, our job table has a unique id, started and last updated time, and a STATE. In a real system, there would probably be some more meta data here about the nature of the job.

When new jobs need to be done, rows are inserted into the table (with started=NOW(), updated changes automatically)

insert into job_queue set started=NOW();

Now, in perl, we write our job controller. This program could be run on every server you want to run jobs:


#!/usr/bin/perl

# Don't leave zombies around, shamelessly stolen from 'man perlipc'
use POSIX ":sys_wait_h";
sub REAPER {
    my $child;
    while (($child = waitpid(-1,WNOHANG)) > 0) {
        $Kid_Status{$child} = $?;
    }
    $SIG{CHLD} = \&REAPER;
}
$SIG{CHLD} = \&REAPER;

use DBD::mysql;

my $dbh = DBI->connect(
'DBI:mysql:database=test;host=127.0.0.1;port=3306',
'test', 'test',
{ RaiseError => 1, AutoCommit => 0 }
);


while( 1 ) {
 my $row;
 eval {
     my $sth = $dbh->prepare( "select id from job_queue where state='NEW' li
mit 1 for update" );
     $sth->execute();
     $row = $sth->fetchrow_hashref();
     $sth->finish();
 };

 if( $@ or !$row ) {
     # Couldn't lock or lock wait timeout
     $dbh->commit();
     sleep 10;
     next;
 }

 # Got one, change state, commit, and fork a worker

 $dbh->do( "update job_queue set state='WORKING' where id=" . $row->{id} );
 $dbh->commit();

 # Fork a worker
 if( fork )  {
     # Parent, let the child have the old connection and reconnect to
     # the db.
     $dbh->{InactiveDestroy} = 1;
     $dbh = DBI->connect(
        'DBI:mysql:database=test;host=127.0.0.1;port=3306',
        'test', 'test',
        { RaiseError => 1, AutoCommit => 0 }
     );
 } else{
     # Child
     print "Locking\n";
     $dbh->do( "select * from job_queue where id=" . $row->{id}
         . " for update " );

     print "Working\n";
     #simulated work here
     srand( time );
     sleep rand 30;

     $dbh->do( "update job_queue set STATE='DONE' where
         id=" . $row->{id} );
     print "Committing\n";
     $dbh->commit();

     $dbh->disconnect();

     exit;
 }
 sleep 1;
}

Now, this isn't bad. But it has at least one thing I don't like about it. If a forked worker dies before the job finishes (and commits), there's no way for the job to be reassigned to another worker.

Ideally each worker would have a lock on his job row inside of a transaction, which it is doing now, but, instead of the job being in the 'WORKING' state first, I'd rather it was 'NEW' before the transaction started. If that were the case, if a worker died, it's unfinished transaction would be rolled back, and the job would be unlocked and NEW again.

However, because of the way SELECT ... FOR UPDATE works, Innodb will deadlock waiting for 'NEW' jobs to unlock from the workers (or wait until they are done). This is sub-optimal, since my parent process could be forking new jobs in the meantime. What would fix this is a SELECT ... FOR UPDATE that skipped locked rows without blocking.

If anyone knows a good way to achieve this, please let me know!

However, assuming we have no better alternatives, we can create a reaper process like this:


#!/usr/bin/perl

use DBD::mysql;

my $dbh = DBI->connect(
   'DBI:mysql:database=test;host=127.0.0.1;port=3306',
   'test', 'test',
   { RaiseError => 1, AutoCommit => 0 }
);


while( 1 ) {
    my $row;
    eval {
        my $sth = $dbh->prepare( "select id from job_queue where state='WORKING' limit 1 for update" );
        $sth->execute();
        $row = $sth->fetchrow_hashref();
        $sth->finish();
    };

    if( $@ or !$row ) {
        # Couldn't lock or lock wait timeout
        $dbh->commit();
        sleep 10;
        next;
    }

    # Got one, change state, commit, and fork a worker

    $dbh->do( "update job_queue set state='NEW' where id=" . $row->{id} );    $dbh->commit();

    sleep 1;
}

This will find the first WORKING row and try to lock it. Since normal jobs will be moved to DONE before they are committed, this should only ever find jobs that are in WORKING and unlocked, which means the worker died.

However this isn't perfect. Because SELECT ... FOR UPDATE will try to lock a row already locked, we have to wait until all jobs before our stalled job are complete, getting deadlocks along the way (be sure to set your innodb_lock_wait_timeout fairly low!. Even further, I've seen dead worker processes leave behind idle mysql connections that hold onto their row locks.

Is there any smoother way to do this? I'd love to hear other people's advice.

Thursday, October 4, 2007

How safe is setting read_only?

We use a lot of dual master setups for redundancy and recoverability. Typically, most apps can't write to both masters at the same time, so for consistency we keep only one of the masters read-write at any given instant. To switch over gracefully, turn the primary read-only, wait for replication to catch up, turn the secondary read-write. Simple, eh?

Well, I've just had a report that setting the primary read-only caused a transaction to be partially replicated to a slave. Wait, what? If I'm reading the email correctly, they switched masters in the middle of a transaction. The transaction commit correctly failed on the master with a read-only error, but somehow the other master (or a slave) got part of the transaction that failed and had to be skipped.

This was an innodb table in MySQL 5.0. My first thought was that XA was disabled (though we typically use 4.1 here, so I thought of innodb_safe_binlog). I verified that the XA support appeared to be enabled correctly. My other thoughts are maybe the binlog didn't get synced properly or a thousand other possibilities... I'll update if find out more.

Now, I'm not panicking too much, this could be a non-issue, bad report, what-have-you. However, it got me thinking. How safe *is* it to simply set read-only? Transactionally speaking, is this doing what I want, which is an atomic "deny writes from now on" change? Is this safe with binlogging?

My user claims that doing a flush tables with read lock before setting read-only is safer, but is it? Does setting read-only somehow muck with the binlog to prevent it from being resynced with the inno transactional state somehow?

I figured I'd post it out here and see what people say...

UPDATE: I found out the user isn't using normal transactions, but rather AUTOCOMMIT. I'm following up to try to figure out what the exact nature of the slave error was.

However, another related question is: Is there a race condition between applying a write statement and writing the binary log? Is it possible to do one without the other if there's a sudden server crash? With Innodb, I'd expect the XA transactions to handle this event, but what about plain old MyISAM?

Saturday, May 19, 2007

My first database

I grew up on a farm. Lest anyone automatically load up their preconceived notions of farmers, my Dad was (and is) very interested in computers and purchased the original IBM PC back in the early/mid 80's. He found a lot of practical use in the Lotus 1-2-3 spreadsheet program, and could fly through the keyboard shortcuts (I guess they weren't really shortcuts in those days, as there wasn't a mouse) fast enough to put him on the same level with any Unix geek today with Emacs or Vi.

In the 90's, I remember him buying our 486/66 with 16MB of RAM from Micron and describing it as a computer that would "peel out on the lawn." My dad is a tractor-guy and has always been obsessed with horsepower. He still likes to rev up his Steiger tractors and listen to the power. This type of personality translates well into the computer field.

Anyway, I had been interested in computers since he brought the first one home. Sometime around when he got the Micron, he paid a lot of money for this thing called a database. I'm not even sure which one it was, I think it might have been FoxPro 2. He wanted to use it to track information on our crop fields: what seed brand had been planted, when it had been worked, fertilizers used, statistics on the yield, stuff like that. He didn't really know how this new software worked, but he understood that it could do stuff like that better than a spreadsheet could. He was totally right.

Unfortunately for him, he was relying on me to make this work (think lazy teenager, I'm not saying all teenagers are lazy, but I was). He knew I grasped computer concepts much more quickly than he did, but I couldn't make heads or tails of it: at least for the 20 minutes I tried. The FoxPro box sat on the shelf, unused, and it probably is still sitting there.

I titled this post "My First Database", and I guess that story doesn't fit the requirement since I didn't really actually create a database. So here's this story:

A few years later, after my first two years of college, I went on a Co-op at Pratt & Whitney as a programmer. That was where I first learned Perl, among other things, and I can remember building a few web applications that were meant to help access these legally retained documents (think plane crash, faulty engine, lawsuit). Incredulously, I basically had to design and build these applications myself (just finished 2nd year of Computer Science). Perl CGI was pretty cutting edge in 1996, but at the time (and not even until after I graduated) I still didn't really have the foggiest notion of a database. Having no other guidance from the more experienced engineers or my manager, I based what was essentially the database on the unix filesystem and a set of storage conventions I coded into my application.

This was, in essence, my first database, but I had enough sense at this point to realize that:

this didn't feel right
there had to be a better way to manage the data
this thing was going to break as soon as my co-op ended and no one would be able to fix it.

I never heard what happened.

That's right, my first database was flat files. One of my co-workers at Yahoo, Dave, likes to listen to me and Tim, the Oracle Architect, talk about databases and he (Dave) likes to joke that flat files are his database expertise. He would be so proud.

What does this have to do with anything? I think about how both of these experiences would have been different if it had been set today. How awesome is it that a) MySQL is freely available on the net (indeed, along with plenty of other good choices) and b) the tools to learn and understand these systems are freely available to anyone. There are so many more opportunities for guys who are interested enough, like my dad, to take these tools and build something without wasting a lot of money.

I'm not sure my Dad would get MySQL, he's more into trading stocks nowadays than learning new technologies (though he did just buy his first Mac), or if he would have "gotten" it then. He did have me, and realized I was capable of understanding it, I was just too busy playing X-wing.

I realize now that this post is a gushy open-source loving rant (of which I never bother to read myself anymore once I identify it) but I guess that's at least somewhat reflective of how I feel on the subject after careful analysis. I'm no hardcore zealot, but it is cool to look back just a short 10-15 years and see how much has changed.

Wednesday, May 16, 2007

MySQL is to Oracle...

I came up with this analogy today comparing MySQL and Oracle and it seemed to fit. Only real Unix geeks will get this, but here goes:

MySQL is to Oracle as Vi is to Emacs

Maybe this has been said before, but think about it.

Here's what they all have in common:

All are cool technologies and worthy of respect in their own right
All have zealots vigorously defend their technology to the death
Each set (Oracle/Emacs and MySQL/Vi) represents a different way of looking at and solving a problem

For the Oracle/Emacs world things work like this:

The technology is the be-all/end-all, make it do everything we want.

In many ways this is a huge advantage, things are tightly integrated, you don't need to worry so much about system conventions and portability, because new features can easily be added inside the existing framework. This continues until the technology becomes very Operating System-like.

The disadvantage (to my mind) seems to be bloat. What the technology was really made for becomes a little murky with all the extra features and complexity of setup and use.

I can't spend as much time explaining this camp here, because I'm not a part of it. I do know that Oracle installations take up multiple Gigabytes by default and Emacs has its own web browser and email client, both attributes that are common with Operating Systems.

On the MySQL/Vi side, things are more like this:

Keep this technology focused on its purpose, lean and mean

In this camp, there's far less bloat. It's very clear (most of the time) what the technology is supposed to do, and installing/configuring/using the technology is centered around the core purpose of that technology.

What's the benefit? In theory, a well made piece of technology that does its job in an efficient manner.

What's the disadvantage? "Is that all it can do??!" There are less features, harder integrate third-party components, and it's harder to add that stuff in (a pluggable what?!)

Of course, it's not as simple as all that, and there are plenty shades of gray. MySQL has a pluggable storage engine framework (I heard it's drag and drop), and Oracle still has a RDBMS somewhere in there (if I pick on MySQL, I can pick on Oracle).

Lest I'm misunderstood, I'm simply saying that these technologies seem to represent a particular way of looking at developing software over time.

It seems to be generally true that commercial vendors tend to gravitate towards the "be-all/end-all" way of looking at things, but I'm pretty sure no one made millions licensing Emacs.

The pragmatists amongst us (myself included) would probably say:

Use what works for you.
If you know it, and you have it, use it!
Save the baby, throw out the bathwater
At the end of the day, keep what you need, ditch the rest.
Can't we all just get along?
Prejudice works both ways, everyone thinks their technology is the best.

Some after thoughts
(in no particular order, and with no particular point)

It sure is hard to switch to an alternative technology after you have one under your belt. I've attempted to grasp Emacs a few times, just can't do it. I'd like to learn Python or Ruby, but it's sure hard when I can just use Perl.
I really try to be neutral on these holy wars, but hey, everyone believes in something.
*flashes the Vi gang-sign to Paul Tuckfield*

Thursday, May 10, 2007

Where cluster needs to be

I'm really excited about MySQL Cluster. It has a lot of potential to be a good competitor to Oracle RAC, and not just in a copy-cat kind of way.

The thing that makes MySQL different and a good balance, when you're fortunate enough to have the choice, is the "Share nothing" concept. In RAC (and I'm no expert), you use shared storage, usually a Netapp Filer. While filers are seriously cool, they are also seriously expensive, and, no matter how much you gussy it up, it's still a single point of failure.

I was glad I sat in on the Intro to cluster talk at the Users conference, it helped me understand what cluster has to offer, and what it still needs. These needs probably aren't any big secret to the Cluster developers, they openly talked about them and made them very clear. Others are my (admittedly limited) opinions based only on what I heard at the conference and not actually from using cluster mysql (yet).

Dynamically adding and removing Storage nodes
Doing a lot of queries on a lot of storage nodes is fine, but until I can dynamically add and remove nodes without cluster downtime, I can't easily recover from failures, I can't add capacity, and I can't get my data to re-balance itself over the nodes.
I need to scale bigger than physical memory
I know 5.1 has disk-based storage for non-indexed data, but I need to scale my indexes beyond RAM too. RAM is too expensive to scale this way forever. For what cluster should be useful for, like a terabyte or two of data, I'd probably have a hard time fitting all my indexes in physical memory.
Replicating a cluster
Talk about SPoFs all you want, but until you consider your colocation a point of failure, you're still vulnerable. The replication in 5.1 for cluster is good in some ways, but I can't help feeling like it's a little hacky. Passing the replication data back (upstream) into the SQL nodes from the data cluster just doesn't feel right somehow.
I do like the epoch concept, I hope that makes it upstream into generic replication and isn't just NDB specific forever.
From what I heard, even with single threaded replication, cluster is so fast that it can scale quite a ways anyway. With clever partitioning, multiple replication threads can be used, but it wouldn't be easy to implement in normal cases.
So how do I backup this thing?
So I know data nodes can be redundant, and I know I can replicate to another colo. But how can I simply get a cumulative data node cluster's data onto a single backup medium in a consistent state? Maybe I missed this at the conference, but is backing up your binary log enough? It's not how I would back up any other MySQL server.
Netapps may be expensive, but taking a consistent snapshot of all of your data comes in pretty handy.

All of that aside, I'm still looking forward to sitting down and playing with NDB.