Jay Janssen -- MySQL Guy: May 2007

Saturday, May 19, 2007

My first database

I grew up on a farm. Lest anyone automatically load up their preconceived notions of farmers, my Dad was (and is) very interested in computers and purchased the original IBM PC back in the early/mid 80's. He found a lot of practical use in the Lotus 1-2-3 spreadsheet program, and could fly through the keyboard shortcuts (I guess they weren't really shortcuts in those days, as there wasn't a mouse) fast enough to put him on the same level with any Unix geek today with Emacs or Vi.

In the 90's, I remember him buying our 486/66 with 16MB of RAM from Micron and describing it as a computer that would "peel out on the lawn." My dad is a tractor-guy and has always been obsessed with horsepower. He still likes to rev up his Steiger tractors and listen to the power. This type of personality translates well into the computer field.

Anyway, I had been interested in computers since he brought the first one home. Sometime around when he got the Micron, he paid a lot of money for this thing called a database. I'm not even sure which one it was, I think it might have been FoxPro 2. He wanted to use it to track information on our crop fields: what seed brand had been planted, when it had been worked, fertilizers used, statistics on the yield, stuff like that. He didn't really know how this new software worked, but he understood that it could do stuff like that better than a spreadsheet could. He was totally right.

Unfortunately for him, he was relying on me to make this work (think lazy teenager, I'm not saying all teenagers are lazy, but I was). He knew I grasped computer concepts much more quickly than he did, but I couldn't make heads or tails of it: at least for the 20 minutes I tried. The FoxPro box sat on the shelf, unused, and it probably is still sitting there.

I titled this post "My First Database", and I guess that story doesn't fit the requirement since I didn't really actually create a database. So here's this story:

A few years later, after my first two years of college, I went on a Co-op at Pratt & Whitney as a programmer. That was where I first learned Perl, among other things, and I can remember building a few web applications that were meant to help access these legally retained documents (think plane crash, faulty engine, lawsuit). Incredulously, I basically had to design and build these applications myself (just finished 2nd year of Computer Science). Perl CGI was pretty cutting edge in 1996, but at the time (and not even until after I graduated) I still didn't really have the foggiest notion of a database. Having no other guidance from the more experienced engineers or my manager, I based what was essentially the database on the unix filesystem and a set of storage conventions I coded into my application.

This was, in essence, my first database, but I had enough sense at this point to realize that:

this didn't feel right
there had to be a better way to manage the data
this thing was going to break as soon as my co-op ended and no one would be able to fix it.

I never heard what happened.

That's right, my first database was flat files. One of my co-workers at Yahoo, Dave, likes to listen to me and Tim, the Oracle Architect, talk about databases and he (Dave) likes to joke that flat files are his database expertise. He would be so proud.

What does this have to do with anything? I think about how both of these experiences would have been different if it had been set today. How awesome is it that a) MySQL is freely available on the net (indeed, along with plenty of other good choices) and b) the tools to learn and understand these systems are freely available to anyone. There are so many more opportunities for guys who are interested enough, like my dad, to take these tools and build something without wasting a lot of money.

I'm not sure my Dad would get MySQL, he's more into trading stocks nowadays than learning new technologies (though he did just buy his first Mac), or if he would have "gotten" it then. He did have me, and realized I was capable of understanding it, I was just too busy playing X-wing.

I realize now that this post is a gushy open-source loving rant (of which I never bother to read myself anymore once I identify it) but I guess that's at least somewhat reflective of how I feel on the subject after careful analysis. I'm no hardcore zealot, but it is cool to look back just a short 10-15 years and see how much has changed.

Wednesday, May 16, 2007

MySQL is to Oracle...

I came up with this analogy today comparing MySQL and Oracle and it seemed to fit. Only real Unix geeks will get this, but here goes:

MySQL is to Oracle as Vi is to Emacs

Maybe this has been said before, but think about it.

Here's what they all have in common:

All are cool technologies and worthy of respect in their own right
All have zealots vigorously defend their technology to the death
Each set (Oracle/Emacs and MySQL/Vi) represents a different way of looking at and solving a problem

For the Oracle/Emacs world things work like this:

The technology is the be-all/end-all, make it do everything we want.

In many ways this is a huge advantage, things are tightly integrated, you don't need to worry so much about system conventions and portability, because new features can easily be added inside the existing framework. This continues until the technology becomes very Operating System-like.

The disadvantage (to my mind) seems to be bloat. What the technology was really made for becomes a little murky with all the extra features and complexity of setup and use.

I can't spend as much time explaining this camp here, because I'm not a part of it. I do know that Oracle installations take up multiple Gigabytes by default and Emacs has its own web browser and email client, both attributes that are common with Operating Systems.

On the MySQL/Vi side, things are more like this:

Keep this technology focused on its purpose, lean and mean

In this camp, there's far less bloat. It's very clear (most of the time) what the technology is supposed to do, and installing/configuring/using the technology is centered around the core purpose of that technology.

What's the benefit? In theory, a well made piece of technology that does its job in an efficient manner.

What's the disadvantage? "Is that all it can do??!" There are less features, harder integrate third-party components, and it's harder to add that stuff in (a pluggable what?!)

Of course, it's not as simple as all that, and there are plenty shades of gray. MySQL has a pluggable storage engine framework (I heard it's drag and drop), and Oracle still has a RDBMS somewhere in there (if I pick on MySQL, I can pick on Oracle).

Lest I'm misunderstood, I'm simply saying that these technologies seem to represent a particular way of looking at developing software over time.

It seems to be generally true that commercial vendors tend to gravitate towards the "be-all/end-all" way of looking at things, but I'm pretty sure no one made millions licensing Emacs.

The pragmatists amongst us (myself included) would probably say:

Use what works for you.
If you know it, and you have it, use it!
Save the baby, throw out the bathwater
At the end of the day, keep what you need, ditch the rest.
Can't we all just get along?
Prejudice works both ways, everyone thinks their technology is the best.

Some after thoughts
(in no particular order, and with no particular point)

It sure is hard to switch to an alternative technology after you have one under your belt. I've attempted to grasp Emacs a few times, just can't do it. I'd like to learn Python or Ruby, but it's sure hard when I can just use Perl.
I really try to be neutral on these holy wars, but hey, everyone believes in something.
*flashes the Vi gang-sign to Paul Tuckfield*

Thursday, May 10, 2007

Where cluster needs to be

I'm really excited about MySQL Cluster. It has a lot of potential to be a good competitor to Oracle RAC, and not just in a copy-cat kind of way.

The thing that makes MySQL different and a good balance, when you're fortunate enough to have the choice, is the "Share nothing" concept. In RAC (and I'm no expert), you use shared storage, usually a Netapp Filer. While filers are seriously cool, they are also seriously expensive, and, no matter how much you gussy it up, it's still a single point of failure.

I was glad I sat in on the Intro to cluster talk at the Users conference, it helped me understand what cluster has to offer, and what it still needs. These needs probably aren't any big secret to the Cluster developers, they openly talked about them and made them very clear. Others are my (admittedly limited) opinions based only on what I heard at the conference and not actually from using cluster mysql (yet).

Dynamically adding and removing Storage nodes
Doing a lot of queries on a lot of storage nodes is fine, but until I can dynamically add and remove nodes without cluster downtime, I can't easily recover from failures, I can't add capacity, and I can't get my data to re-balance itself over the nodes.
I need to scale bigger than physical memory
I know 5.1 has disk-based storage for non-indexed data, but I need to scale my indexes beyond RAM too. RAM is too expensive to scale this way forever. For what cluster should be useful for, like a terabyte or two of data, I'd probably have a hard time fitting all my indexes in physical memory.
Replicating a cluster
Talk about SPoFs all you want, but until you consider your colocation a point of failure, you're still vulnerable. The replication in 5.1 for cluster is good in some ways, but I can't help feeling like it's a little hacky. Passing the replication data back (upstream) into the SQL nodes from the data cluster just doesn't feel right somehow.
I do like the epoch concept, I hope that makes it upstream into generic replication and isn't just NDB specific forever.
From what I heard, even with single threaded replication, cluster is so fast that it can scale quite a ways anyway. With clever partitioning, multiple replication threads can be used, but it wouldn't be easy to implement in normal cases.
So how do I backup this thing?
So I know data nodes can be redundant, and I know I can replicate to another colo. But how can I simply get a cumulative data node cluster's data onto a single backup medium in a consistent state? Maybe I missed this at the conference, but is backing up your binary log enough? It's not how I would back up any other MySQL server.
Netapps may be expensive, but taking a consistent snapshot of all of your data comes in pretty handy.

All of that aside, I'm still looking forward to sitting down and playing with NDB.

Tuesday, May 8, 2007

Changing masters

MySQL replication is cool. Almost everyone, when asked what they like best about MySQL, will mention replication. There can be no doubt that many web companies, Yahoo included, owes much to MySQL replication. How else can you scale the reading capacity of your database so easily?

It's not all rosy, replication has its problems too. It's very simplistic (I like the term brittle, if any errors happen when executing the SQL, it just stops), it can be inefficient in some places (think UPDATE that does a table scan to change one value, though 5.1 starts to address this with row-based replication), and it is single-threaded.

Single Threadedness

Let me stop here for a second. A lot of people spend good money on a nice Master DB, multi-core CPU's, plenty of memory, and good drives/RAID, etc. When it comes to their slaves, any old piece of junk will do. If you have low write volume on your master, this can work reasonably well. Your data is protected on your master, and if you lose a slave, no big deal. But once your writes go up, your bottleneck will become your slave(s) getting further and further behind during peak; I promise you.

Since the slave is single threaded, it executes the writes that came into the master serially, writes that probably hit the master in parallel. So, you have hundreds or thousands of writes on your master, but your slave can only do half of that and keeps getting further behind because it has no disk throughput, no CPU, and no memory. Whoops!

The lesson?

If you need to scale reads and your write volume is growing fast, get the same hardware (or better) on your slaves as on your master.

Back on topic...

But I digress. That's one problem with replication. The one that I'm more interested in is what to do with your slave when your master goes down. Let us suppose you have a single master and two slaves. If your master crashes (and isn't coming back anytime soon), you can simply start writing to one of your slaves and life goes on. But wait, you were writing to your master, and reading from both of your slaves. You just went from three servers down to one!

Your other slave is sitting there doing nothing, but replication is broken, so how can you get it replicating from your new master? You need to find the precise binlog and position on your new master where your slave left off of from the old master.

You have a few options:

Start binary logging from scratch on your new master before you start writing to it, and start your slave replicating from file 01, position 0.
Outcome: not too bad, but you better hope that both your new master and your slave were at exactly the same position on the old master for this to work right.
Guess at the replication position.
Outcome: missing data, or tons of replication errors with "duplicate keys". You probably want to guess conservatively, so you'll be sludging through the writes that already happened and hoping things stay consistent. The SQL_SLAVE_SKIP_COUNTER variable will become your friend, but I only recommend skipping one record at a time.
Stop your new master and get a snapshot.
Outcome: consistent data, but more downtime for your site. A workaround for this would be the various hotbackup tools that should let you take a consistent backup of your master without any downtime. Just make sure you snag the master's binlog and position of the instant of your backup.
Scan through your new master's binlog and look for the exact position where you left off from on your slave. mysqlbinlog and SHOW BINLOG EVENTS will be your friend. Start your slave from there.
Outcome: Consistency, no master downtime, no recopying the data to the slave, but a lot of effort, especially if you do this manually.

My idea for this, which is a hack, is to create a "heartbeat" table and consistently update it with an incremented field on your master. This could be done via Events, or cron. It's important that the actual value of the field is in the INSERT or UPDATE statement, so that it is viewable when you scan the binary logs and see the statements.

Once that is running and is trickling down to all of your slaves via replication, make sure all of your slaves are doing 'log-bin' and 'log-slave-updates', so theoretically any of them can be promoted to master at any time. When you are trying to troll through the binary logs on the new master, you have a statement that you know will be in there some place at regular intervals, the value of the increment at each binary log position, and you know what the value of the increment is on your slave.

With some clever script-hackery you can write a quick and dirty binary search (or the search algorithm of your choice) that will quickly find the exact binlog and position on your new master where the increment value that updated on your slave is located. Once you have that, you can slowly roll forward through the binary log until you find the precise spot that matches the binary log on your slave and you are have your position!

This could be done better, without script-hackery and in the MySQL code. I'm not much of a C hacker, but there's talk of something similar coming soon. MySQL cluster implements this same concept in 5.1 called 'epochs' to support failing over to different master or slave SQL nodes, but it's only for clusters at this point.

Next time your master crashes, get that back up first. But, don't forget about your read capacity in your slave pool, and how much your application is suffering without it.

Monday, May 7, 2007

Will this stick?

I've semi-started a lot of blogs, will this one last?

My name is Jay Janssen. Amongst other things I am a MySQL guy (tm) at Yahoo! This is a role that I more fell into, rather than one I specifically applied for. My background is Operations, somewhere between Coding/Developing and a System Admin/Colo Tech. I've done both, to some extent, and what's great about Yahoo is finding a lot of opportunity to set stuff up, see it run and fix it when it breaks.

MySQL, of course, never breaks. Unfortunately this only really true most often for the blogger, the guy on the hosted website or on an old P-III 1 Ghz running Debian Sarge off of his cable modem. These are the guys who don't know what hit them when digg.com comes around. In reality, the true test of any software is what happens when you beat the snot out of it. What's a better OS, Linux or BSD? What happens under extreme loads to each? Try it and find out.

The same holds true for MySQL. There's a lot of great (and free, as in beer) features, 5.1 is badass. But when there's a few million in advertising, brand reputation, and a lot of execs breathing down your neck, you need to think hard about stability. Once you slap a thousand page views a second on a website, and all sorts of User-generated, Web 2.0 features, you'll really find out where the rubber meets the road.

If you don't believe me, ask Google. There was a lot of head-scratching going on at the MySQL User's Conference when the Google presenters said they used 4.0.

So that's what I think about all day. Yahoo has a lot of sites and sub-sites, acquisitions and in-house builds, small sites and big sites. I get to see how they all work and try to make sense out of how we can best make use of MySQL at Yahoo.

This blog might end up with lots of posts of general types of MySQL nuggets that I can talk about, or it might not. I'm not that good at keeping blogs up.

Jay Janssen -- MySQL Guy