I recently found myself wishing for an async library for MySQL. My goal is to be able to fire off queries to a group of federated servers in parallel and aggregate the results in my code.
With the standard client (DBD::mysql), I'd have to query the servers one at a time. If there are 10 servers and each query takes 0.5 seconds, my code would stall for 5 seconds. But by using an async library, I could fire off all the queries and fetch the results as they become available. The overall wait time should not be much more than 0.5 seconds.
While I found little evidence of anyone doing this in practice, my search led me to the perl-mysql-async project on Google Code. It's a pure-Perl implementation of the MySQL 4.1 protocol and an asyncronous client that uses Event::Lib (and libevent) under the hood.
The code contains little in the way of documentation or examples, aside from the simple bundled test script. After a bit of mucking around with it, I managed to cobble together a working example. It looks like this:
Sure enough, that code runs in just a bit more time than the longest query it executes, rather than the sum of all the query times.
What still surprises me is that this code doesn't appear to get a lot of use (or at least discussion) in the real world. In the PHP world, the mysqlnd driver offers async queries.
So count this as my contribution to demonstrating that Perl can do async MySQL queries too.
In one of those "well, duh!" moments the other day, I came across a headline on Slashdot that said Unhappy People Watch More TV. Given that I mostly stopped watching TV quite some time ago and consider it to be one of the more rude devices in our culture, I clicked thru to read about how others have discovered what I'd already guessed was true...
A new study by sociologists at the University of Maryland concludes that unhappy people watch more TV, while people who describe themselves as 'very happy' spend more time reading and socializing. 'TV doesn't really seem to satisfy people over the long haul the way that social involvement or reading a newspaper does,' says researcher John P. Robinson. 'It's more passive and may provide escape--especially when the news is as depressing as the economy itself.
Imagine that... Stagnation and exposure to negative information leads to sadness. It goes on...
The data suggest to us that the TV habit may offer short-run pleasure at the expense of long-term malaise.' Unhappy people also liked their TV more: 'What viewers seem to be saying is that while TV in general is a waste of time and not particularly enjoyable, "the shows I saw tonight were pretty good.
Another shock. TV provides only a short-term reward (kind of like a drug hit).
If this resonates with you a bit, or you suspect deep down that there's more going on with the influence of TV in our culture, I highly recommend reading Amusing Ourselves To Death by Neil Postman if you have not already.
It's too bad this stuff doesn't get taught in school--where, I'm told, teachers are using PowerPoint more and more.
I recently had a need to add some error checking to a bash script that runs multiple copies of a Perl script in parallel to better utilize a multi-core server. I wanted a way to run these four processes in the background and gather up their exit values. Then, if any of them failed, I'd prematurely exit the bash script and report the error.
After a bit of reading bash docs, I came across some built-ins that I hadn't previously used or even seen. First, I'll show you the code:
And here's the Perl script that I wrote in order to test the functioning of wait.sh. It accepts to arguments. The first is the number of seconds to sleep (to simulate the delay associated with doing work) and the second is the exit value it should use (any non-zero value indicates a failure).
Discussion
New to me was the use of let to do math on a variable so that I can count up the number of failures. Is there a better way? There's no native ++ operator in bash. Similarly, using jobs to get a list of pids to wait on provided to be a very useful idiom.
The code is straightforward and works for my purposes. But since 99% of my time is spent in Perl rather than bash, I wonder what I could have done differently and/or better. Feedback welcome.
And, if this is at all useful to you, feel free to take it and run...
Finally, I'm starting to really dig gist.github for showing off bits of code. It's good stuff.
A month or so ago, the long under-construction Opa! opened its doors on Lincoln Ave in downtown Willow Glen. Wanting to try it for a while, we walked down on Friday night for dinner. And we were not disappointed.
The Good
The menu is straightforward and has a good variety of Greek food. We ordered the Keftedes (Greek Meatballs) as an appetizer. The dish consisted of two well prepared meatballs and an excellent sauce.
For the main courses, we selected a Beef Souvlaki Pita (hers) and Seafood Souvlaki (mine). Both came with the most excellent Opa! Fries. (Think: garlic fries with a twist.) The food came in a reasonable amount of time and our waitress was very friendly and helpful. It was very tasty and portions were not excessively large either.
Their drink menu contains a selection of beers and a good selection of Greek wines as well. The wine we sampled was quite good and is apparently available at Costco. Needless to say, we're going to have to verify that for ourselves. ;-)
The interior is well decorated. I especially like the large TV monitor that shows what songs are playing over the sound system.
Pricing was reasonable. Dinner for two with drinks, an appetizer, and desert (Baklava!) was about $50. Not the sort of thing we do often, but definitely not out of line with other favorite eating establishments.
The Bad
Opa! is a small sit down restaurant with tables for 2 and 4 (mostly) that also handles to go orders. It's often very full and could definitely benefit from more space inside. As a result, the tables are fairly close together and the waitresses occasionally bump into customers. But space isn't easy to come by in Willow Glen's downtown.
I recently was looking to make compressed backups of some files thatexist in a tree that's actually a set of hard links(rsnapshotor rsnap style) to acanonical set of files.
In other words, I have a data directory anda data.previous directory. I would like to make abackup of the stuff in data.previous, most of thefiles being unchanged from data. And I'd like to dothis without using lots of disk space.
The funny thing is that gzip isweird about hard links. If you try to gzip a file whose link count isgreater than one, it complains.
I was puzzled by this and started to wonder if it actually over-writesthe original input file instead of simply unlinking it when it is donereading it and generating the compressed version.
So I did a little experiment.
First I create a file with two links to it.
/tmp/gz$ touch a/tmp/gz$ ln a b
Then I check to ensure they have the same inode.
/tmp/gz$ ls -li a b5152839 -rw-r--r-- 2 jzawodn jzawodn 0 2008-12-03 15:38 a5152839 -rw-r--r-- 2 jzawodn jzawodn 0 2008-12-03 15:38 b
They do. So I compress one of them.
/tmp/gz$ gzip agzip: a has 1 other link -- unchanged
And witness the complaint. The gzip man page says I can force it withthe "-f" argument, so I do.
/tmp/gz$ gzip -f a
And, as I'd expect, the new file doesn't replaced the old file. Itgets a new inode instead.
This leads me to believe that the gzip error/warning messageis really trying to say something like:
gzip: a has 1 other link and compressing it will save no space
But I still don't see the danger. What can't that simply be aninformational message? After all, you still need enough space tostore the original and compressed versions since the original (in thenormal case) exists until it is done writing the compressed versionanyway. (I checked the source code later.)
It's Friday and this is the Internet, so I present to you Cats Eating Chicken, or "My Dumb Cat Video" (embedded below too).
The background is that we had a bit of leftover grilled chicken the other night and decided to bust it up and feed it to the cats. Amusingly, they all got together to partake of the feast, but a couple of them got curious about the camera too.
Both Timmy (white and grey) and Thunder (mostly grey) give the camera a sniff or two. My boys (Barnes and Noble) remained single-mindedly devoted to devouring the meat.
Interesting things are afoot in the MySQL world. You see, it used to be that the MySQL world consisted of about 20-40 employees of MySQL AB (this funny distributed Swedish company that built and supported the open source MySQL database server), a tiny handful of MySQL mailing lists, and large databases were counted in gigabytes not terabytes. A Pentium III was still a decent server. Replication was a new feature!
Hey, anyone remember the Gemini storage engine? :-)
How times have changed...
Nowadays MySQL is sort of a universe onto itself. There are multiple storage engines (though MyISAM and InnoDB are still the popular ones), version 5.1 is out (finally), and the whole company made it over 400 employees before it was gobbled up by Sun Microsystems (a smart move, IMHO, though history will judge that) a while back.
If I had to guess 5 years or so ago what would be interesting to me today about MySQL, I'd have been really, really wrong. The future rarely turns out like we think. Just ask Hillary Clinton.
Here's a little of what's rattling around in the MySQL part of my little brain these days...
Outside Support, Patches, and Forks
The single most interesting and surprising thing to me is both the number and necessity of third-party patches for enhancing various aspects of MySQL and InnoDB. Companies like Percona, Google, Proven Scaling, Prime Base Technologies, and Open Query are all doing so in one way or another.
On the one hand, it's excellent validation of the Open Source model. Thanks to reasonable licensing, companies other than Sun/MySQL are able to enhance and fix the software and give their changes back to the world.
Some organizations are providing just patches. Others, like Percona are providing their own binaries--effectively forks of MySQL/InnoDB. Taking things a step further, the OurDelta project aims to aggregate these third party patches and provide source and binaries for various platforms. In essences, you can get a "better" MySQL than the one Sun/MySQL gives you today. For free.
Meanwhile, development on InnoDB continues. Oh, did I mention the part where they were bought by Oracle (yes, *that* Oracle) a while back? Crazy shit, I tell you. But it makes sense if you squint right.
Anyway, the vibe I'm getting is that folks are frustrated because there's not a lot of communication coming out of the InnoDB development team these days. I can't personally verify that. It's been years since I corresponded with Heikki Tuuri (the creator of InnoDB). So folks like Mark Callaghan of Google have been busy analyzing and patching it to scale better for their needs.
And we all benefit.
Drizzle
Taking things a step further yet, the Drizzle project is a re-making of MySQL started primarily by Brian Aker, who worked as MySQL's Director of Architecture for years. Brian is now at Sun and, along with a handful of others at Sun and elsewhere, is ripping out a lot of the stuff in a fork of MySQL that doesn't get used much, needlessly complicated the code, or is simply no longer needed.
In essence, they're taking a hard look at MySQL and asking what it really needs to provide for a lot of it's uses today: Web and "cloud" stuff. He visited us at Craigslist a few months ago to talk about the project a bit and get our input and feedback. I believe it was that day I joined one of the mailing list and started following what's going on. Heck, I even build Drizzle on an Atom-powered MSI Wind PC regularly.
It's great to see a re-think of MySQL going on... keeping the good, getting rid of the bad, and modularizing the stuff that people often want to do differently (authentication, for example).
It's even better to see the group that's hacking on it. They really have their heads on straight.
Unanswered Questions
Why is all this even necessary? Are the "enterprise" customers and their demands taking focus away from what used to be the core use and users of MySQL? Is Sun hard to work with?
It's clear that both the MySQL and InnoDB teams could be doing more to help. But having worked at a large company for long enough, I realize that things are rarely as simple as they should be.
Will this stuff get integrated back into mainline MySQL? Will Linux distributions like Ubuntu, Debian, and Red Hat pick up OurDelta builds? What about Drizzle?
Will Drizzle hit its target and be the sleek and lean database kernel that MySQL once could have been?
Hard to say.
It's hard to guess what the future holds and too easy to play armchair quarterback about the work of others. But these are question worth wondering about a bit.
What's it all mean?
Nowadays MySQL has a much slower release cycle that it used to. It's still available in "commecial" and free ("community") releases. There's still a company behind it--a much larger one in fact. But one that also has a vested interest in showing how it works better on their storage appliances or 256 "core" computers and whatnot.
Clustering is still very niche. Transactions are not.
Meanwhile, all the cutting edge stuff (at least from the point of view of scaling) is happening outside Sun/MySQL and being integrated by OurDelta and even Drizzle. The OutDelta builds are gaining steam quickly and Drizzle is shaping up.
Heck, I'm hoping to get an OurDelta box or two on-line at work sometime soon. And I'd like to put a Drizzle node up too. I want to see how the InnoDB patches help and also play with the InnoDB plug-in (and its page compression).
The next few years are proving to be far more interesting than I might have expected from a project and technology that looked like was on a track straight for Open Source maturity.
Here's the abstract (which I've promised to expand upon soon):
Millions of people search for things every day on craigslist: tickets, cars, garage sales, jobs, events, and so on.
This talk will look at the recent evolution of database and search architecture at Craigslist, including performance, caching, partitioning, and other tweaks. We'll pay special attention to the unique challenges of doing this for a large data set that has an especially high churn rate (new posts, edits, and deletes).
And we strive to do this using as little hardware and power as possible.
If you're coming to the conference, drop by and harass me. :-)
If you're not sure check out the full schedule--there's a lot of good stuff packed into the conference already and a lot of talks are still not even posted.
I occasionally wish to know the IP address of my home Cable Modem or DSL connection but don't really care if it's available in DNS or not. It occurred to me that if I could programmatically detect the IP change, I'd be able to notify myself via Twitter.
At first, I wanted a simple web service that'd tell me my IP address--something like WhatIsMyIP.com but an API suitable for simple scripting.
That made it easy to write a simple bash shell script that can be run from cron every few minutes. It uses curl to hit that script and compares the result with the previous result (stored in ~/.last_ip). If they differ it updates the file and tells twitter, again using curl.
Of course, I had to create that new twitter account and then follow it in my main account. But, hey, that wasn't so hard. Now I have a Web 2.0ish social dynamic DNS thingy that uses Twitter.
Over on the 37signals blog, DHH writes Mr. Moore gets to punt on sharding. His argument is basically that if you continually delay fixing your data storage and retrieval layer, Moore's Law will be there to save our ass--over and over again.
Bzzzt. Wrong answer.
Depending on future improvements to fix your own bad planning is a risky way to build an on-line service--especially one you expect to grow and charge money for.
It's easy to forget history in this industry (as Paul pointed out in the comments on that post). There was a point a few years ago when people still believed the clock speed of CPUs would be doubling roughly every 18 months for half the cost. Putting aside that Moore's Law is really about transistor density and not raw speed, we all ended up taking a funny little detour anyway.
Until recently, the sweet spot (in terms of cost and power use) was probably a dual CPU, dual core server with 16 or 32GB of RAM. But soon that'll be dual quads with 32 or 64GB of RAM. And then it'll be quad eight core CPUs with 128GB or whatever.
But notice that nowadays we're not all running 6.4GHz CPUs in our servers. Instead we're running multi-core CPUs at slower clock speeds. Those two are definitely not equivalent.
A funny thing happens as you add cores and CPUs. You begin to find that the underlying software doesn't always... get this... scale. That's right. Software designed in a primarily single or dual CPU world starts to show its age and performance limitation in a world where you have 8, 16, or 32 cores per server (and more if you're running one of those crazy Sun boxes).
You see, David is talking specifically about MySQL (and probably InnoDB), which is currently being patched by outside developers precisely because it has multi-core issues . Its locking is expensive and not granular enough to utilize all those cores. It's expensive in terms of memory use too. And there are assumptions built into the I/O subsystem that don't scale well in today's world of fast multi-disk RAID units, SSDs, and SANs. People are hitting these issues in the real world and it's definitely becoming a serious bottleneck.
Moore's Law is no silver bullet here. A fundamental change has occurred in the hardware platform and now we're all playing catch-up in one way or another.
I'll discuss this a bit in my upcoming MySQL Conference Talk too. The world is not nearly as clear or simple as DHH is suggesting. Perhaps they can get by with constantly postponing the work of sharding their database, but that doesn't mean you should follow their lead.