A lesson in never letting your filesystem fill up too much

June 26th, 2008 by atomic

A good rule of thumb is that a DBA should not let the size of his/her database grow to much more than 50% of the available disk space. I’m sure everyone has a story of being burned in some way by not following this advice.

Before promoting 5.1 to some beefier hardware, I’ve got a few large tables sitting on a small SATA-based machine. Noticing that a few were growing faster than I expected, i decided to pause my loading and convert another one of them to the archive storage engine to save some space.

I certainly saved space.

By trying to do too much at one time, I had some other temp files lying around on the same filesystem, and my conversion to the archive storage engine failed at about 80% through with my disk full.

mysql> alter table w_stats engine = Archive;
ERROR 1030 (HY000): Got error -1 from storage engine

mysql 5.1 has an unfortunate bug that I discovered, and reported, that causes the unfinished archive files to remain after the failed conversion, and not the original data. My disk space problems are no longer much of a problem :)

Now I know that this was basically my fault, and that I should have been more careful with my system, but I would have preferred maybe an embarassing error message, a slap on the wrist, not the loss of a 150GB table! For example:

mysql> alter table w_stats engine = Archive;
ERROR 1030 (HY000):  df -h is your friend

Well, it sounded funny in my head at least…

Posted in mysql | 1 Comment »

Testing MySQL sequential IO performance with different IO schedulers - Part 1

June 23rd, 2008 by atomic

Testing MySQL sequential IO performance with different IO schedulers - Part 1

I stumbled upon an interesting article discussing the different Linux I/O schedulers that are available in the 2.6 kernel, and also discussing some of the deficiencies of the 2.4 “Linus elevator” I/O scheduler.

The default scheduler as of Kernel 2.6.18 is the “Completely fair queuing” scheduler, or CFQ. The previous default was the anticipatory scheduler between 2.6.0 and 2.6.17. Also available is a buffed-up version of the 2.4 scheduler called the deadline scheduler, and a “noop” scheduler. This article has a great discussion on what these schedulers mean for databases at a higher level.

Although old, some documentation in the kernel states  that database workloads should typically use the deadline scheduler. This tuning guide for Oracle on Linux gives, without much explanation, the same advice .

I don’t have the time at the moment to do a full analysis (hence part 1), but I did some quick tests and already got some pretty surprising results. I’m hoping to see if anyone else has done some tests along these lines that can comment further.

The test

I have a 140GB table sitting on a test machine with a simple 320GB SATA drive.

Changing the scheduler in newer kernels is a simple tunable parameter. You can change it to the deadline scheduler for your sda device, for example, as follows:

echo deadline > /sys/block/sda/queue/scheduler

I ran a simple count(*) on one of the partitions, 13GB in size, to see if there was a measurable difference between the schedulers. Memory size on this system is 4GB.

These are the results I got on an otherwise idle machine. The CFQ and anticipatory results were very consistent, almost to the second each time, while the deadline and no-op schedulers fluctuated as much as 50s per test. Not sure at this point why this is the case.

Scheduler
Average runtime (s)
Average Throughput (MiB/s)
CFQ
303
43.04
Anticipatory
370
35.4
Deadline
492
26.5
No-op
471
27.6



Now, these results come with the following caveats:

- These are not extensive, robust benchmarks, they’re just a few tests I ran against my very specific workload. I encourage people to try the different schedulers on different workloads and hardware and share results!
- These are test results with a large, sequential I/O workload that is very different from the random I/O intensive tests that most people are interested in.
- These are test results against a single ATA drive. These results are probably irrelevant in a hardware RAID context

In a future blog post, I will try to dig a bit deeper into these results and try the tests against hardware and software RAID. What i think can be taken from this already, and the referenced articles, is that CFQ should probably be used if your kernel has direct knowledge of the physical layout of your disks, eg. through LVM. Otherwise, deadline or even noop is probably your best bet if you are using hardware RAID, putting the burden on the hardware to decide how best to execute reads and writes.

Posted in mysql, performance | 3 Comments »

Building a data warehouse on a budget with MySQL 5.1

June 16th, 2008 by atomic

If there is one thing that a DBA or data warehouse architect can count on, it is that data volumes will increase while budgets will decrease.

This is why MySQL 5.1 and its partitioning capabilities are so interesting. I’m going to demonstrate how you can build a small/medium-sized data warehouse or data mart (1-10 TB range) on a shoe-string budget.

the mission

I decided to convert a relatively large statistics table (750m rows, 140GB in size in about 10 partitions) on a test machine from MyISAM to the Archive storage engine. After a long conversion process, my data, on disk, ended up being about 21GB, for an impressive compression ratio of 6.7:1.

Prior to MySQL 5.1, one of the drawbacks to the archive storage engine was that you could not index it; however, with partition pruning, you can get yourself a “free” index on a large archive table by splitting it into date-based chunks, whether by day, week, month or year, depending on the volume of data you want to make accessible.

I decided to see how long it would take to run a simple aggregate query against a portion of one of the partitions I had created.

mysql> explain partitions select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-31′;

+—-+————-+————–+————+——+—————+——+———+——+———-+————-+
| id | select_type | table        | partitions | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+—-+————-+————–+————+——+—————+——+———+——+———-+————-+
|  1 | SIMPLE      | stats        | p_2007q2   | ALL  | NULL          | NULL | NULL    | NULL | 59558354 | Using where |
+—-+————-+————–+————+——+—————+——+———+——+———-+————-+
1 row in set (0.00 sec)

On disk, the partition was about 1.7G:

-rw-rw—- 1 mysql mysql 1.7G Jun 16 17:16 stats#P#p_2007q2.ARZ

After issuing the query, I watched iostat. The file being accessed was completely absent from filesystem cache because i had previous run a long query against a large MyISAM table. This is what I saw, fairly consistently:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
24.96    0.00    0.75    0.88    0.00   73.42

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda              48.55     0.40 62.44  0.60     7.36     0.00   239.32     0.39    6.20   5.42  34.20


As expected with a highly-compressed table, the query speed is CPU-limited: the disk can go much faster but we are fully utilizing one of the four cores (notice 25% cpu utilization).

Back-of-the-envelope calculations suggest that, at 7.3mb/s read throughput from disk, and a 1.7G file to be read, that the query will complete in about:

1700mb / 7.3mb/s = 232s = 3m52s

mysql> select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-31′;
+———-+
| count(*) |
+———-+
|        0 |
+———-+
1 row in set (3 min 52.19 sec)

Dead on!

the possibilities

What is interesting is that from this we can deduce the effective throughput of an archive table based on how much data the disk is feeding the cpu.

If we attempted to run such a query on the original MyISAM partition (about 11.5GB), in order to complete the result in 232s, we would require a throughput of :

11500 / 232 = 49mb/s

50mb/s is not lightning fast, it’s about what you’d expect from a single SATA disk, but it’s still fairly good considering the amount of compression we’re getting.

But wait! You still have 3 more cores left on the machine. Now to confirm another hypothesis:

Testing two simultaneous queries, against the same partition:

select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-29′;
select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-30′;

Most of the data was in fs cache, so our iostat looked like this for most of the time:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
50.70    0.00    0.85    0.00    0.00   48.45

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.50  0.00  0.80     0.00     0.01    13.00     0.00    0.38   0.37   0.03
hda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Final runtime for both queries was 208s and 217s; suggesting that the queries were almost perfectly parallelized.

If we needed to run a beefy calculation against a year’s worth of data, for example, we could run four parallel queries (one per core per quarter) and aggregate the results for each query in a separate client. Now our “equivalent” 50mb/s throughput on one core has become 50mb/s * 4 = 200mb/s.

If you get a slightly faster disk, you could fully utilize an 8-core box for a throughput of 400mb/s.

And if your disk space is 750GB, factoring in the compression ratio, you are storing the equivalent of 5TB of non-compressed data with a potential read throughput of 200mb/s. Not too shabby for low-end hardware!

Posted in mysql, performance | 5 Comments »

MySQL Repair/Optimize Partition Errors

June 5th, 2008 by atomic

mysql 5.1 is nearing release, with the present release candidate 5.1.24.

The most important new feature, in my eyes, is the new partitioning capability. When I get some time, I will write up a more complete post on my experiences so far with 5.1 partitioning, but I am going to try to keep the turnover on posts a bit higher, and post smaller things on here more regularly.

Partitioning has the potential to make large tables in mysql manageable once again. This is music to the ears of anyone that has had the misfortune of having to learn, the hard way, about MyISAM’s often painfully slow “Repair by keycache” loading and repairing of large tables with unique keys. Add that to MyISAM’s propensity to table corruption, especially with large tables, and you have a ticking timebomb on many pre-5.1 servers out there. If you are lucky, you can repair a 10GB table in a few hours, something you don’t want to have to wait around for at 3am which is invariably when those big tables get screwed up and you have to repair them.

Of course, many people use application-level partitioning to deal with this problem, or used the inadequate MERGE storage engine to deal with this. But partitioning at the database level has the potential to make things just so much easier. If you partition wisely, you can turn a 10GB behemoth table into a series of 500mb chunks, each of which is (I’m hoping here) less likely to corrupt, and far more manageable by MySQL individually.

Beware of this bug though! It’s duplicated many times, and might catch you off guard if you plan to be an early adopter of 5.1. Essentially, you can’t really repair or optimize partitioned tables, at least not directly. The documentation still contains references to the ALTER TABLE … REPAIR PARTITION statements which don’t work at all, as of 5.1.24. For eg, with my partitioned InnoDB table tst_p:

mysql> optimize table tst_p;
+-------------+----------+----------+-----------------------------------------------------------+
| Table       | Op       | Msg_type | Msg_text                                                  |
+-------------+----------+----------+-----------------------------------------------------------+
| stats.tst_p | optimize | note     | The storage engine for the table doesn't support optimize |
+-------------+----------+----------+-----------------------------------------------------------+
1 row in set (0.00 sec)
mysql> alter table tst_p repair partition p_pre2007;
ERROR 1178 (42000): The storage engine for the table doesn't support repair partition

Luckily, there is a workaround. Using ALTER TABLE REORGANIZE PARTITION, you can effectively rebuild the partition which will optimize/repair it. I’ve yet to test this with a truly corrupted table, however. The downside of this is that you have to have the partition definition handy. Luckily it seems to be smart enough to stop you from making a mess — notice the error in the first one, and the second statement goes through correctly:

mysql> alter table tst_p reorganize partition p_pre2007 into (partition p_pre2007 values less than (to_days('2007-12-31')));
ERROR 1520 (HY000): Reorganize of range partitions cannot change total ranges except for last partition where it can extend the range
mysql>
mysql> alter table tst_p reorganize partition p_pre2007 into (partition p_pre2007 values less than (to_days('2007-01-01')));
Query OK, 0 rows affected (0.08 sec)
Records: 0  Duplicates: 0  Warnings: 0

As I get more time i’ll post more cool partitioning tricks.

Posted in mysql | No Comments »

Sun caught in a pincer with MySQL

June 2nd, 2008 by atomic

Over the years, the database world has been buzzing with the strategic threat posed to the established players by upstart open-source database systems. Oracle and IBM would no longer be able to gouge defenseless small and medium-sized businesses of non-trivial portions of their IT budgets for a mere database licence. Oracle, IBM and Microsoft, for their part, have tried their best to respond to this threat, but it is clear that they cannot simply squash open-source products, but rather evolve with the changing landscape.

the countered threat from Oracle

Oracle made some strategic purchases in the past few years to establish a foothold in the embedded and front-end database market by acquiring Sleepycat (maintainers of BerkeleyDB) and InnoBase (makers of InnoDB storage engine for MySQL). These two also happened to provide the only two transactional backends for MySQL, whlie InnoDB is the only one to be used widely in practice. While this was looked upon as a disaster for MySQL, it was really not a major issue for a couple of reasons:

  • The InnoDB source has been GPL’d, so even if Innobase were to completely abandon maintenance of the codebase (which it has not), the community can step in to fix bugs. For now, the status quo prevails
  • MySQL acquired Netfrastructure (sp?) and begun the process of porting the backend to the new Falcon transactional storage engine
  • MySQL also began the process of fixing up some of the severe defecits in the MyISAM storage engine and branded it Maria

So, from a strategic perspective, it looks like MySQL is taking the right steps to counter the threat from Oracle. An infusion of money from Sun will speed up development on many of these initiatives.

a general picture of the database landscape

Let’s take a step back for the moment and take an unscientific look at a few players in the current database market. This is not rigorous, exhaustive market research, it’s just my observations over the past 9-10 years or so:

Teradata: The data-warehousing champ, a reputation for high quality but also prohibitively expensive for many with large data volumes

DB2: Strong in the institutional market, mainframes and data warehousing; not much use in the web/internet world

Oracle: The jack of all trades. Expensive, but no CTO would ever be fired for picking Oracle for almost any purpose, whether its an OLTP system or data warehouse

PostgreSQL: The Betacam of open source databases. Highly-functional, stable and scalable. Over the years, unfortunately saddled with a somewhat-unjustified reputation for being slow and difficult to use (in comparison to MySQL) and a militant userbase that spends an inordinate amount of time bashing MySQL instead of evangelising its capabilities

MySQL: A simple, fast database with a reduced featureset that works well for web applications

My belief is that MySQL owes its popularity mostly due to the fact it is perceived to be very fast when using it to build simple apps. Over the years it became the ‘default’ web database, with most hosting providers using it as the backend in combination with PHP. Many are now providing PostgreSQL hosting, but this wasn’t always the case.

mysql vs. postgresql

While they are both OSS databases, MySQL and PostgreSQL are very different. PostgreSQL has its roots in academa and the defense industry, and a trip into the source code is like a trip down a memory from a computer science class. MySQL, on the other hand, feels and has been developed much more like a commercial product, with a focus on functionality and speed and less interest in elegance and standards compliance. These are sweeping generalisations of course.

My reason for bringing up this comparison is to make the following important assertion:

For a particular project, given the choice, DBAs choose PostgreSQL, developers choose MySQL.

DBAs like tablespaces. Good query optimizers. Tables that don’t randomly corrupt.

Developers like databases that reduce their need to think. A system that will let you send 100 queries over 100 separate connections without any apparent overhead compared to sending 100 queries over the same connection.

I come to this conclusion after nearly 2.5 years as a MySQL “DBA” working for companies of various sizes in the internet industry. MySQL, in many cases, has been reduced to a glorified flat file system, and many non-junior developers do not even understand the most basic SQL optimization.

It’s not hard to see why MySQL is far more popular than PostgreSQL, given that developers are more numerous and higher up the application stack.

Amazon SimpleDB and Google BigTable

The users of MySQL may prove to be a fickle bunch, however. Oracle never was never that much of a threat all along, and two unlikely competitors in the database space will change the rules of the game. Developers like the simplicity and avoiding-of-thinking they get with MySQL. Once a site gets to a certain scale, the database become a major bottleneck. Complex yet surprisingly-robust sharding architectures have been developed to deal with scaling MySQL beyond the capacity of one machine, but this is not for the faint of heart.

BigTable and SimpleDB look ready to take developers back to the simpler days when MySQL was a fast, reliable persistent store, allowing them to focus on their strengths. HBase, while still very alpha-ish, also holds great promise. Many people have criticised BigTable and SimpleDB for being, well, just a big table and a simple database. But that’s precisely what MySQL was, and it did it quite well for a long time. Developers in the internet age simply don’t care about the things DBAs and database developers of a previous era did. They want three things: performance, availability and more performance. Strict ACID compliance is simply not that important in an age when entire internet empires are built off of clicks worth as little as $0.07 and page views worth $0.00001 — but high availability is important.

the pincer

Sun has acquired MySQL at a time when the old guard of the database world is becoming more aggressive, and the new guard of software-as-a-service providers are swooping into the space to appeal to IT managers initially happy with the open source licensing, but not thrilled with the non-trivial total cost of ownership (read: paying a DBA who must continually wake up at 230 to repair your corrupted MyISAM tables)

prognosis

Well, let’s put it this way — as someone that abandoned the proprietary database world a few years ago to work solely with open-source technologies, I feel that I need to start learning more about HBase, BigTable and the like to survive in this marketplace. MySQL may not have seen this strategic threat coming, but they better start working, quickly, to make MySQL scale better and more easily, or it will begin to lose its place as the “default” database of new web applications.

Posted in Uncategorized, mysql | 8 Comments »