Building a data warehouse on a budget with MySQL 5.1
atomic If there is one thing that a DBA or data warehouse architect can count on, it is that data volumes will increase while budgets will decrease.
This is why MySQL 5.1 and its partitioning capabilities are so interesting. I’m going to demonstrate how you can build a small/medium-sized data warehouse or data mart (1-10 TB range) on a shoe-string budget.
the mission
I decided to convert a relatively large statistics table (750m rows, 140GB in size in about 10 partitions) on a test machine from MyISAM to the Archive storage engine. After a long conversion process, my data, on disk, ended up being about 21GB, for an impressive compression ratio of 6.7:1.
Prior to MySQL 5.1, one of the drawbacks to the archive storage engine was that you could not index it; however, with partition pruning, you can get yourself a “free” index on a large archive table by splitting it into date-based chunks, whether by day, week, month or year, depending on the volume of data you want to make accessible.
I decided to see how long it would take to run a simple aggregate query against a portion of one of the partitions I had created.
mysql> explain partitions select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-31′;
+—-+————-+————–+————+——+—————+——+———+——+———-+————-+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+————–+————+——+—————+——+———+——+———-+————-+
| 1 | SIMPLE | stats | p_2007q2 | ALL | NULL | NULL | NULL | NULL | 59558354 | Using where |
+—-+————-+————–+————+——+—————+——+———+——+———-+————-+
1 row in set (0.00 sec)
On disk, the partition was about 1.7G:
-rw-rw—- 1 mysql mysql 1.7G Jun 16 17:16 stats#P#p_2007q2.ARZ
After issuing the query, I watched iostat. The file being accessed was completely absent from filesystem cache because i had previous run a long query against a large MyISAM table. This is what I saw, fairly consistently:
avg-cpu: %user %nice %system %iowait %steal %idle
24.96 0.00 0.75 0.88 0.00 73.42
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 48.55 0.40 62.44 0.60 7.36 0.00 239.32 0.39 6.20 5.42 34.20
As expected with a highly-compressed table, the query speed is CPU-limited: the disk can go much faster but we are fully utilizing one of the four cores (notice 25% cpu utilization).
Back-of-the-envelope calculations suggest that, at 7.3mb/s read throughput from disk, and a 1.7G file to be read, that the query will complete in about:
1700mb / 7.3mb/s = 232s = 3m52s
mysql> select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-31′;
+———-+
| count(*) |
+———-+
| 0 |
+———-+
1 row in set (3 min 52.19 sec)
Dead on!
the possibilities
What is interesting is that from this we can deduce the effective throughput of an archive table based on how much data the disk is feeding the cpu.
If we attempted to run such a query on the original MyISAM partition (about 11.5GB), in order to complete the result in 232s, we would require a throughput of :
11500 / 232 = 49mb/s
50mb/s is not lightning fast, it’s about what you’d expect from a single SATA disk, but it’s still fairly good considering the amount of compression we’re getting.
But wait! You still have 3 more cores left on the machine. Now to confirm another hypothesis:
Testing two simultaneous queries, against the same partition:
select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-29′;
select count(*) from stats where StatsTime between ‘2007-05-01′ and ‘2007-05-30′;
Most of the data was in fs cache, so our iostat looked like this for most of the time:
avg-cpu: %user %nice %system %iowait %steal %idle
50.70 0.00 0.85 0.00 0.00 48.45
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.50 0.00 0.80 0.00 0.01 13.00 0.00 0.38 0.37 0.03
hda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Final runtime for both queries was 208s and 217s; suggesting that the queries were almost perfectly parallelized.
If we needed to run a beefy calculation against a year’s worth of data, for example, we could run four parallel queries (one per core per quarter) and aggregate the results for each query in a separate client. Now our “equivalent” 50mb/s throughput on one core has become 50mb/s * 4 = 200mb/s.
If you get a slightly faster disk, you could fully utilize an 8-core box for a throughput of 400mb/s.
And if your disk space is 750GB, factoring in the compression ratio, you are storing the equivalent of 5TB of non-compressed data with a potential read throughput of 200mb/s. Not too shabby for low-end hardware!
Posted in mysql, performance |
June 18th, 2008 at 10:45 am
Hi!
great blog you have going here, very interesting subject.
Regarding this post, I was wondering, how does a good old MERGE table of compressed (myisampack -ed) MyISAM tables match up to a partitioned ARCHIVE table?
I have been meaning to figure this out myself, but it may interest you too, and it would be cool to compare it to the results you have now.
kind regards,
Roland
June 18th, 2008 at 1:12 pm
@Roland:
Hmm, I had completely forgotten about compressed myisam tables.
I believe compression on packed myisam tables is not as good because they use Huffman coding only and not zlib like the archive engine does. If CPU usage is not as high to read the data and the compression is still good, though, this might result in higher throughput.
When I get a chance I’ll do some tests on that same box using myisampack and post the results.
June 18th, 2008 at 1:27 pm
Atomic, myisampack also does data type optimization, like creating enum fields out of low cardinality string columns.
Anyway, it would be good to see the results next to each other.
Thanks in advance for looking into it
June 20th, 2008 at 8:37 am
[...] Tomic gives a howto on building a data warehouse on a budget with MySQL 5.1. He begins, “If there is one thing that a DBA or data warehouse architect can count on, it is [...]
June 21st, 2008 at 3:35 am
The other benefit of myisampack’ed tables is that you can have indexes as well.
The MySQL 5.1 documentation only mentions that ARCHIVE can have indexes on AUTOINCREMENT columns.
Nice project btw. Interested to see how it goes.
January 15th, 2009 at 10:42 pm
The Data Mapping engine in Data Transformation Server allows any-to-any transformations between different data formats. It includes complex data functions such as string, math, and conditional operations as well as DB and XML file look-up.