A case for Kettle for your next ETL or data warehouse project

May 10th, 2008 by atomic

I am, for the most part, a do-it-yourself type of person. I fix my own car if I can; I even have four healthy tomato plants growing in pots outside as we speak — the plants will take that little extra CO2 out of the air and give me great tasting tomatoes (soon… i hope!)

But I digress.

Whether to use an ETL tool such as Kettle (aka Penatho Data Integration) for a project involving large data transfers is a typical “build vs. buy” type of decision, one that is fairly well understood and I don’t wish to repeat it all here — putting together some Perl scripts to do the job, you typically get great performance, development speed and accessibility. This would need to be balanced against the benefits of ETL tools and their potential drawbacks (development speed, license costs and performance implications).

A few things have happened in the last couple of years that make some of the typical reasons why you’d build your own no longer valid.

cost

Kettle is an open-source product, so not only is it priced right, you have access to the source code, which can be surprisingly handy. The 2.x series had some nasty bugs, and there were many times we dug into the source code to diagnose problems at my previous company. Don’t worry though, the 3.0 series from my experience is vastly improved.

a new multi-core order

Entry-level desktops today typically come with dual-core CPUs, and four or more cores on server hardware
is now common. This trend will continue as chip makers shift towards adding cores to improve performance rather than simply increasing clock speed.

I’m willing to bet that your average Perl ETL script is not written in a multi-threaded fashion. Perhaps you can architect the process so that you can run several of them at once to take full advantage of the hardware, but this is essentially polluting your ETL logic with hardware dependencies.

With Kettle, leveraging all the cores on your ETL machine is almost free. Each transformation step instantiates a thread within the java virtual machine (JVM) which in turn will be run as a native thread. If you have more than a few steps in your transformation, and data is flowing through fairly evenly (no large bottlenecks), chances are that you are coming close to maximizing use of your hardware.

Surprisingly there are some still out there that have been underneath a rock since 1998 and think java is “slow”; all modern JVMs just-in-time compile your Java code natively and performance in some cases can be faster than C/C++. It is definitely much faster than any interpreted language like Perl.

In my experience, I’ve noticed Kettle can be much faster at sorting and grouping large volumes of data than MySQL itself, at least out of the box, without turning too many knobs on the MySQL end. Good software should do that. I will post some tests of this assertion when I get a chance.

visibility

Kettle allows me to see, if I want to, exactly how much data each step has processed and get row counts/second to measure throughput and performance. Now that I’ve been spoiled, I can never go back to issuing a query like the below:

SELECT uid, count(*)
FROM hundred_million_row_table GROUP BY uid;

The corresponding operation in Kettle:

Spoon UI in action

No more running iostat, top and hacking around to get a vague idea how long it might take MySQL (or whatever RDBMS) to run that beast of a query! Kettle gives you output at select intervals, so you can see how many rows the sort has processed (the most time consuming part of this process).

Posted in kettle, performance |

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.