On the need for an agile approach to data warehousing

January 27th, 2009 by atomic

I’d like to take a step back from technical issues to distill some of my thoughts on the challenges of data warehousing in the 21st century.

Having worked on a number of warehouse projects in different industries over the years, I’ve encountered many challenges, some failures, some successes. One thing is certain: all organizations that have a reasonable amount of data should be building a data warehouse if they don’t already have one. In 2009, given the economic atmosphere, no one wants to wait as long, or pay as much, as they did in 1999 to get one.

While this is a huge opportunity for open-source competitors like MySQL, it comes with big challenges for an organization that thinks it will get a $10MM warehouse (in 1999 dollars) for $300,000 (2009 dollars).

My contention is that in a web-connected, high-traffic and high-speed world, a monolithic approach with a rigid set of requirements, and a project team isolated from the business through layers of product and project managers is an almost guarantee of a disastrous investment of resources into a product that will not satisfy the needs of an ever-changing business. Most of my reasons for believing this follow directly from the arguments made in favour of agile development in software engineering. In a data warehousing effort, I believe some of these reasons are even more compelling, yet many do not see a warehouse project as a typical example of something that can be made “agile”.

For those unfamiliar with agile development, I’ll crudely summarize it as follows: a project that is lightly specified and built with a continuous loop of user/business feedback is more flexible and produces better results. This usually takes the form of implementing small, usable features called “stories” every few weeks. They can be thought of as a series of small prototypes. The strongest argument for developing in this way is that requirements and the business environment can change rapidly. Users may know what they want at a given point in time (and even that’s a somewhat optimistic assumption) but may change their minds and produce more innovative results if they are able and encouraged to make course modifications after seeing initial prototypes.

In undertaking a new, or improved warehousing effort in an organization, it is crucial that users have access to and are encouraged to use new data that are being produced. This makes users more comfortable with the results they work with, easing over time any psychological aversion they may have to trusting new spins on numbers. More importantly, developers can write specified test cases until they are blue in the face, but nothing can replace the eyeballs of a user that can immediately tell when numbers do not look right.

In a monolithic approach, a bug can creep into the processing pipeline and cause many GB or TB of data to be incorrectly processed until someone notices it, if no one is working with the data. If users frequently make use of new data, bugs can be caught quickly and the time spent fixing them far less.

A working example

While this sounds good in theory, an entire book could be written on how examples of this could work in practice. I’ll just use a contrived example that i’m sure many can relate to.

Your company has grown quickly in the last couple of years, and has always collected in log form tons of data that never get used. Other priorities prevented much from being done with it, but recently someone in management read an article in a CIO/CTO magazine about “data mining” and now you’ve been stuck with the task of buiding a warehouse with all of this data.

In a monolithic, 1999-style approach, the CTO would have Oracle or IBM come in with a team of 10 highly-compensated consultants. A few feasability studies, $10MM and 2 years later, you’d have a brand-spanking-new data warehouse full of data that a large majority of which the spindles will not touch even once.

Luckily, we are in more enlightened, frugal times, and many companies now undertake this effort themselves after recovering from Oracle sticker shock. Here’s a series of starting tips to the company in the situation above:

  • This is kind of a rule of thumb, but to begin with, forget about all historical data that is too old to impact payments or revenue. If your company reconciles payments and/or revenues, say, every 60 days, the degree of interest your analysts and management team have in data from 61 days ago compared to 59 days ago is probably substantial.
  • Listen to the CTO’s and management team’s vision for what they want to do with all this data. Listen to their thoughts and keep them in the back of your mind, but alsoremember that Rome wasn’t built in a day
  • Now that you’ve discarded the mental burden of having to come up with a way of scaling your computations on many years worth of data, get an off-the-shelf box with a couple of large hard drives, talk to your analysts about something interesting that can be done with the last 60 days worth of data with a maximum development time of 2-3 weeks. This will force the people that work with your data on a daily basis to decide what is most important to them. Be highly suspicous of those that say they need “everything”
  • After 3 weeks, you should have a self-maintaining system, processing whatever it processes on its own and without your intervention. Show it to the anlaysts, and get their feedback. If all goes well, they will be excited to use the new data and it will very quickly becomea critical component of their jobs . Remember another rule of thumb: once users have been given data, and actively use it, you take it away from them at your peril!
  • If your first “user story” was a success, most likely you will have to fight with the analysts over what data gets included next. Again, do not reprocess back data no matter how much they say it “would be useful” — stick to that 60-day window.
  • Rinse. Repeat.
  • After a few iterations, you should be in the enviable position of having a hacked together system running on an off-the-shelf box. And this is fine, because far more important things are solid — the user’s confidence in the data they are getting, the management team’s confidence that you are giving the business the data they need, and your confidence that what you are building is being used.
  • Now sit the managers and analysts down for “the talk” in which you describe to them that you need the next 3 weeks to prevent the system from breaking. This may involve getting proper hardware, or simply re-architecting the system. They will hate you for it just like a child will hate you for taking their steady supply of candy away, but remember that you must be a responsible parent.
  • Rinse. Repeat.
  • Receive accolades.

I hope that no one takes my tips too seriously. Certainly referring to your users as “children” will not win you many allies. I also certainly do not advise a completely chaotic development model with zero planning. But I think those of you that have been involved in multi-year data warehouse projects with vast project teams doing gap analyses, feasibility studies and book-length project plans will agree that working on something collaboratively over a few short iterations can’t be all bad. You may even give the users something they want and not break the bank while you’re at it.

Posted in agile development, data warehousing |

7 Responses

  1. Roland Bouman Says:

    Hi!

    Great post! As it happens I’m pretty much in this situation right now. So far, the target audience likes the approach, but there certainly is some entrenchment going on among the IT staff…

  2. Jeff Says:

    Alternatively, you could get an instance of HDFS running, copy your logfiles into HDFS, and start running MapReduce against them that day. When you decide to provide SQL access to end users, you can layer Hive on top of HDFS for some of the data.

    A big advantage of Hadoop that I’ve seen is that you don’t have to write schemas to get started, and you never have to worry about what data to keep and what data to throw away. Just stuff it all into HDFS and add structure as needed with Hive.

  3. atomic Says:

    @Roland: Ah yes, that’s a common problem. In the end you need someone up above that believes in this approach or else heels get dug in… hope it works out!

    @Jeff: That’s definitely a good approach and I suppose is consistent with what I’m suggesting — i’m not really advising any particular technology, or way of processing data, just a way of interacting with the business side and getting data into their hands. If you already have a HDFS cluster, great, but some people may start with literally nothing of the sort. In the end, the end-users’ skill level is important, but i’m operating under the assumption that analysts will do direct SQL queries, _maybe_ some extra-SQL scripting/processing, while managers strictly use canned reports and GUIs. How relevant these assumptions are to your situation of course make a big difference. I do know from first-hand experience though that Oracle consultants go into involuntary spasms upon hearing “Hadoop” :-D

  4. Log Buffer #133: A Carnival of the Vanities for DBAs Says:

    [...] those who have an interest in database warehousing, comes this interesting post from Alex Tomic on agile development with data warehousing. Bet you never though you would hear those two phrases [...]

  5. SS Says:

    great set of posts.

    We are just starting out on a 1T-5 T DW for a web startup. I am curious- do you think MySQL is upto it (w/o resorting to the new fangled INFOBRIGHT) etc. ?

    Any pointers on published /web resources that even vaguely describe multi T MySQL DWes ?

  6. Alex Says:

    @SS: What I believe is that with data warehousing (and probably anything IT-related), as you spend more money on some solution, really what you are paying for is maybe two things — the convenience of someone building something for you, and much more importantly, a support policy (i.e. the ability to divert blame to a vendor if a project falls on its face)

    So, if you’re working for Big Corporate(tm) and are in a position of relative power, a “community” edition of MySQL may not be appropriate, not because its not good enough, but because you’re dependent on the goodwill of the community, and your own team, to help you if you run into trouble.

    However, if you’re working for a non-politicized web start up where just getting stuff done cheaply is important, you’d be amazed how far you can stretch mysql.

    Where you are along that spectrum between saving cold hard cash and buying merely a security policy dictates what product you go with (assuming you have unlimited cash or unlimited talent/willingness to go it alone). While I haven’t actually tried it, based on the specs alone, InfoBright does seem to be a nice middle ground between a budget-busting Oracle RAC DW and a homegrown mysql setup.

    You will find some info on blogs here and there that are mysql-specific, but i think that mysql as a DW solution doesn’t seem to be extremely popular just yet, or, most smaller companies build basic MySQL solutions themselves until they have enough money and scale to go with an established vendor. Take a look at Ralph Kimball’s warehousing book, it’s like the bible of DWs.

  7. Alexwebmaster Says:

    Hello webmaster
    I would like to share with you a link to your site
    write me here preonrelt@mail.ru

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.