Don’t make a Pollocks of your data!

Back to BT Tradespace Team blog

I love data. I spend lots of time looking at it. I spend lots of time telling people that I’ve been looking at it, and what interesting things I’ve found. People tend to look at me sympathetically at that point and get back on with doing real work.

Something I’ve not spent much time doing of late is looking at the actual data. What I usually look at is prettified abstractions, aggregations and representations of the ‘data’ served out through random number generators like Omniture.

In the process of rebuilding that hallowed monolith to SME culture that is BT Tradespace, I’ve been having to think about real data again, or more specifically, how it’s structured in databases.

To most people, ‘data’ is the amorphous soup of information that underpins the interweb which barely manifests itself other than as the latest Steven Fry tweet.

But the data out back comes in various forms and flavours. And how you store and access it is vitally important when constructing a web applications.

I was intrigued when I read this article and is sensational headline claim, and its straightforward explanation of the emerging new world of online data. So much so, I sent the link to our Chief Technical Architect Richard.

His response solicited a small stumble to my generally rampaging ego. What the article suggests, is that for the sake of ease and extensibility, you could forsake the relational database altogether and store the data however it damn well suits you. BT Tradespace (both V1 and the nascent V2) use Endeca at their core which, as Richard irritatingly pointed out, forsakes the relational database for the sake of ease and extensibility! Now, Richard is not at all smarmy, but for the sake of dramatic effect you should think of him so.

The subsequent conversation down the pub developed upon this theme. Richard suggested that, for your website, you could store your data however you wish, across 20 databases if you so felt. After all, our service orientated application is quasi-anarchic in structure, so having a highly structured database supporting it seems a little strange anyway, right? This solicited a response from me along the lines of “get your dirty mangling hands of my data!” What actually happened is that a launched into a lengthy diatribe about future proofing your data. Here is a shortened version:

You see, we do store our data across, like, 20 databases on Tradespace V1, and it causes us a right royal collective headache. When we built V1, no-one really thought that much about what we’d need to do with the data further down the line. So they just threw the data architecture together as suited the application developers (even if in many cases this defied common sense). The data is all over the place in and most of the logic for accessing it lives in the dark recesses of the application itself. So querying said data is extremely difficult and costly.

So when Richard suggested that we could (theoretically) do something similar, my data sensibilities went into spasm. I need to be able to use our data – not just to relay blog entries to the web page, but to understand all the wonderful things that happen on our website and work out how to do stuff better. If you just take spread it around all unevenly like butter on a Tesco sandwich, doing so may not just be hard, it may be impossible.

The point is this: if for the sake of convenience, performance, extensibility etc. you want to go all Jackson Pollock on your data, then fine, just make sure you’re also keeping it somewhere else for other, more structured uses.

This is, of course, all moot, and I was preaching to the converted with Richard. We’d already figured this one out for V2. We just ferry the data off to the datawarehouse as it emerges, leaving the application/operational database to do as it will. For that database we need to think about scale and performance, flexibility and extensibility. If that means relinquishing structure, then so be it. I’ve got all my data in a very structured form elsewhere.

All this seems so obvious, we broke this rule on V1, and I’ve seen it broken (usually by overly creative application developers) countless other places. This is horses for courses. It’s unlikely that you only have 1 use for your data, and although storing it several places seems wasteful, it’s generally essential to maintain its integrity for its multiple uses. Again, blindingly obvious, but here’s the rub: doing this retroactively is often spectacularly difficult and expensive.

Respect your data from day one, as I can guarantee you’ll need it later.

Tags: , , , ,

This entry was posted by Alex Loveless on 1 Mar 2009 at 22:04 and is filed under V2. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Previous post:

Next post:

Comments are closed.

How do you feel about the economy in 2010?

View the results of this poll

Previous polls

Archived articles