Data, data and more data

A wise man has told me that blogging my experiences regularly will improve my writing and help to organize my thoughts more effectively. With this in mind I’m endeavoring to update this blog more often in the hope that it works. Typically I find it quite difficult to write and some of the recent conversations I’ve had with colleagues would indicate I’m not alone especially when writing something more detailed than an email. Anyway, time to bang on some more about the Indicators project which is occupying a lot of my mental space of late.

The Indicators project is gathering data from two primary sources which are enterprise databases that hold an enormous amount of data. For example one of them has approx 200 million row while the other about 500 thousand rows. Databases this size can be very slow to query so we decided to rationalize the data to just the terms and student cohorts we are interested in. This will reduce the total data set size to around 15 million rows which should give us a performance boost when executing large queries. Another obstacle we face is that we only have quite restricted views into the data we are analyzing and as a result we can’t create temporary tables to store processed data near to the source and this led to the current plan which is to migrate the applicable rows off their current hosts an onto a host of our own devices where we can manipulate the data sets to suit our needs.

Initially we used MySQL on some antiquated hardware but found there were several limitations with this approach

  • The query times were ok however the updates were particularly slow.
  • We had great difficulty in migrating the data from enterprise system onto MySQL simply due to the size of the data.
  • The hardware is quite dated and crashed several times at critical moments.

The plan now is to use MS SQL Server 2008 on an updated PC and so far the results are more heartening however being a typical MS product there are some annoyances with using this approach especially when most of the team members are using Macs which adds more difficulty to process. Our early observations are good in that it appears relatively simple to migrate the data onto our system so we can start the difficult part which is trying to make sense of the extracted data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s