Working with some legacy data sets can feel like you’re out in the wilds, doing data archeology. Digging into the files and finding the pieces that tell us something. Sometimes, it actually is Data Archeology!
Most of the time, the actual files are machine-readable spreadsheets. Maybe they’re not all disasters, but they sure can be a hassle. They take up a lot of time and effort, and they have to be dealt with before analysis or visualizations can be developed from the data set.
With a lot of data, during the compilation and reporting process, not a single thought was put into ‘how will this data be processed?’ Other than within the spreadsheet software it was compiled in. This has led to many …um… ‘irregularities’ in how the data is structured. Mostly, this is with ‘historic’ data, that predates a lot of recent data collection methods. It’s very common in legacy systems.
As a data scientist, it’s easy to criticize these irregularities, but there are often very good reasons for why the spreadsheet is structured in a certain way. Most often, it’s because that software is the only tool available and the user does the most they can with that tool. They’ve come up with many clever ways to get the job done.
We just have to get on with the job of extracting the data and transforming it into something we can use. Here’s an open source tool I’ve been working on, to help me with my own work.
What are some of the worst data disasters you’ve had to deal with? Add your comments to this article, or share your stories in a very brief survey.
Check out the next article in this series. Dealing with Data Disasters: Simple Fixes