After the third iteration of the clients’ data, the validation and data cleaning started getting out of control. If I did not do something more efficient, I was going to run out of time or royally screw up.
It was early November in Michigan. I had just changed my car over to snow tires and it was the first day the temperature dropped in the low 30s F. As I gazed outside I thought about some of the analytics lessons from the Supply Chain project I was working on. One of the biggest lessons was the need to manage, clean and prep the data in a systematic and repeatable fashion. Not only did the data need a lot of manipulation on this project, but the scope of what we had to cover was fluid. Below is the process I used to keep on top of the data and prevent getting ‘punched in the face’ by multiple data extracts.
1. Communicate – Create a data request and keep a data registry to track what you have received
While this may seem pretty basic project management, it is really important to make sure you communicate what you need and what you have received. Your colleagues will forget what you require and whether more follow up work is needed.
I started the project with a data request template outlining the data I wanted and an example table of what the information would ideally look like. In all the projects I have ever done, what you end up receiving is usually vastly different than what you originally ask for. This is OK. Every company operates a little differently and has different levels of granularity of information. As long as you know the themes of what you want and the time frame, you can usually find enough to make your analysis work as you iterate through your data request.
The next step is to put together a data register. I sent out this register every 2 days during the data collection phase. The register had the data I needed in one column, whether it was complete / in progress / complete in a colour coded column, a date received and a comments column (see below). I find that using a simple template like this helps communicate to both the people you are working with but also helps manage upwards to communicate where you are running into obstacles.
Chart 1: Data Register
2. Document control – Carefully file all data received in a directory
Again this may seem simple, but it is very easy to use the wrong version of a piece of data. Years ago I worked on a construction site and did not understand why the contractor was consistently pouring the foundations of a power station with 5 fewer reinforcing bars than what the drawings called for. I had an argument with him each time and had to get him to add 5 more reinforcing bars, but he kept repeating the mistake. I asked to see his drawings and found out he was using a preliminary document from 3 months previously. Similarly with analytics, document control is crucial when you have lots of data sources with many versions.
One of the processes that helps me is to create a directory for all documents received. Whenever I get an attachment from an email or a data extract on a USB stick, I copy the file and move it into a ‘Received’ directory. I then create a new folder with the structure ‘Date (File Name)(Who from)’. For example:
2016_11_01 (Product Master)(Mr A)
2016_11_03 (Site Geocodes)(Miss B)
Furthermore, for all documents I work on, I date and time stamp so I know which is the latest. I learnt this from a former US Army officer who was very meticulous in his document control. For example, he would have the standard ‘File Name – Date Time’. Using this technique, your directory will automatically sort the latest file so the versions are clear. For example:
Product Master – 20161101 1400.xls
Baseline Presentation – 20161103 0900.ppt
3. High level checks – Use totals and row counts to check the new data received is correct
The first thing I try and do when I get a new extract is to do a very quick check on whether the information contains everything I think should be in there. Often Excel is the best option for smaller data sets, where you create one or 2 pivots. This simple check will save you many hours if you can identify obvious errors in the data you have received straight away. Some simple checks are:
- Totals – do the total values look correct e.g. is annual volume what you expect?
- Row counts – are there duplicates in your data that are increasing your row count?
- Categories – are all the categories you expect in the data?
- Dates – are the date ranges correct?
4. Automation – Use scripts to quickly repeat data cleaning
Automating the data cleaning is really important when there could be updates to your data. While there is a little upfront work to write the script, once in place, the script allows you to very quickly do lots of updates in a short period of time.
Some of the cleaning at this stage could cover:
- Duplicates – remove duplicate entries
- Outliers – remove outliers
- Null / Zero values – re-code null or zero values to averages or agreed upon assumptions
- Format – change formats e.g. dates
- Calculated fields – add columns to calculate useful piece of information e.g Gross Profit, Inventory Turns etc
- Enrich – combine tables to add all the information you need for your analytics in the one table e.g. Product attributes, Customer ZIP codes etc
To automate the data cleaning on this project I used SQL server. I wrote a short script using a series of views to iterate through all the data cleaning. The views allowed me to keep the cleaning in memory without creating lots of working tables in the database. Once I was happy with where the views had reached I would write a final table that I could use for my analytics. SQL server is useful because it is free and there are just a couple of tricks to get this to work.
4.1 Import all raw data tables
I firstly imported all the raw data tables and label them in the format ‘A_RAW0X_Name’ in the SQL database e.g. A_RAW01_Sales or A_RAW02_SiteMaster. I use the prefix ‘A’ to sort all raw data and distinguish from ‘B’, the cleaned tables. I use the suffix ‘RAW01’ to then sub-order the raw tables.
4.2 Create views
Views are a terrific way to clean the data without having to write lots of intermediate tables to your database. I like to break data cleaning into many small sub-steps where I make lots of intermediate views that build on one another. For instance, if I want to create a view from the raw sales data ‘RAW01_Sales’ that changes the Month Number to a text description, I might use the following SQL code:
IF object_id(N'v_A_Raw01_Sales02_Month', 'V') IS NOT NULL
DROP VIEW v_A_Raw01_Sales02_Month
CREATE VIEW v_A_Raw01_Sales02_Month AS
WHEN 01 THEN 'JAN'
WHEN 02 THEN 'FEB'
WHEN 03 THEN 'MAR'
WHEN 04 THEN 'APR'
WHEN 05 THEN 'MAY'
WHEN 06 THEN 'JUN'
WHEN 07 THEN 'JUL'
WHEN 08 THEN 'AUG'
WHEN 09 THEN 'SEP'
WHEN 10 THEN 'OCT'
WHEN 11 THEN 'NOV'
END as [MonthName]
You will notice that the view is written in such a manner that it checks to see if a view already exists ‘IF Object id…’. If the view already exists, then the view is deleted and recreated. You will also see the term ‘GO’ in the script. This is a batch terminator that helps SQL know when a block of code ends in your script – see more at SQL GO Command. This is necessary when you write a script with multiple views.
In the next step of the data cleaning process, you may then want to create a second view that builds on the previous view. For example to select Years that are ‘> 2015’ you may write code such as:
IF object_id(N'v_A_Raw01_Sales03_Daterange', 'V') IS NOT NULL
DROP VIEW v_A_Raw01_Sales03_Daterange
CREATE VIEW v_A_Raw01_Sales03_Daterange AS
WHERE Year > 2015
The beauty of this process is that you can do one view at a time to build up the final clean table for your analytics. Here we create a view ‘v_A_Raw01_Sales03_Daterange’ based on the previous view ‘v_A_Raw01_Sales02_Month’. Once you are ready for the final table, you can use a SELECT * INTO command to write the ‘cleaned’ table. For instance, if I wanted to write the last view ‘v_A_Raw01_Sales03_Daterange’ as the final table, I would use code such as:
IF OBJECT_ID('B_Clean01_Sales', 'U') IS NOT NULL
DROP TABLE B_Clean01_Sales
5. Documentation – Document the final files and assumptions used in your data cleaning
Once the entire data collection has been finalised, it is helpful to document the final data versions. It is also helpful to document all the major assumptions used to clean your data. Why bother doing this? You will forget the details once you move on from this phase and you will save yourself time when you get follow up questions later. This documentation is also helpful when you have to hand the project over to someone else.
That wraps up this blog on taking control of your data. Hopefully the above gives you some ideas to help you become more effective at managing your data on your analytics projects.