Anyone who analyzes data knows that often there is more time spent transforming data into a usable format rather than actually using the data.
This is especially true when operating in a context that includes multiple independent systems that create and format data in a proprietary manner or when working with applications that simply don’t do a very good job generating data in a usable format.
This latter condition being distressingly common.
Therefore cleaning source data to make it usable is both an extremely important process and also extremely common.
And quite often very time-consuming and rife with frustration and repetition.
There are excellent data cleaning tools such as Python and pandas or Tableau, but Excel is also a power-house tool for cleaning data, and it is quite often easier and quicker to clean data directly in a source spreadsheet.
So what constitutes the process of cleaning data?
Basically the cleaning process is about getting the data into a form that is most usable to an analytics process.
This could include some or all of the following:
- Remove spaces
- Rename headers
- Reshape dates and times
- Make field data atomic
- Remove empty fields
- Make naming conventions consistent
Plus many others depending on need and desire;
Every process will have its own context and requirements, but if you are doing any of those things listed above then you are cleaning data.
What follows here are my top tips for cleaning up your data using Excel.
Most have been grindingly learned through the arduous process of cleaning up way too much horribly formatted data. Hopefully these tips can save you a little of the frustration that I have felt over the years.
For convenience and in an effort to more closely follow the data cleaning process, these tips are arranged in a roughly chronological order starting from the beginning of the data clean process to the end.
Some of these tips are Excel specific, but others are more about process rather than tooling and could be used with a wide range of applications.
Always remember the goal of cleaning data is to transform the data into a format that is conducive to analysis not presentation. This is a key element of cleaned data. Data cleaning is the first step to creating a presentation or a visualization, not the presentation or visualization itself.
Before you start
Know where you are going with the data
Is your data going to the web?
To a .csv file?
To the CFO in a presentation?
To a project stakeholder?
To an infographic?
What do you need from the data?
What can be discarded?
What needs to be highlighted?
These are all questions to ask as you begin the process of cleaning data.
To be certain, you may not have all the answers at an early stage, but the more understanding you have of where the data is going and what the data usage context will be, the easier the data cleaning process itself will be.
As always knowing where you will end up usually makes the process of getting there much easier.
Study the data
Perhaps this is obvious, but I’ve seen people just start tearing data apart without even taking a small amount of time to really look at it.
How much data is there?
Is it numbers?
Are some fields dependent on other fields?
How many rows?
How many columns?
Although this may seem like analysis, it is really a pre-analysis. You’re not trying to find the narrative in the data, you’re just trying to figure out what data is actually here and how it is structured. This is not really a part of the analysis process, since you are not yet at the point where you can ask specific questions of the data, but it is very much a key part of the data cleaning process.
Make a copy of the raw data
This is a pretty obvious one, but it’s easy to forget in the haste to get started.
Make a copy of the data or at least know that you can regenerate the same starting data again if you have to do so.
There are a couple important reasons for this.
- If something goes horribly wrong while cleaning the data, you can at least go back to your origin point and start again.
- It is always important to be able to check the cleaned data against the original data to make sure you haven’t messed up some part of the data cleaning process and changed the data incorrectly.
Once you begin
Create a new tab for each significant step
This tip is a little more Excel-specific than some of the others, but I just find it a really useful way to keep my process organized.
As you move through each part of the data cleaning process, copy all the data from one completed step to a new tab. Then begin your next step on the data in the new tab.
This way you have a sort of timeline of your process, and you can return to previous steps without having to go all the way back to the beginning.
If you want to try a different approach or something didn’t work out as intended you can simply click into a previous tab and return to a known good point of the process and began again from there.
Delete everything you don’t need
Get rid of as much extraneous data as you can right at the beginning. If you know you don’t need it, just delete it. Simplify the data that you need to work with as much as possible.
Make two sets of data if you have two very different needs
If you are are going to be doing very different things with the same base data consider setting up two different subsets of the data.
For example, if one aspect of your data analysis is concerned with change over time, then delete everything that doesn’t impact that analysis.
If you are also going to be analyzing frequency of occurrence in a separate process then create a dataset that just keeps the essential data you will need for that task.
Again the less extra data in the set, the easier it is to clean.
One size rarely fits all needs.
Break large chunks of data into smaller chunks
A common problem with data is that the fields you have to work with are not as atomic as they need to be.
Too many different types of data in one cell is one of the biggest issues in data cleaning, and splitting those big chunks of data into smaller chunks across multiple cells is one of the most time consuming yet important aspects of data cleaning.
When confronted with a big chunk of data stored in a single cell, look at that data as a whole first and try to split the single big chunk into two smaller chunks.
Then split the two chunks into four and so forth until you the cell data is as atomic as you need.
This approach usually works much better than trying to split data from one big chunk into multiple smaller chunks all in one go.
Then fix the small chunks
Once you have the big chunks split into small chunks there should be less complexity to the data, and the smaller chunks will be easier to clean. If necessary repeat the process of slowly reducing the size of each cells data chunk.
Prefer splitting existing data over creating new data
This particular tip does vary somewhat based on context, but generally it’s easier to split apart existing data to get the atomic granularity you need rather than copying data into a new cell.
Don’t create more data.
You already have enough.
Split what you have into the proper chunks, one chunk per cell.
When data is copied you end up with multiple pieces of the same data in slightly different forms.
If you simply split the data you will always have just one set of data.
Use find and replace
I always prefer to use the simplest tools possible to get the job done and tools don’t get any simpler in Excel then find and replace.
The most important aspect of using find and replace effectively in Excel is to limit the scope of the find and replace operation by selecting the data area first.
The selection creates a range in which the find and replace will operate.
In many cases this is best done on a column by column basis.
For example, find all spaces in a column and replace them with nothing.
Use text to columns
Excel has several different ways to move data chunks from one column to another. a variety of formulas can pick out specific data and the flash fill feature can be helpful, but I find the best and most straightforward way to use the text to columns tool.
Text to columns often requires some clean up work to isolate some kind of delimiter to split on, but once you have that delimiter text to columns is accurate and efficient.
Copy/paste as values only to clean up formulas
Getting rid of formulas is of high importance when cleaning data.
This is because cell data that is created by formulas usually depends on other cell data to create that calculation. If the source cell gets removed or changed then the formula in the dependent cell will break.
Really this tip could either come at the beginning or the end of the list.
Which is what working iteratively is all about: a process that is repeated multiple times.
You won’t get the cleaning right or perfect the first time no matter how thoroughly you plan and consider. A big part of the cleaning data process is the learning that goes along with it; you will learn about the data as you try to clean it.
It may take several iterations of the cleaning process before you really have the things figured out.
This is completely normal.
Experience will always be a guide to help you anticipate potential issues and plan ahead but every dataset can and usually will present problems that you never anticipated.
As I said, these tips are Excel-based to some degree, but many programs have tools that perform similar functions. In many cases your tools may be chosen for you by your organizational process and standards. These tips can help guide your process regardless.
It is important to remember that the tool itself is less significant than the knowledge of the person who uses the tool.
The key part is that you develop your personal approach to data cleaning and think about and consider that process on an ongoing basis.
The best processes are not closed to change.
The best processes are open to change.
Use your tools within that context, and you will be much better at what you do.