A Gartner study has found that bad data costs businesses from $9.7 million to $14.2 million annually. This is where data cleansing enters the picture.
Data cleansing, interchangeably termed as data cleaning or data scrubbing, is the process of identifying issues in data sets followed by employing the necessary steps to address those issues. The ultimate objective of this exercise is to ensure the data sets are of acceptable quality.
Quality data should meet the following criteria:
- Validity — The degree to which your data conforms to the defined business rules and validations
- Accuracy — Closeness of data to the true values
- Completeness — The extent to which all required data is available
- Consistency — How consistent the data appears within data sets or across different data sets
- Uniformity — The degree to which the data is specified using the same units of measure
Methodology
To achieve thorough data cleansing that hits the quality marks outlined above, here are some main areas that you can pay particular attention to.
1. Remove duplicates
When you have data coming from different sources, there’s bound to be duplicate records. Duplicate records can be misleading and take up precious database space. Therefore, it’s essential to deduplicate these disparate records and merge them based on identifying key fields. If done right, you’ll achieve the Golden Record state — a single, accurate, and complete source of truth.
2. Remove irrelevant records
Having lots and lots of data may seem like a big plus nowadays. But there’s no point bragging about it if most of them are irrelevant and not contributing to the day-to-day running of your operations.
For example, if you only want to analyze data on the millennial workforce, you can safely remove data of individuals that were born before 1981. By removing this ‘noise’, you can quickly zoom in to the data sets of interest, hence increasing work efficiency. Removing irrelevant data also helps in streamlining data and ensuring consistency and validity across the board.
3. Convert data types
Data comes in various formats and types. Due to this complexity, one may be tempted to leave the data as is and not handle its proper type conversion. But let’s consider this scenario where your date field is accidentally left in text form. What happens when there’s a need to calculate delivery date based on the original date entered? You’ll be in a pickle!
So, don’t miss out this step especially when you have numbers amongst your data. This also extends to data sets that have units of measure (UoM). For example, data that has various currency units may need to be converted to one standard currency, such as USD.
This way, you ensure data uniformity and validity — essential pre-requisites to perform impactful data-related operations and insightful analysis.
4. Correct structural errors
Undeniably, fixing data errors comes with the territory. This includes typo errors, incorrect capitalizations, and unstandardized naming conventions. Oftentimes, running a simple spell check tool might help. Other times, it’s handling familiar situations like “N/A” and “Not Applicable” entries which should mean the same thing. This allows you to achieve data accuracy and consistency.
5. Handle missing values
Handling missing values may well be the most difficult task here. While the more obvious way is just to remove the associated records, there could be a bigger issue at hand. Why are these values missing in the first place? If it’s customer data, is it because your customer service didn’t collect the full customer info, hence not doing their job effectively?
Investigation should be done before deciding what to do with the missing values. Then, you can do one of these options:
- Drop the records altogether
- Impute missing values based on prior investigation and analysis
- Input ‘NULL’ or ‘0’ into the field, acknowledging the empty value
This is how you ensure completeness of your data.
Wrapping It Up
The advent of AI and machine learning (ML) technology is upping the stakes even more. More companies are dabbling with ML models to interpret data from various perspectives, derive actionable insights, and ultimately inform their strategic plans. To produce the desired results, the ML models need to be trained with reliable and high-quality data sets. Hence, it’s even more imperative for data cleansing to be done thoroughly with tried-and-true methodology.
Written by: Shigim Yusof