The Main 4 Steps
There are mainly 4 steps for data preprocessing.
- Data Quality Assessment
- Data Cleaning
- Data Transformation
- Data Reduction
1. Data Quality Assessment
Before jumping into coding, evaluating the overall data quality is essential. Here are several problems to look out for.
- Mismatched Data Types
- Mixed Data Values
- Data Outliers
- Missing Data
2. Data Cleaning
Now that you’ve examined and understood the issues with the current data, our next step is to start cleaning the data by fixing the problems we’ve found on our previous step.
3. Data Transformation
By cleaning the data, you are finally able to stand at the starting line. Now we will transform the data so that your data will be turned into proper formats for analysis and other downstream phases.
Here are some examples.
- Aggregation
- Normalization
- Feature Selection
- Discreditization
4. Data Reduction
The more data you have, the more harder it will get to analyze the data. Data reduction not only makes the analysis easier but cuts down on data storage.
Here are some examples.
- Attribute Selection
- Numerosity Reduction
- Dimensionality Reduction