82. Data Cleaning Methods

Here are some methods for cleaning data when using pandas dataframe:

Delete data with empty district
```
df.dropna(subset=["COLUMN_NAME"])
```
Delete the whole attribute which includes empty data
```
df.drop("COLUMN_NAME",axis=1)
```

Replace empty data with other elements(Such as median)

median = df["COLUMN_NAME"].median()
df["COLUMN_NAME"].fillna(median,inplace=True)

When replacing your data, it is important to save the median value you’ve just computed.(You’ll want to be able to recreate the same output later on)

Scikit-Learn provides a usefull class to take care of that.

from sklearn.impute import SimpleImputer

#Create Instance specifying that you want to replace the value with the median
imputer = SimpleImputer(strategy="median")

#Fit the instance to calculate the median
imputer.fit(df)

# Now, the median for each attribute are stored in the statistics_ instance variable 
imputer.statistics_

#Finally, replace value with the trained imputer
X=imputer.transform(df)

Related Posts

403. Data Distribution Shifts

402. Data Leakage

397. Finding Data