Here are some methods for cleaning data when using pandas dataframe:
- Delete data with empty district
df.dropna(subset=["COLUMN_NAME"])
- Delete the whole attribute which includes empty data
df.drop("COLUMN_NAME",axis=1)
- Replace empty data with other elements(Such as median)
median = df["COLUMN_NAME"].median() df["COLUMN_NAME"].fillna(median,inplace=True)
When replacing your data, it is important to save the median value you’ve just computed.(You’ll want to be able to recreate the same output later on)
Scikit-Learn provides a usefull class to take care of that.
from sklearn.impute import SimpleImputer
#Create Instance specifying that you want to replace the value with the median
imputer = SimpleImputer(strategy="median")
#Fit the instance to calculate the median
imputer.fit(df)
# Now, the median for each attribute are stored in the statistics_ instance variable
imputer.statistics_
#Finally, replace value with the trained imputer
X=imputer.transform(df)