82. Data Cleaning Methods

Here are some methods for cleaning data when using pandas dataframe:

  1. Delete data with empty district
    df.dropna(subset=["COLUMN_NAME"])
    
  2. Delete the whole attribute which includes empty data
    df.drop("COLUMN_NAME",axis=1)
    
  3. Replace empty data with other elements(Such as median)
    median = df["COLUMN_NAME"].median()
    df["COLUMN_NAME"].fillna(median,inplace=True)
    

When replacing your data, it is important to save the median value you’ve just computed.(You’ll want to be able to recreate the same output later on)

Scikit-Learn provides a usefull class to take care of that.

from sklearn.impute import SimpleImputer

#Create Instance specifying that you want to replace the value with the median
imputer = SimpleImputer(strategy="median")

#Fit the instance to calculate the median
imputer.fit(df)

# Now, the median for each attribute are stored in the statistics_ instance variable 
imputer.statistics_

#Finally, replace value with the trained imputer
X=imputer.transform(df)