Python Data Cleaning: Essential Code Guide

2 years ago

Emily Johnson

1 minute

Data cleaning refers to preprocessing data, which includes operations such as removing duplicate values, handling missing values, and dealing with outliers. Below are some examples of common data cleaning operations’ sample codes:

Remove duplicates.

df = df.drop_duplicates()

Missing value handling:

Delete rows with missing values.

df = df.dropna()

Fill in missing values with specified value:

df = df.fillna(value)

Interpolating to fill in missing values.

df = df.interpolate()

outlier handling:

Removing outliers based on standard deviation.

df = df[np.abs(df['column'] - df['column'].mean()) <= (3 * df['column'].std())]

Remove outliers based on the box plot.

q1 = df['column'].quantile(0.25)
q3 = df['column'].quantile(0.75)
iqr = q3 - q1
df = df[(df['column'] >= q1 - 1.5 * iqr) & (df['column'] <= q3 + 1.5 * iqr)]

The above code is just an example, and the specific data cleaning operations need to be adjusted and expanded based on the specific data situation.

#Data Analysis #data cleaning #data preprocessing #pandas #Python