Data Cleaning Techniques Using Python
In this data driven world, data is very crucial as it helps in decision making in most of the businesses around the world. Most government organizations and private organizations rely on data to get useful insights and make valuable predictions. The Quality, Reliability, and Integrity of the data plays a major role in the process of decision making, getting insights, and making valuable predictions. To ensure the quality of the data, data cleaning process is used. Let us see some data cleaning techniques using Python and how they are helpful in making a foundation for better analysis and predictions.
1. Removing Duplicates
In any dataset, we should always ensure that it does not have any duplicate values. Duplicate values can distort the analytical outcomes and lead to biased insights. So it's necessary to remove them. Let's see how we can identify duplicate data and remove them from a dataset.
import pandas as pd
# Sample Data
data = {
'Name': ['Rohit', 'Jaishwal', 'Rohit', 'Gill', 'Rohit']
}
df = pd.DataFrame(data)
# Identify duplicates
print("Duplicates:\n", df[df.duplicated()])
# Remove duplicates
df_cleaned = df.drop_duplicates()
print("\nData without duplicates:\n", df_cleaned)
2. Handling Missing Values
Alike duplicate values, missing values also affect the outcomes and predictions. Handling missing values can be done in various ways according to the use-case. Some popular methods are:
- Removing Missing Values
- Filling Missing Values with 0, mean, median, or mode
- Interpolating Missing Values
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {
'Name': ['Rohit', 'Jaishwal', 'Gill', 'Kohli', np.nan],
'Age': [38, np.nan, 22, 25, 36],
'City': ['Mumbai', 'Rajasthan', np.nan, 'Gujarat', 'Bengaluru'],
'Score': [85, np.nan, 78, 92, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDropped rows with missing values:\n", df_dropped_rows)
# Fill missing values with the mean in the 'Age' column
df['Age'] = df['Age'].fillna(df['Age'].mean())
print("\nFilled missing 'Age' with mean:\n", df)
# Interpolate missing values in the 'Score' column
df['Score'] = df['Score'].interpolate()
print("\nInterpolated missing 'Score' values:\n", df)
3. Dealing with Outliers
Outliers are data points that differ significantly from the majority of data. They affect the process of analysis and prediction. Outliers can be detected using statistical methods (IQR, Z-score) or visualization. Once detected, we can remove, replace, or use capping methods to handle them.
# Remove outliers using IQR thresholds
filtered_data = data[(data['values'] >= lower_bound) & (data['values'] <= upper_bound)] print(filtered_data) # Log transformation to reduce the effect of outliers data['log_values'] = np.log(data['values'] + 1) # Adding 1 to avoid log(0) # Cap values at lower and upper bounds data['capped_values'] = np.where(data['values'] > upper_bound, upper_bound,
np.where(data['values'] < lower_bound, lower_bound, data['values']))
4. Handling Categorical Data and Converting Data Types
Handling Categorical Data and converting data types are essential steps. You can use label encoding or one-hot encoding to transform categorical data into numerical format when needed. Likewise, data types can be converted for better compatibility with analysis.
# Label Encoding
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
label_encoder = LabelEncoder()
data['Color_encoded'] = label_encoder.fit_transform(data['Color'])
print(data)
# One-hot encoding
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
data_one_hot = pd.get_dummies(data, columns=['Color'])
print(data_one_hot)
# Convert column 'A' to numeric
data = pd.DataFrame({'A': ['1', '2', '3.5', '4'], 'B': ['5', '6.1', 'invalid', '7']})
data['A'] = pd.to_numeric(data['A'], errors='coerce') # 'coerce' converts invalid parsing to NaN
print(data)
Conclusion
The above techniques create a robust data foundation, maximizing the quality of your dataset and enabling more accurate, meaningful results in downstream analysis. Consistently applying these methods not only makes your data more efficient but also strengthens the insights and decisions derived from your data. Everyone working with data should be familiar with these data cleaning techniques.