8 December 2024

4 min

Data Cleaning Techniques Using Python

Delves into the critical role of data cleaning in data analysis and how Python serves as a powerful tool in the process. From handling missing values and outliers to normalizing data and ensuring consistency, it provides a comprehensive guide with practical examples and hands-on demonstrations. Whether you’re a data analyst, scientist, or enthusiast, this blog equips you with actionable techniques to transform raw data into a reliable and insightful foundation for decision-making.

Baskar S

Data Cleaning Techniques Using Python

In this data driven world, data is very crucial as it helps in decision making in most of the businesses around the world. Most government organizations and private organizations rely on data to get useful insights and make valuable predictions. The Quality, Reliability, and Integrity of the data plays a major role in the process of decision making, getting insights, and making valuable predictions. To ensure the quality of the data, data cleaning process is used. Let us see some data cleaning techniques using Python and how they are helpful in making a foundation for better analysis and predictions.

1. Removing Duplicates

In any dataset, we should always ensure that it does not have any duplicate values. Duplicate values can distort the analytical outcomes and lead to biased insights. So it's necessary to remove them. Let's see how we can identify duplicate data and remove them from a dataset.

import pandas as pd

# Sample Data
data = { 
    'Name': ['Rohit', 'Jaishwal', 'Rohit', 'Gill', 'Rohit']
}
df = pd.DataFrame(data)

# Identify duplicates
print("Duplicates:\n", df[df.duplicated()])

# Remove duplicates
df_cleaned = df.drop_duplicates()
print("\nData without duplicates:\n", df_cleaned)

2. Handling Missing Values

Alike duplicate values, missing values also affect the outcomes and predictions. Handling missing values can be done in various ways according to the use-case. Some popular methods are:

Removing Missing Values
Filling Missing Values with 0, mean, median, or mode
Interpolating Missing Values

import pandas as pd
import numpy as np

# Sample DataFrame with missing values
data = { 
    'Name': ['Rohit', 'Jaishwal', 'Gill', 'Kohli', np.nan],
    'Age': [38, np.nan, 22, 25, 36],
    'City': ['Mumbai', 'Rajasthan', np.nan, 'Gujarat', 'Bengaluru'],
    'Score': [85, np.nan, 78, 92, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDropped rows with missing values:\n", df_dropped_rows)

# Fill missing values with the mean in the 'Age' column
df['Age'] = df['Age'].fillna(df['Age'].mean())
print("\nFilled missing 'Age' with mean:\n", df)

# Interpolate missing values in the 'Score' column
df['Score'] = df['Score'].interpolate()
print("\nInterpolated missing 'Score' values:\n", df)

3. Dealing with Outliers

Outliers are data points that differ significantly from the majority of data. They affect the process of analysis and prediction. Outliers can be detected using statistical methods (IQR, Z-score) or visualization. Once detected, we can remove, replace, or use capping methods to handle them.

# Remove outliers using IQR thresholds
filtered_data = data[(data['values'] >= lower_bound) & (data['values'] <= upper_bound)] print(filtered_data) # Log transformation to reduce the effect of outliers data['log_values'] = np.log(data['values'] + 1) # Adding 1 to avoid log(0) # Cap values at lower and upper bounds data['capped_values'] = np.where(data['values'] > upper_bound, upper_bound, 
                                  np.where(data['values'] < lower_bound, lower_bound, data['values']))

4. Handling Categorical Data and Converting Data Types

Handling Categorical Data and converting data types are essential steps. You can use label encoding or one-hot encoding to transform categorical data into numerical format when needed. Likewise, data types can be converted for better compatibility with analysis.

# Label Encoding
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
label_encoder = LabelEncoder()
data['Color_encoded'] = label_encoder.fit_transform(data['Color'])
print(data)

# One-hot encoding
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
data_one_hot = pd.get_dummies(data, columns=['Color'])
print(data_one_hot)

# Convert column 'A' to numeric
data = pd.DataFrame({'A': ['1', '2', '3.5', '4'], 'B': ['5', '6.1', 'invalid', '7']})
data['A'] = pd.to_numeric(data['A'], errors='coerce')  # 'coerce' converts invalid parsing to NaN
print(data)

Conclusion

The above techniques create a robust data foundation, maximizing the quality of your dataset and enabling more accurate, meaningful results in downstream analysis. Consistently applying these methods not only makes your data more efficient but also strengthens the insights and decisions derived from your data. Everyone working with data should be familiar with these data cleaning techniques.

Baskar S

Associate Data Engineer

Since joining Ignitho in June 2024, Baskar has demonstrated expertise in Python, Power BI, and SQL. His role involves working on a digital marketing project, where he has applied his technical skills to analyze and visualize data, enabling data-driven decision-making. He is committed to leveraging his expertise to deliver impactful insights and contribute to the success of innovative projects.

Continue Reading

Database Normalization

Banuprakash Vellingiri

Revolutionizing ETL with AI

Vishnu Azhagan

The Role of AI and Machine Learning in Digital

Mukhilan R

Unlocking the Power of Data Analytics

Kabil R

Impact of AI & ML in Business Analytics

Prathiba Subramaniyan

Data Security in Power BI

Karthika Natesan

Cassandra Architecture

Ashin Antony

Evaluation Metrics in Machine Learning

Kamalakannan B

Automation Testing In React Application

Pavithiraa H

Use cases for LLM Agents

Sriram T S

Building Custom Hooks in React: A Guide to Reusable Logic

Monika P

Predicting Ad Click-Through Rate (CTR) with Machine Learning: A Retail Case Study

Thirumurugan R

Mastering AWS Cloud: Cost Efficiency, Experimentation, and Configuration Explained

Ashin Antony

Data and Automotive Technology

MS Manikandan

The Power of Data-Driven Decision-Making: How to Leverage Data for Business Success Using DOMO

Allwin Joshua

How Generative AI is transforming Software Testing

Illakiya S

An Overview on Various Aspects of MongoDB

Admin

Everything You Need to Know About dApp in Blockchain

Admin

How to Improve Month End Close Process

Admin

Integrating BI in Finance: The Merits & Demerits

Admin

How to get started using Cucumber- BDD Framework

Admin

How to Improve Financial Close Process

Admin

App Development Using OutSystems:A Low-Code Development Platform

Admin

Transform your Business using Microsoft Power Platform

Admin

How to get started using PowerApps?

Admin

A Complete Guide on Entity Framework

Admin

An Overview On TypeScript vs JavaScript

Admin

An Introduction to metaverse technology

Admin

The Dire Need for Front End Innovation: An Engineer’s Rationale for Customer Engagement

Admin

How to Choose the Right Software Development Companies in New York

Admin

Natural Language Processing (NLP) Solutions

Admin

Introduction to Azure IoT: Smartness Redefined

Admin

Understanding Root Cause Analysis

Admin

The Digital Transformation Blizzard: Why Design Thinking Can Help in 2020?

Admin

Innovation Pods: The Future of Enterprise Delivery Model

Admin

CIOs Guide to Selecting Right Innovation Partner

Admin

Beginner’s Guide To Node JS: Top Language for Full-Stack Development in 2019

Admin

7 Steps to ensure Secure and Error-free Code: My checklist

Admin

Breaking the Complexities: How the Imaging Domain got revolutionized

Admin

Why Kotlin cannot be missed in Application Development?

Admin

Angular’s check on Breaking Bad

Admin

Swing with your right IT partner: Forget the leap of faiths

Admin

5 factors to consider while choosing your Cloud deployment partner

Admin

7 Steps to Consider While Choosing the Right Software Development Partner

Admin

Ahead of the Curve: React Native and its Odds

Admin

Fabricating Real Experiences through Augmented Reality

Admin

Search by keywords

Data Cleaning Techniques Using Python

Editors Pick

How to get started using PowerApps?

Innovation Pods: The Future of Enterprise Delivery Model

Beginner’s Guide To Node JS: Top Language for Full-Stack Development in 2019

Data Cleaning Techniques Using Python

1. Removing Duplicates

2. Handling Missing Values

3. Dealing with Outliers

4. Handling Categorical Data and Converting Data Types

Conclusion

Baskar S

Continue Reading

Database Normalization

Revolutionizing ETL with AI

The Role of AI and Machine Learning in Digital

Unlocking the Power of Data Analytics

Impact of AI & ML in Business Analytics

Data Security in Power BI

Cassandra Architecture

Evaluation Metrics in Machine Learning

Automation Testing In React Application

Use cases for LLM Agents

Building Custom Hooks in React: A Guide to Reusable Logic

Predicting Ad Click-Through Rate (CTR) with Machine Learning: A Retail Case Study

Mastering AWS Cloud: Cost Efficiency, Experimentation, and Configuration Explained

Data and Automotive Technology

The Power of Data-Driven Decision-Making: How to Leverage Data for Business Success Using DOMO

How Generative AI is transforming Software Testing

An Overview on Various Aspects of MongoDB

Everything You Need to Know About dApp in Blockchain

How to Improve Month End Close Process

Integrating BI in Finance: The Merits & Demerits

How to get started using Cucumber- BDD Framework

How to Improve Financial Close Process

App Development Using OutSystems:A Low-Code Development Platform

Transform your Business using Microsoft Power Platform

How to get started using PowerApps?

A Complete Guide on Entity Framework

An Overview On TypeScript vs JavaScript

An Introduction to metaverse technology

The Dire Need for Front End Innovation: An Engineer’s Rationale for Customer Engagement

How to Choose the Right Software Development Companies in New York

Natural Language Processing (NLP) Solutions

Introduction to Azure IoT: Smartness Redefined

Understanding Root Cause Analysis

The Digital Transformation Blizzard: Why Design Thinking Can Help in 2020?

Innovation Pods: The Future of Enterprise Delivery Model

CIOs Guide to Selecting Right Innovation Partner

Beginner’s Guide To Node JS: Top Language for Full-Stack Development in 2019

7 Steps to ensure Secure and Error-free Code: My checklist

Breaking the Complexities: How the Imaging Domain got revolutionized

Why Kotlin cannot be missed in Application Development?

Angular’s check on Breaking Bad

Swing with your right IT partner: Forget the leap of faiths

5 factors to consider while choosing your Cloud deployment partner

7 Steps to Consider While Choosing the Right Software Development Partner

Ahead of the Curve: React Native and its Odds

Fabricating Real Experiences through Augmented Reality

Sign Up for the Latest Tech Feeds!

Your daily dose of the Tech world

Sign Up for the
Latest Tech Feeds!