Handling Missing Values in a Real-World Dataset with Pandas

Handling missing values is an essential part of data preprocessing, as it ensures the integrity and accuracy of your data analysis. In this blog post, we will use the Titanic dataset, a well-known dataset in the data science community, to demonstrate how to handle missing values using Python’s Pandas library.

About the Titanic Dataset

The Titanic dataset provides information about the passengers who were on board the Titanic when it sank in 1912. It includes various features such as age, sex, class, and whether the passenger survived. This dataset is commonly used for practicing data cleaning, data analysis, and machine learning techniques.

Importing Necessary Libraries

First, let’s import the necessary libraries and load the Titanic dataset. You can download the dataset from Kaggle or use the Seaborn library, which has a built-in function to load the dataset.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')
print(titanic.head())

Exploring Missing Values

Checking for Missing Values

To identify missing values in the dataset, use the isnull() method along with sum() to get a count of missing values in each column.

print(titanic.isnull().sum())

This will give you an output like this:

vbnetCopy codesurvived       0
pclass         0
sex            0
age          177
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
deck         688
embark_town    2
alive          0
alone          0
dtype: int64

Visualizing Missing Values

Visualizing missing values can help you understand their distribution across the dataset. You can use the seaborn library to create a heatmap of missing values.

sns.heatmap(titanic.isnull(), cbar=False, cmap='viridis')
plt.show()

Handling Missing Values

Dropping Missing Values

You can drop rows or columns with missing values using the dropna() method.

Drop rows with any missing values:

titanic_dropped_rows = titanic.dropna()
print(titanic_dropped_rows.shape)

Drop columns with any missing values:

pythonCopy codetitanic_dropped_columns = titanic.dropna(axis=1)
print(titanic_dropped_columns.shape)

Filling Missing Values

You can fill missing values with specific values or use statistical methods.

Fill with a specific value:

titanic_filled_value = titanic.fillna(0)
print(titanic_filled_value.head())

Fill with the mean value:

titanic['age'].fillna(titanic['age'].mean(), inplace=True)
print(titanic.head())

Fill with the mode value:

titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)
print(titanic.head())

Forward and Backward Fill

Forward fill and backward fill are methods to propagate the next or previous values respectively to fill missing values.

Forward fill:

titanic_ffill = titanic.fillna(method='ffill')
print(titanic_ffill.head())

Backward fill:

titanic_bfill = titanic.fillna(method='bfill')
print(titanic_bfill.head())

Interpolation

Interpolation is a method of estimating missing values by using the existing data points. Pandas provides several interpolation methods.

titanic_interpolated = titanic.interpolate()
print(titanic_interpolated.head())

Handling Specific Columns

Let’s handle missing values in the ‘age’ and ‘deck’ columns specifically:

Age: We can fill missing values with the mean age.

titanic['age'].fillna(titanic['age'].mean(), inplace=True)

titanic.drop(columns=['deck'], inplace=True)
# Or fill with a placeholder
# titanic['deck'].fillna('Unknown', inplace=True)

Handling the ’embarked’ Column

The ’embarked’ column has a few missing values. We can fill these with the most frequent value (mode).

titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)

Handling missing values is a crucial step in data preprocessing. Using the Titanic dataset, we have demonstrated various techniques to identify, analyze, and handle missing values with Pandas. Whether you choose to drop, fill, or interpolate missing values, the key is to understand the nature of your data and the impact of missing values on your analysis.

By mastering these techniques, you can ensure the integrity and accuracy of your data analysis, leading to more reliable insights and better decision-making.

Happy data cleaning with Pandas!