Detecting missing values is only the first step — the real challenge lies in deciding what to do with them. Missing data can distort averages, bias models, or simply break your analysis pipeline. Pandas gives us two powerful tools for this:
dropna() (removing missing values) and
fillna() (replacing them).
The choice depends on context: Sometimes dropping is safe, other times filling preserves important patterns. As a data scientist, you must balance accuracy with completeness.
๐น The Sample Data
Let’s start with a small dataset containing missing values:
import pandas as pd
import numpy as np
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
"Age": [25, np.nan, 30, None, 22],
"City": ["London", "Paris", None, "Berlin", "Madrid"],
"Score": [95, 88, np.nan, 76, None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
๐ Output:
Original DataFrame:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob NaN Paris 88.0
2 Charlie 30.0 None NaN
3 David NaN Berlin 76.0
4 Eva 22.0 Madrid NaN
๐น Method 1: Removing Missing Values with dropna()
The most straightforward way to handle missing data is to remove it.
dropna() removes any rows (or columns) with missing values.
# Drop rows with any missing values
print(df.dropna())
๐ Output:
Name Age City Score
0 Alice 25.0 London 95.0
Only Alice’s row remains because every other row had at least one missing value.
Remove Columns with Missing Data
# Drop entire columns with any missing values
print(df.dropna(axis=1))
๐ Output:
Name
0 Alice
1 Bob
2 Charlie
3 David
4 Eva
Here, only Name survives because every other column had at least one missing value.
Keeping Rows with Enough Data
Sometimes we don’t want to drop everything, only rows that are too incomplete. Use thresh:
# Keep rows with at least 3 non-missing values
print(df.dropna(thresh=3))
๐ Output:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob NaN Paris 88.0
3 David NaN Berlin 76.0
Rows with fewer than 3 valid entries are dropped.
๐น Method 2: Filling Missing Values with fillna()
Instead of discarding data, we can fill missing values with meaningful replacements. This often preserves the dataset size and is useful in machine learning.
Filling with a Constant Value
# Fill missing values with 0
print(df.fillna(0))
๐ Output:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob 0.0 Paris 88.0
2 Charlie 30.0 0 0.0
3 David 0.0 Berlin 76.0
4 Eva 22.0 Madrid 0.0
Column-Specific Filling
We can replace missing values differently in each column using a dictionary:
# Fill numeric with 0, text with 'Unknown'
filled_df = df.fillna({"Age": 0, "City": "Unknown", "Score": df["Score"].mean()})
print(filled_df)
๐ Output (Score filled with mean ≈ 86.33):
Name Age City Score
0 Alice 25.0 London 95.000000
1 Bob 0.0 Paris 88.000000
2 Charlie 30.0 Unknown 86.333333
3 David 0.0 Berlin 76.000000
4 Eva 22.0 Madrid 86.333333
Forward/Backward Fill
# Forward fill (use previous value)
print(df.fillna(method="ffill"))
# Backward fill (use next value)
print(df.fillna(method="bfill"))
๐ Forward Fill Output:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob 25.0 Paris 88.0
2 Charlie 30.0 Paris 88.0
3 David 30.0 Berlin 76.0
4 Eva 22.0 Madrid 76.0
๐ Backward Fill Output:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob 30.0 Paris 88.0
2 Charlie 30.0 Berlin 76.0
3 David 22.0 Berlin 76.0
4 Eva 22.0 Madrid NaN
⚠️ Common Mistakes and Best Practices
- Dropping Too Aggressively: Don’t just call
dropna()without checking — you may lose valuable data. - Wrong Fill Strategy: Using
0for missing ages or incomes can distort results. Think carefully about replacements. - Mixing Data Types: Filling text columns with numbers (or vice versa) will cause type errors.
- Mean Imputation Overuse: Filling all numeric missing values with the mean can hide important variance.
✅ Rule of thumb: Understand your data before choosing a cleaning strategy.
No comments:
Post a Comment