Wednesday, August 27, 2025

๐ŸHandling missing values (dropna(), fillna())

Detecting missing values is only the first step — the real challenge lies in deciding what to do with them. Missing data can distort averages, bias models, or simply break your analysis pipeline. Pandas gives us two powerful tools for this: dropna() (removing missing values) and fillna() (replacing them).

The choice depends on context: Sometimes dropping is safe, other times filling preserves important patterns. As a data scientist, you must balance accuracy with completeness.


๐Ÿ”น The Sample Data

Let’s start with a small dataset containing missing values:


import pandas as pd
import numpy as np

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Age": [25, np.nan, 30, None, 22],
    "City": ["London", "Paris", None, "Berlin", "Madrid"],
    "Score": [95, 88, np.nan, 76, None]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
    

๐Ÿ‘‰ Output:


Original DataFrame:
      Name   Age    City  Score
0    Alice  25.0  London   95.0
1      Bob   NaN   Paris   88.0
2  Charlie  30.0    None    NaN
3    David   NaN  Berlin   76.0
4      Eva  22.0  Madrid    NaN
    

๐Ÿ”น Method 1: Removing Missing Values with dropna()

The most straightforward way to handle missing data is to remove it. dropna() removes any rows (or columns) with missing values.


# Drop rows with any missing values
print(df.dropna())
    

๐Ÿ‘‰ Output:


    Name   Age    City  Score
0  Alice  25.0  London   95.0
    

Only Alice’s row remains because every other row had at least one missing value.

Remove Columns with Missing Data


# Drop entire columns with any missing values
print(df.dropna(axis=1))
    

๐Ÿ‘‰ Output:


      Name
0    Alice
1      Bob
2  Charlie
3    David
4      Eva
    

Here, only Name survives because every other column had at least one missing value.

Keeping Rows with Enough Data

Sometimes we don’t want to drop everything, only rows that are too incomplete. Use thresh:


# Keep rows with at least 3 non-missing values
print(df.dropna(thresh=3))
    

๐Ÿ‘‰ Output:


      Name   Age    City  Score
0    Alice  25.0  London   95.0
1      Bob   NaN   Paris   88.0
3    David   NaN  Berlin   76.0
    

Rows with fewer than 3 valid entries are dropped.


๐Ÿ”น Method 2: Filling Missing Values with fillna()

Instead of discarding data, we can fill missing values with meaningful replacements. This often preserves the dataset size and is useful in machine learning.

Filling with a Constant Value


# Fill missing values with 0
print(df.fillna(0))
    

๐Ÿ‘‰ Output:


      Name   Age    City  Score
0    Alice  25.0  London   95.0
1      Bob   0.0   Paris  88.0
2  Charlie  30.0       0   0.0
3    David   0.0  Berlin  76.0
4      Eva  22.0  Madrid  0.0
    

Column-Specific Filling

We can replace missing values differently in each column using a dictionary:


# Fill numeric with 0, text with 'Unknown'
filled_df = df.fillna({"Age": 0, "City": "Unknown", "Score": df["Score"].mean()})
print(filled_df)
    

๐Ÿ‘‰ Output (Score filled with mean ≈ 86.33):


      Name   Age     City       Score
0    Alice  25.0   London   95.000000
1      Bob   0.0    Paris   88.000000
2  Charlie  30.0  Unknown   86.333333
3    David   0.0   Berlin   76.000000
4      Eva  22.0   Madrid   86.333333
    

Forward/Backward Fill


# Forward fill (use previous value)
print(df.fillna(method="ffill"))

# Backward fill (use next value)
print(df.fillna(method="bfill"))
    

๐Ÿ‘‰ Forward Fill Output:


      Name   Age    City  Score
0    Alice  25.0  London   95.0
1      Bob  25.0   Paris   88.0
2  Charlie  30.0   Paris   88.0
3    David  30.0  Berlin   76.0
4      Eva  22.0  Madrid   76.0
    

๐Ÿ‘‰ Backward Fill Output:


      Name   Age    City  Score
0    Alice  25.0  London   95.0
1      Bob  30.0   Paris   88.0
2  Charlie  30.0  Berlin   76.0
3    David  22.0  Berlin   76.0
4      Eva  22.0  Madrid   NaN
    

⚠️ Common Mistakes and Best Practices

  • Dropping Too Aggressively: Don’t just call dropna() without checking — you may lose valuable data.
  • Wrong Fill Strategy: Using 0 for missing ages or incomes can distort results. Think carefully about replacements.
  • Mixing Data Types: Filling text columns with numbers (or vice versa) will cause type errors.
  • Mean Imputation Overuse: Filling all numeric missing values with the mean can hide important variance.

✅ Rule of thumb: Understand your data before choosing a cleaning strategy.



๐Ÿ–ฅ️ Practice in Browser

No comments:

Post a Comment

๐ŸWhat is scikitlearn??