Wednesday, August 27, 2025

🐍Detecting Missing Values in datasets with Pandas

In the messy reality of data science, missing values are an inevitability. Whether it's from a faulty sensor, a human input error, or a skipped field, incomplete data can throw off your entire analysis and lead to unreliable machine learning models. The first critical step in wrangling this data is to find out exactly where those missing pieces are.

Thankfully, the Pandas library provides two indispensable methods, isna() and notna(), that act like a digital metal detector for your data, helping you quickly pinpoint and understand missing values.


🔹 What are isna() and notna()?

Think of these two methods as boolean gatekeepers for your DataFrame. They inspect every single cell and return a new DataFrame of the exact same size, but filled with True or False values.

  • isna(): Returns True for every cell that contains a missing value (such as NaN, None, or NaT for dates), and False for everything else. It's the "is it missing?" function.
  • notna(): The direct opposite. It returns True for every cell that contains a valid value (i.e., not missing) and False for missing values. It's the "is it present?" function.

Let's see this in action with a simple DataFrame. We'll use numpy to represent some missing values as np.nan.


import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, np.nan, 30, None],               # Missing values
    "City": ["London", "Paris", None, "Berlin"], # Missing value
    "Score": [95, 88, np.nan, 76]                # Missing value
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")

# Use isna() to detect missing values
print("DataFrame after df.isna():")
print(df.isna())
print("\n" + "="*30 + "\n")

# Use notna() to detect present values
print("DataFrame after df.notna():")
print(df.notna())
    

👉 Output:


Original DataFrame:
      Name   Age    City  Score
0    Alice  25.0  London   95.0
1      Bob   NaN   Paris   88.0
2  Charlie  30.0    None    NaN
3    David   NaN  Berlin   76.0

==============================

DataFrame after df.isna():
    Name    Age   City  Score
0  False  False  False  False
1  False   True  False  False
2  False  False   True   True
3  False   True  False  False

==============================

DataFrame after df.notna():
   Name   Age   City  Score
0  True  True  True   True
1  True  False True   True
2  True  True  False  False
3  True  False True   True
    

🔹 Beyond Simple Detection

While seeing a sea of True and False is helpful, the real power of isna() and notna() comes from combining them with other Pandas methods for data analysis and cleaning. Here are a few powerful techniques.

Counting Missing Values

By chaining isna() with .sum(), we can get a quick summary of missing values per column.


# Count the number of missing values per column
missing_counts = df.isna().sum()
print("Total missing values per column:")
print(missing_counts)
    

👉 Output:


Total missing values per column:
Name     0
Age      2
City     1
Score    1
dtype: int64
    

To get percentages instead of raw counts:


# Percentage of missing values per column
missing_percentages = (df.isna().sum() / len(df)) * 100
print("Percentage of missing values per column:")
print(missing_percentages)
    

👉 Output:


Percentage of missing values per column:
Name      0.0
Age      50.0
City     25.0
Score    25.0
dtype: float64
    

Filtering Rows with Missing Data


# Rows where 'Age' is missing
missing_age_rows = df[df["Age"].isna()]
print("Rows where 'Age' is missing:")
print(missing_age_rows)

print("\n" + "="*30 + "\n")

# Rows where 'Score' is not missing
complete_score_rows = df[df["Score"].notna()]
print("Rows where 'Score' is not missing:")
print(complete_score_rows)
    

👉 Output:


Rows where 'Age' is missing:
    Name  Age    City  Score
1    Bob  NaN   Paris   88.0
3  David  NaN  Berlin   76.0

==============================

Rows where 'Score' is not missing:
    Name   Age    City  Score
0  Alice  25.0  London   95.0
1    Bob   NaN   Paris   88.0
3  David   NaN  Berlin   76.0
    

This is crucial for creating clean subsets of data or preparing datasets for machine learning.


🔹 Visualizing Missing Data

Sometimes a chart reveals patterns better than tables. Plotting isna().sum() gives a quick visual summary of completeness.


import matplotlib.pyplot as plt

# Create a bar chart of missing values
df.isna().sum().plot(kind="bar", title="Missing Values per Column")
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.show()
    

🖼️ Expected bars: Name: 0, Age: 2, City: 1, Score: 1. This helps you prioritize cleaning efforts.


⚠️ Common Pitfalls and Best Practices

  • The "" and "NaN" Trap: Empty strings ("") and the string "NaN" are valid text values, not missing data. Convert them to np.nan first.
  • isna() vs. isnull(): Both do the same job, but isna() is preferred for modern Pandas consistency.
  • Don’t Rush to Drop: Always check how much data you'd lose before dropping rows. Sometimes filling is a better choice.


🖥️ Practice in Browser

No comments:

Post a Comment

🐍What is scikitlearn??