In the messy reality of data science, missing values are an inevitability. Whether it's from a faulty sensor, a human input error, or a skipped field, incomplete data can throw off your entire analysis and lead to unreliable machine learning models. The first critical step in wrangling this data is to find out exactly where those missing pieces are.
Thankfully, the Pandas library provides two indispensable methods, isna() and notna(),
that act like a digital metal detector for your data, helping you quickly pinpoint and understand missing values.
🔹 What are isna() and notna()?
Think of these two methods as boolean gatekeepers for your DataFrame. They inspect every single cell and
return a new DataFrame of the exact same size, but filled with True or False values.
isna(): ReturnsTruefor every cell that contains a missing value (such asNaN,None, orNaTfor dates), andFalsefor everything else. It's the "is it missing?" function.notna(): The direct opposite. It returnsTruefor every cell that contains a valid value (i.e., not missing) andFalsefor missing values. It's the "is it present?" function.
Let's see this in action with a simple DataFrame. We'll use numpy to represent some missing values as np.nan.
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, np.nan, 30, None], # Missing values
"City": ["London", "Paris", None, "Berlin"], # Missing value
"Score": [95, 88, np.nan, 76] # Missing value
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")
# Use isna() to detect missing values
print("DataFrame after df.isna():")
print(df.isna())
print("\n" + "="*30 + "\n")
# Use notna() to detect present values
print("DataFrame after df.notna():")
print(df.notna())
👉 Output:
Original DataFrame:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob NaN Paris 88.0
2 Charlie 30.0 None NaN
3 David NaN Berlin 76.0
==============================
DataFrame after df.isna():
Name Age City Score
0 False False False False
1 False True False False
2 False False True True
3 False True False False
==============================
DataFrame after df.notna():
Name Age City Score
0 True True True True
1 True False True True
2 True True False False
3 True False True True
🔹 Beyond Simple Detection
While seeing a sea of True and False is helpful,
the real power of isna() and notna() comes from combining them with other Pandas methods
for data analysis and cleaning. Here are a few powerful techniques.
Counting Missing Values
By chaining isna() with .sum(), we can get a quick summary of missing values per column.
# Count the number of missing values per column
missing_counts = df.isna().sum()
print("Total missing values per column:")
print(missing_counts)
👉 Output:
Total missing values per column:
Name 0
Age 2
City 1
Score 1
dtype: int64
To get percentages instead of raw counts:
# Percentage of missing values per column
missing_percentages = (df.isna().sum() / len(df)) * 100
print("Percentage of missing values per column:")
print(missing_percentages)
👉 Output:
Percentage of missing values per column:
Name 0.0
Age 50.0
City 25.0
Score 25.0
dtype: float64
Filtering Rows with Missing Data
# Rows where 'Age' is missing
missing_age_rows = df[df["Age"].isna()]
print("Rows where 'Age' is missing:")
print(missing_age_rows)
print("\n" + "="*30 + "\n")
# Rows where 'Score' is not missing
complete_score_rows = df[df["Score"].notna()]
print("Rows where 'Score' is not missing:")
print(complete_score_rows)
👉 Output:
Rows where 'Age' is missing:
Name Age City Score
1 Bob NaN Paris 88.0
3 David NaN Berlin 76.0
==============================
Rows where 'Score' is not missing:
Name Age City Score
0 Alice 25.0 London 95.0
1 Bob NaN Paris 88.0
3 David NaN Berlin 76.0
This is crucial for creating clean subsets of data or preparing datasets for machine learning.
🔹 Visualizing Missing Data
Sometimes a chart reveals patterns better than tables. Plotting isna().sum()
gives a quick visual summary of completeness.
import matplotlib.pyplot as plt
# Create a bar chart of missing values
df.isna().sum().plot(kind="bar", title="Missing Values per Column")
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.show()
🖼️ Expected bars: Name: 0, Age: 2, City: 1, Score: 1.
This helps you prioritize cleaning efforts.
⚠️ Common Pitfalls and Best Practices
- The "" and "NaN" Trap: Empty strings (
"") and the string"NaN"are valid text values, not missing data. Convert them tonp.nanfirst. isna()vs.isnull(): Both do the same job, butisna()is preferred for modern Pandas consistency.- Don’t Rush to Drop: Always check how much data you'd lose before dropping rows. Sometimes filling is a better choice.
No comments:
Post a Comment