Mastering Pandas — Part 3: Data Cleaning, Merging & Joining

posted 11 min read

Real Data Is Never Clean

In Part 2, you learned how to group and slice a DataFrame with precision. But in real projects, the data you receive rarely arrives ready to analyze. Column names have spaces. Ages are negative. Phone numbers mix letters with digits. Duplicate rows sneak in. And your data often lives across multiple files that need to be combined before you can do anything useful.

This article covers the two skills that bridge raw data and real analysis:

  • Data Cleaning — detecting and fixing everything wrong with your data before it corrupts your results
  • Merging & Joining — combining separate DataFrames into a single, unified dataset

These topics flow naturally into each other. You clean each source first, then you combine them. That's the order every data analyst follows in practice — and it's the order we'll follow here.

We'll use this sample dataset throughout the cleaning section:

import pandas as pd
import numpy as np

data = {
    "Name":             ["Alice", "bob", "Carol", "Alice", "  dave  ", "Eve"],
    "Age":              [25, -5, 200, 25, 30, None],
    "Phone":            ["055-123-4567", None, "55123x", "055-123-4567", "055-987-0000", "055abc"],
    "Email":            ["*Emails are not allowed*", "*Emails are not allowed*", None, "*Emails are not allowed*", "*Emails are not allowed*", "not-an-email"],
    "Paying_Customer":  ["Y", "N", "Yes", "Y", "N", "y"],
    "Score":            [85, 110, 72, 85, -10, 90],
}
df = pd.DataFrame(data)

Step 1 — Inspect Before You Touch Anything

The most important rule in data cleaning: look before you act. Jumping straight into fixes without understanding the full picture leads to mistakes you won't catch until much later.

These are the inspection tools you should run on every new dataset:

df.shape          # (rows, columns) — how big is this?
df.info()         # column names, data types, non-null counts
df.head()         # first 5 rows
df.tail()         # last 5 rows
df.describe()     # min, max, mean, std — spot impossible values here

For missing values specifically:

df.isna().sum()           # count NaN per column
df.isnull().sum()         # same as isna() — both work
df[df["Phone"].notna()]   # rows that actually have a phone number

For duplicates and inconsistent categories:

df.duplicated()                         # True/False per row
df[df.duplicated()]                     # show only the duplicate rows
df["Paying_Customer"].value_counts()    # reveals: Y, N, Yes, y — all meaning the same thing

Always run describe() on numeric columns — the min and max alone will tell you immediately if something is wrong. A minimum age of -5 or a score of 110 out of 100 is a red flag you can't miss.


Step 2 — Fix Column Names First

Messy column names make everything harder. Clean them before anything else, so every subsequent operation uses predictable, consistent names.

Rename specific columns:

df.rename(columns={"Old Name": "New_Name", "phone number": "Phone"})

Clean all column names at once — the standard one-liner:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

This strips surrounding spaces, converts to lowercase, and replaces spaces with underscores in a single chain. After this, "First Name" becomes "first_name", " Age " becomes "age", and so on.

Before:              After:
"First Name"    →    "first_name"
"  Age "        →    "age"
"Phone Number"  →    "phone_number"
"ZIP Code"      →    "zip_code"

Step 3 — Handle Missing Values

Missing values appear as NaN in Pandas. Before deciding what to do with them, you need to understand why they're missing — because the right fix depends on the cause.

Fill missing values with a fixed value:

df["Phone"].fillna("000-000-0000")    # placeholder for missing phone
df["Age"].fillna(0)                   # fill numeric with 0
df["Name"].fillna("Unknown")          # fill text with label
df["Score"].fillna(df["Score"].mean()) # fill with column average

Fill missing numeric values by estimating from neighbors:

df["Sales"] = df["Sales"].interpolate()

interpolate() fills NaN by calculating the midpoint between the values above and below it. This is ideal for time series or sensor data where values change gradually — it would be wrong to use for random datasets.

Before:     After interpolate():
100         100
NaN    →    150   ← estimated midpoint
200         200
NaN    →    225   ← estimated midpoint
250         250

Remove rows with missing values:

df.dropna()                            # drop any row with at least one NaN
df.dropna(subset=["Phone"])            # drop only if Phone is NaN
df.dropna(thresh=3)                    # drop rows that have fewer than 3 non-NaN values

inplace rule: use either inplace=True or variable assignment — never both. They do the same thing, and combining them produces a bug where df becomes None.

df.dropna(subset=["Phone"], inplace=True)   # ✅ modifies df directly
df = df.dropna(subset=["Phone"])            # ✅ returns new df
df = df.dropna(subset=["Phone"], inplace=True)  # ❌ df becomes None

Step 4 — Remove Duplicates

Duplicate rows silently inflate counts, averages, and totals. Always check for them.

df.drop_duplicates()                         # remove fully identical rows
df.drop_duplicates(subset=["Name"])          # remove rows with duplicate Name only
df.drop_duplicates(subset=["Name", "Email"]) # remove rows where both columns match
df.drop_duplicates(keep="last")              # keep the last occurrence instead of the first
Before:                        After drop_duplicates():
Name   Email                   Name   Email
Alice  *Emails are not allowed*          Alice  *Emails are not allowed*
Bob    *Emails are not allowed*      →     Bob    *Emails are not allowed*
Alice  *Emails are not allowed* ← dup    Carol  *Emails are not allowed*
Carol  *Emails are not allowed*

Step 5 — Fix Data Types

Pandas sometimes reads numeric columns as strings, or dates as plain text. Operations on wrong types either crash or silently produce wrong results.

Change column type directly:

df["Zip_Code"] = df["Zip_Code"].astype(str)     # treat as text, not a number
df["Age"] = df["Age"].astype(int)               # convert to integer
df["Price"] = df["Price"].astype(float)         # convert to decimal

Handle messy numbers that contain symbols or text:

df["Price"] = pd.to_numeric(df["Price"], errors="coerce")

errors="coerce" converts anything that can't be parsed into NaN instead of crashing. This is the safe way to clean columns like "$1,200" or "N/A" that can't be directly cast with astype().

Before:           After pd.to_numeric(..., errors="coerce"):
"1200"    →       1200.0
"$850"    →       NaN     ← couldn't parse
"700.5"   →       700.5
"N/A"     →       NaN     ← couldn't parse

Convert to datetime:

df["Date"] = pd.to_datetime(df["Date"])
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")  # specify format if needed

Once a column is proper datetime, you unlock sorting by date, filtering by year/month, calculating time differences, and much more.


Step 6 — Clean Strings

String columns are almost always the messiest part of a real dataset. Pandas provides a full .str accessor with everything you need.

Strip unwanted characters from edges:

df["Name"] = df["Name"].str.strip()           # remove spaces from both sides
df["Name"] = df["Name"].str.strip(" ./-")     # remove spaces, dots, slashes, dashes
df["Name"] = df["Name"].str.lstrip()          # left side only
df["Name"] = df["Name"].str.rstrip()          # right side only

Replace characters anywhere in the string:

df["Phone"] = df["Phone"].str.replace(r"[a-zA-Z]", "", regex=True)   # remove all letters
df["Phone"] = df["Phone"].str.replace(r"\D", "", regex=True)          # keep digits only
df["Name"]  = df["Name"].str.replace(r"[./]", "", regex=True)         # remove dots and slashes

Standardize text case:

df["Name"] = df["Name"].str.lower()     # all lowercase → "alice"
df["Name"] = df["Name"].str.upper()     # all uppercase → "ALICE"
df["Name"] = df["Name"].str.title()     # title case    → "Alice"

Check text content:

df[df["Email"].str.contains("@")]           # filter valid-looking emails
df[df["Phone"].str.len() != 12]             # find rows where phone length is wrong
df[df["Name"].str.startswith("A")]          # names starting with A
df[df["Name"].str.endswith("e")]            # names ending with e

Extract a pattern using regex:

df["Zip"] = df["Address"].str.extract(r"(\d{5})")        # pull 5-digit zip code
df["Area_Code"] = df["Phone"].str.extract(r"^(\d{3})")   # pull first 3 digits

Split one column into multiple columns:

df[["Street", "State", "Zip"]] = (
    df["Address"]
    .str.split(",", n=2, expand=True)
)

n=2 limits the split to 2 times, producing exactly 3 parts regardless of how many commas the string contains. expand=True spreads those parts into separate columns.

Before:                              After:
"123 Main St, Texas, 75001"   →   Street="123 Main St"  State="Texas"  Zip="75001"

Step 7 — Replace and Standardize Values

Raw data often has the same value written multiple ways: "Y", "Yes", "yes", "y" — all meaning the same thing. Standardize them.

Replace whole cell values using a dictionary:

df["Status"] = df["Status"].replace({"Y": "Yes", "N": "No", "y": "Yes"})

# Apply to multiple columns at once
df[["Paying_Customer", "Do_Not_Contact"]] = (
    df[["Paying_Customer", "Do_Not_Contact"]]
    .replace({"Y": "Yes", "N": "No", "y": "Yes", "n": "No"})
)

Overwrite invalid values using loc[]:

df.loc[df["Age"] < 0,   "Age"] = None     # negative age → invalid
df.loc[df["Age"] > 120, "Age"] = None     # age over 120 → invalid
df.loc[df["Price"] < 0, "Price"] = None   # negative price → invalid

Keep valid values, replace everything else with NaN:

df["Salary"] = df["Salary"].where(df["Salary"] > 0)
# Rows where Salary <= 0 become NaN; the rest are untouched

Now Combine — Merging & Joining DataFrames

Once each source is clean, the next challenge is combining them. In the real world, data almost never lives in a single file. You might have orders in one table, customers in another, and products in a third. Merging is how you bring those pieces together.

Pandas provides three main tools for this:

Tool What it does
merge() Joins two DataFrames by matching on a shared column
concat() Stacks DataFrames vertically or side by side
join() Joins on the index instead of a column

We'll use these two DataFrames for all the examples:

customers = pd.DataFrame({
    "ID":        [1, 2, 3, 4],
    "Name":      ["Alice", "Bob", "Carol", "Dave"],
    "Country":   ["USA", "UK", "India", "USA"],
})

orders = pd.DataFrame({
    "ID":        [1, 2, 3, 5],
    "Product":   ["Laptop", "Phone", "Tablet", "Monitor"],
    "Price":     [1200, 800, 600, 400],
})

Notice that ID=4 (Dave) exists in customers but not in orders, and ID=5 exists in orders but not in customers. This is exactly the kind of mismatch you encounter in practice — and why the how= parameter matters so much.


1️⃣ merge() — The Main Tool

merge() works like a SQL JOIN. It aligns two DataFrames by finding rows where a shared column has the same value.

Basic syntax:

customers.merge(orders, how="inner", on="ID")

The how= Parameter — All Four Types

# INNER → only rows that match in both DataFrames (default)
customers.merge(orders, how="inner", on="ID")
# Result: IDs 1, 2, 3 only — Dave (4) and Monitor (5) are excluded

# LEFT → all rows from the left + matched rows from the right
customers.merge(orders, how="left", on="ID")
# Result: all 4 customers — Dave gets NaN for Product and Price

# RIGHT → all rows from the right + matched rows from the left
customers.merge(orders, how="right", on="ID")
# Result: all 4 orders — Monitor (ID=5) gets NaN for Name and Country

# OUTER → all rows from both, NaN where no match
customers.merge(orders, how="outer", on="ID")
# Result: all 5 unique IDs — gaps filled with NaN on both sides
how= Keeps from Left Keeps from Right NaN Possible
inner Matched only Matched only No
left All rows Matched only Right side
right Matched only All rows Left side
outer All rows All rows Both sides

CROSS merge — every row with every row:

customers.merge(orders, how="cross")
# Produces 4 × 4 = 16 rows — every customer paired with every order
# No on= parameter — it will raise an error if you include it

The on= Parameter — What to Match On

# Match on a single column
customers.merge(orders, how="inner", on="ID")

# Match on multiple columns — both must match simultaneously
customers.merge(orders, how="inner", on=["ID", "Country"])

# Auto-detect — Pandas finds all shared column names automatically
customers.merge(orders)

Be careful with auto-detect. If both DataFrames happen to share a column that you don't intend to join on (like a generic "Name" column), Pandas will include it in the match condition and produce unexpected results. Explicit is safer.


Handling Shared Column Names — _x and _y

If both DataFrames have a column with the same name, but you only join on ID, Pandas keeps both copies and renames them automatically:

# Both DataFrames have a "Name" column, but you only merge on "ID"
customers.merge(orders, how="inner", on="ID")
# Result: Name_x (from customers), Name_y (from orders)

Three ways to fix this:

# Option 1 — merge on all shared columns so no duplicates appear
customers.merge(orders, how="inner", on=["ID", "Name"])

# Option 2 — use custom suffixes instead of _x and _y
customers.merge(orders, how="inner", on="ID", suffixes=("_customer", "_order"))

# Option 3 — drop the shared column from one DataFrame before merging
orders_clean = orders.drop(columns=["Name"])
customers.merge(orders_clean, how="inner", on="ID")

Other Useful Parameters

# left_on / right_on — when the key column has different names in each DataFrame
customers.merge(orders, how="inner", left_on="CustomerID", right_on="OrderID")

# indicator — adds a column showing where each row came from
customers.merge(orders, how="outer", on="ID", indicator=True)
# _merge column values: "left_only", "right_only", "both"

The indicator=True parameter is especially useful for auditing: after a merge, you can immediately see which rows matched, which came only from the left, and which came only from the right.


2️⃣ concat() — Stacking DataFrames

While merge() combines DataFrames horizontally by matching values, concat() simply stacks them — either rows on top of rows, or columns side by side.

pd.concat([df1, df2])        # stack vertically (default)
pd.concat([df1, df2], axis=1)  # stack horizontally

axis= Parameter

# axis=0 → stack rows vertically (append one DataFrame below another)
monthly_data = pd.concat([january, february, march], axis=0)

# axis=1 → stack columns side by side
combined = pd.concat([names_df, scores_df], axis=1)

join= Parameter

# outer (default) → keep all columns, NaN where a column doesn't exist in one source
pd.concat([df1, df2], join="outer")

# inner → keep only columns that exist in all DataFrames
pd.concat([df1, df2], join="inner")

ignore_index= Parameter

# Without ignore_index — original indices are preserved, which can repeat
pd.concat([df1, df2])
# Index: 0, 1, 2, 0, 1, 2  ← 0, 1, 2 repeats

# With ignore_index=True — fresh sequential index
pd.concat([df1, df2], ignore_index=True)
# Index: 0, 1, 2, 3, 4, 5  ← clean

Always use ignore_index=True after stacking unless you specifically need the original indices.

keys= Parameter

# Label each source so you can identify where each row came from
combined = pd.concat([df1, df2], keys=["source_a", "source_b"])

This creates a MultiIndex where the first level is your label and the second is the original row index — useful when you need to trace data back to its source after combining.

verify_integrity= Parameter

# Raises an error if any duplicate index values are found after concatenation
pd.concat([df1, df2], verify_integrity=True)

merge() vs concat() — When to Use Which

merge() concat()
Direction Horizontal (by matching) Vertical or horizontal (by stacking)
Needs a shared key column Yes No
Best for Joining related tables Combining same-structure data
Equivalent to SQL JOIN SQL UNION

A simple rule: if you're combining columns from different tables, use merge(). If you're stacking more rows of the same structure, use concat().


3️⃣ join() — Index-Based Joining

join() works like merge(), but it joins on the index instead of a regular column. You need to set the index first.

customers_indexed = customers.set_index("ID")
orders_indexed    = orders.set_index("ID")

customers_indexed.join(orders_indexed, how="inner")

The how= options are the same as merge(): inner, left, right, outer.

Why merge() Is Almost Always Better

merge() join()
Join on any column ✅ Yes ❌ No — index only
Join on index ✅ Yes ✅ Yes
Requires set_index() first ❌ No ✅ Yes
Simpler syntax ✅ Yes ⚠️ Extra steps

> Use join() only when your data is already indexed — for example, after a groupby() that produced a meaningful index. Otherwise, merge() is cleaner and more flexible.

This is Part 3 of the Pandas for Data Science series. Next up: Part 4 — Data Visualization with Matplotlib & Seaborn

To access the full article, please visit the link below or directly access the repository via the links below.


Refrences

More Posts

Mastering Pandas — Part 4: Data Visualization with Matplotlib & Seaborn

Hussein Mahdi - Apr 2

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

Masbadar - Mar 12

Forecast Kebutuhan Bahan & Produksi Konveksi dengan Python (Praktis + Template)

Masbadar - Mar 8

Mastering Pandas — Part 1: Reading, Sorting & Displaying Data

Hussein Mahdi - Feb 26

Optimizing the Clinical Interface: Data Management for Efficient Medical Outcomes

Huifer - Jan 26
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!