Data Cleaning and Preprocessing
1. Handling missing data
Pandas manages missing data in DataFrames effectively: fillna() and dropna() replace or drop missing values, respectively, and isnull() or notnull() detects them.
import pandas as pd
data = {'A': [1, 2, None, 4],
'B': [None, 5, 6, 7],
'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
# Detecting missing values
print("Missing Values:")
print(df.isnull())
# Filling missing values with 0
filled_df = df.fillna(value=0)
print("\nDataFrame after filling missing values:")
print(filled_df)
clean_df = df.dropna()
print("\nDataFrame after dropping missing values:")
print(clean_df)
2. Data Type Conversion
Pandas frequently performs data type conversion to guarantee consistency and interoperability across various data types. You can convert the data types of columns in a Data Frame using the astype() method.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': ['25.5, '30.3', '22.7'],# Age as string 'Date': ['2022-01-01', '2022-02-01', '2022-03-01']} # Date as strings
df = pd.DataFrame(data)
# Convert 'Age' column to float
df['Age'] = df['Age'].astype(float)
# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Use caution when processing huge datasets in pieces, as this can still result in high memory usage, particularly when complicated calculations or data manipulation are involved.
3. Removing Duplicates
During data cleaning and pre processing, removing duplicates in Pandas Data Frame is a common procedure. Duplicates can be removed using the 'drop_duplicates()'
method.
import pandas as pd
data = {'Name': ['John', 'Anna', 'John', 'Linda', 'Anna'],
'Age': [28, 35, 28, 45, 35]}
df = pd.DataFrame(data)
# Remove duplicate rows
unique_df = df.drop_duplicates()
print("DataFrame after removing duplicate rows:")
print(unique_df)

4. Data normalization and scaling
Preprocessing data with scaling and normalisation helps to standardise feature ranges for better machine learning. Standardisation sets features to mean 0 and standard deviation 1, whereas normalisation scales characteristics to 0-1.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample DataFrame
data = {'Feature1': [10, 20, 30, 40, 50],
'Feature2': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Min-Max scaling
min_max_scaler = MinMaxScaler()
normalized_data = min_max_scaler.fit_transform(df[['Feature1', 'Feature2']])
# Standardization
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(df[['Feature1', 'Feature2']])
print("Normalized Data:")
print(normalized_data)
print("\nStandardized Data:")
print(standardized_data)
Data Visualization with Pandas
With Pandas, you can quickly gain understanding of your data by creating simple plots straight from Data Frames for data visualization. Pandas uses the robust charting tool Matplotlib to facilitate visualization.
1. Basic plotting
Pandas offers easy-to-use techniques for basic plotting, including scatter plots, histograms, bar graphs, and line plots. On a Data Frame or Series, you can directly call these graphing functions.
import pandas as pd
import matplotlib.pyplot as plt
# Creating a DataFrame
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
'Value': [10, 15, 20, 25],
'Category': ['A', 'B', 'C', 'D'],
'Count': [5, 3, 4, 2],
'Feature1': [1, 2, 3, 4],
'Feature2': [5, 6, 7, 8]}
df = pd.DataFrame(data)
# Line plot
df.plot(x='Date', y='Value', kind='line')
plt.title('Line Plot')
# Bar plot
df.plot(x='Category', y='Count', kind='bar')
plt.title('Bar Plot')
# Show plots
plt.show()
2. Customizing plots
To alter the look of plots, pandas plotting functions support a wide range of choices, such as colours, labels, titles, grid lines, and more. You can customize the visualization to meet your own requirements by adjusting these settings.
import pandas as pd
# Creating a DataFrame
data = {'Date': ['2022-01-01', '2022-01-02','2022-01-03', '2022-01-04'],
'Value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
# Changing colors
df.plot(color=['blue'])
# Adding labels and title
df.plot(x='Date', y='Value', xlabel='Date',
ylabel='Value', title='Value over Time')
3. Subplots and multiple axes
Pandas enables the setting up of numerous axes and subplots inside a single figure. You can compare several datasets or different features of your data by visualizing them side by side with this functionality.
import pandas as pd
# Creating a DataFrame
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
'Value': [10, 20, 15, 25],
'Count': [5, 3, 4, 2]}
df = pd.DataFrame(data)
# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Subplots
df.plot(subplots=True, figsize=(10, 8))
# Multiple axes
ax1 = df.plot(x='Date', y='Value', color='blue')
ax2 = df.plot(x='Date', y='Count', color='red', secondary_y=True, ax=ax1)
Time Series Analysis with Pandas
Using Pandas for time series analysis involves handling and examining data that is indexed by time.
1. Handling time series data
Pandas can be used to work with time series data by creating datetime objects from strings or numeric representations of dates. Date-like objects can be converted into datetime objects using the pd.to_datetime() function.
import pandas as pd
data = {'Date': ['2024-03-13', '2024-03-14', '2024-03-15'],
'Value': [10, 20, 30]}
df = pd.DataFrame(data)
# Parsing dates
df['Date'] = pd.to_datetime(df['Date'])
# Setting datetime index
df.set_index('Date', inplace=True)
print(df)

2. Resampling and frequency conversion
The resample() function in Pandas allows you to aggregate data at multiple intervals of time, either up-sampling or down-sampling (data to a higher or lower frequency).
import pandasa as pd
data = {'Date': pd.date_range(start='2024-01-01', end='2024-03-15'),
'Value': range(1, 76)}
df = pd.DataFrame(data)
# Setting 'Date' column as datetime index
df.set_index('Date', inplace=True)
# Downsampling from daily to monthly
monthly_data = df.resample('M').mean()
# Upsampling from monthly to daily
daily_data = monthly_data.resample('D').ffill() # Forward fill missing values
print("Monthly Data:")
print(monthly_data.head())
print("\nDaily Data:")
print(daily_data.head())

To handle data in smaller chunks, reduce memory errors, and increase efficiency for huge datasets, use chunksize when using pandas' read_csv() or read_sql() functions.
3. Time shifting and lagging
Pandas has methods like shift() and diff() that let you shift or lag time series data. Data shifting can be helpful in calculating variations or changes over time.
import pandas as pd
data = {'Date': pd.date_range(start='2024-01-01', end='2024-01-05'),
'Value': [10, 15, 20, 25, 30]}
df = pd.DataFrame(data)
# Shifting data by one period
df['Shifted'] = df['Value'].shift(1)
# Calculating differences between consecutive elements
df['Diff'] = df['Value'].diff()
print(df)

Advanced Pandas Techniques
In the scope of fundamental functions, efficient and effective data manipulation and analysis are made possible by advanced Pandas approaches.
1. Handling categorical data
Categorical data is effectively managed by Pandas using pd. To create dummy variables, use pd.get_dummies() and categorical() for fixed distinct values.
import pandas as pd
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)
# Convert 'Category' column to categorical
df['Category'] = pd.Categorical(df['Category'],categories=['A', 'B', 'C'])
# Convert categorical variable into dummy/indicator variables
dummy_df = pd.get_dummies(df['Category'], prefix='Category')
print("Original DataFrame:")
print(df)
print("\nDummy DataFrame:")
print(dummy_df)
2. Working with text data
Pandas has several string methods (such str.lower() and str.contains()) that can be used to work with text data in Data Frame columns. Advanced text processing activities can also be carried out with regular expressions.
import pandas as pd
data = {'Text': ['This is a Text with Keyword', 'Another text without the keyword']}
df = pd.DataFrame(data)
# Convert 'Text' column to lowercase
df['Text'] = df['Text'].str.lower()
# Check if text contains a specific substring
contains_substring = df['Text'].str.contains('keyword')
# Extract text using regular expressions
extracted_text = df['Text'].str.extract(r'(\d+)')
print(df)
print(contains_substring)
print(extracted_text)
3. Pivot tables and cross-tabulations
To aggregate and sum data based on variables, pivot tables can be created using Pandas' pivot_table() function. In a similar manner, cross-tabulations are used by pd.crosstab() to compute frequency tables.
import pandas as pd
import numpy as np
data = {'Date': pd.date_range(start='2024-01-01', periods=6),
'Category': ['A', 'B', 'A', 'C', 'B', 'C'],
'Value': [10, 20, 15, 25, 30, 35]}
df = pd.DataFrame(data)
# Create a pivot table
pivot_table = pd.pivot_table(df, values='Value', index='Date', columns='Category', aggfunc=np.sum)
print("Pivot Table:")
print(pivot_table)
# Sample DataFrame for cross-tabulation
data_cross_tab = {'Category1': ['A', 'B', 'A', 'C', 'B', 'C'], 'Category2': ['X', 'Y', 'X', 'Y', 'X', 'Y']}
df_cross_tab = pd.DataFrame(data_cross_tab)
# Create a cross-tabulation
cross_tab = pd.crosstab(df_cross_tab['Category1'], df_cross_tab['Category2'])
print("\nCross-Tabulation:")
print(cross_tab)
4. Method chaining
Code readability and workflow efficiency are increased by the Pandas method chaining, which reduces several processes into a single line.
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'C'],
'Value': [10, 20, 15, 25, 30, 35]}
df = pd.DataFrame(data)
# Group by 'Category', calculate mean of 'Value', reset index, and rename column
result = df.groupby('Category').agg({'Value': 'mean'}).reset_index().rename(columns={'Value': 'Mean_Value'})
print(result)
Q: What is the significance of handling missing data?
A: Handling missing data is crucial for accurate analysis and modelling. Pandas offers methods like isnull(), fillna(), and dropna() for dealing with missing values effectively.