Pandas in Python: A Comprehensive Guide

posted 10 min read

Pandas is a powerful Python library for analysis and data manipulation. It offers user-friendly, high-performance data structures and tools for handling structured data. Pandas' versatility and extensive feature set make it a popular choice for data research and analysis tasks.

Series and DataFrame are the two primary data structures in Pandas.

  1. A series is a one-dimensional object that resembles an array and is made up of an array of labels called the index and an array of data.

  2. A data structure with two dimensions and labels that can have different kinds of columns is called a data frame. It can be compared to a SQL table or spreadsheet.

The Pandas Series adds labelled axes and missing data handling to NumPy arrays. Data Frames are similar to dictionaries in that they handle diverse data types and use column names as keys. They provide effective row and column operations for manipulating structured data in Python.
In this article we will look at the Pandas Python data analysis library. We will go over its main characteristics, including performance optimization, handling of missing data, and intuitive data structures. Let's learn about Pandas' capabilities and improve your knowledge of data analysis.

Table of Contents: 

Installing Pandas

Pip or Anaconda can be used to install Pandas.

1. Installing with pip

pip install pandas

2. Installing with Anaconda

Pandas comes pre-installed in the distribution if you use Anaconda. You may use conda to update it:

conda install pandas

Once installed, run the following command to see the Pandas version:

import pandas as pd
print(pd.__version__)
Tip: Use the robust indexing features of Pandas to effectively access and alter data.

Pandas Series

Python's Pandas Series provide a flexible and effective method for handling one-dimensional labelled data. Different data types, such as texts, floats, integers, and even custom objects, can be stored in series. 

1. Creating Series Objects

You can use a variety of data structures, including Series and Data Frames, when constructing data objects in Python using Pandas. For instance, to create a Series object, you can use:

A. pd.Series()

It shows a labelled, one-dimensional array with indices that can hold several kinds of data used to handle data with flexibility and ease of usage, enabling effective data processing and analysis.

import pandas as pd
import numpy as np
# Creating a Pandas Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# Creating a Pandas Series from a NumPy array
arr = np.array([10, 20, 30, 40, 50])
s_np = pd.Series(arr)
# Creating a Pandas Series from a dictionary
data = {'a': 0, 'b': 1, 'c': 2}
s_dict = pd.Series(data)

2. Accessing and manipulating Series data

There are several ways to access and work with Pandas Series, including slicing, indexing, and simple arithmetic operations.

import pandas as pd
s = pd.Series([1, 3, 5])
# Accessing elements of a Series using index
print(s[0])  # Output: 1
# Slicing a Series
print(s[1:3])  # Output: [3, 5]
s_dict = pd.Series({'a': 2, 'b': 4, 'c': 6})
# Basic arithmetic operations on Series
s_add = s + s_dict  # Adding two Series with different indices

3. Methods for data alignment and missing data handling

When executing operations, Pandas automatically aligns data according to the index. Techniques like 'isnull()','fillna()' ,and 'dropna()'can be used to deal with missing data.

import numpy as np
s = pd.Series([1, 3, np.nan, 5])
# Checking for missing values
print(s.isnull())
# Filling missing values with a specified value
s_filled = s.fillna(0)
# Dropping missing values
s_dropped = s.dropna()

Pandas Data Frames

Pandas Data Frame is a two-dimensional labelled data structure that is widely used for data manipulation and analysis in Python. It can be thought of as a table with rows and columns, where each column represents a different variable and each row represents a different observation.

1. Creating DataFrames

The 'pd.DataFrame()' constructor can be used to create Data Frames from a variety of data sources, including CSV, Excel, SQL, and more, by providing in a dictionary or list of lists.

import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 35, 21, 45]}
df = pd.DataFrame(data)
print(df)

2. Understanding Data Frame structure

Data Frames are made up of rows, indexes, and columns. Rows hold the actual data, whereas columns represent variables and indices provide row labels.

import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
# Accessing column names
print(df.columns)
# Accessing index
print(df.index)
# Accessing rows
print(df.iloc[0])  # Accessing the first row

3. Indexing and Selecting data

Many indexing techniques, such as label-based, position-based, and Boolean indexing, can be used to retrieve data in Data Frames.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Label-based indexing
print(df.loc[0, 'Name'])  # Accessing the 'Name' column of the first row
# Position-based indexing
print(df.iloc[0, 1])  # Accessing the element in the first row and second column
# Boolean indexing
print(df[df['Age'] > 30])  # Selecting rows where Age is greater than 30
Note: Keep data types consistent between columns to prevent unexpected behaviour when performing operations. Verify and convert data types as needed to guarantee accuracy and compatibility.

Data Manipulation with Pandas

Pandas can be used to manipulate data in a variety of ways, including grouping and aggregating data, filtering and sorting data, joining and merging Data Frames, and more.

1. Adding and Removing columns and rows

You can add or remove rows and columns with 'df.dropna()' and 'df.drop()', among other ways.

import pandas as pd
# Creating the initial DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 35, 21, 45]}
df = pd.DataFrame(data)
# Adding a new row using pd.concat()
new_row = pd.DataFrame([['Tom', 33]], columns=['Name', 'Age'])
df = pd.concat([df, new_row], ignore_index=True)
print("DataFrame after adding a new row:")
print(df)
# Removing a row using pd.concat()
df = pd.concat([df.iloc[1:]], ignore_index=True)
print("\nDataFrame after removing the first row:")
print(df)

2. Filtering and sorting data

Functions like 'df.filter()' and 'df.sort_values()'can be used to filter and sort Data Frames.

import pandas as pd 
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)
# Filtering rows based on conditions
filtered_df = df[df['Age'] > 30]
# Sorting by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print("Filtered DataFrame:")
print(filtered_df)
print("\nSorted DataFrame:")
print(sorted_df)
Caution: Large datasets may need computationally costly tasks like sorting, merging, or grouping. Employ the right Pandas techniques and algorithms to maximise performance.

3. Combining and Merging Data Frames

Data Frames can be joined together or merged using 'pd.concat()' and 'pd.merge()'.

import pandas as pd
import numpy as np
# Define DataFrame 1
data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [25, 30, 22]}
df1 = pd.DataFrame(data1)
# Define DataFrame 2
data2 = {'Name': ['David', 'Emily'],
         'Age': [28, 35]}
df2 = pd.DataFrame(data2)
# Concatenating DataFrames horizontally
df_concat_horizontal = pd.concat([df1, df2], axis=1)
print(df_concat_horizontal)
# Merging DataFrames based on multiple columns
merged_df_multiple = pd.merge(df1, df2, on=['Name', 'Age'])
print(merged_df_multiple)

4. Grouping and Aggregating data

Aggregation functions such as count(), mean(), and sum(), as well as 'df.groupby()', can be used to group and aggregate data.

import pandas as pd 
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Price': [10, 20, 15, 25, 12],
        'Quantity': [5, 3, 4, 2, 6]}
df = pd.DataFrame(data)
# Grouping and aggregating data
aggregated_df = df.groupby('Category').agg({'Price': 'sum', 'Quantity': 'mean'})
print("Aggregated Data:")
print(aggregated_df)

FAQ
Q: How do I perform data manipulation?
A: Pandas provides functions for selecting, filtering, merging, grouping, and aggregating data.
Q: What is the difference between loc and iloc?
A: loc is label-based indexing, used for selecting data by labels, while iloc is integer-location based indexing, used for selecting data by integer indices.

For Refering to Part2 of the Article:

Mastering Data Analysis and Manipulation with Pandas: A Comprehensive Guide

For Refering to Part3 of the Article:

Advancing with Pandas: Beyond the Basic

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

NumPy in Python: An Advanced Guide

Muzzamil Abbas - Mar 13, 2024

NumPy in Python: A Comprehensive Guide (Easy)

Muzzamil Abbas - Mar 13, 2024

Mastering Lambda Functions in Python: A comprehensive Guide

Abdul Daim - May 6, 2024

NameError: name 'pd' is not defined in Python [Solved]

muhammaduzairrazaq - Feb 29, 2024

Git and GitHub for Python Developers A Comprehensive Guide

Tejas Vaij - Apr 7, 2024
chevron_left