Pandas is a powerful Python library for analysis and data manipulation. It offers user-friendly, high-performance data structures and tools for handling structured data. Pandas' versatility and extensive feature set make it a popular choice for data research and analysis tasks.
Series and DataFrame are the two primary data structures in Pandas.
-
A series is a one-dimensional object that resembles an array and is made up of an array of labels called the index and an array of data.
- A data structure with two dimensions and labels that can have different kinds of columns is called a data frame. It can be compared to a SQL table or spreadsheet.
The Pandas Series adds labelled axes and missing data handling to NumPy arrays. Data Frames are similar to dictionaries in that they handle diverse data types and use column names as keys. They provide effective row and column operations for manipulating structured data in Python.
In this article we will look at the Pandas Python data analysis library. We will go over its main characteristics, including performance optimization, handling of missing data, and intuitive data structures. Let's learn about Pandas' capabilities and improve your knowledge of data analysis.
Table of Contents:
Installing Pandas
Pip or Anaconda can be used to install Pandas.
1. Installing with pip
pip install pandas
2. Installing with Anaconda
Pandas comes pre-installed in the distribution if you use Anaconda. You may use conda to update it:
conda install pandas
Once installed, run the following command to see the Pandas version:
import pandas as pd
print(pd.__version__)
Use the robust indexing features of Pandas to effectively access and alter data.
Pandas Series
Python's Pandas Series provide a flexible and effective method for handling one-dimensional labelled data. Different data types, such as texts, floats, integers, and even custom objects, can be stored in series.
1. Creating Series Objects
You can use a variety of data structures, including Series and Data Frames, when constructing data objects in Python using Pandas. For instance, to create a Series object, you can use:
A. pd.Series()
It shows a labelled, one-dimensional array with indices that can hold several kinds of data used to handle data with flexibility and ease of usage, enabling effective data processing and analysis.
import pandas as pd
import numpy as np
# Creating a Pandas Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# Creating a Pandas Series from a NumPy array
arr = np.array([10, 20, 30, 40, 50])
s_np = pd.Series(arr)
# Creating a Pandas Series from a dictionary
data = {'a': 0, 'b': 1, 'c': 2}
s_dict = pd.Series(data)
2. Accessing and manipulating Series data
There are several ways to access and work with Pandas Series, including slicing, indexing, and simple arithmetic operations.
import pandas as pd
s = pd.Series([1, 3, 5])
# Accessing elements of a Series using index
print(s[0]) # Output: 1
# Slicing a Series
print(s[1:3]) # Output: [3, 5]
s_dict = pd.Series({'a': 2, 'b': 4, 'c': 6})
# Basic arithmetic operations on Series
s_add = s + s_dict # Adding two Series with different indices
3. Methods for data alignment and missing data handling
When executing operations, Pandas automatically aligns data according to the index. Techniques like 'isnull()'
,'fillna()'
,and 'dropna()'
can be used to deal with missing data.
import numpy as np
s = pd.Series([1, 3, np.nan, 5])
# Checking for missing values
print(s.isnull())
# Filling missing values with a specified value
s_filled = s.fillna(0)
# Dropping missing values
s_dropped = s.dropna()
Pandas Data Frames
Pandas Data Frame is a two-dimensional labelled data structure that is widely used for data manipulation and analysis in Python. It can be thought of as a table with rows and columns, where each column represents a different variable and each row represents a different observation.
1. Creating DataFrames
The 'pd.DataFrame()'
constructor can be used to create Data Frames from a variety of data sources, including CSV, Excel, SQL, and more, by providing in a dictionary or list of lists.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 35, 21, 45]}
df = pd.DataFrame(data)
print(df)
2. Understanding Data Frame structure
Data Frames are made up of rows, indexes, and columns. Rows hold the actual data, whereas columns represent variables and indices provide row labels.
import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
# Accessing column names
print(df.columns)
# Accessing index
print(df.index)
# Accessing rows
print(df.iloc[0]) # Accessing the first row
3. Indexing and Selecting data
Many indexing techniques, such as label-based, position-based, and Boolean indexing, can be used to retrieve data in Data Frames.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Label-based indexing
print(df.loc[0, 'Name']) # Accessing the 'Name' column of the first row
# Position-based indexing
print(df.iloc[0, 1]) # Accessing the element in the first row and second column
# Boolean indexing
print(df[df['Age'] > 30]) # Selecting rows where Age is greater than 30
Keep data types consistent between columns to prevent unexpected behaviour when performing operations. Verify and convert data types as needed to guarantee accuracy and compatibility.
Data Manipulation with Pandas
Pandas can be used to manipulate data in a variety of ways, including grouping and aggregating data, filtering and sorting data, joining and merging Data Frames, and more.
1. Adding and Removing columns and rows
You can add or remove rows and columns with 'df.dropna()'
and 'df.drop()'
, among other ways.
import pandas as pd
# Creating the initial DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 35, 21, 45]}
df = pd.DataFrame(data)
# Adding a new row using pd.concat()
new_row = pd.DataFrame([['Tom', 33]], columns=['Name', 'Age'])
df = pd.concat([df, new_row], ignore_index=True)
print("DataFrame after adding a new row:")
print(df)
# Removing a row using pd.concat()
df = pd.concat([df.iloc[1:]], ignore_index=True)
print("\nDataFrame after removing the first row:")
print(df)
2. Filtering and sorting data
Functions like 'df.filter()'
and 'df.sort_values()'
can be used to filter and sort Data Frames.
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)
# Filtering rows based on conditions
filtered_df = df[df['Age'] > 30]
# Sorting by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print("Filtered DataFrame:")
print(filtered_df)
print("\nSorted DataFrame:")
print(sorted_df)
Large datasets may need computationally costly tasks like sorting, merging, or grouping. Employ the right Pandas techniques and algorithms to maximise performance.
3. Combining and Merging Data Frames
Data Frames can be joined together or merged using 'pd.concat()'
and 'pd.merge()'
.
import pandas as pd
import numpy as np
# Define DataFrame 1
data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
df1 = pd.DataFrame(data1)
# Define DataFrame 2
data2 = {'Name': ['David', 'Emily'],
'Age': [28, 35]}
df2 = pd.DataFrame(data2)
# Concatenating DataFrames horizontally
df_concat_horizontal = pd.concat([df1, df2], axis=1)
print(df_concat_horizontal)
# Merging DataFrames based on multiple columns
merged_df_multiple = pd.merge(df1, df2, on=['Name', 'Age'])
print(merged_df_multiple)
4. Grouping and Aggregating data
Aggregation functions such as count(), mean(), and sum(), as well as 'df.groupby()'
, can be used to group and aggregate data.
import pandas as pd
# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
'Price': [10, 20, 15, 25, 12],
'Quantity': [5, 3, 4, 2, 6]}
df = pd.DataFrame(data)
# Grouping and aggregating data
aggregated_df = df.groupby('Category').agg({'Price': 'sum', 'Quantity': 'mean'})
print("Aggregated Data:")
print(aggregated_df)
Q: How do I perform data manipulation?
A: Pandas provides functions for selecting, filtering, merging, grouping, and aggregating data.
Q: What is the difference between loc and iloc?
A: loc is label-based indexing, used for selecting data by labels, while iloc is integer-location based indexing, used for selecting data by integer indices.
For Refering to Part2 of the Article:
Mastering Data Analysis and Manipulation with Pandas: A Comprehensive Guide
For Refering to Part3 of the Article:
Advancing with Pandas: Beyond the Basic