The main aim of this project is to predict the mortality rate i.e. whether a person will live or not with the help of different machine learning models and pycaret library.
Motivation
The predictors of in-hospital mortality for intensive care units (ICU) admitted heart failure patients remain poorly characterized. We aimed to develop and validate a prediction model for all cause in-hospital mortality among ICU admitted heart failure patients.
We will try to build a machine learning model with the help of different machine learning and auto machine learning techniques to predict whether a person will live or die after observing certain parameters. The data set used have different attributes which will help us in predicting the outcomes.
What is PyCaret
PyCaret is an open-source, low-code machine learning library in Python that automates the entire process of model training, evaluation, and deployment. It is designed to be simple and efficient, providing an easy-to-use interface for performing various machine learning tasks with minimal coding. Here are some key features and components of PyCaret:
Ease of Use: PyCaret simplifies the machine learning workflow, allowing users to build models with just a few lines of code. This makes it accessible for both beginners and experienced data scientists.
Low-Code: With PyCaret, users can perform complex machine learning tasks with minimal coding, reducing the time and effort required for model development.
End-to-End Machine Learning: PyCaret covers the entire machine learning lifecycle, including data preprocessing, model training, hyperparameter tuning, model evaluation, and deployment.
Multiple Modules: PyCaret provides various modules tailored for different types of machine learning tasks which are:
Classification: For binary and multi-class classification problems.
- Regression: For regression tasks.
- Clustering: For unsupervised clustering tasks.
- Anomaly Detection: For identifying outliers in data.
- NLP: For natural language processing tasks.
- Time Series: For time series forecasting (recently added).
Timeline of the project
Data Analysis: Finding out different relations.
Feature Engineering: Processing the data before feeding to the model
Model building using machine learning: Using machine learning algorithm
Model building using auto Machine learning: Using PyCaret library
Code implementation
Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Used to mount Google Drive in a Google Colab notebook. This allows you to access files stored in your Google Drive directly from the Colab environment.
from google.colab import drive
drive.mount('/content/drive')
Used to read a CSV file stored in Google Drive into a pandas DataFrame in a Google Colab environment.
df= pd.read_csv("/content/drive/MyDrive/mortality.csv")
For displaying the first few rows of a pandas DataFrame.
df.head()
Data Analysis
to generate descriptive statistics of a pandas DataFrame.
df.describe()
This code is used to identify and count the number of missing values in each column of a pandas DataFrame.
df.isnull().sum()
This code below is used to get the dimensions of a pandas DataFrame.
df.shape
which is (1177, 51)
The code below is used to get a concise summary of a pandas DataFrame.
df.info()
Handling NAN Values
For float variables
The code snippet imports the SimpleImputer class from sklearn.impute and creates an instance of it to handle missing values in a dataset.
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan, strategy='mean')
The code below selects columns of a specific data type (float64) from a pandas DataFrame and stores the column names in a variable.
float_col = df.select_dtypes(include='float64').columns
The code below df[float_col] = si.transform(df[float_col]) uses the SimpleImputer (si) object to transform missing values in specific columns (float_col) of a pandas DataFrame (df).
df[float_col] = si.transform(df[float_col])
The code below separates the features (x) and the target variable (y) from a pandas DataFrame (df).
x = df.drop(columns='outcome')
y = df[['outcome']]
For Dependent variable
# Create a SimpleImputer object named 'SI'
# It replaces missing values (NaN) with the most frequent value along each column
SI = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
SI.fit_transform(y)
# Create a new DataFrame 'y' from the existing object 'y'
# The column name is set to 'outcome' and the data type to 'int64'
y = pd.DataFrame(y, columns=['outcome'], dtype='int64')
# Create a new DataFrame 'df_final' by copying the features DataFrame 'x'
df_final = x.copy()
# Add a new column 'outcome' to 'df_final' and fill it with the values from the target variable 'y'
The code below df_final.isnull().sum() is used to count the number of missing values in each column of the DataFrame df_final.
df_final.isnull().sum()
Visualising our Dependent variable
**Create a figure and axis object with specified size and DPI**
fig, ax = plt.subplots(figsize=(8, 5), dpi=100)
**Create a pie chart with counts of 'outcome' values**
'autopct' formats the percentages displayed on the chart
'shadow' adds shadow effect to the pie chart
'startangle' rotates the start of the pie chart to 90 degrees
'explode' specifies the fraction of the radius with which to offset each wedge
'labels' specifies custom labels for each wedge
patches, texts, autotexts = ax.pie(df_final['outcome'].value_counts(), autopct='%1.1f%%', shadow=True,
startangle=90, explode=(0.1, 0), labels=['Alive', 'Death'])
**Customize the font size, color, and weight of the autopct text**
plt.setp(autotexts, size=12, color='black', weight='bold')
Set the color of the 'Death' autopct text to white for better visibility
autotexts[1].set_color('white')
**Add a title to the pie chart**
plt.title('Outcome Distribution', fontsize=14)
**Display the pie chart**
plt.show()
import matplotlib.pyplot as plt
**Create a figure and axis object with specified size and DPI**
fig, ax = plt.subplots(figsize=(8, 5), dpi=100)
**Create a pie chart with counts of 'outcome' values**
'autopct' formats the percentages displayed on the chart
'shadow' adds shadow effect to the pie chart
'startangle' rotates the start of the pie chart to 90 degrees
'explode' specifies the fraction of the radius with which to offset each wedge
'labels' specifies custom labels for each wedge
patches, texts, autotexts = ax.pie(df_final['outcome'].value_counts(), autopct='%1.1f%%', shadow=True,
startangle=90, explode=(0.1, 0), labels=['Alive', 'Death'])
**Customize the font size, color, and weight of the autopct text**
plt.setp(autotexts, size=12, color='black', weight='bold')
**Set the color of the 'Death' autopct text to white for better visibility**
autotexts[1].set_color('white')
**Add a title to the pie chart**
plt.title('Outcome Distribution', fontsize=14)
**Display the pie chart**
plt.show()
**Same code format above but for BMI**
fig = px.histogram(df, x="BMI", color="outcome", marginal="box", hover_data=df.columns)
fig.show()
**Same code format above but for SP 02**
fig = px.histogram(df, x="SP O2", color="outcome", marginal="box", hover_data=df.columns)
fig.show()
**Same code format above but for heart rate**
fig = px.histogram(df, x="heart rate", color="outcome", marginal="box", hover_data=df.columns)
fig.show()
This code below creates a count plot to visualize the distribution of outcomes ('Alive' and 'Death') based on gender ('gendera') in the DataFrame df_final. It utilizes Seaborn for plotting.
plt.figure(figsize=(12,8))
plot = sns.countplot(df_final['gendera'], hue=df_final['outcome'])
plt.xlabel('Gender', fontsize=14, weight='bold')
plt.ylabel('Count', fontsize=14, weight='bold')
plt.xticks(np.arange(2), ['Male', 'Female'], rotation='vertical', weight='bold')
for i in plot.patches:
plot.annotate(format(i.get_height()),
(i.get_x() + i.get_width()/2,
i.get_height()), ha='center', va='center',
size=10, xytext=(0,8),
textcoords='offset points')
plt.show()
Correlation
This code below defines a list named col containing the column names from a DataFrame. This list represents the features or attributes we are interested in analyzing.
col = ['group', 'gendera', 'hypertensive','atrialfibrillation', 'CHD with no MI', 'diabetes', 'deficiencyanemias',
'depression', 'Hyperlipemia', 'Renal failure', 'COPD', 'outcome']
This line of code calculates the correlation matrix for the columns specified in the list col from the DataFrame df_final.
corr = df_final[col].corr()
This code below generates a heatmap to visualize the correlation matrix calculated earlier (corr) using Seaborn.
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, cmap='YlOrBr')
In the part two of this project, we would continue with plotting of continuous variable, data pre-processing, developing the models and then using auto machine learning (PyCaret).