A machine learning project on In-Hospital mortality prediction using machine learning and PyCaret (Part one)

A machine learning project on In-Hospital mortality prediction using machine learning and PyCaret (Part one)

posted 6 min read

The main aim of this project is to predict the mortality rate i.e. whether a person will live or not with the help of different machine learning models and pycaret library.

Motivation

The predictors of in-hospital mortality for intensive care units (ICU) admitted heart failure patients remain poorly characterized. We aimed to develop and validate a prediction model for all cause in-hospital mortality among ICU admitted heart failure patients.

We will try to build a machine learning model with the help of different machine learning and auto machine learning techniques to predict whether a person will live or die after observing certain parameters. The data set used have different attributes which will help us in predicting the outcomes.

What is PyCaret

PyCaret is an open-source, low-code machine learning library in Python that automates the entire process of model training, evaluation, and deployment. It is designed to be simple and efficient, providing an easy-to-use interface for performing various machine learning tasks with minimal coding. Here are some key features and components of PyCaret:

  • Ease of Use: PyCaret simplifies the machine learning workflow, allowing users to build models with just a few lines of code. This makes it accessible for both beginners and experienced data scientists.

  • Low-Code: With PyCaret, users can perform complex machine learning tasks with minimal coding, reducing the time and effort required for model development.

  • End-to-End Machine Learning: PyCaret covers the entire machine learning lifecycle, including data preprocessing, model training, hyperparameter tuning, model evaluation, and deployment.

  • Multiple Modules: PyCaret provides various modules tailored for different types of machine learning tasks which are:

  • Classification: For binary and multi-class classification problems.

  • Regression: For regression tasks.
  • Clustering: For unsupervised clustering tasks.
  • Anomaly Detection: For identifying outliers in data.
  • NLP: For natural language processing tasks.
  • Time Series: For time series forecasting (recently added).

Timeline of the project

  • Data Analysis: Finding out different relations.

  • Feature Engineering: Processing the data before feeding to the model

  • Model building using machine learning: Using machine learning algorithm

  • Model building using auto Machine learning: Using PyCaret library

Code implementation

Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Used to mount Google Drive in a Google Colab notebook. This allows you to access files stored in your Google Drive directly from the Colab environment.

from google.colab import drive
drive.mount('/content/drive')

Used to read a CSV file stored in Google Drive into a pandas DataFrame in a Google Colab environment.

df= pd.read_csv("/content/drive/MyDrive/mortality.csv")

For displaying the first few rows of a pandas DataFrame.

df.head()

Data Analysis

to generate descriptive statistics of a pandas DataFrame.

df.describe()

This code is used to identify and count the number of missing values in each column of a pandas DataFrame.

df.isnull().sum()

This code below is used to get the dimensions of a pandas DataFrame.

df.shape

which is (1177, 51)

The code below is used to get a concise summary of a pandas DataFrame.

df.info()

Handling NAN Values

For float variables

The code snippet imports the SimpleImputer class from sklearn.impute and creates an instance of it to handle missing values in a dataset.

from sklearn.impute import SimpleImputer

si = SimpleImputer(missing_values=np.nan, strategy='mean')

The code below selects columns of a specific data type (float64) from a pandas DataFrame and stores the column names in a variable.

float_col = df.select_dtypes(include='float64').columns

The code below df[float_col] = si.transform(df[float_col]) uses the SimpleImputer (si) object to transform missing values in specific columns (float_col) of a pandas DataFrame (df).

df[float_col] = si.transform(df[float_col])

The code below separates the features (x) and the target variable (y) from a pandas DataFrame (df).

x = df.drop(columns='outcome')

y = df[['outcome']]

For Dependent variable

# Create a SimpleImputer object named 'SI'
# It replaces missing values (NaN) with the most frequent value along each column
SI = SimpleImputer(missing_values=np.nan, strategy="most_frequent")

SI.fit_transform(y)

# Create a new DataFrame 'y' from the existing object 'y'
# The column name is set to 'outcome' and the data type to 'int64'
 y = pd.DataFrame(y, columns=['outcome'], dtype='int64')

# Create a new DataFrame 'df_final' by copying the features DataFrame 'x'
df_final = x.copy()

# Add a new column 'outcome' to 'df_final' and fill it with the values from the target variable 'y'

The code below df_final.isnull().sum() is used to count the number of missing values in each column of the DataFrame df_final.

df_final.isnull().sum()

Visualising our Dependent variable

**Create a figure and axis object with specified size and DPI**
fig, ax = plt.subplots(figsize=(8, 5), dpi=100)

**Create a pie chart with counts of 'outcome' values**
'autopct' formats the percentages displayed on the chart
'shadow' adds shadow effect to the pie chart
'startangle' rotates the start of the pie chart to 90 degrees
'explode' specifies the fraction of the radius with which to offset each wedge
'labels' specifies custom labels for each wedge
patches, texts, autotexts = ax.pie(df_final['outcome'].value_counts(), autopct='%1.1f%%', shadow=True,
                                  startangle=90, explode=(0.1, 0), labels=['Alive', 'Death'])

**Customize the font size, color, and weight of the autopct text**
plt.setp(autotexts, size=12, color='black', weight='bold')

Set the color of the 'Death' autopct text to white for better visibility

autotexts[1].set_color('white')

**Add a title to the pie chart**
plt.title('Outcome Distribution', fontsize=14)

**Display the pie chart**
plt.show()

import matplotlib.pyplot as plt

**Create a figure and axis object with specified size and DPI**
fig, ax = plt.subplots(figsize=(8, 5), dpi=100)

**Create a pie chart with counts of 'outcome' values**
 'autopct' formats the percentages displayed on the chart
 'shadow' adds shadow effect to the pie chart
 'startangle' rotates the start of the pie chart to 90 degrees
 'explode' specifies the fraction of the radius with which to offset each wedge
 'labels' specifies custom labels for each wedge
patches, texts, autotexts = ax.pie(df_final['outcome'].value_counts(), autopct='%1.1f%%', shadow=True,
                                  startangle=90, explode=(0.1, 0), labels=['Alive', 'Death'])

**Customize the font size, color, and weight of the autopct text**
plt.setp(autotexts, size=12, color='black', weight='bold')

**Set the color of the 'Death' autopct text to white for better visibility**
autotexts[1].set_color('white')

**Add a title to the pie chart**
plt.title('Outcome Distribution', fontsize=14)

**Display the pie chart**
plt.show()

**Same code format above but for BMI**
fig = px.histogram(df, x="BMI", color="outcome", marginal="box", hover_data=df.columns)
fig.show()

**Same code format above but for SP 02**
fig = px.histogram(df, x="SP O2", color="outcome", marginal="box", hover_data=df.columns)
fig.show()

**Same code format above but for heart rate**
fig = px.histogram(df, x="heart rate", color="outcome", marginal="box", hover_data=df.columns)
fig.show()

This code below creates a count plot to visualize the distribution of outcomes ('Alive' and 'Death') based on gender ('gendera') in the DataFrame df_final. It utilizes Seaborn for plotting.

plt.figure(figsize=(12,8))
plot = sns.countplot(df_final['gendera'], hue=df_final['outcome'])
plt.xlabel('Gender', fontsize=14, weight='bold')
plt.ylabel('Count', fontsize=14, weight='bold')
plt.xticks(np.arange(2), ['Male', 'Female'], rotation='vertical', weight='bold')

for i in plot.patches:
  plot.annotate(format(i.get_height()),
                (i.get_x() + i.get_width()/2,
                 i.get_height()), ha='center', va='center',
                size=10, xytext=(0,8),
                textcoords='offset points')

plt.show()

Correlation

This code below defines a list named col containing the column names from a DataFrame. This list represents the features or attributes we are interested in analyzing.

col = ['group', 'gendera', 'hypertensive','atrialfibrillation', 'CHD with no MI', 'diabetes', 'deficiencyanemias',
   'depression', 'Hyperlipemia', 'Renal failure', 'COPD', 'outcome']

This line of code calculates the correlation matrix for the columns specified in the list col from the DataFrame df_final.

corr = df_final[col].corr()

This code below generates a heatmap to visualize the correlation matrix calculated earlier (corr) using Seaborn.

plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, cmap='YlOrBr')

In the part two of this project, we would continue with plotting of continuous variable, data pre-processing, developing the models and then using auto machine learning (PyCaret).

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

The machine learning project on predicting In-Hospital mortality rate using machine learning and PyCaret beyond basic

Onumaku C Victory - Jun 22, 2024

A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

Onumaku C Victory - Jun 1, 2024

Water drinking potability prediction using machine learning and H2O Auto Machine learning (Part one)

Onumaku C Victory - Jun 3, 2024

Continuation of a data science project on heart attack risk predictor with eval machine learning (part two)

Onumaku C Victory - Jun 3, 2024

Water drinking potability prediction using machine learning and H2O Auto Machine learning- part 2

Onumaku C Victory - Jun 3, 2024
chevron_left