This is the continuation of the project on predicting mortality rate with machine learning and PyCaret.
The link for the part one can be found below:
https://coderlegion.com/307/machine-learning-project-hospital-mortality-prediction-using-machine-learning-pycaret
The main aim of this project is to predict the mortality rate i.e. whether a person will live or not with the help of different machine learning models and pycaret library.
Challenges and solution
The predictors of in-hospital mortality for intensive care units (ICU) admitted heart failure patients remain poorly characterized. We aimed to develop and validate a prediction model for all cause in-hospital mortality among ICU admitted heart failure patients.
We will try to build a machine learning model with the help of different machine learning and auto machine learning techniques to predict whether a person will live or die after observing certain parameters. The data set used have different attributes which will help us in predicting the outcomes.
What is PyCaret
PyCaret is an open-source, low-code machine learning library in Python that automates the entire process of model training, evaluation, and deployment. It is designed to be simple and efficient, providing an easy-to-use interface for performing various machine learning tasks with minimal coding.
Code Implementation
- Distribution of continuous variables.
This code below creates a Kernel Density Estimate (KDE) plot for the 'age' column in the DataFrame df_final using Matplotlib.
plt.figure(figsize=(9,4))
df_final['age'].plot(kind='kde')
This code below generates a Kernel Density Estimate (KDE) plot for the 'EF' (Ejection Fraction) column in the DataFrame df_final using Matplotlib.
plt.figure(figsize=(10,5))
df_final['EF'].plot(kind='kde')
This code below generates a Kernel Density Estimate (KDE) plot for the 'RBC' (Red Blood Cell count) column in the DataFrame df_final using Matplotlib.
plt.figure(figsize=(10,5))
df_final['RBC'].plot(kind='kde')
This code below generates a Kernel Density Estimate (KDE) plot for the 'Creatinine' column in the DataFrame df_final.
plt.figure(figsize=(10,5))
df_final['Creatinine'].plot(kind='kde')
This code below generates a Kernel Density Estimate (KDE) plot for the 'Blood calcium' column in the DataFrame df_final using Matplotlib.
plt.figure(figsize=(10,5))
df_final['Blood calcium'].plot(kind='kde')
The df_final.head() method displays the first five rows of the DataFrame df_final. This is useful for quickly inspecting the data and verifying its structure, column names, and initial values.
df_final.head()
This code below separates the features and the target variable from the DataFrame df_final.
x = df_final.drop(columns='outcome')
y = df_final[['outcome']]
This line below of code imports the StandardScaler class from the sklearn.preprocessing module. StandardScaler is used to standardize features by removing the mean and scaling to unit variance, which is an important preprocessing step in many machine learning workflows.
from sklearn.preprocessing import StandardScaler
scale= StandardScaler()
scaled= scale.fit_transform(x)
final_x= pd.DataFrame(scaled,columns= x.columns)
final_x.head()
y.head()
This code below imports the train_test_split function from the sklearn.model_selection module and then uses it to split the dataset into training and testing sets.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=123)
This line of below code prints the shapes of the training and testing feature sets, x_train and x_test. This is useful for verifying that the data has been split correctly and understanding the dimensions of the training and testing sets.
print(x_train.shape, x_test.shape)
This code removes the 'ID' column from both the training and testing feature sets, x_train and x_test, respectively.
x_train.drop(columns = 'ID', inplace=True)
x_test.drop(columns='ID', inplace=True)
The outputs are (823, 50) (354, 50).
x_train.head()
- Model development using machine learning
We will use the XG boost classifier model.
This line of code below imports necessary modules from the XGBoost library.
from xgboost import XGBClassifier, plot_tree, plot_importance
xgb = XGBClassifier(random_state=42)
xgb.fit(x_train, y_train)
pred = xgb.predict(x_test)
pred
This line of code imports several useful metrics and tools from the sklearn.metrics module, which are commonly used for evaluating classification models.
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
cf = confusion_matrix(y_test, pred)
cf
print(classification_report(y_test, pred))
This line of code below concatenates two arrays vertically using NumPy.
combine = np.concatenate((y_test.values.reshape(len(y_test),1), pred.reshape(len(pred),1)),1)
This line of code below converts the combined NumPy array combine into a Pandas DataFrame named combine_result, with two columns labeled 'y_test' and 'y_pred', respectively.
combine_result = pd.DataFrame(combine, columns=['y_test', 'y_pred'])
combine_result
Plotting ROC and accuracy curve
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
# Plot the ROC curve for the XGBoost classifier
plot_roc_curve(xgb, x_test, y_test)
# Plot the diagonal line representing a random classifier
plt.plot([0,1], [0,1], color='magenta', ls='-')
Using auto machine learning (PyCaret)
!pip install pycaret
from pycaret.classification import *
#This code below reads a CSV file named "mortality.csv" located at the specified path
"/content/drive/MyDrive/mortality.csv" into a Pandas DataFrame named df.
df= pd.read_csv("/content/drive/MyDrive/mortality.csv")
df.head()
This code below utilizes the setup function from the pycaret library to set up a machine learning experiment with the DataFrame df as the data and 'outcome' as the target variable.
model = setup(data= df, target= 'outcome')
This line of code is a call to the compare_models() function from the pycaret library.
compare_models()
This line of code creates a Ridge Regression model using the create_model() function from the pycaret library. Create_model('ridge'): This function call creates a Ridge Regression model. Ridge Regression is a linear regression algorithm that uses L2 regularization to prevent overfitting by penalizing large coefficients.
ridge= create_model('ridge')
The predict_model function is typically used in frameworks like pycaret or scikit-learn to generate predictions using a trained model (ridge in this case) on new or unseen data (x_test).
pred = predict_model(ridge,data= x_test)
pred
Lastly, the ridge regression model has the highest accuracy in this project. The pycaret library was precise in giving us an accurate result.
We can also deploy these models which means it is production ready or business ready because it provides all the different output.