Continuation of a data science project on heart attack risk predictor with eval machine learning (part two)

Continuation of a data science project on heart attack risk predictor with eval machine learning (part two)

posted 8 min read

In continuation of the data science project on heart attack risk predictor with eval machine learning, we would dive further into different models starting with random forest and then fine tune the best model using eval machine learning.

What is eval machine learning?
EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions. Combined with feature tools and compose, EvalML can be used to create end-to-end supervised machine learning solutions.

The link to the part one of this project is attached below:
https://coderlegion.com/270/a-data-science-project-on-heart-attack-risk-predictor-with-eval-machine-learning-part-one

Random Forest

Importing random forest classifier and assigning variable

from sklearn.ensemble import RandomForestClassifier
rf= RandomForestClassifier()
rf.fit(x_train,encoded_y)
ypred3 = rf.predict(x_test)

Assigning for the confusion matrix and accuracy score

rf_conf_matrix = confusion_matrix(encoded_ytest,ypred3 )
rf_acc_score = accuracy_score(encoded_ytest, ypred3)
rf_conf_matrix

Printing the accuracy score

print(rf_acc_score*100,"%")

Random Forest also gives us an accuracy of around 79%.

K Nearest Neighbour

  • We have to select what k we will use for the maximum accuracy

Now let us write a function for it

from sklearn.neighbors import KNeighborsClassifier

Function

error_rate= []
for i in range(1,40):
     knn= KNeighborsClassifier(n_neighbors=i)
     knn.fit(x_train,encoded_y)
     pred= knn.predict(x_test)
     error_rate.append(np.mean(pred != encoded_ytest))

Plotting to check the correct value of K

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
              markerfacecolor='red', markersize=10)

plt.xlabel('K Vlaue') plt.ylabel('Error rate') plt.title('To check the correct value of k') plt.show()

As we see from the graph we should select K= 12 as it gives the best error rate

Assigning k nearest neighbors

knn= KNeighborsClassifier(n_neighbors=12)
knn.fit(x_train,encoded_y)
ypred4= knn.predict(x_test)

Confusion matrix

knn_conf_matrix = confusion_matrix(encoded_ytest,ypred4 )
knn_acc_score = accuracy_score(encoded_ytest, ypred4)

knn_conf_matrix

Printing accuracy score

print(knn_acc_score*100,"%")

As we see KNN gives us an accuracy of around 85% which is good.

Support Vector Machine(SVM)

Importing SVM, assigning variable for confusion matrix and printing SVM accuracy

from sklearn import svm
svm= svm.SVC()
svm.fit(x_train,encoded_y)
SVC()

ypred5= svm.predict(x_test)
svm_conf_matrix = confusion_matrix(encoded_ytest,ypred5)
svm_acc_score = accuracy_score(encoded_ytest, ypred5)

svm_conf_matrix

Print SVM accuracy score

print(svm_acc_score*100,"%")

We get an accuracy of 80% in SVM

Let us see our model accuracy in a Table form

model_acc= pd.DataFrame({'Model' : ['Logistic Regression','Decision Tree','Random Forest','K Nearest Neighbor','SVM'],'Accuracy' : [lr_acc_score*100,tree_acc_score*100,rf_acc_score*100,knn_acc_score*100,svm_acc_score*100]})

model_acc = model_acc.sort_values(by=['Accuracy'],ascending=False)
model_acc

Let us use one more Techniques known as Adaboost, this is a Boosting technique which uses multiple models for better accuracy.

Adaboost Classifier

Let us first use some random parameters for training the model without Hypertuning.

from sklearn.ensemble import AdaBoostClassifier
adab=AdaBoostClassifier(base_estimator=svm,n_estimators=100,algorithm='SAMME',learning_rate=0.01,random_state=0)
adab.fit(x_train,encoded_y)

ypred6=adab.predict(x_test)

adab_conf_matrix = confusion_matrix(encoded_ytest,ypred6)

adab_acc_score = accuracy_score(encoded_ytest, ypred6)

Confusion Matrix

adab_conf_matrix

Print accuracy score

print(adab_acc_score*100,"%")

X-train score

adab.score(x_train,encoded_y)

X-test score

adab.score(x_test,encoded_ytest)

As we see our model has performed very poorly with just 50% accuracy.

We will use Grid Seach CV for HyperParameter Tuning.

Grid Search CV

Let us try Grid Search CV for our top 3 performing Algorithms for HyperParameter tuning

from sklearn.model_selection import GridSearchCV

model_acc

Logistic Regression

Implementing grid search with logistic regression

param_grid= {

'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
'penalty' : ['none', 'l1', 'l2', 'elasticnet'],
'C' : [100, 10, 1.0, 0.1, 0.01]

}

grid1= GridSearchCV(LogisticRegression(),param_grid)

grid1.fit(x_train,encoded_y)

grid1.best_params_

Let us apply these para in our Model

logreg1= LogisticRegression(C=0.01,penalty='l2',solver='liblinear')
logreg1.fit(x_train,encoded_y)

logreg_pred= logreg1.predict(x_test)

logreg_pred_conf_matrix = confusion_matrix(encoded_ytest,logreg_pred)
logreg_pred_acc_score = accuracy_score(encoded_ytest, logreg_pred)

logreg_pred_conf_matrix

Printing the accuracy score

print(logreg_pred_acc_score*100,"%")

We got an accuracy of 81%.

Implementing grid search with KNN

n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)

from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

grid_search = GridSearchCV(estimator=knn, param_grid=grid, n_jobs=-1, cv=cv, 
scoring='accuracy',error_score=0)

grid_search.fit(x_train,encoded_y)

grid_search.best_params_

Let's apply

knn= KNeighborsClassifier(n_neighbors=12,metric='manhattan',weights='distance')
knn.fit(x_train,encoded_y)
knn_pred= knn.predict(x_test)

knn_pred_conf_matrix = confusion_matrix(encoded_ytest,knn_pred)
knn_pred_acc_score = accuracy_score(encoded_ytest, knn_pred)

knn_pred_conf_matrix

Printing accuracy score

print(knn_pred_acc_score*100,"%")

We have an Accuracy of 82.5%.

For Support Vector Machine (SVM)

kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']

grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=svm, param_grid=grid, n_jobs=-1, cv=cv, 
scoring='accuracy',error_score=0)

grid_search.fit(x_train,encoded_y)

grid_search.best_params_

Let us apply these

from sklearn.svm import SVC

svc= SVC(C= 0.1, gamma= 'scale',kernel= 'sigmoid')

svc.fit(x_train,encoded_y)

svm_pred= svc.predict(x_test)

svm_pred_conf_matrix = confusion_matrix(encoded_ytest,svm_pred)
svm_pred_acc_score = accuracy_score(encoded_ytest, svm_pred)

svm_pred_conf_matrix

Printing score

print(svm_pred_acc_score*100,"%")

Accuracy is 81%.

Final Verdict

After comparing all the models the best performing model is the logistic Regression with no Hyperparameter tuning.

logreg= LogisticRegression()
logreg = LogisticRegression()
logreg.fit(x_train, encoded_y)

Y_pred1

lr_conf_matrix

Printing logistic accuracy score

print(lr_acc_score*100,"%")

Let us build a proper confusion matrix for our model

# Confusion Matrix of  Model enlarged
options = ["Disease", 'No Disease']

fig, ax = plt.subplots()
im = ax.imshow(lr_conf_matrix, cmap= 'Set3', interpolation='nearest')

# We want to show all ticks...
ax.set_xticks(np.arange(len(options)))
ax.set_yticks(np.arange(len(options)))
# ... and label them with the respective list entries
ax.set_xticklabels(options)
ax.set_yticklabels(options)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
       rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(options)):
      for j in range(len(options)):
            text = ax.text(j, i, lr_conf_matrix[i, j],
                                       ha="center", va="center", color="black")

ax.set_title("Confusion Matrix of Logistic Regression Model")
fig.tight_layout()
plt.xlabel('Model Prediction')
plt.ylabel('Actual Result')
plt.show()
print("ACCURACY of our model is ",lr_acc_score*100,"%")

Accuracy of our model is 85.71428571428571 %.

We have successfully made our model which predicts whether a person is having a risk of Heart Disease or not with 85.7% accuracy.

Now using Auto ML

What is eval machine learning?

EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions. Combined with feature tools and compose, EvalML can be used to create end-to-end supervised machine learning solutions.

It automates a large part of the machine learning process and we can easily evaluate which machine learning pipeline works better for the given set of data.

Installing Eval Ml

!pip install evalml

Let us load our DataSet.

df= pd.read_csv("/content/drive/MyDrive/heart.csv")

df.head()

Let us split our Data Set into Dependent i.e our Target variable and independent variable

x= df.iloc[:,:-1]
x

Transforming to an array

y= df.iloc[:,-1:]
y= lbl.fit_transform(y)
y

Importing Eval ML Library

X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(x, y, 
problem_type='binary')

import evalml

Note: There are different problem type parameters in Eval ML, we have a Binary type problem here, that's why we are using Binary as a input.

evalml.problem_types.ProblemTypes.all_problem_types

Running the Auto ML to select best Algorithm

from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary') 
automl.search()

As we see from the above output if you run the code, the Auto ML classifier has given us the best fit Algorithm which is extra trees classifier with imputer we can also compare the rest of the models.

automl.rankings

automl.best_pipeline

best_pipeline=automl.best_pipeline

We can have a Detailed description of our Best Selected Model

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

Best pipeline score

best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])

Now if we want to build our Model for a specific objective we can do that.

automl_auc = AutoMLSearch(X_train=X_train, y_train=y_train,
                                     problem_type='binary',
                                     objective='auc',
                                     additional_objectives=['f1', 'precision'],
                                     max_batches=1,
                                     optimize_thresholds=True)

automl_auc.search()

Creating a variable for best pipeline.

best_pipeline_auc = automl_auc.best_pipeline

Get the score on holdout data

In conclusion, we got an 88.5 % AUC Score which is the highest of all.

To save the model

best_pipeline.save("model.pkl")

Loading our Model

final_model=automl.load('model.pkl')

final_model.predict_proba(X_test)

Output for the final model predict for the test data

In conclusion, we got an 88.5 % AUC Score which is the highest of all. Also the model was saved as a .pkl file, you can specify the directory folder in your local computer when saving the model.

This is the final conclusion of a data science project on heart attack risk predictor with Eval Machine Learning.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

Onumaku C Victory - Jun 1, 2024

A machine learning project on In-Hospital mortality prediction using machine learning and PyCaret (Part one)

Onumaku C Victory - Jun 1, 2024

The machine learning project on predicting In-Hospital mortality rate using machine learning and PyCaret beyond basic

Onumaku C Victory - Jun 22, 2024

Water drinking potability prediction using machine learning and H2O Auto Machine learning- part 2

Onumaku C Victory - Jun 3, 2024

Water drinking potability prediction using machine learning and H2O Auto Machine learning (Part one)

Onumaku C Victory - Jun 3, 2024
chevron_left