In continuation of the data science project on heart attack risk predictor with eval machine learning, we would dive further into different models starting with random forest and then fine tune the best model using eval machine learning.
What is eval machine learning?
EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions. Combined with feature tools and compose, EvalML can be used to create end-to-end supervised machine learning solutions.
The link to the part one of this project is attached below:
https://coderlegion.com/270/a-data-science-project-on-heart-attack-risk-predictor-with-eval-machine-learning-part-one
Random Forest
Importing random forest classifier and assigning variable
from sklearn.ensemble import RandomForestClassifier
rf= RandomForestClassifier()
rf.fit(x_train,encoded_y)
ypred3 = rf.predict(x_test)
Assigning for the confusion matrix and accuracy score
rf_conf_matrix = confusion_matrix(encoded_ytest,ypred3 )
rf_acc_score = accuracy_score(encoded_ytest, ypred3)
rf_conf_matrix
Printing the accuracy score
print(rf_acc_score*100,"%")
Random Forest also gives us an accuracy of around 79%.
K Nearest Neighbour
- We have to select what k we will use for the maximum accuracy
Now let us write a function for it
from sklearn.neighbors import KNeighborsClassifier
Function
error_rate= []
for i in range(1,40):
knn= KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,encoded_y)
pred= knn.predict(x_test)
error_rate.append(np.mean(pred != encoded_ytest))
Plotting to check the correct value of K
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.xlabel('K Vlaue') plt.ylabel('Error rate') plt.title('To check the correct value of k') plt.show()
As we see from the graph we should select K= 12 as it gives the best error rate
Assigning k nearest neighbors
knn= KNeighborsClassifier(n_neighbors=12)
knn.fit(x_train,encoded_y)
ypred4= knn.predict(x_test)
Confusion matrix
knn_conf_matrix = confusion_matrix(encoded_ytest,ypred4 )
knn_acc_score = accuracy_score(encoded_ytest, ypred4)
knn_conf_matrix
Printing accuracy score
print(knn_acc_score*100,"%")
As we see KNN gives us an accuracy of around 85% which is good.
Support Vector Machine(SVM)
Importing SVM, assigning variable for confusion matrix and printing SVM accuracy
from sklearn import svm
svm= svm.SVC()
svm.fit(x_train,encoded_y)
SVC()
ypred5= svm.predict(x_test)
svm_conf_matrix = confusion_matrix(encoded_ytest,ypred5)
svm_acc_score = accuracy_score(encoded_ytest, ypred5)
svm_conf_matrix
Print SVM accuracy score
print(svm_acc_score*100,"%")
We get an accuracy of 80% in SVM
Let us see our model accuracy in a Table form
model_acc= pd.DataFrame({'Model' : ['Logistic Regression','Decision Tree','Random Forest','K Nearest Neighbor','SVM'],'Accuracy' : [lr_acc_score*100,tree_acc_score*100,rf_acc_score*100,knn_acc_score*100,svm_acc_score*100]})
model_acc = model_acc.sort_values(by=['Accuracy'],ascending=False)
model_acc
Let us use one more Techniques known as Adaboost, this is a Boosting technique which uses multiple models for better accuracy.
Adaboost Classifier
Let us first use some random parameters for training the model without Hypertuning.
from sklearn.ensemble import AdaBoostClassifier
adab=AdaBoostClassifier(base_estimator=svm,n_estimators=100,algorithm='SAMME',learning_rate=0.01,random_state=0)
adab.fit(x_train,encoded_y)
ypred6=adab.predict(x_test)
adab_conf_matrix = confusion_matrix(encoded_ytest,ypred6)
adab_acc_score = accuracy_score(encoded_ytest, ypred6)
Confusion Matrix
adab_conf_matrix
Print accuracy score
print(adab_acc_score*100,"%")
X-train score
adab.score(x_train,encoded_y)
X-test score
adab.score(x_test,encoded_ytest)
As we see our model has performed very poorly with just 50% accuracy.
We will use Grid Seach CV for HyperParameter Tuning.
Grid Search CV
Let us try Grid Search CV for our top 3 performing Algorithms for HyperParameter tuning
from sklearn.model_selection import GridSearchCV
model_acc
Logistic Regression
Implementing grid search with logistic regression
param_grid= {
'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
'penalty' : ['none', 'l1', 'l2', 'elasticnet'],
'C' : [100, 10, 1.0, 0.1, 0.01]
}
grid1= GridSearchCV(LogisticRegression(),param_grid)
grid1.fit(x_train,encoded_y)
grid1.best_params_
Let us apply these para in our Model
logreg1= LogisticRegression(C=0.01,penalty='l2',solver='liblinear')
logreg1.fit(x_train,encoded_y)
logreg_pred= logreg1.predict(x_test)
logreg_pred_conf_matrix = confusion_matrix(encoded_ytest,logreg_pred)
logreg_pred_acc_score = accuracy_score(encoded_ytest, logreg_pred)
logreg_pred_conf_matrix
Printing the accuracy score
print(logreg_pred_acc_score*100,"%")
We got an accuracy of 81%.
Implementing grid search with KNN
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=knn, param_grid=grid, n_jobs=-1, cv=cv,
scoring='accuracy',error_score=0)
grid_search.fit(x_train,encoded_y)
grid_search.best_params_
Let's apply
knn= KNeighborsClassifier(n_neighbors=12,metric='manhattan',weights='distance')
knn.fit(x_train,encoded_y)
knn_pred= knn.predict(x_test)
knn_pred_conf_matrix = confusion_matrix(encoded_ytest,knn_pred)
knn_pred_acc_score = accuracy_score(encoded_ytest, knn_pred)
knn_pred_conf_matrix
Printing accuracy score
print(knn_pred_acc_score*100,"%")
We have an Accuracy of 82.5%.
For Support Vector Machine (SVM)
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=svm, param_grid=grid, n_jobs=-1, cv=cv,
scoring='accuracy',error_score=0)
grid_search.fit(x_train,encoded_y)
grid_search.best_params_
Let us apply these
from sklearn.svm import SVC
svc= SVC(C= 0.1, gamma= 'scale',kernel= 'sigmoid')
svc.fit(x_train,encoded_y)
svm_pred= svc.predict(x_test)
svm_pred_conf_matrix = confusion_matrix(encoded_ytest,svm_pred)
svm_pred_acc_score = accuracy_score(encoded_ytest, svm_pred)
svm_pred_conf_matrix
Printing score
print(svm_pred_acc_score*100,"%")
Accuracy is 81%.
Final Verdict
After comparing all the models the best performing model is the logistic Regression with no Hyperparameter tuning.
logreg= LogisticRegression()
logreg = LogisticRegression()
logreg.fit(x_train, encoded_y)
Y_pred1
lr_conf_matrix
Printing logistic accuracy score
print(lr_acc_score*100,"%")
Let us build a proper confusion matrix for our model
# Confusion Matrix of Model enlarged
options = ["Disease", 'No Disease']
fig, ax = plt.subplots()
im = ax.imshow(lr_conf_matrix, cmap= 'Set3', interpolation='nearest')
# We want to show all ticks...
ax.set_xticks(np.arange(len(options)))
ax.set_yticks(np.arange(len(options)))
# ... and label them with the respective list entries
ax.set_xticklabels(options)
ax.set_yticklabels(options)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(options)):
for j in range(len(options)):
text = ax.text(j, i, lr_conf_matrix[i, j],
ha="center", va="center", color="black")
ax.set_title("Confusion Matrix of Logistic Regression Model")
fig.tight_layout()
plt.xlabel('Model Prediction')
plt.ylabel('Actual Result')
plt.show()
print("ACCURACY of our model is ",lr_acc_score*100,"%")
Accuracy of our model is 85.71428571428571 %.
We have successfully made our model which predicts whether a person is having a risk of Heart Disease or not with 85.7% accuracy.
Now using Auto ML
What is eval machine learning?
EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions. Combined with feature tools and compose, EvalML can be used to create end-to-end supervised machine learning solutions.
It automates a large part of the machine learning process and we can easily evaluate which machine learning pipeline works better for the given set of data.
Installing Eval Ml
!pip install evalml
Let us load our DataSet.
df= pd.read_csv("/content/drive/MyDrive/heart.csv")
df.head()
Let us split our Data Set into Dependent i.e our Target variable and independent variable
x= df.iloc[:,:-1]
x
Transforming to an array
y= df.iloc[:,-1:]
y= lbl.fit_transform(y)
y
Importing Eval ML Library
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(x, y,
problem_type='binary')
import evalml
Note: There are different problem type parameters in Eval ML, we have a Binary type problem here, that's why we are using Binary as a input.
evalml.problem_types.ProblemTypes.all_problem_types
Running the Auto ML to select best Algorithm
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
automl.search()
As we see from the above output if you run the code, the Auto ML classifier has given us the best fit Algorithm which is extra trees classifier with imputer we can also compare the rest of the models.
automl.rankings
automl.best_pipeline
best_pipeline=automl.best_pipeline
We can have a Detailed description of our Best Selected Model
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
Best pipeline score
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])
Now if we want to build our Model for a specific objective we can do that.
automl_auc = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
objective='auc',
additional_objectives=['f1', 'precision'],
max_batches=1,
optimize_thresholds=True)
automl_auc.search()
Creating a variable for best pipeline.
best_pipeline_auc = automl_auc.best_pipeline
Get the score on holdout data
In conclusion, we got an 88.5 % AUC Score which is the highest of all.
To save the model
best_pipeline.save("model.pkl")
Loading our Model
final_model=automl.load('model.pkl')
final_model.predict_proba(X_test)
Output for the final model predict for the test data
In conclusion, we got an 88.5 % AUC Score which is the highest of all. Also the model was saved as a .pkl file, you can specify the directory folder in your local computer when saving the model.
This is the final conclusion of a data science project on heart attack risk predictor with Eval Machine Learning.