A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

posted 6 min read

The aim of this project is to predict whether a person is having a risk of heart attack or not, using automated machine learning techniques.

What is a heart attack?

A heart Attack occurs when one or more of coronary arteries becomes blocked. Overtime, a build up of fatty deposits, including cholesterol for substances called plaques, which can narrow the arteries. This condition, called coronary artery disease, causes most heart attacks.

Understanding the problem statement.
A dataset is given in which various attributes are given which are crucial for heart disease detection. Using the data, a model was built using auto Machine learning techniques.

What is Auto ML?
Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problem.

In this project we will first use different machine learning algorithm, then we will understand how we can use auto-ml techniques, Eval ML for this project to make the work simpler.

Steps taken.

1) Data Analysis:

  • Collecting Data: Gather a comprehensive dataset from reliable sources
    such as healthcare institutions or public datasets like the UCI Heart
    Disease dataset.
  • Exploratory Data Analysis (EDA): Understand the data by using
    statistical summaries and visualizations.
  • Summary Statistics: Compute mean, median, standard deviation, and
    other descriptive statistics.
  • Visualization: Create plots such as histograms, box plots, and
    scatter plots to visualize distributions, correlations, and potential
    outliers.
  • Correlation Analysis: Use heatmaps or correlation matrices to
    identify relationships between features.

2) Feature Engineering:

  • Handling Missing Values: Address any missing data by methods such as
    imputation (mean, median, or mode) or by removing rows/columns with
    missing values.
  • Creating New Features: Derive new features from existing ones that
    might be more informative for the prediction task. For example,
    combining 'age' and 'cholesterol' to create a new risk score.
  • Encoding Categorical Variables: Convert categorical data into
    numerical format using techniques like one-hot encoding or label
    encoding.
  • Feature Selection: Identify and retain the most relevant features
    that contribute to predicting heart disease. Use methods like
    Recursive Feature Elimination (RFE), feature importance from
    tree-based models, or statistical tests.

3) Standardization:

  • Scaling Features: Normalize or standardize features to ensure that
    they contribute equally to the model.
  • Normalization: Scale features to a range (typically 0 to 1).
  • Standardization: Transform features to have a mean of 0 and a
    standard deviation of 1. Methods: Use tools like StandardScaler or
    MinMaxScaler from libraries such as scikit-learn to perform this
    step.

4) Model building:

  • Choosing Algorithms: Select appropriate machine learning algorithms
    for the task. Common choices for heart disease prediction include:
    Logistic Regression, Decision Trees, Random Forests, Support Vector
    Machines (SVM) and Neural Networks.
  • Training the Model: Split the dataset into training and testing sets,
    typically using an 80/20 or 70/30 ratio.
  • Train the selected models on the training dataset.
  • Hyperparameter Tuning: Optimize model performance by tuning
    hyperparameters using techniques such as grid search or random search
    with cross-validation.

5) Prediction:

  • Making Predictions: Use the trained model to make predictions on the
    test set.
  • Evaluation: Assess the model’s performance using metrics such as:
    Accuracy: Proportion of correctly predicted instances.
  • Precision and Recall: Measure of relevance of predictions and
    sensitivity.
  • F1 Score: Harmonic mean of precision and recall.
  • ROC-AUC: Area under the receiver operating characteristic curve.
  • Refinement: Based on evaluation results, further refine the model by
    adjusting features, tuning hyperparameters, or trying different algorithms.

Code Implementation

Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


Let us import our Data Set

from google.colab import drive
drive.mount('/content/drive/')

df= pd.read_csv("/content/drive/MyDrive/heart.csv")
df= df.drop(['oldpeak','slp','thall'],axis=1)
df.head()

Data Analysis
Understanding our Dataset:

  • Age : Age of the patient
  • Sex : Sex of the patient
  • exang: exercise induced angina (1 = yes; 0 = no)
  • ca: number of major vessels (0-3)
  • cp : Chest Pain type chest pain type
  • Value 0: typical angina
  • Value 1: atypical angina
  • Value 2: non-anginal pain
  • Value 3: asymptomatic
  • trtbps : resting blood pressure (in mm Hg)
  • chol : cholestoral in mg/dl fetched via BMI sensor
  • fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • rest_ecg : resting electrocardiographic results
  • Value 0: normal
  • Value 1: having ST-T wave abnormality (T wave inversions and/or ST
    elevation or depression of > 0.05 mV)
  • Value 2: showing probable or definite left ventricular hypertrophy by
    Estes' criteria thalach : maximum heart rate achieved
  • target : 0= less chance of heart attack 1= more chance of heart
    attack

    df.shape
    df.isnull().sum()#

There are no null values in our Data Set

df.corr()

sns.heatmap(df.corr())

Our variables are not highly correlated to each other

We will do Uni and Bi variate analysis on our Features

plt.figure(figsize=(20, 10))
plt.title("Age of Patients")
plt.xlabel("Age")
sns.countplot(x='age',data=df)

The Patients are of Age Group 51-67years in majority

plt.figure(figsize=(20, 10))

plt.title("Sex of Patients,0=Female and 1=Male")

sns.countplot(x='sex',data=df)

Plotting a countplot above

cp_data= df['cp'].value_counts().reset_index()
cp_data['index'][3]= 'asymptomatic'
cp_data['index'][2]= 'non-anginal'
cp_data['index'][1]= 'Atyppical Anigma'
cp_data['index'][0]= 'Typical Anigma'
cp_data

plt.figure(figsize=(20, 10))
plt.title("Chest Pain of Patients")

sns.barplot(x=cp_data['index'],y= cp_data['cp'])

We have seen how the Chest Pain Category is distributed

ecg_data= df['restecg'].value_counts().reset_index()
ecg_data['index'][0]= 'normal'
ecg_data['index'][1]= 'having ST-T wave abnormality'
ecg_data['index'][2]= 'showing probable or definite left ventricular hypertrophy by Estes'

ecg_data

Boxplot for the ECG data

plt.figure(figsize=(20, 10))
plt.title("ECG data of Patients")

sns.barplot(x=ecg_data['index'],y= ecg_data['restecg'])

sns.pairplot(hue='output',data=df)

Let us see for our Continuous Variable

plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
sns.distplot(df['trtbps'], kde=True, color = 'magenta')
plt.xlabel("Resting Blood Pressure (mmHg)")
plt.subplot(1,2,2)
sns.distplot(df['thalachh'], kde=True, color = 'teal')
plt.xlabel("Maximum Heart Rate Achieved (bpm)")

plotting for cholestrol

plt.figure(figsize=(10,10))
sns.distplot(df['chol'], kde=True, color = 'red')
plt.xlabel("Cholestrol")

We have done the Analysis of the data now let's have a look at out data

df.head()

Let us do Standardisation

from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
scale.fit(df)
df= scale.transform(df)
df=pd.DataFrame(df,columns=['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
   'exng', 'caa', 'output'])
df.head()

We can insert this data into our ML Models

We will use the following models for our predictions :

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • K Nearest Neighbour
  • SVM

Then we will use the ensembling techniques

Let us split our data

x= df.iloc[:,:-1]

x
Also

y= df.iloc[:,-1:]

y

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)

Logistic Regression

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import LabelEncoder

lbl= LabelEncoder()

encoded_y= lbl.fit_transform(y_train)

logreg= LogisticRegression()

logreg = LogisticRegression()
logreg.fit(x_train, encoded_y)

    Y_pred1

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
encoded_ytest= lbl.fit_transform(y_test)

Y_pred1 = logreg.predict(x_test)
lr_conf_matrix = confusion_matrix(encoded_ytest,Y_pred1 )
lr_acc_score = accuracy_score(encoded_ytest, Y_pred1)

lr_conf_matrix

Trying to get logistic regression accuracy

85.71428571428571 %

As we see the Logistic Regression Model have a 85% accuracy.
Decision Tree

importing decision tree classifier and assigning a variable

from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier()
tree.fit(x_train,encoded_y)
ypred2=tree.predict(x_test)
encoded_ytest= lbl.fit_transform(y_test)

For confusion matrix and finding the accuracy score

tree_conf_matrix = confusion_matrix(encoded_ytest,ypred2 )
tree_acc_score = accuracy_score(encoded_ytest, ypred2)
tree_conf_matrix

printing accuracy score and multiplying by 100

print(tree_acc_score*100,"%")

As we can see our Decision Tree Model does not perform well as it gives a score of only 70%

For the part two of this project, we will continue with the random forest model and also other models to get their accuracy score and find the best model.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

Continuation of a data science project on heart attack risk predictor with eval machine learning (part two)

Onumaku C Victory - Jun 3, 2024

A machine learning project on In-Hospital mortality prediction using machine learning and PyCaret (Part one)

Onumaku C Victory - Jun 1, 2024

Water drinking potability prediction using machine learning and H2O Auto Machine learning (Part one)

Onumaku C Victory - Jun 3, 2024

The machine learning project on predicting In-Hospital mortality rate using machine learning and PyCaret beyond basic

Onumaku C Victory - Jun 22, 2024

Water drinking potability prediction using machine learning and H2O Auto Machine learning- part 2

Onumaku C Victory - Jun 3, 2024
chevron_left