The aim of this project is to predict whether a person is having a risk of heart attack or not, using automated machine learning techniques.
What is a heart attack?
A heart Attack occurs when one or more of coronary arteries becomes blocked. Overtime, a build up of fatty deposits, including cholesterol for substances called plaques, which can narrow the arteries. This condition, called coronary artery disease, causes most heart attacks.
Understanding the problem statement.
A dataset is given in which various attributes are given which are crucial for heart disease detection. Using the data, a model was built using auto Machine learning techniques.
What is Auto ML?
Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problem.
In this project we will first use different machine learning algorithm, then we will understand how we can use auto-ml techniques, Eval ML for this project to make the work simpler.
Steps taken.
1) Data Analysis:
- Collecting Data: Gather a comprehensive dataset from reliable sources
such as healthcare institutions or public datasets like the UCI Heart
Disease dataset.
- Exploratory Data Analysis (EDA): Understand the data by using
statistical summaries and visualizations.
- Summary Statistics: Compute mean, median, standard deviation, and
other descriptive statistics.
- Visualization: Create plots such as histograms, box plots, and
scatter plots to visualize distributions, correlations, and potential
outliers.
- Correlation Analysis: Use heatmaps or correlation matrices to
identify relationships between features.
2) Feature Engineering:
- Handling Missing Values: Address any missing data by methods such as
imputation (mean, median, or mode) or by removing rows/columns with
missing values.
- Creating New Features: Derive new features from existing ones that
might be more informative for the prediction task. For example,
combining 'age' and 'cholesterol' to create a new risk score.
- Encoding Categorical Variables: Convert categorical data into
numerical format using techniques like one-hot encoding or label
encoding.
- Feature Selection: Identify and retain the most relevant features
that contribute to predicting heart disease. Use methods like
Recursive Feature Elimination (RFE), feature importance from
tree-based models, or statistical tests.
3) Standardization:
- Scaling Features: Normalize or standardize features to ensure that
they contribute equally to the model.
- Normalization: Scale features to a range (typically 0 to 1).
- Standardization: Transform features to have a mean of 0 and a
standard deviation of 1. Methods: Use tools like StandardScaler or
MinMaxScaler from libraries such as scikit-learn to perform this
step.
4) Model building:
- Choosing Algorithms: Select appropriate machine learning algorithms
for the task. Common choices for heart disease prediction include:
Logistic Regression, Decision Trees, Random Forests, Support Vector
Machines (SVM) and Neural Networks.
- Training the Model: Split the dataset into training and testing sets,
typically using an 80/20 or 70/30 ratio.
- Train the selected models on the training dataset.
- Hyperparameter Tuning: Optimize model performance by tuning
hyperparameters using techniques such as grid search or random search
with cross-validation.
5) Prediction:
- Making Predictions: Use the trained model to make predictions on the
test set.
- Evaluation: Assess the model’s performance using metrics such as:
Accuracy: Proportion of correctly predicted instances.
- Precision and Recall: Measure of relevance of predictions and
sensitivity.
- F1 Score: Harmonic mean of precision and recall.
- ROC-AUC: Area under the receiver operating characteristic curve.
- Refinement: Based on evaluation results, further refine the model by
adjusting features, tuning hyperparameters, or trying different algorithms.
Code Implementation
Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Let us import our Data Set
from google.colab import drive
drive.mount('/content/drive/')
df= pd.read_csv("/content/drive/MyDrive/heart.csv")
df= df.drop(['oldpeak','slp','thall'],axis=1)
df.head()
Data Analysis
Understanding our Dataset:
There are no null values in our Data Set
df.corr()
sns.heatmap(df.corr())
Our variables are not highly correlated to each other
We will do Uni and Bi variate analysis on our Features
plt.figure(figsize=(20, 10))
plt.title("Age of Patients")
plt.xlabel("Age")
sns.countplot(x='age',data=df)
The Patients are of Age Group 51-67years in majority
plt.figure(figsize=(20, 10))
plt.title("Sex of Patients,0=Female and 1=Male")
sns.countplot(x='sex',data=df)
Plotting a countplot above
cp_data= df['cp'].value_counts().reset_index()
cp_data['index'][3]= 'asymptomatic'
cp_data['index'][2]= 'non-anginal'
cp_data['index'][1]= 'Atyppical Anigma'
cp_data['index'][0]= 'Typical Anigma'
cp_data
plt.figure(figsize=(20, 10))
plt.title("Chest Pain of Patients")
sns.barplot(x=cp_data['index'],y= cp_data['cp'])
We have seen how the Chest Pain Category is distributed
ecg_data= df['restecg'].value_counts().reset_index()
ecg_data['index'][0]= 'normal'
ecg_data['index'][1]= 'having ST-T wave abnormality'
ecg_data['index'][2]= 'showing probable or definite left ventricular hypertrophy by Estes'
ecg_data
Boxplot for the ECG data
plt.figure(figsize=(20, 10))
plt.title("ECG data of Patients")
sns.barplot(x=ecg_data['index'],y= ecg_data['restecg'])
sns.pairplot(hue='output',data=df)
Let us see for our Continuous Variable
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
sns.distplot(df['trtbps'], kde=True, color = 'magenta')
plt.xlabel("Resting Blood Pressure (mmHg)")
plt.subplot(1,2,2)
sns.distplot(df['thalachh'], kde=True, color = 'teal')
plt.xlabel("Maximum Heart Rate Achieved (bpm)")
plotting for cholestrol
plt.figure(figsize=(10,10))
sns.distplot(df['chol'], kde=True, color = 'red')
plt.xlabel("Cholestrol")
We have done the Analysis of the data now let's have a look at out data
df.head()
Let us do Standardisation
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
scale.fit(df)
df= scale.transform(df)
df=pd.DataFrame(df,columns=['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
'exng', 'caa', 'output'])
df.head()
We can insert this data into our ML Models
We will use the following models for our predictions :
- Logistic Regression
- Decision Tree
- Random Forest
- K Nearest Neighbour
- SVM
Then we will use the ensembling techniques
Let us split our data
x= df.iloc[:,:-1]
x
Also
y= df.iloc[:,-1:]
y
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
lbl= LabelEncoder()
encoded_y= lbl.fit_transform(y_train)
logreg= LogisticRegression()
logreg = LogisticRegression()
logreg.fit(x_train, encoded_y)
Y_pred1
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
encoded_ytest= lbl.fit_transform(y_test)
Y_pred1 = logreg.predict(x_test)
lr_conf_matrix = confusion_matrix(encoded_ytest,Y_pred1 )
lr_acc_score = accuracy_score(encoded_ytest, Y_pred1)
lr_conf_matrix
Trying to get logistic regression accuracy
85.71428571428571 %
As we see the Logistic Regression Model have a 85% accuracy.
Decision Tree
importing decision tree classifier and assigning a variable
from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier()
tree.fit(x_train,encoded_y)
ypred2=tree.predict(x_test)
encoded_ytest= lbl.fit_transform(y_test)
For confusion matrix and finding the accuracy score
tree_conf_matrix = confusion_matrix(encoded_ytest,ypred2 )
tree_acc_score = accuracy_score(encoded_ytest, ypred2)
tree_conf_matrix
printing accuracy score and multiplying by 100
print(tree_acc_score*100,"%")
As we can see our Decision Tree Model does not perform well as it gives a score of only 70%
For the part two of this project, we will continue with the random forest model and also other models to get their accuracy score and find the best model.