Water drinking potability prediction using machine learning and H2O Auto Machine learning (Part one)

Water drinking potability prediction using machine learning and H2O Auto Machine learning (Part one)

posted 9 min read

Why is safe drinking water important?

Access to safe drinking water is essential to health, a basic human right, and a component of effective policy for health protection. This is important as a health and development issue at a national, regional, and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.

What is water potability?

Water potability refers to the quality of water being safe to drink or potable. Portability refers to the ability to be easily carried or moved, which is not relevant in this context.

What is H2o auto ML library?

H2O is a fully open-source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical machine learning algorithms, including gradient boosted machines, generalized linear models, deep learning, and many more.

The drinkingwaterpotability.csv file used for this project contains water quality metrics for 3,276 different water bodies.

Model Development

We will use the following models and H2O auto machine learning library in this project:

  • Logistic Regression
  • SVM (Support Vector Machine)
  • Random Forest

Time line of the project:

1) Importing Libraries and Data Set:

  • Start by setting up the development environment like Google colab or jupyter notebook.
  • Import essential libraries for data manipulation (e.g., Pandas, NumPy), data visualization (e.g., Matplotlib, Seaborn), and machine learning (e.g., Scikit-learn, H2O).
  • Data Acquisition: Obtain the dataset related to water potability. This might involve downloading from a public repository or collecting from various sources.
  • Loading Data: Load the dataset into a Pandas DataFrame for initial examination and analysis.

2) Data Analysis and Pre processing:

  • Exploratory Data Analysis (EDA): Conduct a thorough EDA to understand the dataset. Generate summary statistics, visualize distributions, and identify patterns or anomalies using tools like Matplotlib and Seaborn.
  • Handling Missing Values: Identify and handle any missing values in the dataset through imputation or removal, ensuring the dataset is complete for model training.
  • Data Cleaning: Remove or correct any inconsistencies or errors in the data. Normalize or standardize the data if necessary.

3) Feature Engineering:

  • Feature Selection: Identify the most relevant features that influence water potability. This might involve statistical tests, correlation analysis, or domain knowledge.
  • Feature Creation: Create new features that might enhance the model’s performance, such as combining existing features or creating interaction terms.
  • Feature Transformation: Transform features to make them suitable for machine learning algorithms. This can include normalization, scaling, or encoding categorical variables.

4) Model Building using machine learning:

  • Train-Test Split: Split the dataset into training and testing sets to evaluate the performance of the models.
  • Model Selection: Choose a variety of machine learning algorithms to compare, such as Logistic Regression, Random Forests, Support Vector Machines (SVM).
  • Training Models: Train each model on the training set, tuning hyperparameters using techniques like Grid Search or Random Search to find the optimal settings.
  • Model Evaluation: Evaluate the models on the testing set using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. Visualize the performance using confusion matrices and ROC curves.

5) Model building and prediction using H2O auto machine learning:

  • Setting Up H2O: Install and configure H2O.ai’s AutoML tool, ensuring it’s integrated with your development environment.
  • H2O Data Preparation: Convert the dataset into H2O frames suitable for the AutoML process.
  • Running H2O AutoML: Initiate the H2O AutoML process to automatically train and tune a variety of machine learning models. This process includes data pre processing, model training, hyperparameter tuning, and model selection.
  • Model Comparison: Compare the performance of the models generated by H2O AutoML with the manually built models. Evaluate them using the same metrics for a fair comparison.
  • Performance Review: Conduct a detailed performance review of both the manually built models and the H2O AutoML models. Highlight the strengths and weaknesses of each approach.
  • Model Selection: Select the best-performing model based on predefined criteria, such as highest accuracy or best generalization performance.

Code Implementation

Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
%matplotlib inline

Loading the Data Set

from google.colab import drive
drive.mount('/content/drive')

Read data

df= pd.read_csv("/content/drive/MyDrive/drinking_water_potability.csv")
df.head()

Data frame shape

df.shape

Data info

df.info()

Data uniqueness

df.unique()

Data Analysis

 # This code creates a count plot to visualize the distribution of potable and non-potable water samples in the dataset
sns.countplot(data=df, x=df.Potability)

# This code calculates and displays the total number of missing (null) values in each column of the dataframe
df.isnull().sum()

Handling Null Values

null= ['ph','Sulfate','Trihalomethanes']

Distribution plot

# This code creates a distribution plot to visualize the distribution of pH values in the dataset
sns.distplot(df.ph)

# This code replaces all missing (NaN) values in the 'ph' column with the mean pH value of the dataset
df['ph'] = df['ph'].replace(np.nan, df.ph.mean())

# This code creates a distribution plot to visualize the distribution of Sulfate values in the dataset
sns.distplot(df.Sulfate)

# This code replaces all missing (NaN) values in the 'Sulfate' column with the mean Sulfate value of the dataset
df['Sulfate'] = df['Sulfate'].replace(np.nan, df.Sulfate.mean())

# This code creates a distribution plot to visualize the distribution of Trihalomethanes values in the dataset
sns.distplot(df.Trihalomethanes)

# This code replaces all missing (NaN) values in the 'Trihalomethanes' column with the mean Trihalomethanes value of the dataset
df['Trihalomethanes'] = df['Trihalomethanes'].replace(np.nan, df.Trihalomethanes.mean())

# This code calculates and displays the total number of missing (NaN) values in each column of the dataframe
df.isnull().sum()

# This code creates a pairplot to visualize pairwise relationships between features in the dataset, with different colors indicating the 'Potability' classes
sns.pairplot(data=df, hue='Potability')

# This code iterates over each column in the dataframe and creates a boxplot for each one.
for column in df.columns:
     plt.figure()
     df.boxplot([column])

# This code computes and displays the correlation matrix for the dataframe, showing the pairwise correlation coefficients between features
df.corr()

This code generates a heatmap visualization of the correlation matrix for the DataFrame df, using the seaborn library. The heatmap displays the pairwise correlations between all numerical features in the DataFrame. The annot=True parameter adds numerical annotations to each cell of the heatmap, indicating the correlation coefficients.

plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True)

Feature Engineering

# Importing the ExtraTreesClassifier from scikit-learn's ensemble module
from sklearn.ensemble import ExtraTreesClassifier

# Extracting features (x) and target variable (y) from DataFrame df
# Dropping the 'Potability' column to create the feature matrix x
x = df.drop(['Potability'], axis=1)

# Assigning the 'Potability' column as the target variable y
y = df.Potability
# Importing the ExtraTreesClassifier and RandomForestClassifier from scikit-learn's ensemble module
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier

# Instantiating an ExtraTreesClassifier object
Ext = ExtraTreesClassifier()

# Training the classifier using the feature matrix x and target variable y

Ext.fit(x, y)

Code below for printing the feature importances computed by the trained ExtraTreesClassifier model

print(Ext.feature_importances_)

# Creating a Pandas Series to store feature importances with feature names as index
feature = pd.Series(Ext.feature_importances_, index=x.columns)

# Sorting feature importances in ascending order, selecting top 10, and plotting them as a horizontal bar chart
feature.sort_values(ascending=True).nlargest(10).plot(kind='barh')

df.head()

Let us Standardize our data

# Importing the StandardScaler class from scikit-learn's preprocessing module
from sklearn.preprocessing import StandardScaler

# Instantiating a StandardScaler object
scale = StandardScaler()

# Using the StandardScaler object 'scale' to standardize the feature matrix 'x'
scaled = scale.fit_transform(x)

# Creating a DataFrame 'scaled_df' from the standardized features with original column names
scaled_df = pd.DataFrame(scaled, columns=x.columns)

# Displaying the first few rows of the DataFrame 'scaled_df'
scaled_df.head()

Our data is ready for model building

# Splitting the feature matrix 'x' and target variable 'y' into training and testing sets
# The testing set size is set to 30% of the total data, and random_state is set for reproducibility
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

Model Development

# Importing machine learning models and evaluation metrics from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, 
roc_auc_score

Logistic Regression

# Training a logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Predicting labels for training and testing datasets
y_train_hat = lr.predict(X_train)
y_test_hat = lr.predict(X_test)

# Evaluating the model's performance on the testing dataset
print('Test performance')
print('-------------------------------------------------------')
print(classification_report(y_test, y_test_hat))

print('Roc_auc score')
print('-------------------------------------------------------')
print(roc_auc_score(y_test, y_test_hat))
print('')

print('Confusion matrix')
print('-------------------------------------------------------')
print(confusion_matrix(y_test, y_test_hat))
print('')

print('accuracy score')
print('-------------------------------------------------------')
print("test data accuracy score:", accuracy_score(y_test, y_test_hat) * 100)
print("train data accuracy score:", accuracy_score(y_train, y_train_hat) * 100)

Support Vector Machines

# Training a Support Vector Machine (SVM) classifier
svm = SVC()
svm.fit(X_train, y_train)

# Predicting labels for training and testing datasets
y_train_hat = svm.predict(X_train)
y_test_hat = svm.predict(X_test)

# Evaluating the model's performance on the testing dataset
print('Test performance')
print('-------------------------------------------------------')
print(classification_report(y_test, y_test_hat))

print('Roc_auc score')
print('-------------------------------------------------------')
print(roc_auc_score(y_test, y_test_hat))
print('')

print('Confusion matrix')
print('-------------------------------------------------------')
print(confusion_matrix(y_test, y_test_hat))
print('')

print('accuracy score')
print('-------------------------------------------------------')
print("test data accuracy score:", accuracy_score(y_test, y_test_hat) * 100)
print("train data accuracy score:", accuracy_score(y_train, y_train_hat) * 100)

Random Forest

# Training a Random Forest classifier
rf = RandomForestClassifier(n_jobs=-1, random_state=123)
rf.fit(X_train, y_train)

# Predicting labels for training and testing datasets
y_train_hat = rf.predict(X_train)
y_test_hat = rf.predict(X_test)

# Evaluating the model's performance on the testing dataset
print('Test performance')
print('-------------------------------------------------------')
print(classification_report(y_test, y_test_hat))

print('Roc_auc score')
print('-------------------------------------------------------')
print(roc_auc_score(y_test, y_test_hat))
print('')

print('Confusion matrix')
print('-------------------------------------------------------')
print(confusion_matrix(y_test, y_test_hat))
print('')

print('accuracy score')
print('-------------------------------------------------------')
print("test data accuracy score:", accuracy_score(y_test, y_test_hat) * 100)
print("train data accuracy score:", accuracy_score(y_train, y_train_hat) * 100)

In the part two of this of this article, we will continue with using the H2O auto machine learning library.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

Water drinking potability prediction using machine learning and H2O Auto Machine learning- part 2

Onumaku C Victory - Jun 3, 2024

A machine learning project on In-Hospital mortality prediction using machine learning and PyCaret (Part one)

Onumaku C Victory - Jun 1, 2024

A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

Onumaku C Victory - Jun 1, 2024

The machine learning project on predicting In-Hospital mortality rate using machine learning and PyCaret beyond basic

Onumaku C Victory - Jun 22, 2024

Continuation of a data science project on heart attack risk predictor with eval machine learning (part two)

Onumaku C Victory - Jun 3, 2024
chevron_left