Water drinking potability prediction using machine learning and H2O Auto Machine learning- part 2

Water drinking potability prediction using machine learning and H2O Auto Machine learning- part 2

posted 4 min read

In continuation of the data science project on water drinking potability prediction using machine learning and H2O auto machine learning.

What is water potability?

Water potability refers to the quality of water being safe to drink or potable. Portability refers to the ability to be easily carried or moved, which is not relevant in this context.

What is H2o auto ML library?

H2O is a fully open-source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical machine learning algorithms, including gradient boosted machines, generalized linear models, deep learning, and many more.

The drinkingwaterpotability.csv file used for this project contains water quality metrics for 3,276 different water bodies.

The link to the part one of this project is attached below:
https://coderlegion.com/300/water-drinking-potability-prediction-using-machine-learning-auto-machine-learning-part

Code Implementation

Using auto machine learning

H2O Auto ML

Installing H2O Auto ML

# Installing the requests library, which is used for making HTTP requests in Python
!pip install requests

# Installing the tabulate library, which is used for creating simple ASCII tables
!pip install tabulate

# Installing the colorama library (version 0.3.8 or higher), which is used for cross- 
platform colored terminal text
!pip install "colorama>=0.3.8"

# Installing the future library, which provides compatibility between Python 2 and 
Python 3
!pip install future

Using pip to install the H2O library, which is a scalable and distributed machine learning platform

!pip install h2o

Importing the h2o Python module and H2OAutoML class

import h2o
from h2o.automl import H2OAutoML
h2o.init(max_mem_size='16G')  ##  the h2o.init() makes sure that no prior instance of H2O is running.

Loading data and displaying the data

df = h2o.import_file("/content/drive/MyDrive/drinking_water_potability.csv")
df.head()

H2O auto ml can do all the data preprocessing techniques

# Splitting the DataFrame df into two separate DataFrames: df_train and df_test
# 80% of the data will go into df_train and the remaining 20% into df_test
df_train, df_test = df.split_frame(ratios=[.8])

Splitting the data

# Defining the dependent variable (target) as 'Potability'
y = "Potability"  # dependent variable

# Defining the independent variables (features) as all columns in the DataFrame
x = df.columns  # Independent variable

# Removing the dependent variable 'Potability' from the list of independent variables
x.remove(y)

Defining the model

# Instantiating an H2OAutoML object with specified parameters
# max_runtime_secs=300 sets the maximum runtime to 300 seconds
# max_models=10 sets the maximum number of models to train to 10
# seed=10 sets the seed for random number generation to 10 for reproducibility
# verbosity="info" sets the level of verbosity for logging information during the AutoML run
# nfolds=2 sets the number of cross-validation folds to 2
aml = H2OAutoML(max_runtime_secs=300, max_models=10, seed=10, verbosity="info", nfolds=2)

Fitting the model

# Training an H2O AutoML model using the specified features and target variable
# x represents the list of feature column names
# y represents the target column name
# training_frame is the H2OFrame containing the training data
aml.train(x=x, y=y, training_frame=df_train)

Seeing the Leaderboard

# Retrieving the leaderboard from the H2O AutoML run and storing it in the variable 'lb'
lb = aml.leaderboard

lb

Getting all the model ids

# Extracting the model IDs from the leaderboard of an H2O AutoML run
# Converting the 'model_id' column of the leaderboard to a DataFrame, then to a list
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:, 0])

model_ids

# Generating the model performance metrics on the test dataset using the leader model from H2O AutoML
aml.leader.model_performance(df_test)

Getting the model details for best performing model

# Retrieving the model with "StackedEnsemble" in its ID from a list of model IDs using H2O's get_model function
h2o.get_model([mid for mid in model_ids if "StackedEnsemble" in mid][0])

# Retrieving the model identified by its ID, which contains "StackedEnsemble" in its name
output = h2o.get_model([mid for mid in model_ids if "StackedEnsemble" in mid][0])

# Printing the parameters of the retrieved model
output.params

Retrieving the leader model from an H2O AutoML process

aml.leader

# Generating predictions on the test dataset using the leader model from an H2O AutoML process
y_pred = aml.leader.predict(df_test)

y_pred

If probablity greater than 0.5 than it is a 1 else it is a 0

In the context of predicting water potability using machine learning models, we used a probability threshold to determine the final classification. Specifically, if the predicted probability of water being potable is greater than 0.5, the water sample is classified as potable (denoted as 1). Conversely, if the predicted probability is 0.5 or less, the sample is classified as non-potable (denoted as 0). This threshold ensures a clear decision boundary for classification based on the model's output probabilities.

The choice of 0.5 as the threshold is standard in binary classification problems, as it assumes equal cost for false positives and false negatives as we come to the conclusion of this project.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

Water drinking potability prediction using machine learning and H2O Auto Machine learning (Part one)

Onumaku C Victory - Jun 3, 2024

A machine learning project on In-Hospital mortality prediction using machine learning and PyCaret (Part one)

Onumaku C Victory - Jun 1, 2024

The machine learning project on predicting In-Hospital mortality rate using machine learning and PyCaret beyond basic

Onumaku C Victory - Jun 22, 2024

Continuation of a data science project on heart attack risk predictor with eval machine learning (part two)

Onumaku C Victory - Jun 3, 2024

A data science project on heart attack risk predictor with Eval Machine Learning (Part one)

Onumaku C Victory - Jun 1, 2024
chevron_left