In continuation of the data science project on water drinking potability prediction using machine learning and H2O auto machine learning.
What is water potability?
Water potability refers to the quality of water being safe to drink or potable. Portability refers to the ability to be easily carried or moved, which is not relevant in this context.
What is H2o auto ML library?
H2O is a fully open-source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical machine learning algorithms, including gradient boosted machines, generalized linear models, deep learning, and many more.
The drinkingwaterpotability.csv file used for this project contains water quality metrics for 3,276 different water bodies.
The link to the part one of this project is attached below:
https://coderlegion.com/300/water-drinking-potability-prediction-using-machine-learning-auto-machine-learning-part
Code Implementation
Using auto machine learning
H2O Auto ML
Installing H2O Auto ML
# Installing the requests library, which is used for making HTTP requests in Python
!pip install requests
# Installing the tabulate library, which is used for creating simple ASCII tables
!pip install tabulate
# Installing the colorama library (version 0.3.8 or higher), which is used for cross-
platform colored terminal text
!pip install "colorama>=0.3.8"
# Installing the future library, which provides compatibility between Python 2 and
Python 3
!pip install future
Using pip to install the H2O library, which is a scalable and distributed machine learning platform
!pip install h2o
Importing the h2o Python module and H2OAutoML class
import h2o
from h2o.automl import H2OAutoML
h2o.init(max_mem_size='16G') ## the h2o.init() makes sure that no prior instance of H2O is running.
Loading data and displaying the data
df = h2o.import_file("/content/drive/MyDrive/drinking_water_potability.csv")
df.head()
H2O auto ml can do all the data preprocessing techniques
# Splitting the DataFrame df into two separate DataFrames: df_train and df_test
# 80% of the data will go into df_train and the remaining 20% into df_test
df_train, df_test = df.split_frame(ratios=[.8])
Splitting the data
# Defining the dependent variable (target) as 'Potability'
y = "Potability" # dependent variable
# Defining the independent variables (features) as all columns in the DataFrame
x = df.columns # Independent variable
# Removing the dependent variable 'Potability' from the list of independent variables
x.remove(y)
Defining the model
# Instantiating an H2OAutoML object with specified parameters
# max_runtime_secs=300 sets the maximum runtime to 300 seconds
# max_models=10 sets the maximum number of models to train to 10
# seed=10 sets the seed for random number generation to 10 for reproducibility
# verbosity="info" sets the level of verbosity for logging information during the AutoML run
# nfolds=2 sets the number of cross-validation folds to 2
aml = H2OAutoML(max_runtime_secs=300, max_models=10, seed=10, verbosity="info", nfolds=2)
Fitting the model
# Training an H2O AutoML model using the specified features and target variable
# x represents the list of feature column names
# y represents the target column name
# training_frame is the H2OFrame containing the training data
aml.train(x=x, y=y, training_frame=df_train)
Seeing the Leaderboard
# Retrieving the leaderboard from the H2O AutoML run and storing it in the variable 'lb'
lb = aml.leaderboard
lb
Getting all the model ids
# Extracting the model IDs from the leaderboard of an H2O AutoML run
# Converting the 'model_id' column of the leaderboard to a DataFrame, then to a list
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:, 0])
model_ids
# Generating the model performance metrics on the test dataset using the leader model from H2O AutoML
aml.leader.model_performance(df_test)
Getting the model details for best performing model
# Retrieving the model with "StackedEnsemble" in its ID from a list of model IDs using H2O's get_model function
h2o.get_model([mid for mid in model_ids if "StackedEnsemble" in mid][0])
# Retrieving the model identified by its ID, which contains "StackedEnsemble" in its name
output = h2o.get_model([mid for mid in model_ids if "StackedEnsemble" in mid][0])
# Printing the parameters of the retrieved model
output.params
Retrieving the leader model from an H2O AutoML process
aml.leader
# Generating predictions on the test dataset using the leader model from an H2O AutoML process
y_pred = aml.leader.predict(df_test)
y_pred
If probablity greater than 0.5 than it is a 1 else it is a 0
In the context of predicting water potability using machine learning models, we used a probability threshold to determine the final classification. Specifically, if the predicted probability of water being potable is greater than 0.5, the water sample is classified as potable (denoted as 1). Conversely, if the predicted probability is 0.5 or less, the sample is classified as non-potable (denoted as 0). This threshold ensures a clear decision boundary for classification based on the model's output probabilities.
The choice of 0.5 as the threshold is standard in binary classification problems, as it assumes equal cost for false positives and false negatives as we come to the conclusion of this project.