Are you curious about how machine learning actually works in real life? Do you want to build your first machine learning model from scratch, without using any drag-and-drop tools or black-box software?
If yes — you’re in the right place!
In this guide, we’ll walk through a real ML workflow: taking a CSV file, cleaning it, training a model, and making predictions — all using Python, Pandas, and Scikit-learn.
What is an ML Pipeline?
An ML pipeline is a series of steps that take raw data and turn it into valuable predictions. It helps you streamline the process and avoid repeating work. A basic ML pipeline usually includes:
- Loading data
- Data cleaning and preprocessing
- Feature selection
- Splitting into training and testing sets
- Model training
- Model evaluation
- Making predictions
We’ll cover all of this, with practical examples.
Step 1: Load Your Dataset
We’ll use the Iris dataset, a beginner-friendly dataset containing measurements of flowers and their species. You can also replace it with any .csv file later.
import pandas as pd
url = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(url) - Load the dataset
print(df.head()) - Preview the data
Step 2: Data Cleaning & Preprocessing
Let’s make sure there are no missing values and convert the text labels (species) into numeric values so our ML model can understand them.
print(df.isnull().sum()) - Check for missing values
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() - Encode species column (categorical to numeric)
df['species'] = le.fit_transform(df['species'])
Step 3: Feature Selection and Splitting
We now separate the features (X) from the target (y) and split them into training and testing sets.
from sklearn.model_selection import train_test_split
X = df.drop('species', axis=1) - Features
y = df['species'] - Target
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42) - Split into train
(80%) and test (20%)
Step 4: Model Training
We’ll use a Random Forest Classifier, a popular and accurate algorithm that works well out of the box.
Random Forest is a popular machine learning algorithm that builds multiple decision trees and combines their results to make more accurate and stable predictions. It works by training each tree on a random subset of the data and features, then taking a majority vote (for classification) or average (for regression) from all trees.
✅ It’s fast, handles large datasets well, reduces overfitting, and works great out of the box.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42) - Initialize the model
model.fit(X_train, y_train) - train the model
Step 5: Model Evaluation
After training, let’s test the model’s performance on unseen data.
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred)) - Accuracy
print(classification_report(y_test, y_pred, target_names=le.classes_)) - classification report
✅ The model performs with perfect accuracy on this small dataset — but this may not always be the case with real-world data.
Step 6: Make Predictions
You can now use your trained model to predict the species of new flowers based on measurements.
input:
[sepal_length, sepal_width, petal_length, petal_width]
sample = [[5.1, 3.5, 1.4, 0.2]]
prediction = model.predict(sample)
predicted_class = le.inverse_transform(prediction)
print("Predicted species:", predicted_class[0])
Output:
Predicted species: setosa
Tools Used in This Pipeline
- Python: Programming language
- Pandas: For data loading and manipulation
- Scikit-learn: For machine learning models and evaluation
- LabelEncoder: To convert categorical values
- RandomForestClassifier: The ML model
Key Concepts Covered
- Loading .csv data into Pandas
- Checking and cleaning data
- Splitting dataset into train/test
- Training a model using RandomForestClassifier
- Evaluating performance with accuracy and F1-score
- Making predictions using new inputs