Linear Regression with and without Scikit learn Library

Linear Regression with and without Scikit learn Library

posted 3 min read

Exploring Linear Regression: Theory & Implementation

Introduction

Linear Regression is one of the fundamental techniques in Machine Learning and Forecasting. It establishes a relationship between an independent variable (X) and a dependent variable (Y) by fitting a straight line to the data. The equation of this line is:

[ Y = mX + c ]

where m is the slope and c is the intercept.

In this article, we will implement Linear Regression in two ways:

  • Using Scikit-Learn: A quick and efficient approach using the LinearRegression model.
  • Without Scikit-Learn: Manually computing the parameters to understand the underlying mathematics.

Understanding Linear Regression

The Mathematics Behind Linear Regression

Linear Regression aims to minimize the error between predicted and actual values using the Least Squares Method. The formula for m and c is calculated as:

Slope (m):

[ m = \frac{ n \sum(XY) - \sum X \sum Y }{ n \sum X^2 - (\sum X)^2 } ]

Intercept (c):

[ c = \frac{\sum Y - m \sum X}{n} ]

Where:
- X = Independent variable (Years of Experience)
- Y = Dependent variable (Salary)
- n = Number of data points

Implementing Linear Regression

Using Scikit-Learn

The easiest way to apply Linear Regression is through Scikit-Learn. Let's walk through the implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Load dataset
df = pd.read_csv("Salary_dataset.csv")
df = df.drop('Unnamed: 0', axis=1)

# Prepare data
X = df[["YearsExperience"]]  # Independent variable
y = df["Salary"]  # Dependent variable

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Get parameters
m = model.coef_[0]  # Slope
c = model.intercept_  # Intercept
print(f"Equation of the Line: Salary = {m:.2f} * Experience + {c:.2f}")

# Predictions
y_pred = model.predict(X_test)

# Predict salary for 10 years of experience
experience = np.array([[10]])
predicted_salary = model.predict(experience)
print(f"Predicted Salary for 10 Years Experience: ${predicted_salary[0]:,.2f}")

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")

# Plot results
plt.scatter(X_train, y_train, color='blue', label="Training Data")
plt.scatter(X_test, y_test, color='red', label="Testing Data")
plt.plot(X, model.predict(X), color='green', linewidth=2, label="Regression Line")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.title("Salary Prediction using Linear Regression")
plt.legend()
plt.show()

Manually Implementing Linear Regression

For a deeper understanding, we manually calculate the slope and intercept:

import numpy as np
import pandas as pd

# Load dataset
data = pd.read_csv("Salary_dataset.csv")
data = data.drop('Unnamed: 0', axis=1)

# Prepare data
y = data['Salary']
X = data['YearsExperience']

# Compute parameters manually
n = len(X)
sum_X = X.sum()
sum_y = y.sum()
sum_Xy = (X * y).sum()
sum_X_squared = (X ** 2).sum()

# Compute Slope (m) and Intercept (c)
m = (n * sum_Xy - sum_X * sum_y) / (n * sum_X_squared - sum_X ** 2)
c = (sum_y - m * sum_X) / n

# Display the equation
print(f"Equation: Salary = {m:.2f} * Experience + {c:.2f}")

# Predict salary for 10 years of experience
experience = 10
predicted_salary = m * experience + c
print(f"Predicted Salary for 10 Years Experience: ${predicted_salary:,.2f}")

# Save predictions
data['Predicted_Salary'] = m * data['YearsExperience'] + c
data.to_csv("salary_predictions.csv", index=False)
print("Predictions saved to salary_predictions.csv")

Common Questions

1. When should I use Linear Regression?

Linear Regression is useful when there is a linear relationship between the dependent and independent variables.

2. What are the assumptions of Linear Regression?

  • The relationship between X and Y is linear.
  • No multicollinearity (independent variables should not be highly correlated).
  • Homoscedasticity (constant variance of errors).
  • Residuals follow a normal distribution.

3. How do I evaluate my model's performance?

You can use:
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- R² Score (indicates how well the model explains the variation in Y)

Conclusion

In this article, we explored Linear Regression in depth by implementing it with and without Scikit-Learn. By manually computing the parameters, we gained insight into the mathematics behind the algorithm. Additionally, we evaluated model performance using MAE and RMSE, and visualized the results.

Understanding Linear Regression is crucial as it is widely used in forecasting, economics, and predictive modeling. By mastering both the theoretical and practical aspects, you can build robust models for real-world applications!


References:
- Scikit-Learn Documentation
- Understanding Linear Regression


Tags: Python, Machine Learning, Linear Regression, Scikit-Learn, Forecasting

If you read this far, tweet to the author to show them you care. Tweet a Thanks
Great article Sanjid ! Clear explanation and solid breakdown of theory vs. implementation. Maybe add a bit on handling outliers or feature scaling. A quick comparison of MAE, RMSE, and R² would also help. Overall, well-structured and informative!

More Posts

Setting up your environment with Jest and React Testing Library, and configuring Babel and Parcel

Bhavik Bhuva - Feb 23

Testing the performance of Python with and without GIL

Andres Alvarez - Nov 24, 2024

Learn how to write GenAI applications with Java using the Spring AI framework and utilize RAG for improving answers.

Jennifer Reif - Sep 22, 2024

Machine Learning Basics: Classification and Regression and their Evaluation metrics

Aman Anand - Oct 11, 2024

Modules and Packages: Using Standard Library Modules

muhammaduzairrazaq - Jul 12, 2024
chevron_left