Linear Regression with and without Scikit learn Library

Linear Regression with and without Scikit learn Library

posted 3 min read

Exploring Linear Regression: Theory & Implementation

Introduction

Linear Regression is one of the fundamental techniques in Machine Learning and Forecasting. It establishes a relationship between an independent variable (X) and a dependent variable (Y) by fitting a straight line to the data. The equation of this line is:

[ Y = mX + c ]

where m is the slope and c is the intercept.

In this article, we will implement Linear Regression in two ways:

  • Using Scikit-Learn: A quick and efficient approach using the LinearRegression model.
  • Without Scikit-Learn: Manually computing the parameters to understand the underlying mathematics.

Understanding Linear Regression

The Mathematics Behind Linear Regression

Linear Regression aims to minimize the error between predicted and actual values using the Least Squares Method. The formula for m and c is calculated as:

Slope (m):

[ m = \frac{ n \sum(XY) - \sum X \sum Y }{ n \sum X^2 - (\sum X)^2 } ]

Intercept (c):

[ c = \frac{\sum Y - m \sum X}{n} ]

Where:

  • X = Independent variable (Years of Experience)
  • Y = Dependent variable (Salary)
  • n = Number of data points

Implementing Linear Regression

Using Scikit-Learn

The easiest way to apply Linear Regression is through Scikit-Learn. Let's walk through the implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Load dataset
df = pd.read_csv("Salary_dataset.csv")
df = df.drop('Unnamed: 0', axis=1)

# Prepare data
X = df[["YearsExperience"]]  # Independent variable
y = df["Salary"]  # Dependent variable

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Get parameters
m = model.coef_[0]  # Slope
c = model.intercept_  # Intercept
print(f"Equation of the Line: Salary = {m:.2f} * Experience + {c:.2f}")

# Predictions
y_pred = model.predict(X_test)

# Predict salary for 10 years of experience
experience = np.array([[10]])
predicted_salary = model.predict(experience)
print(f"Predicted Salary for 10 Years Experience: ${predicted_salary[0]:,.2f}")

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")

# Plot results
plt.scatter(X_train, y_train, color='blue', label="Training Data")
plt.scatter(X_test, y_test, color='red', label="Testing Data")
plt.plot(X, model.predict(X), color='green', linewidth=2, label="Regression Line")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.title("Salary Prediction using Linear Regression")
plt.legend()
plt.show()

Manually Implementing Linear Regression

For a deeper understanding, we manually calculate the slope and intercept:

import numpy as np
import pandas as pd

# Load dataset
data = pd.read_csv("Salary_dataset.csv")
data = data.drop('Unnamed: 0', axis=1)

# Prepare data
y = data['Salary']
X = data['YearsExperience']

# Compute parameters manually
n = len(X)
sum_X = X.sum()
sum_y = y.sum()
sum_Xy = (X * y).sum()
sum_X_squared = (X ** 2).sum()

# Compute Slope (m) and Intercept (c)
m = (n * sum_Xy - sum_X * sum_y) / (n * sum_X_squared - sum_X ** 2)
c = (sum_y - m * sum_X) / n

# Display the equation
print(f"Equation: Salary = {m:.2f} * Experience + {c:.2f}")

# Predict salary for 10 years of experience
experience = 10
predicted_salary = m * experience + c
print(f"Predicted Salary for 10 Years Experience: ${predicted_salary:,.2f}")

# Save predictions
data['Predicted_Salary'] = m * data['YearsExperience'] + c
data.to_csv("salary_predictions.csv", index=False)
print("Predictions saved to salary_predictions.csv")

Common Questions

1. When should I use Linear Regression?

Linear Regression is useful when there is a linear relationship between the dependent and independent variables.

2. What are the assumptions of Linear Regression?

  • The relationship between X and Y is linear.
  • No multicollinearity (independent variables should not be highly correlated).
  • Homoscedasticity (constant variance of errors).
  • Residuals follow a normal distribution.

3. How do I evaluate my model's performance?

You can use:

  • Mean Absolute Error (MAE)
  • Root Mean Squared Error (RMSE)
  • R² Score (indicates how well the model explains the variation in Y)

Conclusion

In this article, we explored Linear Regression in depth by implementing it with and without Scikit-Learn. By manually computing the parameters, we gained insight into the mathematics behind the algorithm. Additionally, we evaluated model performance using MAE and RMSE, and visualized the results.

Understanding Linear Regression is crucial as it is widely used in forecasting, economics, and predictive modeling. By mastering both the theoretical and practical aspects, you can build robust models for real-world applications!


References:


Tags: Python, Machine Learning, Linear Regression, Scikit-Learn, Forecasting

1 Comment

0 votes

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Learn AWS for Free Hands On Without Getting Charged

Ijay - Feb 24

Split-Brain: Analyst-Grade Reasoning Without Raw Transactions on the Server

Pocket Portfolio - Apr 8

Tuesday Coding Tip 02 - Template with type-specific API

Jakub Neruda - Mar 10

Implementing Cellular Redundancy: Cross-Cloud Failover with AWS Transit Gateway and Azure ExpressRou

Cláudio Raposo - May 5
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!