Exploring Linear Regression: Theory & Implementation
Introduction
Linear Regression is one of the fundamental techniques in Machine Learning and Forecasting. It establishes a relationship between an independent variable (X) and a dependent variable (Y) by fitting a straight line to the data. The equation of this line is:
[ Y = mX + c ]
where m is the slope and c is the intercept.
In this article, we will implement Linear Regression in two ways:
- Using Scikit-Learn: A quick and efficient approach using the
LinearRegression
model.
- Without Scikit-Learn: Manually computing the parameters to understand the underlying mathematics.
Understanding Linear Regression
The Mathematics Behind Linear Regression
Linear Regression aims to minimize the error between predicted and actual values using the Least Squares Method. The formula for m and c is calculated as:
Slope (m):
[ m = \frac{ n \sum(XY) - \sum X \sum Y }{ n \sum X^2 - (\sum X)^2 } ]
Intercept (c):
[ c = \frac{\sum Y - m \sum X}{n} ]
Where:
- X = Independent variable (Years of Experience)
- Y = Dependent variable (Salary)
- n = Number of data points
Implementing Linear Regression
Using Scikit-Learn
The easiest way to apply Linear Regression is through Scikit-Learn
. Let's walk through the implementation:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Load dataset
df = pd.read_csv("Salary_dataset.csv")
df = df.drop('Unnamed: 0', axis=1)
# Prepare data
X = df[["YearsExperience"]] # Independent variable
y = df["Salary"] # Dependent variable
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Get parameters
m = model.coef_[0] # Slope
c = model.intercept_ # Intercept
print(f"Equation of the Line: Salary = {m:.2f} * Experience + {c:.2f}")
# Predictions
y_pred = model.predict(X_test)
# Predict salary for 10 years of experience
experience = np.array([[10]])
predicted_salary = model.predict(experience)
print(f"Predicted Salary for 10 Years Experience: ${predicted_salary[0]:,.2f}")
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
# Plot results
plt.scatter(X_train, y_train, color='blue', label="Training Data")
plt.scatter(X_test, y_test, color='red', label="Testing Data")
plt.plot(X, model.predict(X), color='green', linewidth=2, label="Regression Line")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.title("Salary Prediction using Linear Regression")
plt.legend()
plt.show()
Manually Implementing Linear Regression
For a deeper understanding, we manually calculate the slope and intercept:
import numpy as np
import pandas as pd
# Load dataset
data = pd.read_csv("Salary_dataset.csv")
data = data.drop('Unnamed: 0', axis=1)
# Prepare data
y = data['Salary']
X = data['YearsExperience']
# Compute parameters manually
n = len(X)
sum_X = X.sum()
sum_y = y.sum()
sum_Xy = (X * y).sum()
sum_X_squared = (X ** 2).sum()
# Compute Slope (m) and Intercept (c)
m = (n * sum_Xy - sum_X * sum_y) / (n * sum_X_squared - sum_X ** 2)
c = (sum_y - m * sum_X) / n
# Display the equation
print(f"Equation: Salary = {m:.2f} * Experience + {c:.2f}")
# Predict salary for 10 years of experience
experience = 10
predicted_salary = m * experience + c
print(f"Predicted Salary for 10 Years Experience: ${predicted_salary:,.2f}")
# Save predictions
data['Predicted_Salary'] = m * data['YearsExperience'] + c
data.to_csv("salary_predictions.csv", index=False)
print("Predictions saved to salary_predictions.csv")
Common Questions
1. When should I use Linear Regression?
Linear Regression is useful when there is a linear relationship between the dependent and independent variables.
2. What are the assumptions of Linear Regression?
- The relationship between X and Y is linear.
- No multicollinearity (independent variables should not be highly correlated).
- Homoscedasticity (constant variance of errors).
- Residuals follow a normal distribution.
3. How do I evaluate my model's performance?
You can use:
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- R² Score (indicates how well the model explains the variation in Y)
Conclusion
In this article, we explored Linear Regression in depth by implementing it with and without Scikit-Learn. By manually computing the parameters, we gained insight into the mathematics behind the algorithm. Additionally, we evaluated model performance using MAE and RMSE, and visualized the results.
Understanding Linear Regression is crucial as it is widely used in forecasting, economics, and predictive modeling. By mastering both the theoretical and practical aspects, you can build robust models for real-world applications!
References:
- Scikit-Learn Documentation
- Understanding Linear Regression
Tags: Python
, Machine Learning
, Linear Regression
, Scikit-Learn
, Forecasting