Introduction and Types of Linear Regression

Linear regression is a fundamental statistical and machine learning technique used for predicting a continuous outcome variable based on one or more predictor variables. It establishes a relationship between the dependent variable (output) and the independent variable(s) (features) by fitting a linear equation to the observed data. There are two main types of linear regression: Simple Linear Regression and Multiple Linear Regression.

Simple Linear Regression

Simple Linear Regression involves predicting a dependent variable based on a single independent variable. The relationship between the two variables is assumed to be linear, which means it can be represented by a straight line. The equation for simple linear regression is given by:

[ y = mx + b ]

Where:

  • ( y ) is the dependent variable.
  • ( x ) is the independent variable.
  • ( m ) is the slope of the line.
  • ( b ) is the y-intercept.

Intuition of Simple Linear Regression

The intuition behind simple linear regression is to find the best-fitting line that minimizes the sum of the squared differences between the observed and predicted values. This line represents the relationship between the variables, and the slope (( m )) indicates the rate of change in the dependent variable for a unit change in the independent variable.

Code Example

Let’s demonstrate simple linear regression with a Python code example.

Datasets: Simple Linear Regression – Placement data (kaggle.com)


# Importing necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Reading the dataset
df = pd.read_csv('placement.csv')

# Displaying the first few rows of the dataset
df.head()

# Checking the shape of the dataset
df.shape  # Output: (200, 2)

# Scatter plot of CGPA vs. Package
plt.scatter(df['cgpa'], df['package'])
plt.xlabel('CGPA')
plt.ylabel('Package (in LPA)')
plt.title('Scatter Plot of CGPA vs. Package')

# Selecting features and target variable
X = df.iloc[:, 0:1]
y = df.iloc[:, -1]

# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Creating and training the Linear Regression model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Scatter plot with the regression line
plt.scatter(df['cgpa'], df['package'])
plt.plot(X_train, lr.predict(X_train), color='red')
plt.xlabel('CGPA')
plt.ylabel('Package (in LPA)')
plt.title('Linear Regression: CGPA vs. Package')

# Evaluating the model using metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = lr.predict(X_test)

# Displaying the true values
y_test.values

# Calculating Mean Absolute Error (MAE)
print("MAE:", mean_absolute_error(y_test, y_pred))

# Calculating Mean Squared Error (MSE)
print("MSE:", mean_squared_error(y_test, y_pred))

# Calculating Root Mean Squared Error (RMSE)
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculating R2 Score
print("R2 Score:", r2_score(y_test, y_pred))

# Adjusted R2 Score calculation
X_test.shape  # Output: (40, 1)
r2 = r2_score(y_test, y_pred)
adjusted_r2 = 1 - ((1 - r2) * (40 - 1) / (40 - 1 - 1))




How to Find ( m ) and ( b )?

The values of ( m ) and ( b ) in the simple linear regression equation (( y = mx + b )) are determined during the training process. The goal is to find the values that minimize the sum of squared differences between the observed and predicted values. This process is often done using optimization techniques like the least squares method.

Simple Linear Regression Model Code from Scratch

While scikit-learn provides an easy way to implement linear regression, understanding how it works under the hood is valuable. Here’s a simplified version of simple linear regression code implemented from scratch in Python.

# Implementation of simple linear regression from scratch
class SimpleLinearRegression:
    def __init__(self):
        self.m = None
        self.b = None

    def fit(self, X, y):
        X_mean, y_mean = np.mean(X), np.mean(y)
        numerator = np.sum((X - X_mean) * (y - y_mean))
        denominator = np.sum((X - X_mean) ** 2)
        self.m = numerator / denominator
        self.b = y_mean - self.m * X_mean

    def predict(self, X):
        return self.m * X + self.b

Regression Metrics

Conclusion

In conclusion, simple linear regression is a powerful tool for understanding and predicting relationships between variables. By fitting a straight line to the data, we can make predictions and quantify the accuracy of our model using regression metrics. Understanding the intuition, implementing the code from scratch, and interpreting the metrics are crucial aspects of mastering simple linear regression.

Leave a Reply

Your email address will not be published. Required fields are marked *