The interquartile range (IQR) method is another statistical method that can be used to identify and handle outliers in a dataset. The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset. It is a measure of the spread of the data and is less sensitive to outliers than the range or standard deviation.

To identify and handle outliers using the IQR method, you can follow these steps:

  1. Calculate Q1 and Q3 for the dataset.
  2. Calculate the IQR by subtracting Q1 from Q3.
  3. Identify any data points that are outside the following range: Q1–1.5 * IQR <= x <= Q3 + 1.5 * IQR
  4. These data points are considered outliers. You can choose to remove them from the dataset or impute missing values in their place using the mean or median of the data.

It’s important to note that the IQR method is sensitive to the distribution of the data. If the data is not normally distributed, the IQR method may not be the most appropriate method for identifying outliers. In such cases, it may be more appropriate to use alternative methods such as the Z-score method or the median absolute deviation (MAD) method.

IQR (Inter Quantile Range)= Q3 — Q1

Code

Datasets : https://www.kaggle.com/datasets/mayurdalvi/simple-linear-regression-placement-data
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Reading the dataset from a CSV file
df = pd.read_csv('placement.csv')

# Displaying the first few rows of the dataset
df.head()

The code imports the required libraries and loads a dataset named ‘placement.csv’. It then displays the first few rows of the dataset.

# Plotting distribution graphs for 'cgpa' and 'placement_exam_marks'
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()

This section creates a side-by-side comparison of the distributions of ‘cgpa’ and ‘placement_exam_marks’ using Seaborn’s distplot. It helps visualize the spread and shape of the data.

# Descriptive statistics for 'placement_exam_marks'
df['placement_exam_marks'].describe()

This code provides descriptive statistics (mean, standard deviation, min, max, etc.) for the ‘placement_exam_marks’ column.

# Boxplot for 'placement_exam_marks'
sns.boxplot(df['placement_exam_marks'])

The boxplot is used to visualize the distribution of ‘placement_exam_marks’ and identify potential outliers.

# Calculating the Interquartile Range (IQR) for 'placement_exam_marks'
percentile25 = df['placement_exam_marks'].quantile(0.25)
percentile75 = df['placement_exam_marks'].quantile(0.75)
iqr = percentile75 - percentile25

# Calculating upper and lower limits for identifying outliers
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

These lines compute the interquartile range (IQR) and define upper and lower limits for identifying outliers using the IQR method.

# Finding and displaying outliers above the upper limit
df[df['placement_exam_marks'] > upper_limit]

This code identifies and displays rows where ‘placement_exam_marks’ values are above the upper limit.

# Finding and displaying outliers below the lower limit
df[df['placement_exam_marks'] < lower_limit]

Similarly, this code identifies and displays rows where ‘placement_exam_marks’ values are below the lower limit.

# Trimming the dataset to remove outliers
new_df = df[df['placement_exam_marks'] < upper_limit]

This section creates a new DataFrame, new_df, by removing rows with ‘placement_exam_marks’ values above the upper limit.

# Comparing distributions before and after trimming
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df['placement_exam_marks'])
plt.show()

This code compares the distributions of ‘placement_exam_marks’ before and after trimming, helping visualize the impact of removing outliers.

# Capping extreme values to upper and lower limits
new_df_cap = df.copy()
new_df_cap['placement_exam_marks'] = np.where(
    new_df_cap['placement_exam_marks'] > upper_limit,
    upper_limit,
    np.where(
        new_df_cap['placement_exam_marks'] < lower_limit,
        lower_limit,
        new_df_cap['placement_exam_marks']
    )
)

Here, extreme values are capped at the upper and lower limits to manage the impact of outliers.

# Comparing distributions before and after capping
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df_cap['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df_cap['placement_exam_marks'])
plt.show()

This code compares the distributions of ‘placement_exam_marks’ before and after capping, providing insights into the changes made to extreme values.

3 Replies to “Outlier Handling by IQR”

Leave a Reply

Your email address will not be published. Required fields are marked *