The Z-score, also known as the standard score or z-value, is a statistical measure that quantifies the relationship of a data point to the mean of a group of data. It is expressed in terms of standard deviations from the mean. The formula for calculating the Z-score of a data point, x, in a dataset with mean, μ, and standard deviation, σ, is given by:

[ Z = (x-μ)/σ]

Here’s what each component represents:

  • ( x ): The individual data point.
  • (μ ): The mean of the dataset.
  • (σ): The standard deviation of the dataset.

The Z-score helps in understanding how far a particular data point is from the mean in terms of standard deviations. A positive Z-score indicates that the data point is above the mean, while a negative Z-score indicates that the data point is below the mean.

In the context of outlier detection, Z-scores are often used to identify data points that deviate significantly from the mean. Values with Z-scores beyond a certain threshold (commonly ±3) are considered potential outliers. This statistical method provides a standardized way to compare and analyze data points across different distributions.

Any z-score greater than +3 or less than -3 is considered an outlier which is pretty much similar to the standard deviation method.

To remove outliers using the Z-score method, you can follow these steps:

  1. Calculate the Z-score for each data point in the dataset.
  2. Identify the data points with a Z-score greater than the threshold.
  3. Remove these data points from the dataset.

You can also use the Z-score method to identify the outliers in a dataset and then impute missing values in place of the outliers using the mean or median of the data.

It’s important to note that the Z-score method is sensitive to the distribution of the data. If the data is not normally distributed, the Z-score method may not be the most appropriate method for identifying outliers. In such cases, it may be more appropriate to use alternative methods such as the interquartile range (IQR) method or the median absolute deviation (MAD) method.

If data is normal or close to normal distribution use — Z-score.

if data is skewed distribution use — IQR based filtering

Datasets : https://www.kaggle.com/datasets/mayurdalvi/simple-linear-regression-placement-data
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Reading the dataset
df = pd.read_csv('placement.csv')

# Displaying the shape and a random sample of 5 rows
df.shape
df.sample(5)

# Plotting distribution graphs for 'cgpa' and 'placement_exam_marks'
plt.figure(figsize=(16, 5))
plt.subplot(1, 2, 1)
sns.distplot(df['cgpa'])
plt.subplot(1, 2, 2)
sns.distplot(df['placement_exam_marks'])
plt.show()

# Checking skewness for 'placement_exam_marks' and 'cgpa'
df['placement_exam_marks'].skew()
df['cgpa'].skew()

# Using Z-score to identify outliers for 'cgpa'
mean_cgpa = df['cgpa'].mean()
std_cgpa = df['cgpa'].std()

print("Mean value of cgpa:", mean_cgpa)
print("Std value of cgpa:", std_cgpa)

# Calculating boundary values
upper_limit_cgpa = mean_cgpa + 3 * std_cgpa
lower_limit_cgpa = mean_cgpa - 3 * std_cgpa

print("Highest allowed:", upper_limit_cgpa)
print("Lowest allowed:", lower_limit_cgpa)

# Finding and trimming outliers for 'cgpa'
outliers_cgpa = df[(df['cgpa'] > upper_limit_cgpa) | (df['cgpa'] < lower_limit_cgpa)]
trimmed_df_cgpa = df[(df['cgpa'] < upper_limit_cgpa) & (df['cgpa'] > lower_limit_cgpa)]

# Alternative approach using Z-score
# Calculating Z-score for 'cgpa'
df['cgpa_zscore'] = (df['cgpa'] - mean_cgpa) / std_cgpa

# Trimming outliers using Z-score
trimmed_df_cgpa_zscore = df[(df['cgpa_zscore'] < 3) & (df['cgpa_zscore'] > -3)]

# Capping upper and lower limits
upper_limit = mean_cgpa + 3 * std_cgpa
lower_limit = mean_cgpa - 3 * std_cgpa

# Using numpy's where() to cap values outside the limits
df['cgpa'] = np.where(
    df['cgpa'] > upper_limit,
    upper_limit,
    np.where(
        df['cgpa'] < lower_limit,
        lower_limit,
        df['cgpa']
    )
)

# Displaying the final shape of the dataframe
df.shape

This code involves the identification and handling of outliers in the ‘cgpa’ column of a placement dataset. The Z-score method and capping techniques are used to detect and mitigate outliers. Visualizations, descriptive statistics, and data manipulation techniques are employed throughout the process.

Leave a Reply

Your email address will not be published. Required fields are marked *