Outliers in a dataset are data points that significantly differ from the majority of the data, and they can adversely impact the performance of a model.

For instance:

Consider ages: 30, 45, 50, 80, 3000. Here, 3000 is an outlier.

The mean age is calculated as (30 + 45 + 50 + 80 + 3000) / 5 = 641, which is an unrealistic representation of the central tendency due to the outlier.

This outlier can have a detrimental effect on the model’s performance.

Outliers can sometimes be beneficial, especially in tasks like email classification, where anomalous data points may be crucial.

Methods for Detecting Outliers:

  1. Normal Distribution: Data within the range (mean – 3 * standard deviation) to (mean + 3 * standard deviation) is considered normal. Outliers are those falling outside this range.
  2. Skewed Distribution: Using the Interquartile Range (IQR), where:
  • Minimum: Q1 – 1.5 * IQR
  • Maximum: Q3 + 1.5 * IQR Values below the minimum or above the maximum are treated as outliers.

3. Other Distribution

Using Percentile

How to treat Outliers?

1. Trimming:

If outliers are more than the data seems too thin. It will be much faster.

2. Capping:

It will make a limit between two end boundaries.

if max = 80 and outliers=90,85. Then outliers become 80

if min=5 and outliers=3,2,0. Then outliers become 5

Techniques for Outlier Detection and Removal

  1. Z-Score Treatment

Z-score treatment, also known as standardization or normalization, is a method for transforming numerical data so that it has a mean of 0 and a standard deviation of 1. This is done by subtracting the mean from each value and dividing the result by the standard deviation.

2. IQR-based Filtering

IQR-based filtering, also known as interquartile range (IQR) filtering, is a method for identifying and removing outliers in a dataset. It is based on the idea that most of the values in a dataset should be within the range of the first quartile (Q1) to the third quartile (Q3).

To perform IQR filtering, you first need to calculate the IQR of the data by subtracting the first quartile from the third quartile. Then, you can identify and remove the outliers by applying the following criteria:

  • Values that are less than Q1–1.5 * IQR are considered lower outliers.
  • Values that are greater than Q3 + 1.5 * IQR are considered upper outliers.

3. Percentile

A percentile is a measure of the relative standing of a value in a dataset. It represents the value below which a certain percentage of the values in the dataset fall.

For example, if the 50th percentile of a dataset is 10, that means that 50% of the values in the dataset are less than or equal to 10. The 50th percentile is also known as the median.

Percentiles can be useful for understanding the distribution of values in a dataset and for identifying values that are unusually high or low. They can also be used to perform winsorization, which is a method for handling outliers in a dataset.

To calculate percentiles in Python, you can use the quantile function from the pandas library. This function takes a Pandas Series or DataFrame and a percentile as input, and it returns the value at the specified percentile.

Here is an example of how you might use the quantile function to calculate the 50th percentile (median) of a Pandas Series:

import pandas as pd# Load the data
df = pd.read_csv('data.csv')# Calculate the 50th percentile (median)
median = df['value'].quantile(0.5)

In this example, the quantile function will return the median of the ‘value’ column in the DataFrame.

It is important to note that percentiles are sensitive to the size of the dataset, and they may not always accurately reflect the distribution of the values. In addition, the interpretation of percentiles may depend on the context of the analysis.

4. Winsorization(percentile after Capping)

Winsorization, also known as “capping” or “trimming,” is a method for handling outliers in a dataset. It involves replacing extreme values in the dataset with less extreme values, in order to reduce the influence of outliers on the statistical properties of the data.

There are two main types of winsorization: single-sided winsorization and double-sided winsorization. Single-sided winsorization replaces only the values that are above or below a certain percentile with the value at that percentile. Double-sided winsorization replaces both the highest and lowest values with the values at a certain percentile.

17 Replies to “Outliers”

  1. Профессиональные seo https://seo-optimizaciya-kazan.ru услуги для максимизации онлайн-видимости вашего бизнеса. Наши эксперты проведут глубокий анализ сайта, оптимизируют контент и структуру, улучшат технические аспекты и разработают индивидуальные стратегии продвижения.

  2. Изготовление памятников и надгробий https://uralmegalit.ru по низким ценам. Собственное производство. Высокое качество, широкий ассортимент, скидки, установка.

  3. Pin Up https://pin-up.fotoevolution.ru казино, которое радует гемблеров в России на протяжении нескольких лет. Узнайте, что оно подготовило посетителям. Описание, бонусы, отзывы о легендарном проекте. Регистрация и вход.

Leave a Reply

Your email address will not be published. Required fields are marked *