1. Delete the observations: If there is a large number of observations in the dataset, where all the classes to be predicted are sufficiently represented in the training data, then try deleting the missing value observations, which would not bring significant change in your feed to your model.

2. Replace missing values with the most frequent value: You can always impute them based on Mode in the case of categorical variables, just make sure you don’t have highly skewed class distributions.

3. Develop a model to predict missing values: One smart way of doing this could be training a classifier over your columns with missing values as a dependent variable against other features of your data set and trying to impute based on the newly trained classifier.

Here’s a simple step-by-step guide for handling missing data:

  1. Split Data:
  • Separate the data into two parts. One part contains the rows with known values, including the original output column. The other part consists of rows with missing values.

2. Cross-Validation:

  • Take the part with known values and divide it into subsets for cross-validation. This helps in selecting the best model.

3. Model Training:

  • Train various models using the cross-validated subsets. Evaluate the models using metrics to find the most accurate one. You can even explore different model options using techniques like grid search or randomized search.

4. Prediction:

  • Use the selected model to predict the missing values in the part of the data with unknown values.

Note:

  • This approach ensures fairness in the predictions since you’re not biased towards specific values. It allows you to obtain the most accurate predictions by selecting the best-performing model through cross-validation.

4. Deleting the variable: If there is an exceptionally large set of missing values, try excluding the variable itself for further modelling, but you need to make sure that it is not very significant for predicting the target variable i.e, Correlation between dropped variable, and target variable is very low or redundant.

For Example, 1, To implement this strategy to handle the missing values, we have to drop the complete column which contains missing values, so for a given dataset we drop Feature-1 completely and we use only the left features to predict our target variable.

5. Apply unsupervised Machine learning techniques: In this approach, we use unsupervised techniques like K-MeansHierarchical clustering, etc. The idea is that you can skip those columns which are having missing values and consider all other columns except the target column and try to create as many clusters as no of independent features(after dropping missing value columns), finally find the category in which the missing row falls.

For Example, 1, To implement this strategy, we drop the Feature-1 column and then use Feature-2 and Feature-3 as our features for the new classifier then finally after cluster formation, try to observe in which cluster the missing record is falling in and we are ready with our final dataset for further analysis.

Implementation in Python

Import necessary dependencies.

Load and Read the Dataset.

Find the number of missing values per column.

Apply Strategy-1(Delete the missing observations).

Apply Strategy 2 (Replace missing values with the most frequent value).

Apply Strategy-3(Delete the variable which is having missing values).

Apply Strategy-4(Develop a model to predict missing values).

For this strategy, we first encoded our Independent Categorical Columns using “One Hot Encoder” and Dependent Categorical Columns using “Label Encoder”.

– Read and Load the Encoded Dataset.

– Make missing records as our Testing data.

– Make non-missing records as our Training data.

Separate Dependent and Independent variables.

Fit our Logistic Regression model.

Predict the class for missing records.

This completes our implementation part!

Handle missing Categorical Imputation

‘missing’ is a categorical value that we gonna fill in missing data.

Datasets: Regression Technique EDA (kaggle.com)

# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading data and selecting important columns
df = pd.read_csv('train.csv', usecols=['GarageQual', 'FireplaceQu', 'SalePrice'])

# Displaying the first 5 rows
df.head()

# Calculating the percentage of missing values
df.isnull().mean() * 100

# Plotting a bar chart for columns with missing data
df['GarageQual'].value_counts().sort_values(ascending=False).plot.bar()
plt.xlabel('GarageQual')
plt.ylabel('Number of houses')

# Filling missing data with the word 'Missing'
df['GarageQual'].fillna('Missing', inplace=True)

# Plotting the bar chart in ascending order
df['GarageQual'].value_counts().sort_values(ascending=False).plot.bar()
plt.xlabel('GarageQual')
plt.ylabel('Number of houses')

# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['SalePrice']), df['SalePrice'], test_size=0.2)

# Using the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value='Missing')

# Transforming the training and testing sets
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_train)

Explanation:

  1. Import Libraries: Import necessary libraries, including pandas, numpy, and matplotlib.
  2. Read Data: Read the ‘train.csv’ file, focusing on specific columns (‘GarageQual’, ‘FireplaceQu’, ‘SalePrice’).
  3. Display Data: Show the first 5 rows of the DataFrame.
  4. Calculate Missing Percentage: Calculate the percentage of missing values in the dataset.
  5. Plot Bar Chart: Create a bar chart for the ‘GarageQual’ column showing the distribution of houses.
  6. Fill Missing Data: Replace missing values in ‘GarageQual’ with the word ‘Missing’.
  7. Plot Bar Chart Again: Plot the bar chart again to visualize the changes.
  8. Split Data: Split the data into training and testing sets.
  9. Use SimpleImputer: Utilize the SimpleImputer class from sklearn to handle missing values with a constant fill value (‘Missing’).
  10. Transform Data: Transform the training and testing sets using the imputer.

Handling Missing Value by Frequent Imputer

we can handle missing categorical values by Frequent imputer(Mode)

# Handling Missing Values by Frequent Imputer (Mode)

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read data and select important columns
df = pd.read_csv('train.csv', usecols=['GarageQual', 'FireplaceQu', 'SalePrice'])
df.head()

# Check the percentage of null values
df.isnull().mean() * 100

# Plot value counts of GarageQual
df['GarageQual'].value_counts().plot(kind='bar')

# Calculate the most frequent value (mode)
df['GarageQual'].mode()

# Using sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['SalePrice']), df['SalePrice'], test_size=0.2)

# Using the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')

# Transforming the training and testing sets
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_train)

imputer.statistics_

Explanation:

  1. Import Libraries: Import necessary libraries, including pandas, numpy, and matplotlib.
  2. Read Data: Read the ‘train.csv’ file, focusing on specific columns (‘GarageQual’, ‘FireplaceQu’, ‘SalePrice’).
  3. Display Data: Show the first 5 rows of the DataFrame.
  4. Calculate Missing Percentage: Calculate the percentage of missing values in the dataset.
  5. Plot Value Counts: Create a bar chart for the ‘GarageQual’ column showing the distribution of houses.
  6. Calculate Most Frequent Value (Mode): Determine the most frequent value (mode) for ‘GarageQual’.
  7. Split Data: Split the data into training and testing sets.
  8. Use SimpleImputer: Utilize the SimpleImputer class from sklearn to handle missing values by replacing them with the most frequent value.
  9. Transform Data: Transform the training and testing sets using the imputer.
  10. Display Most Frequent Value: Show the most frequent value calculated by the imputer.

15 Replies to “Missing-Categorical-Imputer”

  1. Neymar da Silva Santos Junior https://neymar.prostoprosport-ar.com is a Brazilian footballer who plays as a striker, winger and attacking midfielder for the Saudi Arabian club Al-Hilal and the Brazilian national team. Considered one of the best players in the world. The best scorer in the history of the Brazilian national team.

  2. NGolo Kante https://ngolokante.prostoprosport-ar.com is a French footballer who plays as a defensive midfielder for the Saudi Arabian club Al-Ittihad and the French national team. His debut for the first team took place on May 18, 2012 in a match against Monaco (1:2). In the 2012/13 season, Kante became the main player for Boulogne, which played in Ligue 3.

  3. Kobe Bean Bryant https://kobebryant.prostoprosport-ar.com is an American basketball player who played in the National Basketball Association for twenty seasons for one team, the Los Angeles Lakers. He played as an attacking defender. He was selected in the first round, 13th overall, by the Charlotte Hornets in the 1996 NBA Draft. He won Olympic gold twice as a member of the US national team.

  4. Продажа подземных канализационных ёмкостей https://neseptik.com по выгодным ценам. Ёмкости для канализации подземные объёмом до 200 м3. Металлические накопительные емкости для канализации заказать и купить в Екатеринбурге.

  5. Профессиональные seo https://seo-optimizaciya-kazan.ru услуги для максимизации онлайн-видимости вашего бизнеса. Наши эксперты проведут глубокий анализ сайта, оптимизируют контент и структуру, улучшат технические аспекты и разработают индивидуальные стратегии продвижения.

Leave a Reply

Your email address will not be published. Required fields are marked *