Random Sample Imputation (Missing Indicator)

Both for Numerical and Categorical missing values. No support of sklearn. we have to use pandas.

Advantages:

  • preserve variance of variables.
  • memory heavy for deployment, as we need to store the original training set to extract values and replace the NA in the coming observation.
  • well suited for linear models.As it doesn’t distort the distribution, regardless of the %of NA.
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

# Read CSV file
df = pd.read_csv('train.csv', usecols=['Age', 'Fare', 'Survived'])

# Calculate percentage of missing values
missing_percentage = df.isnull().mean() * 100

# Train-test split
X = df.drop(columns=['Survived'])
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Create imputed columns
X_train['Age_imputed'] = X_train['Age']
X_test['Age_imputed'] = X_test['Age']

# Impute missing values with random samples
X_train['Age_imputed'][X_train['Age_imputed'].isnull()] = X_train['Age'].dropna().sample(X_train['Age'].isnull().sum()).values
X_test['Age_imputed'][X_test['Age_imputed'].isnull()] = X_train['Age'].dropna().sample(X_test['Age'].isnull().sum()).values

# Plot original vs imputed variable distributions
sns.distplot(X_train['Age'], label='Original', hist=False)
sns.distplot(X_train['Age_imputed'], label='Imputed', hist=False)
plt.legend()
plt.show()

# Compare variances
print('Original variable variance: ', X_train['Age'].var())
print('Variance after random imputation: ', X_train['Age_imputed'].var())

# Compare covariances
print(X_train[['Fare', 'Age', 'Age_imputed']].cov())

# Boxplot to visualize outliers
X_train[['Age', 'Age_imputed']].boxplot()

Random Sample Imputation (Missing Indicator)

  1. Import Libraries: Import necessary libraries such as NumPy, Pandas, Seaborn, and more.
  2. Read CSV File: Load your dataset, and in this case, we’re using a CSV file with columns ‘Age’, ‘Fare’, and ‘Survived’.
  3. Calculate Missing Percentage: Find the percentage of missing values in each column.
  4. Train-Test Split: Split the data into training and testing sets.
  5. Create Imputed Columns: Create new columns for imputed values, initially copying the original ‘Age’ column.
  6. Impute Missing Values with Random Samples: For rows with missing ‘Age’ values, replace them with random samples from the non-missing ‘Age’ values.
  7. Plot Original vs. Imputed Variable Distributions: Visualize the distribution of the ‘Age’ variable before and after imputation using Seaborn.
  8. Compare Variances: Compare the variance of the original ‘Age’ variable with the imputed ‘Age’ variable.
  9. Compare Covariances: Explore the covariance between ‘Fare’, ‘Age’, and the imputed ‘Age’ variables.
  10. Boxplot for Outliers: Visualize outliers in the ‘Age’ and imputed ‘Age’ variables using a boxplot.

Missing Indicator for Handling Missing Values

At first, we make a model to understand the differentiation between the missing and non-missing values. The missing indicator fills in True for the missing value and False for the Non-Missing Value.

# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.linear_model import LogisticRegression

# Read CSV file
df = pd.read_csv('train.csv', usecols=['Age', 'Fare', 'Survived'])

# Train-test split
X = df.drop(columns=['Survived'])
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Create a simple imputer
si = SimpleImputer()
X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

# Train logistic regression on the data without the missing indicator
clf = LogisticRegression()
clf.fit(X_train_trf2, y_train)
y_pred = clf.predict(X_test_trf2)

# Calculate accuracy without missing indicator
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

# Add missing indicator to the imputer
si = SimpleImputer(add_indicator=True)
X_train = si.fit_transform(X_train)
X_test = si.transform(X_test)

# Train logistic regression on the data with the missing indicator
clf.fit(X_train_trf2, y_train)
y_pred = clf.predict(X_test_trf2)

# Calculate accuracy with missing indicator
accuracy_score(y_test, y_pred)

Missing Indicator for Handling Missing Values

  1. Import Libraries: Import necessary libraries for data handling and modeling.
  2. Read CSV File: Load your dataset, similar to the first section.
  3. Train-Test Split: Split the data into training and testing sets.
  4. Create Simple Imputer: Create a SimpleImputer to handle missing values.
  5. Train Logistic Regression Without Missing Indicator: Train a logistic regression model on the data without a missing indicator and evaluate its accuracy.
  6. Add Missing Indicator to Imputer: Modify the imputer to add a missing indicator.
  7. Train Logistic Regression With Missing Indicator: Train a logistic regression model on the data with the missing indicator and evaluate its accuracy.

Automatically Select Imputer:

GridSearchCV is a library function that is a member of sklearn’s model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters. It is used for fit and prediction.


#Import libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

#Read CSV file

df = pd.read_csv('train.csv')

#Drop unnecessary columns

df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)

#Train-test split

X = df.drop(columns=['Survived'])
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

#Define numerical and categorical features

numerical_features = ['Age', 'Fare']
categorical_features = ['Embarked', 'Sex']

#Create preprocessor using ColumnTransformer

preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='median'), numerical_features),
('cat', SimpleImputer(strategy='most_frequent'), categorical_features)
]
)

#Create a pipeline with preprocessor and logistic regression classifier

clf = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])

#Define parameter grid for hyperparameter tuning

param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'],
'classifier__C': [0.1, 1.0, 10, 100]
}

#Perform grid search for hyperparameter tuning

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train, y_train)

#Print the best parameters and relevant cross-validation results

print(f"Best params:")
print(grid_search.best_params_)
print(f"Internal CV score: {grid_search.best_score_:.3f}")

#Display relevant columns

Automatically Select Imputer

  1. Import Libraries: Import necessary libraries for data handling, modeling, and hyperparameter tuning.
  2. Read CSV File: Load your dataset, and in this case, drop unnecessary columns.
  3. Train-Test Split: Split the data into training and testing sets.
  4. Define Numerical and Categorical Features: Specify the numerical and categorical features in your dataset.
  5. Create Preprocessor Using ColumnTransformer: Set up a ColumnTransformer with numerical imputation using the median and categorical imputation using the most frequent strategy.
  6. Create a Pipeline: Set up a pipeline that includes the preprocessor and a logistic regression classifier.
  7. Define Parameter Grid for Hyperparameter Tuning: Specify a parameter grid for hyperparameter tuning.
  8. Perform Grid Search for Hyperparameter Tuning: Use GridSearchCV to find the best hyperparameters for the pipeline.
  9. Print Best Parameters and Cross-Validation Results: Display the best parameters and relevant cross-validation results.

These code sections cover various techniques for handling missing data, from random sample imputation to using missing indicators and automatically selecting the imputer.

Leave a Reply

Your email address will not be published. Required fields are marked *