KNN Imputer is an imputation method for handling missing values in a dataset. It stands for “k-nearest neighbors” imputer.

KNN Imputer

The KNN Imputer method works by replacing missing values in a feature with the average (mean) value of the k-nearest neighbors of that value. The number of nearest neighbors (k) is a hyperparameter that can be specified by the user.

To use the KNN Imputer method, you first need to install the knnimpute package. Then, you can use the KNNImputer class from the knnimpute package to perform the imputation.

Here is an example of how you might use the KNN Imputer method to impute missing values in a Pandas DataFrame:

from knnimpute import KNNImputer
import pandas as pd# Load the data
df = pd.read_csv('data.csv')# Initialize the imputer
imputer = KNNImputer(k=5)# Impute the missing values
df_imputed = imputer.fit_transform(df)

In this example, the KNN Imputer will replace missing values in the DataFrame with the average value of the 5 nearest neighbors of that value.

It is important to note that the KNN Imputer method can be computationally expensive for large datasets, as it requires calculating the distance between all pairs of observations. In addition, the quality of the imputations may depend on the choice of the k hyperparameter.

Imputation for completing missing values using k-Nearest Neighbors.

Each sample’s missing values are imputed using the mean value from n_neighbors the nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

nan_euclidean distance:

It is used when we cannot calculate the distance(number,missing_value).

dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Reading the CSV file and selecting relevant columns
df = pd.read_csv('train.csv')[['Age', 'Pclass', 'Fare', 'Survived']]

# Displaying the first few rows of the dataset
df.head()

# Calculating the percentage of missing values in each column
df.isnull().mean() * 100

# Splitting the data into features (X) and target variable (y)
X = df.drop(columns=['Survived'])
y = df['Survived']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Filling missing values using KNNImputer with k=3 and weights='distance'
knn = KNNImputer(n_neighbors=3, weights='distance')
X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

# Training a logistic regression model on the KNN imputed data
lr = LogisticRegression()
lr.fit(X_train_trf, y_train)
y_pred = lr.predict(X_test_trf)

# Calculating and displaying the accuracy score for KNN imputed data
accuracy_knn_imputer = accuracy_score(y_test, y_pred)
print("Accuracy with KNN Imputer:", accuracy_knn_imputer)

# Filling missing values using SimpleImputer with mean strategy
si = SimpleImputer()
X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

# Training a logistic regression model on the mean imputed data
lr.fit(X_train_trf2, y_train)
y_pred2 = lr.predict(X_test_trf2)

# Calculating and displaying the accuracy score for mean imputed data
accuracy_simple_imputer = accuracy_score(y_test, y_pred2)
print("Accuracy with Simple Imputer (mean):", accuracy_simple_imputer)

# Comparing the accuracy of KNN Imputer and Simple Imputer
if accuracy_knn_imputer > accuracy_simple_imputer:
    print("Accuracy of KNN Imputer is greater than the accuracy of Simple Imputer.")
else:
    print("Accuracy of Simple Imputer is greater than or equal to the accuracy of KNN Imputer.")

Explanation:

  • The code reads a dataset and selects specific columns.
  • It calculates the percentage of missing values in each column.
  • The data is split into features (X) and the target variable (y), followed by the creation of training and testing sets.
  • Missing values are imputed using KNNImputer with k=3 and weights=’distance’.
  • A logistic regression model is trained on the KNN imputed data, and predictions are made on the test set.
  • The accuracy score is calculated for KNN imputed data.
  • The same process is repeated using SimpleImputer with the mean strategy, and the accuracy score is calculated.
  • The final section compares the accuracy of KNN Imputer and Simple Imputer, providing a conclusion based on the comparison.

Multiple Imputation by Chained Equation(MICE) / Iterative Imputer 

Multivariate Imputation using Iterative Imputer

  • This strategy estimates each feature from all the others, imputing missing values
  • by modeling each feature as a function of other features in a round-robin fashion.

# Assumptions:

  • # – MCAR (missing completely at random)
  • # – MAR (missing at random)
  • # – MNAR (missing not at random)

# Advantage: – Accurate imputation

# Disadvantage: – Slow and memory-intensive process

  • # If your dataset has missing values and you want to fill them in a way that considers
  • # relationships between different features, this strategy is a more advanced approach.
  • # It assumes that the missingness in your data occurs completely at random, at random, or not at random.
  • # The advantage is that it provides accurate imputation, capturing complex relationships.
  • # However, the disadvantage is that it can be slow and may require more memory than simpler methods.


# Example Code:

# Assuming you have a DataFrame 'data' with missing values and want to impute them

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Extract features and target column
features = data.drop(columns=['target_column'])
target = data['target_column']

# Initialize IterativeImputer
imputer = IterativeImputer(random_state=0)

# Fit and transform the data
imputed_data = imputer.fit_transform(features)

# Replace the original target column with imputed values
data['target_column'] = imputed_data

# Now 'data' contains imputed values for the missing target_column

Explanation for Beginners:

  • Multivariate imputation is a technique that estimates missing values considering relationships between different features.
  • The advantage is accurate imputation, but it comes at the cost of being a slower and more memory-intensive process.
  • The example code demonstrates how to use the IterativeImputer from scikit-learn for this purpose, assuming a DataFrame ‘data’ with missing values.
Datasets : https://www.kaggle.com/datasets/emelgizemay/50-startups
# Multivariate Imputation using Iterative Imputer (Manual Iteration)

# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# Reading and preparing the dataset
df = np.round(pd.read_csv('50_Startups.csv')[['R&D Spend', 'Administration', 'Marketing Spend', 'Profit']]/10000)
np.random.seed(9)
df = df.sample(5)

# Making data artificially empty
df.iloc[1, 0] = np.NaN  # 2nd row and 1st column
df.iloc[3, 1] = np.NaN  # 4th row and 2nd column
df.iloc[-1, -1] = np.NaN  # last row and last column

# 0th Iteration - Impute all missing values with the mean of respective columns
df0 = pd.DataFrame()
df0['R&D Spend'] = df['R&D Spend'].fillna(df['R&D Spend'].mean())
df0['Administration'] = df['Administration'].fillna(df['Administration'].mean())
df0['Marketing Spend'] = df['Marketing Spend'].fillna(df['Marketing Spend'].mean())

# Display the result of the 0th iteration
df0

# Iterative Imputation Process
df1 = df0.copy()

# Impute the missing value in column 1
X = df1.iloc[[0, 2, 3, 4], 1:3]
y = df1.iloc[[0, 2, 3, 4], 0]
lr = LinearRegression()
lr.fit(X, y)
df1.iloc[1, 0] = lr.predict(df1.iloc[1, 1:].values.reshape(1, 2))

# Impute the missing value in column 2
X = df1.iloc[[0, 1, 2, 4], [0, 2]]
y = df1.iloc[[0, 1, 2, 4], 1]
lr.fit(X, y)
df1.iloc[3, 1] = lr.predict(df1.iloc[3, [0, 2]].values.reshape(1, 2))

# Impute the missing value in column 3
X = df1.iloc[0:4, 0:2]
y = df1.iloc[0:4, -1]
lr.fit(X, y)
df1.iloc[4, -1] = lr.predict(df1.iloc[4, 0:2].values.reshape(1, 2))

# Difference between 0th and 1st iterations
df1 - df0

# Repeat the iterative process until the difference becomes zero
df2 = df1.copy()
X = df2.iloc[[0, 2, 3, 4], 1:3]
y = df2.iloc[[0, 2, 3, 4], 0]
lr.fit(X, y)
df2.iloc[1, 0] = lr.predict(df2.iloc[1, 1:].values.reshape(1, 2))
X = df2.iloc[[0, 1, 2, 4], [0, 2]]
y = df2.iloc[[0, 1, 2, 4], 1]
lr.fit(X, y)
df2.iloc[3, 1] = lr.predict(df2.iloc[3, [0, 2]].values.reshape(1, 2))
X = df2.iloc[0:4, 0:2]
y = df2.iloc[0:4, -1]
lr.fit(X, y)
df2.iloc[4, -1] = lr.predict(df2.iloc[4, 0:2].values.reshape(1, 2))

# Difference between 1st and 2nd iterations
df2 - df1

# Repeat the process until the difference becomes zero
df3 = df2.copy()
X = df3.iloc[[0, 2, 3, 4], 1:3]
y = df3.iloc[[0, 2, 3, 4], 0]
lr.fit(X, y)
df3.iloc[1, 0] = lr.predict(df3.iloc[1, 1:].values.reshape(1, 2))
X = df3.iloc[[0, 1, 2, 4], [0, 2]]
y = df3.iloc[[0, 1, 2, 4], 1]
lr.fit(X, y)
df3.iloc[3, 1] = lr.predict(df3.iloc[3, [0, 2]].values.reshape(1, 2))
X = df3.iloc[0:4, 0:2]
y = df3.iloc[0:4, -1]
lr.fit(X, y)
df3.iloc[4, -1] = lr.predict(df3.iloc[4, 0:2].values.reshape(1, 2))

# Difference between 2nd and 3rd iterations
df3 - df2

# Continue the process until the difference becomes zero
# ... Repeat the above steps until the difference between iterations becomes zero ...

Explanation:

  • The code performs multivariate imputation using an iterative process.
  • It starts by filling missing values with column means (0th Iteration).
  • Subsequent iterations involve building a regression model on non-missing values and predicting the missing ones.
  • The process continues until the imputed values converge, and the difference between iterations becomes zero.

26 Replies to “KNN Imputer & MICE”

  1. Wonderful post! Are you interested in exploring a specific location for weekend tours? If so, you can utilize zip codes for tasks like locating nearby hotels, restaurants, and attractions near where you plan to visit. Dive into this post about zip codes for more details.

Leave a Reply

Your email address will not be published. Required fields are marked *