Imputing Numerical Data (Mean/Median)

The provided code demonstrates how to handle missing numerical data using the SimpleImputer class from scikit-learn, focusing on the mean/median imputation method.

Today, our focus will be on mean/median imputation.

Advantages:

  • Simplicity

Disadvantages:

  • Alters the shape of the distribution
  • Sensitivity to outliers
  • Affects covariance and correlation

When to Use:

  • Missing data is completely at random (MCAR)
  • The proportion of missing data is less than 5%

Let’s break down the code step by step:

Datasets : imputing numerical data (kaggle.com)

Importing Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

These lines import the required libraries for data manipulation, numerical operations, plotting, and scikit-learn functionalities.

Loading and Exploring the Dataset

df = pd.read_csv('../input/imputing-numerical-data/titanic_toy.csv')
df.head()
df.info()
df.isnull().mean()

The code reads a dataset (presumably related to the Titanic) and provides a glimpse of the data through the head(), info(), and isnull().mean() functions.

Handling Missing Values Manually (Before Using Scikit-Learn)

# Calculating mean and median
mean_age = X_train['Age'].mean()
median_age = X_train['Age'].median()
mean_fare = X_train['Fare'].mean()
median_fare = X_train['Fare'].median()

# Filling missing values
X_train['Age_median'] = X_train['Age'].fillna(median_age)
X_train['Age_mean'] = X_train['Age'].fillna(mean_age)
X_train['Fare_median'] = X_train['Fare'].fillna(median_fare)
X_train['Fare_mean'] = X_train['Fare'].fillna(mean_fare)

This section calculates the mean and median for the ‘Age’ and ‘Fare’ variables and manually fills the missing values in the dataset.

Comparing Variance Before and After Imputation

# Comparing variance before and after filling in mean/median
print('Original Age variable variance: ', X_train['Age'].var())
print('Age Variance after median imputation: ', X_train['Age_median'].var())
print('Age Variance after mean imputation: ', X_train['Age_mean'].var())
print('Original Fare variable variance: ', X_train['Fare'].var())
print('Fare Variance after median imputation: ', X_train['Fare_median'].var())
print('Fare Variance after mean imputation: ', X_train['Fare_mean'].var())

This code compares the variance of the ‘Age’ and ‘Fare’ variables before and after the mean/median imputation.

Visualizing the Impact of Imputation on Distribution

The provided code segment involves creating a KDE (Kernel Density Estimation) plot to visualize the distribution of the ‘Age’ variable along with its imputed versions using median and mean. It also includes a boxplot to highlight potential outliers. Let’s break down the code step by step:

fig = plt.figure()
ax = fig.add_subplot(111)

# Original variable distribution
X_train['Age'].plot(kind='kde', ax=ax)

# Variable imputed with the median
X_train['Age_median'].plot(kind='kde', ax=ax, color='red')

# Variable imputed with the mean
X_train['Age_mean'].plot(kind='kde', ax=ax, color='green')

# Adding legends
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

X_train.cov()

# Outlier of data
X_train[['Age', 'Age_median', 'Age_mean']].boxplot()

Explanation:

  1. Creating Figure and Subplot:
  • fig = plt.figure() initializes a new figure.
  • ax = fig.add_subplot(111) creates a subplot within the figure (1 row, 1 column, and the 1st subplot).

2. Plotting Original and Imputed Distributions:

  • X_train['Age'].plot(kind='kde', ax=ax) plots the KDE for the original ‘Age’ variable.
  • X_train['Age_median'].plot(kind='kde', ax=ax, color='red') and X_train['Age_mean'].plot(kind='kde', ax=ax, color='green') plot KDEs for ‘Age’ imputed with the median and mean, respectively.

3. Adding Legends:

  • lines, labels = ax.get_legend_handles_labels() retrieves legend handles and labels.
  • ax.legend(lines, labels, loc='best') adds a legend to the plot.

4. Calculating Covariance Matrix:

  • X_train.cov() calculates the covariance matrix of the ‘Age’ variable.

5. Boxplot for Outliers:

  • X_train[['Age', 'Age_median', 'Age_mean']].boxplot() creates a boxplot to visualize the distribution and identify potential outliers for the original ‘Age’ variable and its imputed versions.

This code snippet provides a comprehensive visual representation of the distribution of the ‘Age’ variable before and after imputation using KDE plots and highlights potential outliers through a boxplot. The legend aids in distinguishing between the different distributions.

Using Scikit-Learn for Imputation

The provided code segment demonstrates how to use scikit-learn for imputing missing values in a dataset. Let’s break down the code step by step:

# Using scikit-learn for imputation
imputer1 = SimpleImputer(strategy='median')
imputer2 = SimpleImputer(strategy='mean')

trf = ColumnTransformer([
    ('imputer1', imputer1, ['Age']),
    ('imputer2', imputer2, ['Fare'])
], remainder='passthrough')

trf.fit(X_train)
trf.named_transformers_['imputer1'].statistics_
trf.named_transformers_['imputer2'].statistics_
X_train = trf.transform(X_train)
X_test = trf.transform(X_test)

Explanation:

  1. Imputer Initialization:
  • imputer1 and imputer2 are instances of the SimpleImputer class from scikit-learn. They are configured with strategies ‘median’ and ‘mean’, respectively.

2. Column Transformer:

  • ColumnTransformer is used to apply different imputation strategies to specific columns.
  • 'imputer1' is applied to the ‘Age’ column, and 'imputer2' is applied to the ‘Fare’ column.
  • The remainder='passthrough' parameter ensures that columns not specified will be passed through without any transformation.

3. Fitting the Transformer:

  • trf.fit(X_train) fits the transformer on the training data (X_train).

4. Accessing Imputation Statistics:

  • trf.named_transformers_['imputer1'].statistics_ and trf.named_transformers_['imputer2'].statistics_ provide access to the statistics (in this case, the median and mean values) computed during the fitting process.

5. Transforming the Data:

  • X_train = trf.transform(X_train) and X_test = trf.transform(X_test) apply the imputation transformations to the training and test sets, respectively.

In summary, this code efficiently uses scikit-learn’s SimpleImputer along with ColumnTransformer to impute missing values separately for the ‘Age’ and ‘Fare’ columns, based on the specified strategies (median and mean, respectively). The fitted transformer is then used to transform the original datasets, completing the imputation process.

This comprehensive code demonstrates various aspects of handling missing numerical data, including manual imputation, comparison of variances, visualization of distribution impact, and the use of scikit-learn for efficient imputation.

Arbitrary Value Imputation:

This technique states that we group the missing values in a column and assign them to a new value that is far away from the range of that column. Mostly we use values like 99999999 or -9999999 or “Missing” or “Not defined” for numerical & categorical variables. Data is not Missing At Random.

Arbitrary value imputation consists of replacing all occurrences of missing values (NA) within a variable with an arbitrary value.

Typically used arbitrary values are 0, 999, -999 (or other combinations of 9s), or -1 (if the distribution is positive).

Suitable for both numerical and categorical variables.

This image has an empty alt attribute; its file name is image-16.png

Imputing Numerical Data using Arbitrary Value (Without scikit-learn)

# Import Libraries
import pandas as pd
import numpy as np

# Read data
df = pd.read_csv('titanic_toy.csv')

# Show the first five elements
df.head()

# Calculate the mean of missing values
df.isnull().mean()

# Split the data into training and testing sets
X = df.drop(columns=['Survived'])
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Arbitrary Value Imputation without scikit-learn
# Define arbitrary values
arbitrary_value_age = 99
arbitrary_value_fare = 999

# Impute missing values in the 'Age' column
X_train['Age'].fillna(arbitrary_value_age, inplace=True)
X_test['Age'].fillna(arbitrary_value_age, inplace=True)

# Impute missing values in the 'Fare' column
X_train['Fare'].fillna(arbitrary_value_fare, inplace=True)
X_test['Fare'].fillna(arbitrary_value_fare, inplace=True)

X_train

Explanation:

  1. Import Libraries: Import pandas and numpy for data manipulation and handling.
  2. Read Data: Load the dataset (‘titanic_toy.csv’) into a pandas DataFrame and display the first five elements to understand the data.
  3. Calculate Missing Values: Check the percentage of missing values in the dataset.
  4. Split Data: Divide the dataset into features (X) and the target variable (y). Split it into training and testing sets using the previously imported train_test_split function.
  5. Arbitrary Value Imputation without scikit-learn:
  • Define arbitrary values (arbitrary_value_age and arbitrary_value_fare).
  • Impute missing values in the ‘Age’ column for both training and testing sets with the defined arbitrary value for age.
  • Impute missing values in the ‘Fare’ column for both training and testing sets with the defined arbitrary value for fare.

This code demonstrates how to perform arbitrary value imputation for missing numerical data without relying on the scikit-learn library. It manually replaces missing values with predefined constants for specified columns.


Imputing Numerical Data using Arbitrary Value(with sk-learn)

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import classes from sklearn
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Read data
df = pd.read_csv('titanic_toy.csv')

# Show the first five elements
df.head()

# Calculate the mean of missing values
df.isnull().mean()

# Split the data into training and testing sets
X = df.drop(columns=['Survived'])
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Using sklearn for Arbitrary Value Imputation (an easier method than using pandas)
imputer1 = SimpleImputer(strategy='constant', fill_value=99)
imputer2 = SimpleImputer(strategy='constant', fill_value=999)

# Using Column Transformer
trf = ColumnTransformer([
    ('imputer1', imputer1, ['Age']),
    ('imputer2', imputer2, ['Fare'])
], remainder='passthrough')

# Fit the transformer on the training data
trf.fit(X_train)

# Generate fill values
trf.named_transformers_['imputer1'].statistics_
trf.named_transformers_['imputer2'].statistics_

# Transform X_train and X_test
X_train = trf.transform(X_train)
X_test = trf.transform(X_test)

X_train

Explanation:

  1. Import Libraries: Import the necessary libraries, including pandas for data manipulation, numpy for numerical operations, and matplotlib.pyplot for plotting.
  2. Read Data: Load the dataset (‘titanic_toy.csv’) into a pandas DataFrame and display the first five elements to get an overview of the data.
  3. Calculate Missing Values: Check the percentage of missing values in the dataset.
  4. Split Data: Divide the dataset into features (X) and the target variable (y) and further split it into training and testing sets using the train_test_split function from sklearn.
  5. Arbitrary Value Imputation: Utilize the SimpleImputer class from sklearn to impute missing values with arbitrary constants (99 and 999) for the ‘Age’ and ‘Fare’ columns, respectively.
  6. Column Transformer: Use the ColumnTransformer to apply different imputation strategies to specific columns while keeping the rest unchanged (‘passthrough’).
  7. Fit and Transform: Fit the transformer on the training data and generate fill values. Transform both the training and testing sets accordingly.

The code demonstrates a streamlined approach to imputing numerical missing values using arbitrary values, showcasing the simplicity and efficiency of sklearn’s imputation methods.

33 Replies to “Imputing Numerical Data”

  1. I’m really enjoying the design and layout of your blog. It’s a very easy on the eyes which makes it much more enjoyable for me to come here and visit more often. Did you hire out a designer to create your theme? Great work!

  2. You really make it appear so easy with your presentation but I in finding this topic to be actually something that I feel I might never understand. It seems too complicated and extremely extensive for me. I am having a look forward for your subsequent put up, I?¦ll attempt to get the hang of it!

  3. I wanted to send you one very little observation so as to thank you very much as before for those exceptional principles you have shared at this time. This is seriously generous of people like you to give easily what exactly a number of people might have distributed as an electronic book to end up making some dough on their own, most importantly considering that you could possibly have done it in the event you desired. Those ideas likewise worked to be the fantastic way to know that other people have the same dreams just as my very own to grasp significantly more regarding this condition. I believe there are lots of more fun moments up front for those who find out your site.

  4. Someone essentially help to make seriously articles I would state. This is the first time I frequented your web page and thus far? I amazed with the research you made to make this particular publish extraordinary. Fantastic job!

  5. Good post and right to the point. I don’t know if this is actually the best place to ask but do you guys have any thoughts on where to get some professional writers? Thanks 🙂

  6. Hiya very cool web site!! Man .. Excellent .. Superb .. I will bookmark your website and take the feeds also…I’m happy to find numerous useful info here in the post, we need work out more techniques on this regard, thanks for sharing.

  7. It¦s actually a cool and helpful piece of information. I¦m satisfied that you just shared this useful information with us. Please stay us up to date like this. Thank you for sharing.

  8. What Is Neotonics? Neotonics is a skin and gut supplement made of 500 million units of probiotics and 9 potent natural ingredients to support optimal gut function and provide healthy skin.

  9. What Is ZenCortex? ZenCortex is a natural supplement that promotes healthy hearing and mental tranquility. It’s crafted from premium-quality natural ingredients, each selected for its ability to combat oxidative stress and enhance the function of your auditory system and overall well-being.

  10. This design is steller! You most certainly know how to keep a reader entertained. Between your wit and your videos, I was almost moved to start my own blog (well, almost…HaHa!) Wonderful job. I really loved what you had to say, and more than that, how you presented it. Too cool!

  11. Hey! I know this is somewhat off topic but I was wondering which blog platform are you using for this website? I’m getting tired of WordPress because I’ve had issues with hackers and I’m looking at alternatives for another platform. I would be awesome if you could point me in the direction of a good platform.

  12. Does your blog have a contact page? I’m having a tough time locating it but, I’d like to send you an email. I’ve got some suggestions for your blog you might be interested in hearing. Either way, great blog and I look forward to seeing it grow over time.

  13. Thanks for the auspicious writeup. It in fact was once a entertainment account it. Glance complex to more delivered agreeable from you! By the way, how could we keep up a correspondence?

  14. I am no longer certain the place you’re getting your information, but great topic. I needs to spend some time finding out more or understanding more. Thanks for magnificent information I used to be in search of this information for my mission.

  15. Thank you for every other excellent post. The place else may just anyone get that type of information in such a perfect means of writing? I’ve a presentation next week, and I’m on the look for such info.

Leave a Reply

Your email address will not be published. Required fields are marked *