1. Log Transformer:
  • Purpose: Used when dealing with right-skewed data, where most values are concentrated on the left side.
  • Transformation: Applies the logarithm function to the data, helping to spread out the values and make the distribution more symmetric.

2. Reciprocal Transformer:

  • Purpose: Aims to handle specific types of skewed data.
  • Transformation: Takes the reciprocal of each data point, which can be useful for certain distributions.

3. Power Transformer:

  • Purpose: Similar to the log transformer, it strives to achieve a normal distribution.
  • Transformation: Involves raising each data point to a certain power, adjusting the distribution shape.

Sklearn provides three main transformers for these purposes:

  1. Function Transformer:
  • Usage: General-purpose transformer that applies a specified function to the data.

2. Power Transformer:

  • Usage: Specifically designed for power transformations.

3. Quantile Transformer:

  • Usage: Focuses on mapping data to a specified quantile distribution.

To check if data follows a normal distribution:

  • sns.distplot:
  • Usage: Visualization using seaborn’s distplot helps assess the shape of the distribution.
  • pd.skew=0:
  • Usage: Examining the skewness; a skewness of 0 indicates a perfectly symmetrical distribution.
  • QQplot (scipy.stats):
  • Usage: Utilizing the quantile-quantile plot to compare the data distribution against a theoretical normal distribution.

For specific scenarios:

  • Square Transformation (x²):
  • Usage: Applied when dealing with left-skewed data.
  • np.log:
  • Usage: Applying the natural logarithm function.
  • np.log1p:
  • Usage: Adding 1 before applying the logarithm function, particularly helpful when dealing with data containing zero values to avoid undefined results. (i.e., log(0))

Understanding and applying these function transformers is valuable for preparing data for machine learning models, ensuring they operate effectively on a variety of data distributions.

Function Transformer VS Column Transformer

The FunctionTransformer and ColumnTransformer are both tools provided by scikit-learn to perform specific transformations on input data. Let’s explore their differences:

FunctionTransformer:

1. Purpose:

  • Function Transformation: It is used to apply a specified function to each element in the dataset, transforming the entire dataset according to the defined function.

2. Usage:

  • Single Transformation: It is suitable for scenarios where a single transformation function is applied to the entire dataset or a subset of features.

3. Example:

  • If you want to apply a logarithmic transformation to a specific column in your dataset, FunctionTransformer is a straightforward way to achieve this.

4. Code Example:

   from sklearn.preprocessing import FunctionTransformer
   import numpy as np

   # Define the transformation function
   trf = FunctionTransformer(func=np.log1p)

   # Apply the transformation to the data
   X_transformed = trf.transform(X)

ColumnTransformer:

1. Purpose:

  • Feature-Specific Transformation: It is designed for scenarios where different transformations need to be applied to different subsets of features (columns) in the dataset.

2. Usage:

  • Multiple Transformations: It is useful when you have a dataset with diverse features that require different preprocessing steps.

3. Example:

  • If you have both numerical and categorical features and you want to apply different transformations to each type, ColumnTransformer is a convenient choice.

4. Code Example:

   from sklearn.compose import ColumnTransformer
   from sklearn.preprocessing import StandardScaler, OneHotEncoder

   # Define the transformations for numerical and categorical features
   transformers = [
       ('num', StandardScaler(), ['numerical_feature']),
       ('cat', OneHotEncoder(), ['categorical_feature'])
   ]

   # Create the ColumnTransformer
   col_transformer = ColumnTransformer(transformers=transformers)

   # Apply the transformations to the data
   X_transformed = col_transformer.fit_transform(X)

Summary:

  • Use FunctionTransformer when:
  • You have a specific transformation function to be applied uniformly to the entire dataset or a subset of features.
  • Use ColumnTransformer when:
  • You need to apply different transformations to different subsets of features in your dataset.

In summary, FunctionTransformer is suitable for scenarios where a consistent transformation is needed across specific features, while ColumnTransformer is more versatile, allowing for the application of multiple transformations to different subsets of features.

FUNCTION TRANSFORMER CODE 
Below is the code with comments explaining each step for function Transformer:


# Importing necessary libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

# Reading the Titanic dataset and selecting relevant columns
df = pd.read_csv('../input/data-science-day1-titanic/DSB_Day1_Titanic_train.csv', usecols=['Age', 'Fare', 'Survived'])

# Displaying the first few rows of the dataset
df.head()

# Checking for missing values in the dataset
df.isnull().sum()

# Filling missing age values with the mean of age
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Selecting features (X) and target variable (y)
X = df.iloc[:, 1:3]
y = df.iloc[:, 0]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Plotting the distribution and QQ plot for the 'Age' feature
plt.figure(figsize=(14, 4))
plt.subplot(121)
sns.distplot(X_train['Age'])
plt.title('Age PDF')
plt.subplot(122)
stats.probplot(X_train['Age'], dist="norm", plot=plt)
plt.title('Age QQ Plot')
plt.show()

# Creating Logistic Regression and Decision Tree models
clf = LogisticRegression()
clf2 = DecisionTreeClassifier()

# Training the models
clf.fit(X_train, y_train)
clf2.fit(X_train, y_train)

# Making predictions
y_pred = clf.predict(X_test)
y_pred1 = clf2.predict(X_test)

# Evaluating model accuracy
print("Accuracy LR:", accuracy_score(y_test, y_pred))
print("Accuracy DT:", accuracy_score(y_test, y_pred1))

# Applying log transformer to address right-skewed data
trf = FunctionTransformer(func=np.log1p)
X_train_transformed = trf.fit_transform(X_train)
X_test_transformed = trf.transform(X_test)

# Training models on transformed data
clf.fit(X_train_transformed, y_train)
clf2.fit(X_train_transformed, y_train)

# Making predictions on transformed data
y_pred = clf.predict(X_test_transformed)
y_pred1 = clf2.predict(X_test_transformed)

# Evaluating accuracy on transformed data
print("Accuracy LR (Transformed):", accuracy_score(y_test, y_pred))
print("Accuracy DT (Transformed):", accuracy_score(y_test, y_pred1))

# Applying log transformer to the entire dataset
X_transformed = trf.fit_transform(X)

# Training models on the transformed dataset
clf.fit(X_transformed, y)
clf2.fit(X_transformed, y)

# Cross-checking model accuracy using cross-validation
print("LR (Cross-Validated):", np.mean(cross_val_score(clf, X_transformed, y, scoring='accuracy', cv=10)))
print("DT (Cross-Validated):", np.mean(cross_val_score(clf2, X_transformed, y, scoring='accuracy', cv=10)))

# Plotting QQ plots for 'Fare' before and after log transformation
plt.figure(figsize=(14, 4))
plt.subplot(121)
stats.probplot(X_train['Fare'], dist="norm", plot=plt)
plt.title('Fare Before Log')
plt.subplot(122)
stats.probplot(X_train_transformed['Fare'], dist="norm", plot=plt)
plt.title('Fare After Log')
plt.show()

# Training models on transformed 'Fare' data
X_train_transformed2 = trf.fit_transform(X_train[['Fare']])
X_test_transformed2 = trf.transform(X_test[['Fare']])
clf.fit(X_train_transformed2, y_train)
clf2.fit(X_train_transformed2, y_train)

# Making predictions on transformed 'Fare' data
y_pred = clf.predict(X_test_transformed2)
y_pred2 = clf2.predict(X_test_transformed2)

# Evaluating accuracy on transformed 'Fare' data
print("Accuracy LR (Transformed Fare):", accuracy_score(y_test, y_pred))
print("Accuracy DT (Transformed Fare):", accuracy_score(y_test, y_pred2))
```

This code covers reading and preprocessing the Titanic dataset, creating models, evaluating accuracy, applying a log transformer, and visualizing the impact of the transformation on the data.
TRY THIS CODE

78 Replies to “Function Transformer: Understanding Data Transformations”

Leave a Reply

Your email address will not be published. Required fields are marked *