Encoding categorical data involves converting categorical variables into numerical representations, making it compatible with machine learning algorithms that typically require numerical input. Categorical data represents categories and lacks the inherent numerical meaning found in numerical data. There are various methods for encoding categorical data, and the choice depends on the nature of the data.

Data can be broadly categorized into two types: Numerical and Categorical. Categorical data, in turn, can be divided into two subtypes: Nominal and Ordinal. Each subtype requires a specific encoding method to be effectively used in machine learning models.

  1. Numerical Data:
    Numerical data consists of quantitative values that can be measured or counted. Examples include age, income, and temperature.
  2. Categorical Data:
    Categorical data represents categories and cannot be measured in the same way as numerical data. It can be further classified into two types: i. Nominal Data (One-Hot Encoding):
    Nominal data consists of categories with no inherent order or ranking. One way to encode nominal data is using One-Hot Encoding, which creates binary columns for each category. ii. Ordinal Data (Ordinal Encoding):
    Ordinal data has a clear order or ranking among its categories. Ordinal Encoding is suitable for such data, preserving the ordinal relationship between categories.

Column Transformer:
The Column Transformer is a class in scikit-learn designed to facilitate the application of distinct transformers to numerical and categorical data independently. This estimator allows the transformation of different columns or subsets of columns separately, and the results from each transformer are combined into a unified feature space.

It’s worth noting that the implementation of Column Transformer will be explored later in this study.

# Importing necessary libraries
import numpy as np
import pandas as pd

# Reading data from 'customer.csv' and storing it in the variable 'df'
df = pd.read_csv('customer.csv')

# Displaying a random sample of 5 elements
df.sample(5)

# Slicing, using all rows and columns from index 2 to the end
df = df.iloc[:, 2:]
df.head()

# Using OrdinalEncoder class from scikit-learn
from sklearn.preprocessing import OrdinalEncoder

# Creating an object for OrdinalEncoding with specified categories
oe = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])

# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('purchased', axis=1), df['purchased'], test_size=0.2, random_state=13)

# Fitting the OrdinalEncoder to the training data
oe.fit(X_train)

# Transforming X_train using OrdinalEncoding
X_train = oe.transform(X_train)
X_train

# Displaying the categories used by OrdinalEncoder
oe.categories_

# Using LabelEncoder for output (y_train and y_test)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Fitting the LabelEncoder to the training output (y_train)
le.fit(y_train)

# Displaying the classes used by LabelEncoder
le.classes_

# Transforming the training and testing output using LabelEncoder
y_train = le.transform(y_train)
y_test = le.transform(y_test)

Explanation:

  1. The code reads the data from ‘customer.csv’ into a Pandas DataFrame.
  2. It demonstrates the use of OrdinalEncoder for ordinal categorical data with specified categories.
  3. The data is then split into training and testing sets.
  4. OrdinalEncoder is fitted to the training data and applied to transform the training set.
  5. LabelEncoder is used for encoding the output (target) variables, as LabelEncoder is suitable for single-dimensional labels.
  6. The classes used by both encoders are displayed for reference.

3 Replies to “Encoding Categorical Data”

Leave a Reply

Your email address will not be published. Required fields are marked *