What is One-Hot Encoding ?

Imagine you have a list of different types of cars, and you want to tell a computer about them so it can learn which cars people like. But computers speak a different language—they like numbers! So, we need to turn the names of the cars into numbers.

1. Using Pandas:

With Pandas, it’s like telling the computer, “Hey, make a list for each type of car and put a 1 if the car belongs to that type, and 0 if it doesn’t.”

# Suppose we have a DataFrame 'df' with a column 'fuel' that says how the car is powered. 
# We can use Pandas like this: pd.get_dummies(df, columns=['fuel'])

So, if a car runs on gas, it gets a 1 in the ‘gas’ column, and 0 in the ‘electric’ and ‘hybrid’ columns.

2. K-1 One-Hot Encoding:

Sometimes, having too many columns can confuse the computer. So, we might say, “Okay, let’s make a list for each type of car, but we’ll only use one less column. If a car is not gas or electric, it must be hybrid!”

# Doing K-1 One-Hot Encoding in Pandas:
pd.get_dummies(df, columns=['fuel'], drop_first=True)

Now, if a car is not gas, it must be electric or hybrid. This helps the computer not get confused with too much information.

3. Using Scikit-Learn:

Here, we use a special tool called Scikit-Learn to help us turn the car types into numbers.

# Suppose we want to teach the computer about the 'fuel' and 'owner' of the cars.
from sklearn.preprocessing import OneHotEncoder
# Creating the OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse=False, dtype=np.int32)
# Transforming the data
X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']])

The computer now knows that if a car is not the first fuel type, it must be the second or third. We use this information to help the computer understand our data better.

4. Top Categories:

Now, imagine we have many types of cars, but some are rare. We might say, “Let’s group the rare ones together and call them ‘uncommon.'”

# Suppose we want to group brands with fewer than 100 cars as 'uncommon.'
counts = df['brand'].value_counts()
threshold = 100
repl = counts[counts <= threshold].index
pd.get_dummies(df['brand'].replace(repl, 'uncommon'))

Now, the computer can focus on the popular cars and just know that the rare ones are ‘uncommon.’

So, One-Hot Encoding is like teaching the computer about different things and making it easier for the computer to understand our information using numbers.

——————————————————————————————————————————————-

Same Things as Above BUT in different way

One-Hot Encoding is a technique used to convert categorical variables into a format suitable for machine learning algorithms. This process is particularly useful for nominal categorical variables, where the categories have no inherent order or ranking. Here’s an explanation and code examples for performing One-Hot Encoding using different approaches:

1. One-Hot Encoding using Pandas:

# Importing necessary libraries
import numpy as np
import pandas as pd
# Reading data from 'cars.csv' into a DataFrame
df = pd.read_csv('cars.csv')
# Displaying a random sample of 8 rows
df.sample(8)
# Checking the counts of unique values in the 'owner' column
df['owner'].value_counts()
# Performing One-Hot Encoding using Pandas
pd.get_dummies(df, columns=['fuel', 'owner'])

2. K-1 One-Hot Encoding:

# Performing K-1 One-Hot Encoding and dropping the first column to avoid multicollinearity
pd.get_dummies(df, columns=['fuel', 'owner'], drop_first=True)

3. One-Hot Encoding using Scikit-Learn:

# Importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 0:4], df.iloc[:, -1], test_size=0.2, random_state=2)
# Creating a OneHotEncoder object with specified parameters
ohe = OneHotEncoder(drop='first', sparse=False, dtype=np.int32)
# Transforming the categorical columns in the training set
X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']])
# Transforming the categorical columns in the testing set
X_test_new = ohe.transform(X_test[['fuel', 'owner']])
# Displaying the shape of the transformed training set
X_train_new.shape
# Combining the transformed features with the remaining non-categorical columns
np.hstack((X_train[['brand', 'km_driven']].values, X_train_new))

4. One-Hot Encoding with Top Categories:

# Counting the occurrences of each brand
counts = df['brand'].value_counts()
# Getting the number of unique brands
df['brand'].nunique()
# Setting a threshold for considering a brand as top or uncommon
threshold = 100
# Identifying brands with occurrences less than or equal to the threshold and replacing them with 'uncommon'
repl = counts[counts <= threshold].index
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Explanation:

  1. The code reads data from a CSV file (‘cars.csv’) into a Pandas DataFrame.
  2. Different One-Hot Encoding methods are demonstrated using Pandas and Scikit-Learn.
  3. K-1 One-Hot Encoding is shown, which drops one column to address multicollinearity.
  4. The final example shows One-Hot Encoding with a focus on top categories, where brands with occurrences below a certain threshold are replaced with a common category (‘uncommon’).

9 Replies to “One-Hot Encoding”

Leave a Reply

Your email address will not be published. Required fields are marked *