Sometimes, we get mixed data if you are unlucky. Mixed data contains alphabet and numerals.

Let’s Learn, how to handle mixed variables in the Titanic dataset by extracting numerical and categorical parts from the ‘Ticket’ column:


https://www.kaggle.com/competitions/titanic
# Import Libraries
import numpy as np
import pandas as pd

# Read the Titanic dataset
df = pd.read_csv('titanic.csv')

# Display the first few rows of the DataFrame
df.head()

# Check unique values in the 'number' column
df['number'].unique()

# Plot the count of passengers with each 'number'
fig = df['number'].value_counts().plot.bar()
fig.set_title('Passengers travelling with')

# Extract numerical part from the 'number' column
df['number_numerical'] = pd.to_numeric(df['number'], errors='coerce', downcast='integer')

# Extract categorical part from the 'number' column
df['number_categorical'] = np.where(df['number_numerical'].isnull(), df['number'], np.nan)

# Display the updated DataFrame
df.head()

# Check unique values in the 'Cabin' column
df['Cabin'].unique()

# Check unique values in the 'Ticket' column
df['Ticket'].unique()

# Extract numerical part from the 'Cabin' column
df['cabin_num'] = df['Cabin'].str.extract('(\d+)')

# Extract the first letter from the 'Cabin' column
df['cabin_cat'] = df['Cabin'].str[0]

# Plot the count of each category in the 'cabin_cat' column
df['cabin_cat'].value_counts().plot(kind='bar')

# Extract the last part of the 'Ticket' as a number
df['ticket_num'] = df['Ticket'].apply(lambda s: s.split()[-1])
df['ticket_num'] = pd.to_numeric(df['ticket_num'], errors='coerce', downcast='integer')

# Extract the first part of the 'Ticket' as a category
df['ticket_cat'] = df['Ticket'].apply(lambda s: s.split()[0])
df['ticket_cat'] = np.where(df['ticket_cat'].str.isdigit(), np.nan, df['ticket_cat'])

# Display the first 20 rows of the updated DataFrame
df.head(20)

# Check unique values in the 'ticket_cat' column
df['ticket_cat'].unique()

In this code:

  • The numerical part of the ‘Ticket’ column is extracted using a regular expression to find the digits at the end of the string.
  • The first part of the ‘Ticket’ column is extracted as a category using a regular expression to find non-digit characters at the beginning of the string.
  • The ‘Cabin’ column is split into numerical and categorical parts by converting it to a numeric format and then creating a new categorical column using pd.qcut.
  • The original ‘Ticket’ and ‘Cabin’ columns are dropped, and the modified dataset is displayed.

2 Replies to “Handle Mixed Variables”

Leave a Reply

Your email address will not be published. Required fields are marked *