Imagine you have a bunch of friends, and they all have their own favorite ways of measuring things. One friend likes to use inches, another uses feet, and one even likes to use meters. Now, if you want to compare their heights, it gets a bit tricky because they’re all using different scales.

Standardization is like bringing everyone to the same measuring party! It’s a cool trick where we make sure everyone uses the same measuring unit, like everyone deciding to use inches. So, no matter how your friends measured things before, after standardization, everyone is speaking the same height language.

In computer language, this is super helpful when we want our machines to compare different things. We want them to look at the same “rulers” for every measurement, just like making sure everyone uses inches for height. This way, our machines can understand and compare things better, and that’s what we call standardization! It’s like getting everyone on the same page, so we can all play together nicely.

Standardization in data science is a preprocessing technique used to rescale and transform the features or variables in a dataset so that they all have a similar scale. The goal is to make the data comparable and easier to work with, especially when using algorithms that are sensitive to the scale of the input features.

In simpler terms, standardization ensures that all the variables in your dataset are on a level playing field. It involves adjusting the values of the features so that they have a mean (average) of 0 and a standard deviation of 1. This normalization process makes it easier for machine learning models to understand and learn from the data.

The formula for standardization of a data point (Si) in a feature is given by:

Si′=(Si−mean)/standard deviation

  • Si′ is the standardized value of the data point.
  • Si is the original value of the data point.
  • “mean” represents the average (mean) value of all data points in the feature.
  • “standard deviation” is a measure of how spread out the values are in the feature

By standardizing the data, you eliminate the influence of the original measurement units, allowing models to focus on the relative importance of features rather than their absolute values. It’s particularly useful for algorithms like K-means clustering, K-nearest neighbors (KNN), principal component analysis (PCA), artificial neural networks, and gradient descent.

In summary, standardization is a crucial step in preparing data for analysis and modeling, ensuring that features are consistently scaled for more effective and accurate machine learning outcomes.

we’re diving into the cool world of “Standardization,” which is like making sure all the numbers in our data speak the same language and play fair.

# Importing cool tools
import numpy as np  # For numbers magic
import pandas as pd  # For data fun
import matplotlib.pyplot as plt  # For picture drawing
import seaborn as sns  # For colorful charts

# Reading our data
df = pd.read_csv('../input/standardizationzscore-normalization/Social_Network_Ads.csv')

# Slicing our data to keep only the exciting parts
df = df.iloc[:, 2:]
df.sample(5)

Train-Test Split: Dividing our Data into Teams

# Splitting our data into training and test teams
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('Purchased', axis=1),
                                                    df['Purchased'],
                                                    test_size=0.3,
                                                    random_state=0)

X_train.shape, X_test.shape

StandardScaler: The Magic Wand for Standardization

# Bringing in the magic wand
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Teaching the magic wand the tricks from our training team
scaler.fit(X_train)

# Applying the magic to both our training and test teams
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Turning our data back into something we can read (a DataFrame)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Checking out the effects with some cool charts
np.round(X_train.describe(), 1)
np.round(X_train_scaled.describe(), 1)

# Drawing the effects
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

ax1.scatter(X_train['Age'], X_train['EstimatedSalary'])
ax1.set_title("Before Scaling")

ax2.scatter(X_train_scaled['Age'], X_train_scaled['EstimatedSalary'], color='red')
ax2.set_title("After Scaling")

plt.show()

Why Scaling Matters: The Big Showdown

# Time for a showdown! Who's better: with scaling or without?
from sklearn.linear_model import LogisticRegression

# Let's see how our champion without scaling performs
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# Now, let's see how our champion with scaling performs
lr_scaled = LogisticRegression()
lr_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = lr_scaled.predict(X_test_scaled)

# Checking the scores
from sklearn.metrics import accuracy_score

print("Actual (No Scaling):", accuracy_score(y_test, y_pred))
print("Scaled:", accuracy_score(y_test, y_pred_scaled))
# Another showdown, but this time with a tree!
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

dt_scaled = DecisionTreeClassifier()
dt_scaled.fit(X_train_scaled, y_train)
y_pred_dt_scaled = dt_scaled.predict(X_test_scaled)

# Checking the scores again
print("Actual (No Scaling):", accuracy_score(y_test, y_pred_dt))
print("Scaled:", accuracy_score(y_test, y_pred_dt_scaled))

So, there you have it, coding pals! Standardization is like making sure our data numbers are on the same team, playing fair and square. It helps our computer friends make better predictions and win more showdowns! Keep coding cool!

Leave a Reply

Your email address will not be published. Required fields are marked *