Univariate analysis is a statistical method used to analyze and summarize data that involves examining the distribution, characteristics, and properties of a single variable at a time. In simpler terms, it focuses on understanding and describing the patterns and features of individual variables without considering their relationships with other variables.

Here’s a more detailed explanation for beginners:

  1. Single Variable Focus:
  • Univariate analysis deals with one variable at a time. It allows you to explore and understand the characteristics of a specific variable in isolation.

2. Types of Variables:

  • Variables can be broadly categorized into two types: categorical and numerical.
    • Categorical variables are those that represent categories or labels (e.g., gender, color).
    • Numerical variables are those that represent measurable quantities (e.g., age, height).

3. Categorical Univariate Analysis:

  • For categorical variables, univariate analysis often involves:
    • Counting the frequency of each category using count plots or bar charts.
    • Visualizing the proportions of different categories using pie charts.

4. Numerical Univariate Analysis:

  • For numerical variables, univariate analysis often involves:
    • Creating histograms to visualize the distribution of values.
    • Using summary statistics such as mean, median, minimum, maximum, and standard deviation to describe the central tendency and spread of the data.
    • Exploring boxplots to identify outliers and understand the spread of data.

5. Purpose of Univariate Analysis:

  • Univariate analysis is useful for gaining insights into the characteristics and patterns of individual variables.
  • It helps in identifying outliers, understanding the range of values, and detecting any potential issues with the data.

6. Example:

  • Suppose you have a dataset with information about the ages of individuals. Univariate analysis of the ‘Age’ variable would involve creating a histogram to visualize the age distribution, calculating the average age (mean), and exploring any extreme values using a boxplot.

7 . Limitations:

  • While univariate analysis provides valuable insights into individual variables, it may not capture complex relationships between variables. For a more comprehensive understanding of the data, multivariate analysis, which involves the simultaneous analysis of multiple variables, is often necessary.

In summary, univariate analysis is a fundamental step in the exploratory data analysis (EDA) process. It helps beginners grasp the characteristics and patterns of individual variables, laying the groundwork for more advanced analyses.

#import libraries

import pandas as pd

import seaborn as sns

#read the data

df = pd.read_csv(‘../input/data-science-day1-titanic/DSB_Day1_Titanic_train.csv’)

#show the first 5 elements of data

df.head()

1. Categorical Data

# if there is categorical data, you mostly use a count plot and piechart.

a. Countplot

sns.countplot(df[‘Embarked’])

or

df[‘Survived’].value_counts().plot(kind=’bar’)

b. PieChart

df[‘Sex’].value_counts().plot(kind=’pie’,autopct=’%.2f’)
-----------------------------------------------------------------------------------------------------------------------------------------

2. Numerical Data

In the case of numerical value, you use the following function

a. Histogram

#matplotlib is used to make graphs or for visualization.

import matplotlib.pyplot as plt

plt.hist(df[‘Age’],bins=5)

b. Distplot

#it is a histogram with KDE(Kernel Density Estimation)

sns.distplot(df[‘Age’])

c. Boxplot

#It is especially used for outliers.

sns.boxplot(df[‘Age’])

# to find the minimum age present in data

df[‘Age’].min()

#to find the maximum age present in the data

df[‘Age’].max()

#to find the average age present in the data

df[‘Age’].mean()

# It is used for how much data deviates from the mean.

df[‘Age’].skew()

2 Replies to “Univariate Analysis”

Leave a Reply

Your email address will not be published. Required fields are marked *