Window Functions in SQL are a type of analytical Function that perform calculations across a set of rows that are related to the current row, called a “window”. A window function calculates a value for each row in the results set based on a subset of the rows that are defined by a window specification.

The window specification is defined using the OVER() clause in SQL, which specifies the partitioning and ordering of the rows that the window function will operate on. The partitioning divides the rows into groups based on specific columns within each group.

GROUP BY returns group where Window Function returns rows by rows.

-- 1. AGGREGATE FUNCTION WITH OVER()

-- Explanation: Calculates the average marks for each branch using a window function.
SELECT branch, student_id, marks, AVG(marks) OVER (PARTITION BY branch) AS avg_marks_per_branch
FROM marks;
-- 2. RANK/DENSE RANK/ROW NUMBER (it does not need OVER because of the db engine)

-- Explanation: Assigns a unique rank to each student based on their marks in descending order.
SELECT student_id, marks, RANK() OVER (ORDER BY marks DESC) AS rank
FROM marks;
-- 3. FIRST VALUE/LAST VALUE/Nth VALUE

-- Explanation: Retrieves the first and last marks for each branch using window functions.
SELECT branch, student_id, marks,
       FIRST_VALUE(marks) OVER (PARTITION BY branch ORDER BY marks) AS first_mark,
       LAST_VALUE(marks) OVER (PARTITION BY branch ORDER BY marks) AS last_mark
FROM marks;

In the provided examples:

  • PARTITION BY: Divides the result set into partitions based on the specified column (e.g., branch).
  • ORDER BY: Defines the order in which the window function processes the rows.
  • OVER(): Specifies the window specification for the window function.

Make sure to adapt the examples based on your actual database schema and requirements.

-- 4. FRAMES

-- Explanation: Calculates the average marks for each student based on a custom window frame.
SELECT student_id, marks,
       AVG(marks) OVER (ORDER BY marks ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS avg_marks_custom_frame
FROM marks;

A frame tells how the window function will calculate. A frame in a window function is a subset of rows within the partition that determines the scope of the window function calculation. The frame uses a combination of two clauses in the window function: ROWS & BETWEEN. The ROWS clause specifies how many rows should be included frame relative to the current row. For Example: ROWS 3 PRECEDING means that the frame includes the current row and the three rows that precede in the partition.

Window functions like AVG()RANK()FIRST_VALUE()LAST_VALUE() operate on the defined window.

The BETWEEN clause specifies the boundaries of the frame.

Examples :

1. ROWS BETWEEN UNBOUND PRECEDING(1st Row) AND CURRENT ROW(Default in window function) – means that the frame includes all rows from the beginning of the partition and includes the current row.

Explanation:

  • The frame includes all rows from the beginning of the partition up to and including the current row.
  • For each row, the window function considers all the rows from the start of the partition to the current row.

Example:
Consider a table with student marks partitioned by branch:

RowBranchMarksAvg Marks (Frame)
1A8080
2A8582.5
3A9285.6667

For the first row, the average marks are calculated based on all rows in the partition (all rows for branch A).

2. ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: the frame includes the current row and the row immediately before and after it.

Explanation:

  • The frame includes the current row and the row immediately before and after it.
  • For each row, the window function considers the current row, the one preceding it, and the one following it.

Example:


Continuing with the student marks table:

RowBranchMarksAvg Marks (Frame)
1A8082.5
2A8587.6667
3A9288.6667

For the second row, the average marks are calculated based on the marks of row 1, row 2 (current), and row 3.

3. ROWS BETWEEN UNBOUND PRECEDING AND UNBOUND FOLLOWING: the frame includes all rows in the partition.

Explanation:

  • The frame includes all rows in the partition.
  • For each row, the window function considers all rows in the partition, regardless of their position.

Example:
Continuing with the student marks table:

RowBranchMarksAvg Marks (Frame)
1A8086.75
2A8586.75
3A9286.75

For any row, the average marks are calculated based on all rows in the partition (all rows for branch A).

4. ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING: the frame includes the current row and the three rows before it and the two rows after it

Explanation:

  • The frame includes the current row and the three rows before it and the two rows after it.
  • For each row, the window function considers the current row, the three preceding rows, and the two following rows.

Example:
Continuing with the student marks table:

RowBranchMarksAvg Marks (Frame)
1A8086
2A8587
3A9288

For the second row, the average marks are calculated based on the marks of row 1, row 2 (current), and rows 3 and 4.

WINDOW FUNCTION divides to be GROUPS and GROUPS divides to be FRAMES

Lead and Lag

-- Lead returns the value of a given expression for the next row within the result set.
SELECT column, LEAD(column) OVER (ORDER BY some_column) AS lead_value
FROM your_table;

-- Lag returns the value of a given expression for the previous row within the result set.
SELECT column, LAG(column) OVER (ORDER BY some_column) AS lag_value
FROM your_table;

Ranking

-- Rank assigns a unique rank to each distinct row within a result set.
SELECT column, RANK() OVER (ORDER BY some_column) AS rank_value
FROM your_table;

Cumulative Sum, Cumulative Average & Running Average

Cumulative sum

Cumulative sum is another type of calculation that can be performed using window functions. A cumulative sum calculates the sum of a set of values up to a given point in time, and includes all previous values in the calculation.

Cumulative average

Cumulative average is another type of average that can be calculated using window functions. A cumulative average calculates the average of a set of values up to a given point in time, and includes all previous values in the calculation.

 Running Average

A running average in a window function is a way to calculate the average of a set of values, but it’s done continuously as you move through a sequence of data points. Imagine you have a list of numbers, and you want to know the average of the numbers within a certain group or window.

-- Cumulative Sum calculates the sum of a column up to the current row.
SELECT column, SUM(column) OVER (ORDER BY some_column) AS cumulative_sum
FROM your_table;

-- Cumulative Average calculates the average of a column up to the current row.
SELECT column, AVG(column) OVER (ORDER BY some_column) AS cumulative_avg
FROM your_table;

-- Running Average calculates the average of a column for the current row and all preceding rows.
SELECT column, AVG(column) OVER (ORDER BY some_column ROWS UNBOUNDED PRECEDING) AS running_avg
FROM your_table;

Percent of Total

Percent of total refers to the percentage or proportion of a specific value in relation to the total value. It is a commonly used metric to represent the relative importance or contribution of a particular value within a larger group or population.

-- Percent of Total calculates the percentage of each row relative to the total sum.
SELECT column, column / SUM(column) OVER () * 100 AS percent_of_total
FROM your_table;

Percent Change

Percent change is a way of expressing the difference between two values as a percentage of the original value. It is often used to measure how much a value has increased or decreased over a given period of time, or to compare two different values.

Percent Change = ((new_value-old_value)/old_value)*100

-- Percent Change calculates the percentage change from the previous row to the current row.
SELECT column, (column - LAG(column) OVER (ORDER BY some_column)) / LAG(column) OVER (ORDER BY some_column) * 100 AS percent_change
FROM your_table;

Percentiles and Quantiles

A Quantile is a measure of the distribution of a dataset that divides the data into any number of equally sized intervals. For example, a dataset could be divided into deciles (ten equal parts), quartiles (four equal parts), percentiles (100 equal parts), or any other number of intervals.

Each quantile represents a value below which a certain percentage of the data falls. For example, the 25th percentile (also known as the first quartile, or Q1) represents the value below which 25% of the data falls. The 50th percentile (also known as the median) represents the value below which 50% of the data falls, and so on.

-- Percentiles divide the data into 100 equal parts; Quantiles divide the data into 'n' equal parts.
SELECT column, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column) AS median
FROM your_table;

-- For quantiles, replace PERCENTILE_CONT with PERCENTILE_DISC.
SELECT column, PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY column) AS first_quartile
FROM your_table;

Segmentation

Segmentation using NTILE is a technique in SQL for dividing a dataset into equal-sized groups based on some criteria or conditions, and then performing calculations or analysis on each group separately using window functions.

-- Segmentation categorizes rows into segments based on specified conditions.
SELECT column, CASE WHEN condition1 THEN 'Segment1' WHEN condition2 THEN 'Segment2' ELSE 'Other' END AS segment
FROM your_table;

Cumulative Distribution

The cumulative distribution function is used to describe the probability distribution of random variables. It can be used to describe the probability for a discrete, continuous or mixed variable. It is obtained by summing up the probability density function and getting the cumulative probability for a random variable

“What percentage of the rows in the data set have a value less than or equal to the current row?”

Percentage=(Total Number of Rows in DatasetNumber of Rows with Value≤Current Row’s Value)/(Total Number of Rows in Dataset)​×100%

-- Cumulative Distribution calculates the cumulative percentage of values up to the current row.
SELECT column, CUME_DIST() OVER (ORDER BY column) AS cumulative_distribution
FROM your_table;

Partition by Multiple Columns

-- Partition By allows you to reset the window function based on multiple columns.
SELECT column1, column2, SUM(column3) OVER (PARTITION BY column1, column2 ORDER BY some_column) AS sum_partitioned
FROM your_table;

Connect SQL with Database using Python

import pymysql  # You may need to install the pymysql package

# Establish a connection to the database
connection = pymysql.connect(
    host='your_host',
    user='your_user',
    password='your_password',
    database='your_database'
)

# Create a cursor object to execute SQL queries
cursor = connection.cursor()

# Execute SQL queries
cursor.execute('SELECT * FROM your_table')

# Fetch and print results
results = cursor.fetchall()
for row in results:
    print(row)

# Close the cursor and connection
cursor.close()
connection.close()

Make sure to replace placeholder values such as ‘your_host’, ‘your_user’, ‘your_password’, ‘your_database’, and ‘your_table’ with your actual database details. Also, install the necessary Python packages if not already installed.

One Reply to “Window Functions in SQL”

Leave a Reply

Your email address will not be published. Required fields are marked *