MySQL 8 vs. MariaDB: Comparison of Window Functions and CTEs
Every MySQL database programmer should learn and apply the newly added MariaDB and MySQL window functions and common table expressions (CTEs) in their daily work. Both CTEs and window functions enable easy solutions to many query challenges that in prior releases have been difficult and sometimes impossible to surmount. Mastering these features opens the door to query solutions that are more robust, that execute faster, and that are easier to maintain over time than prior solutions using older techniques.
In our last article, we compared user roles in MySQL 8 vs. MariaDB. Today, we will compare window functions and common table expressions in both databases.
Window Functions
While all database administrators are familiar with aggregate functions like COUNT()
, SUM()
, and AVG()
, far fewer people make use of window functions in their queries. Unlike aggregate functions, which operate on an entire table, window functions operate on a set of rows and return a single aggregated value for each row.
The main advantage of using window functions over regular aggregate functions is that window functions do not cause rows to become grouped into a single output row. Instead, the rows retain their separate identities and an aggregated value is added to each row.
Window Functions in MariaDB
Windowing functions were added to the ANSI/ISO Standard SQL:2003 and then extended in ANSI/ISO Standard SQL:2008. DB2, Oracle, Sybase, PostgreSQL, and other products have had full implementations for years. Other vendors added support for window functions later on. Case in point, Microsoft did not add Window Functions to SQL Server until SQL 2012.
After numerous wishes and feature requests for window functions over the years, they were finally introduced in MariaDB 10.2.0 to great fanfare! Now, MariaDB includes window functions such as ROW_NUMBER
, RANK
, DENSE_RANK
, PERCENT_RANK
, CUME_DIST
, NTILE
, COUNT
, SUM
, AVG
, BIT_OR
, BIT_AND
, and BIT_XOR
.
The Syntax
Window function queries are characterized by the the OVER
keyword, following which the set of rows used for the calculation is specified. By default, the set of rows used for the calculation (the “window”) is the entire dataset, which can be ordered with the ORDER BY
clause. The PARTITION BY
clause is then employed to reduce the window to a particular group within the dataset.
Here’s an example.
Given the following table of student test scores:
+------------+---------+--------+
| name | test | score |
+------------+---------+--------+
| Steve | SQL | 75 |
+------------+---------+--------+
| Robert | SQL | 43 |
+------------+---------+--------+
| Tracy | SQL | 56 |
+------------+---------+--------+
| Tatiana | SQL | 87 |
+------------+---------+--------+
| Steve | Tuning | 73 |
+------------+---------+--------+
| Robert | Tuning | 31 |
+------------+---------+--------+
| Tracy | Tuning | 88 |
+------------+---------+--------+
| Tatiana | Tuning | 83 |
+------------+---------+--------+
The following two queries return the average test scores partitioned by test and by name, respectively — in other words, aggregated by test and by name:
SELECT name, test, score, AVG(score) OVER (PARTITION BY test)
AS average_by_test FROM test_scores;
+----------+--------+-------+-----------------+
| name | test | score | average_by_test |
+----------+--------+-------+-----------------+
| Steve | SQL | 75 | 65.2500 |
| Steve | Tuning | 73 | 68.7500 |
| Robert | SQL | 43 | 65.2500 |
| Robert | Tuning | 31 | 68.7500 |
| Tracy | SQL | 56 | 65.2500 |
| Tracy | Tuning | 88 | 68.7500 |
| Tatiana | SQL | 87 | 65.2500 |
| Tatiana | Tuning | 83 | 68.7500 |
+----------+--------+-------+-----------------+
SELECT name, test, score, AVG(score) OVER (PARTITION BY name)
AS average_by_name FROM student;
+---------+--------+-------+-----------------+
| name | test | score | average_by_name |
+---------+--------+-------+-----------------+
| Steve | SQL | 75 | 74.0000 |
| Steve | Tuning | 73 | 74.0000 |
| Robert | SQL | 43 | 37.0000 |
| Robert | Tuning | 31 | 37.0000 |
| Tracy | SQL | 56 | 72.0000 |
| Tracy | Tuning | 88 | 72.0000 |
| Tatiana | SQL | 87 | 85.0000 |
| Tatiana | Tuning | 83 | 85.0000 |
+---------+--------+-------+-----------------+
In both cases, note that the original scores are still available to each row.
Window Functions in MySQL 8
MySQL has been even later in adopting the window functions standard, with it being part of version 8.0 that is expected to be released later this year.
MySQL employs the same ANSI/ISO Standard as other DBMSes whereby window function queries are characterized by the OVER
keyword and the PARTITION BY
clause is employed to reduce the window to a specific group within the result set.
The currently supported functions include:
Name | Description |
CUME_DIST() |
Cumulative distribution value. |
DENSE_RANK() |
Rank of current row within its partition without gaps. |
FIRST_VALUE() |
Value of argument from the first row of the window frame |
LAG() |
Value of argument from row lagging current row within the partition. |
LAST_VALUE() |
Value of argument from the last row of the window frame. |
LEAD() |
Value of argument from row leading current row within the partition |
NTH_VALUE() |
Value of argument from Nth row of the window frame. |
NTILE() |
Bucket number of the current row within its partition. |
PERCENT_RANK() |
Percentage rank value. |
RANK() |
Rank of current row within its partition with gaps. |
ROW_NUMBER() |
Number of current row within its partition. |
As an example, we’ll explore the CUME_DIST()
function.
It returns the cumulative distribution of a value within a group of values; that is, the percentage of partition values less than or equal to the value in the current row. This represents the number of rows preceding or peer with the current row in the window ordering of the window partition divided by the total number of rows in the window partition. Return values range from 0 to 1.
This function is usually used with ORDER BY
to sort partition rows into the desired order. Without ORDER BY
, all rows are peers having a value of 1.
The following query shows, for the set of values in the val
column, the CUME_DIST()
value for each row, as well as the percentage rank value returned by the similar PERCENT_RANK()
function. For reference, the query also displays row numbers using ROW_NUMBER()
:
SELECT
val,
ROW_NUMBER() OVER w AS 'row_number',
CUME_DIST() OVER w AS 'cume_dist',
PERCENT_RANK() OVER w AS 'percent_rank'
FROM numbers
WINDOW w AS (ORDER BY val);
+------+------------+--------------------+--------------+
| val | row_number | cume_dist | percent_rank |
+------+------------+--------------------+--------------+
| 1 | 1 | 0.2222222222222222 | 0 |
| 1 | 2 | 0.2222222222222222 | 0 |
| 2 | 3 | 0.3333333333333333 | 0.25 |
| 3 | 4 | 0.6666666666666666 | 0.375 |
| 3 | 5 | 0.6666666666666666 | 0.375 |
| 3 | 6 | 0.6666666666666666 | 0.375 |
| 4 | 7 | 0.8888888888888888 | 0.75 |
| 4 | 8 | 0.8888888888888888 | 0.75 |
| 5 | 9 | 1 | 1 |
+------+------------+--------------------+--------------+
The OVER
clause is permitted for many aggregate functions, including:
AVG()
COUNT()
MAX()
MIN()
STDDEV_POP()
,STDDEV()
,STD()
STDDEV_SAMP()
SUM()
VAR_POP()
,VARIANCE()
VAR_SAMP()
These can be used as window or non-window functions, depending on whether the OVER
clause is present or absent.
MySQL also supports non-aggregate functions that are used only as window functions. For these, the OVER
clause is mandatory:
CUME_DIST()
DENSE_RANK()
FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()
NTH_VALUE()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()
As an example of a non-aggregate window function, this query uses ROW_NUMBER()
, which produces the row number of each row within its partition. In this case, rows are numbered per country. By default, partition rows are unordered and row numbering is indeterminate. To sort partition rows, include an ORDER BY
clause within the window definition. The query uses unordered and ordered partitions (the row_num1
and row_num2
columns) to illustrate the difference that omitting and including ORDER BY
makes:
SELECT
year, country, product, profit,
ROW_NUMBER() OVER(PARTITION BY country) AS row_num1,
ROW_NUMBER() OVER(PARTITION BY country ORDER BY year, product) AS row_num2
FROM sales;
+------+---------+------------+--------+----------+----------+
| year | country | product | profit | row_num1 | row_num2 |
+------+---------+------------+--------+----------+----------+
| 2000 | Finland | Computer | 1500 | 2 | 1 |
| 2000 | Finland | Phone | 100 | 1 | 2 |
| 2001 | Finland | Phone | 10 | 3 | 3 |
| 2000 | India | Calculator | 75 | 2 | 1 |
| 2000 | India | Calculator | 75 | 3 | 2 |
| 2000 | India | Computer | 1200 | 1 | 3 |
| 2000 | USA | Calculator | 75 | 5 | 1 |
| 2000 | USA | Computer | 1500 | 4 | 2 |
| 2001 | USA | Calculator | 50 | 2 | 3 |
| 2001 | USA | Computer | 1500 | 3 | 4 |
| 2001 | USA | Computer | 1200 | 7 | 5 |
| 2001 | USA | TV | 150 | 1 | 6 |
| 2001 | USA | TV | 100 | 6 | 7 |
+------+---------+------------+--------+----------+----------+
Common Table Expressions (CTEs)
A common table expression (CTE) can be thought of as a temporary result set that is defined within the execution scope of a single SELECT
, INSERT
, UPDATE
, DELETE
, or CREATE VIEW
statement. A CTE is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. However, unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query.
A CTE can be used to:
- Create a recursive query.
- Substitute for a view when the general use of a view is not required; that is, you do not have to store the definition in metadata.
- Enable grouping by a column that is derived from a scalar subselect or a function that is either not deterministic or has external access.
- Reference the resulting table multiple times in the same statement.
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries. The query can be divided into separate, simple, logical building blocks. These simple blocks can then be used to build more complex, interim CTEs until the final result set is generated.
Recursive CTEs
Recursive common table expressions (CTEs) were an implementation of standard SQL:1999 for hierarchical queries. The first implementations of recursive CTEs began appearing in 2007. The recursive CTEs from the standard were relatively close to the existing implementation in IBM DB2 version 2. Recursive CTEs were eventually supported by Microsoft SQL Server (since SQL Server 2008 R2), Firebird 2.1, PostgreSQL 8.4+, SQLite 3.8.3+, Oracle 11g Release 2, and IBM Informix version 11.50+.
Without common table expressions or a connected by clause, it is still possible to achieve hierarchical queries with user-defined recursive functions, but these tend to result in very complex SQL.
CTEs in MariaDB
In MariaDB, a non-recursive CTE is basically considered to be a query-local VIEW
whose syntax is more readable than nested FROM (SELECT …)
. A CTE can refer to another and it can be referenced from multiple places.
Thus, CTEs are similar to derived tables. For example...
SQL with a derived table:
SELECT * FROM
( SELECT * FROM employees
WHERE dept = 'Engineering' ) AS engineers
WHERE
...
SQL with CTE:
WITH engineers AS
( SELECT * FROM employees
WHERE dept = 'Engineering' )
SELECT * FROM engineers
WHERE ...
SQL is generally poor at recursion. One of the advantages of CTEs is that they permit a query to reference itself, hence, recursive SQL. A recursive CTE will repeatedly execute subsets of the data until it obtains the complete result set. This makes it particularly useful for handling hierarchical or tree-structured data.
With recursive CTEs, you can achieve things that would be very difficult to do with standard SQL and at a faster execution speed. They can help solve many types of business problems and even simplify some complex SQL/application logic down to a simple recursive call to the database.
Some example uses for recursive CTE are to find gaps in data, create organization charts, and create test data.
WITH RECURSIVE
signifies a recursive CTE. It is given a name, followed by a body (the main query), as follows:
Below is a recursive CTE that counts from 1 to 50.
WITH cte
AS (SELECT 1 AS n -- anchor member
UNION ALL
SELECT n + 1 -- recursive member
FROM cte
WHERE n < 50 -- terminator
)
SELECT n
FROM cte;
The above statement prints a number series from 1 to 49.
CTEs in MySQL
MySQL 8.0 adds CTEs via the standard WITH
keyword, in much the same way it is implemented in competing products.
To specify common table expressions, use a WITH
clause that has one or more comma-separated subclauses. Each subclause provides a subquery that produces a result set and associates a name with the subquery. The following example defines CTEs named cte1
and cte2
in the WITH
clause, and refers to them in the top-level SELECT
that follows the WITH
clause:
WITH
cte1 AS (SELECT a, b FROM table1),
cte2 AS (SELECT c, d FROM table2)
SELECT b, d FROM cte1 JOIN cte2
WHERE cte1.a = cte2.c;
Immediately preceding SELECT
for statements that include a SELECT
statement:
INSERT … WITH … SELECT …
REPLACE … WITH … SELECT …
CREATE TABLE … WITH … SELECT …
CREATE VIEW … WITH … SELECT …
DECLARE CURSOR … WITH … SELECT …
EXPLAIN … WITH … SELECT …
A recursive common table expression is one having a subquery that refers to its own name. For example:
WITH RECURSIVE cte (n) AS
(
SELECT 1
UNION ALL
SELECT n + 1 FROM cte WHERE n < 5
)
SELECT * FROM cte;
+------+
| n |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+------+
Conclusion
Window functions and common table expressions (CTEs) have been a mainstay of many popular database products for some time now. With the release of MySQL version 8 and MariaDB 10.2.0, both vendors have caught up with competing DBMSes such as SQL Server and Oracle.