Enhance efficiency of workloads containing repetitive scan filters with multidimensional information format kind keys in Amazon Redshift

November 28, 2023

1

Amazon Redshift, probably the most broadly used cloud information warehouse, has advanced considerably to satisfy the efficiency necessities of probably the most demanding workloads. This put up covers one such new characteristic—the multidimensional information format kind key.

Amazon Redshift now improves your question efficiency by supporting multidimensional information format kind keys, which is a brand new kind of kind key that types a desk’s information by filter predicates as a substitute of bodily columns of the desk. Multidimensional information format kind keys will considerably enhance the efficiency of desk scans, particularly when your question workload incorporates repetitive scan filters.

Amazon Redshift already offers the potential of automated desk optimization (ATO), which routinely optimizes the design of tables by making use of kind and distribution keys with out the necessity for administrator intervention. On this put up, we introduce multidimensional information format kind keys as an extra functionality provided by ATO and fortified by Amazon Redshift’s kind key advisor algorithm.

Multidimensional information format kind keys

Once you outline a desk with the AUTO kind key, Amazon Redshift ATO will analyze your question historical past and routinely choose both a single-column kind key or multidimensional information format kind key to your desk, primarily based on which choice is healthier to your workload. When multidimensional information format is chosen, Amazon Redshift will assemble a multidimensional kind perform that co-locates rows which are usually accessed by the identical queries, and the type perform is subsequently used throughout question runs to skip information blocks and even skip scanning the person predicate columns.

Take into account the next consumer question, which is a dominant question sample within the consumer’s workload:

SELECT season, sum(metric2) AS "__measure__0"
FROM titles
WHERE decrease(subregion) like '%United States%'
GROUP BY 1
ORDER BY 1;

Amazon Redshift shops information for every column in 1 MB disk blocks and shops the minimal and most values in every block as a part of the desk’s metadata. If a question makes use of a range-restricted predicate, Amazon Redshift can use the minimal and most values to quickly skip over giant numbers of blocks throughout desk scans. Nevertheless, this question’s filter on the subregion column can’t be used to find out which blocks to skip primarily based on minimal and most values, and in consequence, Amazon Redshift scans all rows from the titles desk:

SELECT table_name, input_rows, step_attribute
FROM sys_query_detail
WHERE query_id = 123456789;

When the consumer’s question was run with titles utilizing a single-column kind key on subregion, the results of the previous question is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 2164081640 | 
(1 rows)

This reveals that the desk scan learn 2,164,081,640 rows.

To enhance scans on the titles desk, Amazon Redshift would possibly routinely determine to make use of a multidimensional information format kind key. All rows that fulfill the decrease(subregion) like '%United States%' predicate could be co-located to a devoted area of the desk, and subsequently Amazon Redshift will solely scan information blocks that fulfill the predicate.

When the consumer’s question is run with titles utilizing a multidimensional information format kind key that features decrease(subregion) like '%United States%' as a predicate, the results of the sys_query_detail question is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 152324046  | multi-dimensional
(1 rows)

This reveals that the desk scan learn 152,324,046 rows, which is barely 7% of the unique, and it used the multidimensional information format kind key.

Observe that this instance makes use of a single question to showcase the multidimensional information format characteristic, however Amazon Redshift will take into account all of the queries working in opposition to the desk and may create a number of areas to fulfill probably the most generally run predicates.

Let’s take one other instance, with extra complicated predicates and a number of queries this time.

Think about having a desk objects (price int, accessible int, demand int) with 4 rows as proven within the following instance.

#id	price	accessible	demand
1	4	3	3
2	2	23	6
3	5	4	5
4	1	1	2

Your dominant workload consists of two queries:

70% queries sample:

choose * from objects the place price > 3 and accessible < demand

20% queries sample:

choose avg(price) from objects the place accessible < demand

With conventional sorting methods, you would possibly select to kind the desk over the price column, such that the analysis of price > 3 will profit from the type. So, the objects desk after sorting utilizing a single price column will appear to be the next.

#id	price	accessible	demand
Area #1, with price <= 3
Area #2, with price > 3

#id	price	accessible	demand
4	1	1	2
2	2	23	6
1	4	3	3
3	5	4	5

Through the use of this conventional kind, we will instantly exclude the highest two (blue) rows with ID 4 and ID 2, as a result of they don’t fulfill price > 3.

Then again, with a multidimensional information format kind key, the desk can be sorted primarily based on a mixture of the 2 generally occurring predicates within the consumer’s workload, that are price > 3 and accessible < demand. Because of this, the desk’s rows are sorted into 4 areas.

#id	price	accessible	demand
Area #1, with price <= 3 and accessible < demand
Area #2, with price <= 3 and accessible >= demand
Area #3, with price > 3 and accessible < demand
Area #4, with price > 3 and accessible >= demand

#id	price	accessible	demand
4	1	1	2
2	2	23	6
3	5	4	5
1	4	3	3

This idea is much more highly effective when utilized to complete blocks as a substitute of single rows, when utilized to complicated predicates that use operators not appropriate for conventional sorting methods (comparable to like), and when utilized to greater than two predicates.

System tables

The next Amazon Redshift system tables will present customers if multidimensional information layouts are used on their tables and queries:

To find out if a specific desk is utilizing a multidimensional information format kind key, you may examine whether or not sortkey1 in svv_table_info is the same as AUTO(SORTKEY(padb_internal_mddl_key_col)).
To find out if a specific question makes use of multidimensional information format to speed up desk scans, you may examine step_attribute within the sys_query_detail view. The worth can be equal to multi-dimensional if the desk’s multidimensional information format kind key was used in the course of the scan.

Efficiency benchmarks

We carried out inner benchmark testing for a number of workloads with repetitive scan filters and see that introducing multidimensional information format kind keys produced the next outcomes:

A 74% whole runtime discount in comparison with having no kind key.
A 40% whole runtime discount in comparison with having the perfect single-column kind key on every desk.
A 80% discount in whole rows learn from tables in comparison with having no kind key.
A 47% discount in whole rows learn from tables in comparison with having the perfect single-column kind key on every desk.

Characteristic comparability

With the introduction of multidimensional information format kind keys, your tables can now be sorted by expressions primarily based off of the generally occurring filter predicates in your workload. The next desk offers a characteristic comparability for Amazon Redshift in opposition to two rivals.

Characteristic	Amazon Redshift	Competitor A	Competitor B
Assist for sorting on columns	Sure	Sure	Sure
Assist for sorting by expression	Sure	Sure	No
Computerized column choice for sorting	Sure	No	Sure
Computerized expressions choice for sorting	Sure	No	No
Computerized choice between columns sorting or expressions sorting	Sure	No	No
Computerized use of sorting properties for expressions throughout scans	Sure	No	No

Issues

Take into accout the next when utilizing a multidimensional information format:

Multidimensional information format is enabled if you set your desk as SORTKEY AUTO.
Amazon Redshift Advisor will routinely select both a single-column kind key or multidimensional information format for the desk by analyzing your historic workload.
Amazon Redshift ATO adjusts the multidimensional information format sorting outcomes primarily based on the way during which ongoing queries work together with the workload.
Amazon Redshift ATO maintains multidimensional information format kind keys the identical means because it at the moment does for present kind keys. Confer with Working with automated desk optimization for extra particulars on ATO.
Multidimensional information format kind keys will work with each provisioned clusters and serverless workgroups.
Multidimensional information format kind keys will work together with your present information so long as the AUTO SORTKEY is enabled in your desk and a workload with repetitive scan filters is detected. The desk can be reorganized primarily based on the outcomes of multi-dimensional kind perform.
To disable multidimensional information format kind keys for a desk, use alter desk: ALTER TABLE table_name ALTER SORTKEY NONE. This disables the AUTO kind key characteristic on the desk.
Multidimensional information format kind keys are preserved when restoring or migrating your provisioned cluster to a serverless cluster or vice versa.

Conclusion

On this put up, we confirmed that multidimensional information format kind keys can considerably enhance question runtime efficiency for workloads the place dominant queries have repetitive scan filters.

To create a preview cluster from the Amazon Redshift console, navigate to the Clusters web page and select Create preview cluster. You possibly can create a cluster within the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Eire), and Europe (Stockholm) Areas and take a look at your workloads.

We might love to listen to your suggestions on this new characteristic and look ahead to your feedback on this put up.

In regards to the authors

Yanzhu Ji is a Product Supervisor within the Amazon Redshift staff. She has expertise in product imaginative and prescient and technique in industry-leading information merchandise and platforms. She has excellent ability in constructing substantial software program merchandise utilizing internet growth, system design, database, and distributed programming methods. In her private life, Yanzhu likes portray, pictures, and enjoying tennis.

Milind Oke is a Knowledge Warehouse Specialist Options Architect primarily based out of New York. He has been constructing information warehouse options for over 15 years and focuses on Amazon Redshift.

Jialin Ding is an Utilized Scientist within the Discovered Programs Group, specializing in making use of machine studying and optimization methods to enhance the efficiency of knowledge techniques comparable to Amazon Redshift.