Monday, October 23, 2023
HomeBig DataUse the brand new SQL instructions MERGE and QUALIFY to implement and...

Use the brand new SQL instructions MERGE and QUALIFY to implement and validate change knowledge seize in Amazon Redshift


Amazon Redshift is a totally managed, petabyte-scale knowledge warehouse service within the cloud. Tens of hundreds of consumers use Amazon Redshift to course of exabytes of information daily to energy their analytics workloads.

Amazon Redshift has added many options to boost analytical processing like ROLLUP, CUBE and GROUPING SETS, which have been demonstrated within the publish Simplify On-line Analytical Processing (OLAP) queries in Amazon Redshift utilizing new SQL constructs comparable to ROLLUP, CUBE, and GROUPING SETS. Amazon Redshift has just lately added many SQL instructions and expressions. On this publish, we speak about two new SQL options, the MERGE command and QUALIFY clause, which simplify knowledge ingestion and knowledge filtering.

One acquainted job in most downstream purposes is change knowledge seize (CDC) and making use of it to its goal tables. This job requires inspecting the supply knowledge to find out whether it is an replace or an insert to present goal knowledge. With out the MERGE command, you wanted to check the brand new dataset towards the present dataset utilizing a enterprise key. When that didn’t match, you inserted new rows within the present dataset; in any other case, you up to date present dataset rows with new dataset values.

The MERGE command conditionally merges rows from a supply desk right into a goal desk. Historically, this might solely be achieved by utilizing a number of insert, replace, or delete statements individually. When utilizing a number of statements to replace or insert knowledge, there’s a danger of inconsistencies between the totally different operations. Merge operation reduces this danger by making certain that each one operations are carried out collectively in a single transaction.

The QUALIFY clause filters the outcomes of a beforehand computed window perform in line with consumer‑specified search circumstances. You should use the clause to use filtering circumstances to the results of a window perform with out utilizing a subquery. That is just like the HAVING clause, which applies a situation to additional filter rows from a WHERE clause. The distinction between QUALIFY and HAVING is that filtered outcomes from the QUALIFY clause could possibly be based mostly on the results of operating window capabilities on the info. You should use each the QUALIFY and HAVING clauses in a single question.

On this publish, we show how you can use the MERGE command to implement CDC and how you can use QUALIFY to simplify validation of these modifications.

Resolution overview

On this use case, we’ve an information warehouse, during which we’ve a buyer dimension desk that should at all times get the newest knowledge from the supply system. This knowledge should additionally mirror the preliminary creation time and final replace time for auditing and monitoring functions.

A easy approach to resolve that is to override the client dimension absolutely daily; nevertheless, that received’t obtain the replace monitoring, which is an audit mandate, and it won’t be possible to do for larger tables.

You may load pattern knowledge from Amazon S3 by following the instruction right here. Utilizing the present buyer desk underneath sample_data_dev.tpcds, we create a buyer dimension desk and a supply desk that can comprise each updates for present clients and inserts for brand spanking new clients. We use the MERGE command to merge the supply desk knowledge with the goal desk (buyer dimension). We additionally present how you can use the QUALIFY clause to simplify validating the modifications within the goal desk.

To comply with together with the steps on this publish, we suggest downloading the accompanying pocket book, which comprises all of the scripts to run for this publish. To study authoring and operating notebooks, check with Authoring and operating notebooks.

Stipulations

It is best to have the next conditions:

Create and populate the dimension desk

We use the present buyer desk underneath sample_data_dev.tpcds to create a customer_dimension desk. Full the next steps:

  1. Create a desk utilizing just a few chosen fields, together with the enterprise key, and add a few upkeep fields for insert and replace timestamps:
     -- create the client dimension desk DROP TABLE IF EXISTS customer_dim CASCADE;
    CREATE TABLE customer_dim ( 
    customer_dim_id     bigint GENERATED BY DEFAULT AS IDENTITY(1, 1), 
    c_customer_sk integer NOT NULL ENCODE az64 distkey,
    c_first_name character(20) ENCODE lzo,
    c_last_name character(30) ENCODE lzo,
    c_current_addr_sk integer ENCODE az64,
    c_birth_country character various(20) ENCODE lzo,
    c_email_address character(50) ENCODE lzo,
    record_insert_ts    timestamp WITHOUT time ZONE DEFAULT current_timestamp ,
    record_upd_ts       timestamp WITHOUT time ZONE DEFAULT NULL
    )
    SORTKEY (c_customer_sk);

  2. Populate the dimension desk:
    -- populate dimension 
    insert into customer_dim 
           (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
    choose  c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address
    from “sample_data_dev”.”tpcds”.”buyer”;

  3. Validate the row rely and the contents of the desk:
    -- examine clients rely and take a look at pattern knowledge
    choose rely(1) from customer_dim; 
    choose * from customer_dim restrict 10;

Simulate buyer desk modifications

Use the next code to simulate modifications made to the desk:

-- create a supply desk with some updates and a few inserts
-- Replace- Electronic mail has modified for 100 clients 
drop desk if exists src_customer;
create desk src_customer distkey(c_customer_sk) as 
choose c_customer_sk , c_first_name , c_last_name, c_current_addr_sk, c_birth_country, ‘x’+c_email_address as c_email_address, getdate() as effective_dt
from   customer_dim 
the place  c_email_address just isn't null
restrict 100;


-- additionally let’s add three fully new clients
insert into src_customer values 
(15000001, ‘Buyer#15’,’000001’, 10001 ,’USA’    , ‘Buyer#15000001@gmail.com’, getdate() ),
(15000002, ‘Buyer#15’,’000002’, 10002 ,’MEXICO’ , ‘Buyer#15000002@gmail.com’, getdate() ),
(15000003, ‘Buyer#15’,’000003’, 10003 ,’CANADA’ , ‘Buyer#15000003@gmail.com’, getdate() );

-- examine supply rely
choose rely(1) from src_customer;

Merge the supply desk into the goal desk

Now you have got a supply desk with some modifications it’s worthwhile to merge with the client dimension desk.

Earlier than the MERGE command, such a job wanted two separate UPDATE and INSERT instructions to implement:

-- merge modifications to dim buyer
BEGIN TRANSACTION;
-- replace present information
UPDATE customer_dim
SET    c_first_name      = src.c_first_name      ,
       c_last_name       = src.c_last_name       , 
       c_current_addr_sk = src.c_current_addr_sk , 
       c_birth_country   = src.c_birth_country   , 
       c_email_address   = src.c_email_address   ,
       record_upd_ts     = current_timestamp
from   src_customer AS src
the place  customer_dim.c_customer_sk = src.c_customer_sk ;
-- Insert new information
INSERT INTO customer_dim (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
choose src.c_customer_sk, src.c_first_name,src.c_last_name, src.c_current_addr_sk, src.c_birth_country, src.c_email_address 
from   src_customer AS src
the place  src.c_customer_sk NOT IN (choose c_customer_sk from customer_dim);
-- finish merge operation
COMMIT TRANSACTION;

 

The MERGE command makes use of a extra easy syntax, during which we use the important thing comparability outcome to resolve if we carry out an replace DML operation (when matched) or an insert DML operation (when not matched):

MERGE INTO customer_dim utilizing src_customer AS src ON customer_dim.c_customer_sk = src.c_customer_sk
WHEN MATCHED THEN UPDATE 
SET c_first_name      = src.c_first_name      ,
    c_last_name       = src.c_last_name       , 
    c_current_addr_sk = src.c_current_addr_sk , 
    c_birth_country   = src.c_birth_country   , 
    c_email_address   = src.c_email_address   ,
    record_upd_ts     = current_timestamp
WHEN NOT MATCHED THEN INSERT (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
                      VALUES (src.c_customer_sk, src.c_first_name,src.c_last_name, src.c_current_addr_sk, src.c_birth_country, src.c_email_address );

Validate the info modifications within the goal desk

Now we have to validate the info has made it accurately to the goal desk. We are able to first examine the up to date knowledge utilizing the replace timestamp. As a result of this was our first replace, we will look at all rows the place the replace timestamp just isn’t null:

-- Verify the modifications
-- to get updates
choose * 
from customer_dim
the place record_upd_ts just isn't null

Use QUALIFY to simplify validation of the info modifications

We have to look at the info inserted on this desk most just lately. A technique to do this is to rank the info by its insert timestamp and get these with the primary rank. This requires utilizing the window perform rank() and in addition requires a subquery to get the outcomes.

Earlier than the provision of QUALIFY, we would have liked to construct that utilizing a subquery like the next:

choose customer_dim_id,c_customer_sk ,c_first_name ,c_last_name ,c_current_addr_sk,c_birth_country ,c_email_address ,record_insert_ts ,record_upd_ts 
from 
( choose rank() OVER (ORDER BY DATE_TRUNC(‘second’,record_insert_ts) desc) AS rnk, 
         customer_dim_id,c_customer_sk ,c_first_name ,c_last_name ,c_current_addr_sk,c_birth_country ,c_email_address ,record_insert_ts ,record_upd_ts 
  from customer_dim
  the place record_upd_ts is null)
the place rnk = 1;

The QUALIFY perform eliminates the necessity for the subquery, as within the following code snippet:

-- to get the newly inserted rows we will make use of Qualify characteristic
choose * 
from customer_dim
the place record_upd_ts is null
qualify rank() OVER (ORDER BY DATE_TRUNC(‘second’,record_insert_ts) desc) = 1 

Validate all knowledge modifications

We are able to union the outcomes of each queries to get all of the inserts and replace modifications:

-- To get all modifications
choose *
from (
choose 'Updates' as operations, cd.* 
from   customer_dim as cd
the place  cd.record_upd_ts just isn't null
union 
choose 'Inserts' as operations, cd.* 
from customer_dim cd
the place cd.record_upd_ts is null
qualify rank() OVER (ORDER BY DATE_TRUNC('second',cd.record_insert_ts) desc) = 1 
) order by 1

Clear up

To scrub up the sources used within the publish, delete the Redshift provisioned cluster or Redshift Serverless workgroup and namespace you created for this publish (this may also drop all of the objects created).

For those who used an present Redshift provisioned cluster or Redshift Serverless workgroup and namespace, use the next code to drop these objects:

DROP TABLE IF EXISTS customer_dim CASCADE;
DROP TABLE IF EXISTS src_customer CASCADE;

Conclusion

When utilizing a number of statements to replace or insert knowledge, there’s a danger of inconsistencies between the totally different operations. The MERGE operation reduces this danger by making certain that each one operations are carried out collectively in a single transaction. For Amazon Redshift clients who’re migrating from different knowledge warehouse programs or who frequently must ingest fast-changing knowledge into their Redshift warehouse, the MERGE command is an easy approach to conditionally insert, replace, and delete knowledge from goal tables based mostly on present and new supply knowledge.

In most analytic queries that use window capabilities, it’s possible you’ll want to make use of these window capabilities in your WHERE clause as properly. Nonetheless, this isn’t permitted, and to take action, it’s a must to construct a subquery that comprises the required window perform after which use the leads to the mum or dad question within the WHERE clause. Utilizing the QUALIFY clause eliminates the necessity for a subquery and subsequently simplifies the SQL assertion and makes it more easy to jot down and skim.

We encourage you to start out utilizing these new options and provides us your suggestions. For extra particulars, check with MERGE and QUALIFY clause.


In regards to the authors

Yanzhu Ji is a Product Supervisor within the Amazon Redshift workforce. She has expertise in product imaginative and prescient and technique in industry-leading knowledge merchandise and platforms. She has excellent ability in constructing substantial software program merchandise utilizing net improvement, system design, database, and distributed programming strategies. In her private life, Yanzhu likes portray, images, and taking part in tennis.

Ahmed Shehata is a Senior Analytics Specialist Options Architect at AWS based mostly on Toronto. He has greater than twenty years of expertise serving to clients modernize their knowledge platforms. Ahmed is captivated with serving to clients construct environment friendly, performant, and scalable analytic options.

Ranjan Burman is an Analytics Specialist Options Architect at AWS. He makes a speciality of Amazon Redshift and helps clients construct scalable analytical options. He has greater than 16 years of expertise in several database and knowledge warehousing applied sciences. He’s captivated with automating and fixing buyer issues with cloud options.



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments