Wednesday, December 6, 2023
HomeBig DataConstruct and handle your trendy knowledge stack utilizing dbt and AWS Glue...

Construct and handle your trendy knowledge stack utilizing dbt and AWS Glue by way of dbt-glue, the brand new “trusted” dbt adapter


dbt is an open supply, SQL-first templating engine that permits you to write repeatable and extensible knowledge transforms in Python and SQL. dbt focuses on the rework layer of extract, load, rework (ELT) or extract, rework, load (ETL) processes throughout knowledge warehouses and databases by way of particular engine adapters to attain extract and cargo performance. It permits knowledge engineers, knowledge scientists, and analytics engineers to outline the enterprise logic with SQL choose statements and eliminates the necessity to write boilerplate knowledge manipulation language (DML) and knowledge definition language (DDL) expressions. dbt lets knowledge engineers shortly and collaboratively deploy analytics code following software program engineering finest practices like modularity, portability, steady integration and steady supply (CI/CD), and documentation.

dbt is predominantly utilized by knowledge warehouses (corresponding to Amazon Redshift) prospects who wish to preserve their knowledge rework logic separate from storage and engine. Now we have seen a robust buyer demand to develop its scope to cloud-based knowledge lakes as a result of knowledge lakes are more and more the enterprise resolution for large-scale knowledge initiatives on account of their energy and capabilities.

In 2022, AWS revealed a dbt adapter known as dbt-glue—the open supply, battle-tested dbt AWS Glue adapter that permits knowledge engineers to make use of dbt for cloud-based knowledge lakes together with knowledge warehouses and databases, paying for simply the compute they want. The dbt-glue adapter democratized entry for dbt customers to knowledge lakes, and enabled many customers to effortlessly run their transformation workloads on the cloud with the serverless knowledge integration functionality of AWS Glue. From the launch of the adapter, AWS has continued investing into dbt-glue to cowl extra necessities.

Right this moment, we’re happy to announce that the dbt-glue adapter is now a trusted adapter based mostly on our strategic collaboration with dbt Labs. Trusted adapters are adapters not maintained by dbt Labs, however adaptors that that dbt Lab is comfy recommending to customers to be used in manufacturing.

The important thing capabilities of the dbt-glue adapter are as follows:

  • Runs SQL as Spark SQL on AWS Glue interactive periods
  • Manages desk definitions on the AWS Glue Information Catalog
  • Helps open desk codecs corresponding to Apache Hudi, Delta Lake, and Apache Iceberg
  • Helps AWS Lake Formation permissions for fine-grained entry management

Along with these capabilities, the dbt-glue adapter is designed to optimize useful resource utilization with a number of methods on prime of AWS Glue interactive periods.

This submit demonstrates how the dbt-glue adapter helps your workload, and how one can construct a contemporary knowledge stack utilizing dbt and AWS Glue utilizing the dbt-glue adapter.

Frequent use circumstances

One frequent use case for utilizing dbt-glue is that if a central analytics crew at a big company is liable for monitoring operational effectivity. They ingest software logs into uncooked Parquet tables in an Amazon Easy Storage Service (Amazon S3) knowledge lake. Moreover, they extract organized knowledge from operational methods capturing the corporate’s organizational construction and prices of numerous operational parts that they saved within the uncooked zone utilizing Iceberg tables to take care of the unique schema, facilitating easy accessibility to the info. The crew makes use of dbt-glue to construct a reworked gold mannequin optimized for enterprise intelligence (BI). The gold mannequin joins the technical logs with billing knowledge and organizes the metrics per enterprise unit. The gold mannequin makes use of Iceberg’s potential to help knowledge warehouse-style modeling wanted for performant BI analytics in an information lake. The mix of Iceberg and dbt-glue permits the crew to effectively construct an information mannequin that’s able to be consumed.

One other frequent use case is when an analytics crew in an organization that has an S3 knowledge lake creates a brand new knowledge product with the intention to enrich its current knowledge from its knowledge lake with medical knowledge. Let’s say that this firm is situated in Europe and the info product should adjust to the GDPR. For this, the corporate makes use of Iceberg to fulfill wants corresponding to the precise to be forgotten and the deletion of knowledge. The corporate makes use of dbt to mannequin its knowledge product on its current knowledge lake on account of its compatibility with AWS Glue and Iceberg and the simplicity that the dbt-glue adapter brings to using this storage format.

How dbt and dbt-glue work

The next are key dbt options:

  • Undertaking – A dbt undertaking enforces a top-level construction on the staging, fashions, permissions, and adapters. A undertaking may be checked right into a GitHub repo for model management.
  • SQL – dbt depends on SQL choose statements for outlining knowledge transformation logic. As an alternative of uncooked SQL, dbt affords templatized SQL (utilizing Jinja) that permits code modularity. As an alternative of getting to repeat/paste SQL in a number of locations, knowledge engineers can outline modular transforms and name these from different locations throughout the undertaking. Having a modular pipeline helps knowledge engineers collaborate on the identical undertaking.
  • Fashions – dbt fashions are primarily written as a SELECT assertion and saved as a .sql file. Information engineers outline dbt fashions for his or her knowledge representations. To be taught extra, confer with About dbt fashions.
  • Materializations – Materializations are methods for persisting dbt fashions in a warehouse. There are 5 varieties of materializations constructed into dbt: desk, view, incremental, ephemeral, and materialized view. To be taught extra, confer with Materializations and Incremental fashions.
  • Information lineage – dbt tracks knowledge lineage, permitting you to grasp the origin of knowledge and the way it flows by way of completely different transformations. dbt additionally helps impression evaluation, which helps determine the downstream results of modifications.

The high-level knowledge movement is as follows:

  1. Information engineers ingest knowledge from knowledge sources to uncooked tables and outline desk definitions for the uncooked tables.
  2. Information engineers write dbt fashions with templatized SQL.
  3. The dbt adapter converts dbt fashions to SQL statements suitable in an information warehouse.
  4. The information warehouse runs the SQL statements to create intermediate tables or last tables, views, or materialized views.

The next diagram illustrates the structure.

dbt-glue works with the next steps:

  1. The dbt-glue adapter converts dbt fashions to SQL statements suitable in Spark SQL.
  2. AWS Glue interactive periods run the SQL statements to create intermediate tables or last tables, views, or materialized views.
  3. dbt-glue helps csv, parquet, hudi, delta, and iceberg as fileformat.
  4. On the dbt-glue adapter, desk or incremental are generally used for materializations on the vacation spot. There are three methods for incremental materialization. The merge technique requires hudi, delta, or iceberg. With the opposite two methods, append and insert_overwrite, you should use csv, parquet, hudi, delta, or iceberg.

The next diagram illustrates this structure.

Instance use case

On this submit, we use the info from the New York Metropolis Taxi Information dataset. This dataset is obtainable within the Registry of Open Information on AWS (RODA), which is a repository containing public datasets from AWS assets. The uncooked Parquet desk information on this dataset shops journey information.

The target is to create the next three tables, which comprise metrics based mostly on the uncooked desk:

  • silver_avg_metrics – Fundamental metrics based mostly on NYC Taxi Open Information for the yr 2016
  • gold_passengers_metrics – Metrics per passenger based mostly on the silver metrics desk
  • gold_cost_metrics – Metrics per value based mostly on the silver metrics desk

The ultimate objective is to create two well-designed gold tables that retailer already aggregated leads to Iceberg format for advert hoc queries by way of Amazon Athena.

Stipulations

The instruction requires following conditions:

  • An AWS Identification and Entry Administration (IAM) function with all of the necessary permissions to run an AWS Glue interactive session and the dbt-glue adapter
  • An AWS Glue database and desk to retailer the metadata associated to the NYC taxi information dataset
  • An S3 bucket to make use of as output and retailer the processed knowledge
  • An Athena configuration (a workgroup and S3 bucket to retailer the output) to discover the dataset
  • An AWS Lambda perform (created as an AWS CloudFormation customized useful resource) that updates all of the partitions within the AWS Glue desk

With these conditions, we simulate the state of affairs that knowledge engineers have already ingested knowledge from knowledge sources to uncooked tables, and outlined desk definitions for the uncooked tables.

For ease of use, we ready a CloudFormation template. This template deploys all of the required infrastructure. To create these assets, select Launch Stack within the us-east-1 Area, and comply with the directions:

Set up dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command line interface for working dbt tasks. It’s free to make use of and out there as an open supply undertaking. Set up dbt and the dbt CLI with the next code:

$ pip3 set up --no-cache-dir dbt-core

For extra data, confer with Tips on how to set up dbt, What’s dbt?, and Viewpoint.

Set up the dbt adapter with the next code:

$ pip3 set up --no-cache-dir dbt-glue

Create a dbt undertaking

Full the next steps to create a dbt undertaking:

  1. Run the dbt init command to create and initialize a brand new empty dbt undertaking:
  2. For the undertaking title, enter dbt_glue_demo.
  3. For the database, select glue.

Now the empty undertaking has been created. The listing construction is proven as follows:

$ cd dbt_glue_demo 
$ tree .
.
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── fashions
│   └── instance
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── seeds
├── snapshots
└── exams

Create a supply

The subsequent step is to create a supply desk definition. We add fashions/source_tables.yml with the next contents:

model: 2

sources:
  - title: data_source
    schema: nyctaxi

    tables:
      - title: information

This supply definition corresponds to the AWS Glue desk nyctaxi.information, which we created within the CloudFormation stack.

Create fashions

On this step, we create a dbt mannequin that represents the common values for journey length, passenger depend, journey distance, and whole quantity of costs. Full the next steps:

  1. Create the fashions/silver/ listing.
  2. Create the file fashions/silver/silver_avg_metrics.sql with the next contents:
    WITH source_avg as ( 
        SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
        , avg(passenger_count) as avg_passenger_count 
        , avg(trip_distance) as avg_trip_distance 
        , avg(total_amount) as avg_total_amount
        , yr
        , month 
        , kind
        FROM {{ supply('data_source', 'information') }} 
        WHERE yr = "2016"
        AND dropoff_datetime is just not null 
        GROUP BY yr, month, kind
    ) 
    SELECT *
    FROM source_avg

  3. Create the file fashions/silver/schema.yml with the next contents:
    model: 2
    
    fashions:
      - title: silver_avg_metrics
        description: This desk has fundamental metrics based mostly on NYC Taxi Open Information for the yr 2016
    
        columns:
          - title: avg_duration
            description: The common length of a NYC Taxi journey
    
          - title: avg_passenger_count
            description: The common variety of passenger per NYC Taxi journey
    
          - title: avg_trip_distance
            description: The common NYC Taxi journey distance
    
          - title: avg_total_amount
            description: The avarage quantity of a NYC Taxi journey
    
          - title: yr
            description: The yr of the NYC Taxi journey
    
          - title: month
            description: The month of the NYC Taxi journey 
    
          - title: kind
            description: The kind of the NYC Taxi 

  4. Create the fashions/gold/ listing.
  5. Create the file fashions/gold/gold_cost_metrics.sql with the next contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format="iceberg",
        iceberg_expire_snapshots="False",
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_trip_distance) as avg_cost_per_distance
    , (avg_total_amount/avg_duration) as avg_cost_per_minute
    , yr
    , month 
    , kind 
    FROM {{ ref('silver_avg_metrics') }}

  6. Create the file fashions/gold/gold_passengers_metrics.sql with the next contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format="iceberg",
        iceberg_expire_snapshots="False",
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_passenger_count) as avg_cost_per_passenger
    , (avg_duration/avg_passenger_count) as avg_duration_per_passenger
    , (avg_trip_distance/avg_passenger_count) as avg_trip_distance_per_passenger
    , yr
    , month 
    , kind 
    FROM {{ ref('silver_avg_metrics') }}

  7. Create the file fashions/gold/schema.yml with the next contents:
    model: 2
    
    fashions:
      - title: gold_cost_metrics
        description: This desk has metrics per value based mostly on NYC Taxi Open Information
    
        columns:
          - title: avg_cost_per_distance
            description: The common value per distance of a NYC Taxi journey
    
          - title: avg_cost_per_minute
            description: The common value per minute of a NYC Taxi journey
    
          - title: yr
            description: The yr of the NYC Taxi journey
    
          - title: month
            description: The month of the NYC Taxi journey
    
          - title: kind
            description: The kind of the NYC Taxi
    
      - title: gold_passengers_metrics
        description: This desk has metrics per passenger based mostly on NYC Taxi Open Information
    
        columns:
          - title: avg_cost_per_passenger
            description: The common value per passenger for a NYC Taxi journey
    
          - title: avg_duration_per_passenger
            description: The common variety of passenger per NYC Taxi journey
    
          - title: avg_trip_distance_per_passenger
            description: The common NYC Taxi journey distance
    
          - title: yr
            description: The yr of the NYC Taxi journey
    
          - title: month
            description: The month of the NYC Taxi journey 
    
          - title: kind
            description: The kind of the NYC Taxi

  8. Take away the fashions/instance/ folder, as a result of it’s simply an instance created within the dbt init command.

Configure the dbt undertaking

dbt_project.yml is a key configuration file for dbt tasks. It accommodates the next code:

fashions:
  dbt_glue_demo:
    # Config indicated by + and applies to all recordsdata beneath fashions/instance/
    instance:
      +materialized: view

We configure dbt_project.yml to switch the previous code with the next:

fashions:
  dbt_glue_demo:
    silver:
      +materialized: desk

It is because that we need to materialize the fashions beneath silver as Parquet tables.

Configure a dbt profile

A dbt profile is a configuration that specifies how to connect with a specific database. The profiles are outlined within the profiles.yml file inside a dbt undertaking.

Full the next steps to configure a dbt profile:

  1. Create the profiles listing.
  2. Create the file profiles/profiles.yml with the next contents:
    dbt_glue_demo:
      goal: dev
      outputs:
        dev:
          kind: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          area: us-east-1
          employees: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"

  3. Create the profiles/iceberg/ listing.
  4. Create the file profiles/iceberg/profiles.yml with the next contents:
    dbt_glue_demo:
      goal: dev
      outputs:
        dev:
          kind: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          area: us-east-1
          employees: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"
          datalake_formats: "iceberg"
          conf: --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse="{{ env_var('DBT_S3_LOCATION') }}"warehouse/ --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

The final two strains are added for setting Iceberg configurations on AWS Glue interactive periods.

Run the dbt undertaking

Now it’s time to run the dbt undertaking. Full the next steps:

  1. To run the undertaking dbt, you have to be within the undertaking folder:
  2. The undertaking requires you to set surroundings variables with the intention to run on the AWS account:
    $ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output textual content):function/GlueInteractiveSessionRole"
    $ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output textual content)-us-east-1"

  3. Be sure the profile is about up appropriately from the command line:
    $ dbt debug --profiles-dir profiles
    ...
    05:34:22 Connection take a look at: [OK connection ok]
    05:34:22 All checks handed!

Should you see any failures, examine in the event you supplied the proper IAM function ARN and S3 location in Step 2.

  1. Run the fashions with the next code:
    $ dbt run -m silver --profiles-dir profiles
    $ dbt run -m gold --profiles-dir profiles/iceberg/

Now the tables are efficiently created within the AWS Glue Information Catalog, and the info is materialized within the Amazon S3 location.

You may confirm these tables by opening the AWS Glue console, selecting Databases within the navigation pane, and opening dbt_glue_demo_nyc_metrics.

Question materialized tables by way of Athena

Let’s question the goal desk utilizing Athena to confirm the materialized tables. Full the next steps:

  1. On the Athena console, change the workgroup to athena-dbt-glue-aws-blog.
  2. If the workgroup athena-dbt-glue-aws-blog settings dialog field seems, select Acknowledge.
  3. Use the next question to discover the metrics created by the dbt undertaking:
    SELECT cm.avg_cost_per_minute
        , cm.avg_cost_per_distance
        , pm.avg_cost_per_passenger
        , cm.yr
        , cm.month
        , cm.kind
    FROM "dbt_glue_demo_nyc_metrics"."gold_passengers_metrics" pm
    LEFT JOIN "dbt_glue_demo_nyc_metrics"."gold_cost_metrics" cm
        ON cm.kind = pm.kind
        AND cm.yr = pm.yr
        AND cm.month = pm.month
    WHERE cm.kind="yellow"
        AND cm.yr="2016"
        AND cm.month="6"

The next screenshot reveals the outcomes of this question.

Evaluation dbt documentation

Full the next steps to evaluate your documentation:

  1. Generate the next documentation for the undertaking:
    $ dbt docs generate --profiles-dir profiles/iceberg
    11:41:51  Operating with dbt=1.7.1
    11:41:51  Registered adapter: glue=1.7.1
    11:41:51  Unable to do partial parsing as a result of profile has modified
    11:41:52  Discovered 3 fashions, 1 supply, 0 exposures, 0 metrics, 478 macros, 0 teams, 0 semantic fashions
    11:41:52  
    11:41:53  Concurrency: 1 threads (goal="dev")
    11:41:53  
    11:41:53  Constructing catalog
    11:43:32  Catalog written to /Customers/username/Paperwork/workspace/dbt_glue_demo/goal/catalog.json

  2. Run the next command to open the documentation in your browser:
    $ dbt docs serve --profiles-dir profiles/iceberg

  3. Within the navigation pane, select gold_cost_metrics beneath dbt_glue_demo/fashions/gold.

You may see the detailed view of the mannequin gold_cost_metrics, as proven within the following screenshot.

  1. To see the lineage graph, select the circle icon on the backside proper.

Clear up

To wash up your surroundings, full the next steps:

  1. Delete the database created by dbt:
    $ aws glue delete-database —title dbt_glue_demo_nyc_metrics

  2. Delete all generated knowledge:
    $ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity —question "Account" —output textual content)-us-east-1/ —recursive
    $ aws s3 rm s3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity —question "Account" —output textual content)-us-east-1/ —recursive

  3. Delete the CloudFormation stack:
    $ aws cloudformation delete-stack —stack-name dbt-demo

Conclusion

This submit demonstrated how the dbt-glue adapter helps your workload, and how one can construct a contemporary knowledge stack utilizing dbt and AWS Glue utilizing the dbt-glue adapter. You discovered the end-to-end operations and knowledge movement for knowledge engineers to construct and handle an information stack utilizing dbt and the dbt-glue adapter. To report points or request a characteristic enhancement, be happy to open a problem on GitHub.


In regards to the authors

Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue crew at Amazon Net Companies. He works based mostly in Tokyo, Japan. He’s liable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his street bike.

Benjamin Menuet is a Senior Information Architect on the AWS Skilled Companies crew at Amazon Net Companies. He helps prospects develop knowledge and analytics options to speed up their enterprise outcomes. Exterior of labor, Benjamin is a path runner and has completed some iconic races just like the UTMB.

Akira Ajisaka is a Senior Software program Improvement Engineer on the AWS Glue crew. He likes open supply software program and distributed methods. In his spare time, he enjoys taking part in arcade video games.

Kinshuk Pahare is a Principal Product Supervisor on the AWS Glue crew at Amazon Net Companies.

Jason Ganz is the supervisor of the Developer Expertise (DX) crew at dbt Labs



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments