Handle information transformations with dbt in Amazon Redshift

0
11


Amazon Redshift is a totally managed, petabyte-scale information warehouse service within the cloud. You can begin with just some hundred gigabytes of knowledge and scale to a petabyte or extra. Amazon Redshift lets you use your information to accumulate new insights for your online business and clients whereas preserving prices low.

Along with price-performance, clients need to handle information transformations (SQL Choose statements written by information engineers, information analysts, and information scientists) in Amazon Redshift with options together with modular programming and information lineage documentation.

dbt (information construct instrument) is a framework that helps these options and extra to handle information transformations in Amazon Redshift. There are two interfaces for dbt:

  • dbt CLI – Obtainable as an open-source venture
  • dbt Cloud – A hosted service with added options together with an IDE, job scheduling, and extra

On this publish, we display some options in dbt that enable you to handle information transformations in Amazon Redshift. We additionally present the dbt CLI and Amazon Redshift workshop to get began utilizing these options.

Handle frequent logic

dbt lets you write SQL in a modular style. This improves maintainability and productiveness as a result of frequent logic will be consolidated (preserve a single occasion of logic) and referenced (construct on present logic as an alternative of ranging from scratch).

The next determine is an instance exhibiting how dbt consolidates frequent logic. On this instance, two fashions depend on the identical subquery. As a substitute of replicating the subquery, dbt permits you to create a mannequin for the subquery and reference it later.

Manage common subquery in dbt

Determine 1: Handle frequent subquery in dbt

The idea of referencing isn’t restricted to logic associated to subqueries. You can even use referencing for logic associated to fields.

The next is an instance exhibiting how dbt consolidates frequent logic associated to fields. On this instance, a mannequin applies the identical case assertion on two fields. As a substitute of replicating the case assertion for every area, dbt permits you to create a macro containing the case assertion and reference it later.

Manage common case statement in dbt

Determine 2: Handle frequent case assertion in dbt

How is a mannequin in dbt subsequently created in Amazon Redshift? dbt offers you with the command dbt run, which materializes fashions as views or tables in your focused Amazon Redshift cluster. You’ll be able to do this out within the dbt CLI and Amazon Redshift workshop.

Handle frequent information mappings

Though you should utilize macros to handle information mappings (for instance, mapping “1” to “One” and “2” to “Two”), an alternate is to keep up information mappings in recordsdata and handle the recordsdata in dbt.

The next is an instance of how dbt manages frequent information mappings. On this instance, a mannequin applies one-to-one information mappings on a area. As a substitute of making a macro for the one-to-one information mappings, dbt permits you to create a seed for the one-to-one information mappings within the type of a CSV file after which reference it later.

Manage common data mapping in dbt

Determine 3: Handle frequent information mapping in dbt

You’ll be able to create or replace a seed with a two-step course of. After you create or replace a CSV seed file, run the command dbt seed to create the CSV seed as a desk in your focused Amazon Redshift cluster earlier than referencing it.

Handle information lineage documentation

After you will have created fashions and seeds in dbt, and used dbt’s referencing functionality, dbt offers you with a way to generate documentation in your information transformations.

You’ll be able to run the command dbt docs generate adopted by dbt docs serve to launch a domestically hosted web site containing documentation in your dbt venture. While you select a mannequin on the domestically hosted web site, details about the mannequin is displayed, together with columns within the last view or desk, dependencies to create the mannequin, and the SQL that’s compiled to create the view or desk. The next screenshot exhibits an instance of this documentation.

Documentation generated by dbt

Determine 4: Documentation generated by dbt

You can even visualize dependencies for improved navigation of documentations throughout impression evaluation. Within the following instance graph, we are able to see that mannequin rpt_tech_all_users is constructed referencing the mannequin base_public_users, which in flip references the desk customers within the public schema.

Data lineage visualization generated by dbt

Determine 5: Knowledge lineage visualization generated by dbt

Conclusion

This publish coated how you should utilize dbt to handle information transformations in Amazon Redshift. As you discover dbt, you’ll come throughout different options like hooks, which you should utilize to handle administrative duties, for instance, steady granting of privileges.

For a hands-on expertise with dbt CLI and Amazon Redshift, we’ve got a workshop with step-by-step directions that can assist you create your first dbt venture and discover the options talked about on this publish—fashions, macros, seeds, and hooks. Go to dbt CLI and Amazon Redshift to get began.

When you’ve got any questions or ideas, depart your suggestions within the feedback part. When you want any additional help to optimize your Amazon Redshift implementation, contact your AWS account staff or a trusted AWS companion.


Concerning the authors

Randy Chng is an Analytics Acceleration Lab Options Architect at Amazon Net Companies. He works with clients to speed up their Amazon Redshift journey by delivering proof of ideas on key enterprise issues.

Sean Beath is an Analytics Acceleration Lab Options Architect at Amazon Net Companies. He delivers proof of ideas with clients on Amazon Redshift, serving to clients drive analytics worth on AWS.

LEAVE A REPLY

Please enter your comment!
Please enter your name here