Featured Post

From Raw to Ready: Transforming Data with the Power of dbt

Learn how dbt revolutionizes the data transformation process by bringing software engineering best practices like testing, documentation, and version control to analytics.

By Marco Porracin
Model: Gemini 2.5 Pro Preview 06-05 (On Cursor)
Written
Verified
#modern data stack#dbt#data transformation#analytics engineering#sql

From Raw to Ready: Transforming Data with the Power of dbt

In our previous posts, we laid our infrastructure foundation with Terraform and established a scalable data ingestion process with Meltano. The raw ingredients are now in our data warehouse, but raw data is rarely useful on its own. It needs to be cleaned, structured, and enriched with business logic to provide real value. This is where we tackle the "T" in ELT: Transformation.

The "Why": The dbt Revolution

As we discussed, the move from ETL to ELT became the obvious approach with the rise of powerful cloud data warehouses. It's more efficient and traceable to load raw data first and then perform all transformations directly within the warehouse. The question then became: how do we manage this transformation layer effectively?

For years, transformations were often a tangled web of complex SQL scripts or brittle Python code. Then came dbt (data build tool), which fundamentally changed the game by bringing modern software engineering best practices to the world of analytics.

On top of that, managing pure SQL is always easier than maintaining pure Python code. This transformed the data engineer role itself and opened the gates for what is now known as the "Analytics Engineer" a role that lives at the intersection of data engineering and business analysis. However, what we love most is that SQL is simple. So simple that even business owners can dive into the logic, understand it, challenge it, or even propose new models. dbt democratizes data transformation.

The "What": Our Blueprint for dbt Modeling

A dbt project is more than a collection of scripts; it's a structured, documented, and tested representation of your business logic. We follow a multi layered modeling approach that turns raw, messy data into clean, reliable, and analytics ready datasets.

1. The Foundation: Staging Layer (stg_)

The journey begins at the staging layer. Once raw data is loaded into the warehouse, the very first step is to create corresponding staging models. We use incremental models for this, ensuring efficiency.

The purpose of this layer is not to apply complex business logic, but to perform critical, foundational cleanup and enforce consistency. This includes:

  • Casting data types: Ensuring a timestamp is a timestamp, a number is a number, etc.
  • Enforcing naming conventions: This is a crucial step. We standardize all column names so that from this point forward, the data language is consistent across the entire warehouse. For example:
    • id becomes <object>_id (e.g., transaction_id, card_id).
    • Timestamps use an _at suffix (e.g., created_at).
    • Dates use an _at_dt suffix (e.g., created_at_dt).
    • Booleans are prefixed with has_ or is_ (e.g., is_active).

This layer ensures that anyone querying data downstream doesn't have to guess if the user ID is called id, user_id, or customer_id.

2. The Brains: Intermediate Layer (int_)

With a clean and consistent foundation, we can start applying business logic in the intermediate layer. This is where the core logic of your business is encoded.

  • How is revenue calculated?
  • How do we define an "active user"?
  • Which transactions are considered successful?

These definitions are built here as intermediate models. The guiding philosophy for this layer is DRY (Don't Repeat Yourself). If you find yourself writing the same CASE statement or JOIN logic in multiple places, it's a sign that it should be extracted into its own intermediate model. This ensures that when a business definition changes, you only need to update it in one place, and the change propagates everywhere.

3. The Storefront: Analytics Layer (dim_, fct_, and Marts)

With all the business logic defined, we build our analytics layer. This is where the data is shaped for its final purpose: consumption by analysts and business intelligence tools. This layer itself has a couple of levels.

  • First, we build our foundational Kimball models
    • Dimension (dim_) models: represent the core entities of the business, such as dim_users, dim_cards, or dim_products.
    • Fact (fct_) models: represent events or transactions, like fct_transactions or fct_daily_balances.
  • Marts: On top of the dimension and fact models, we build our final marts. These are often wide, denormalized tables composed of many joins between various dim_ and fct_ models. They are specifically designed to be "big and complete boards" that are incredibly quick and easy to use for visualization tools and complex analysis, as all the complex joining is already done. These models often don't have a prefix and might live in their own dedicated database schema for clarity.

The "How": Our dbt Workflow in Action

Theory is great, but let's look at some of the practical ways we leverage dbt's power.

A Note on Sources: Everything is a Model

A common dbt pattern is to declare raw tables in a sources.yml file. We take a slightly different approach: we treat everything as a model from the very beginning. Every raw table ingested by Meltano has a corresponding stg_raw_*.sql model. This allows us to apply the full power of dbt (like tests and documentation) to the absolute source of truth in our warehouse.

Writing DRY Code with Macros

Macros in dbt are like functions in a programming language. They allow you to write reusable pieces of SQL, which is incredibly powerful for keeping your project clean and maintainable. We use them extensively:

  • PII Masking: A common requirement is to restrict access to Personally Identifiable Information (PII). We've built macros that can dynamically mask a column (e.g., show a hashed value) unless the user querying the data has a specific role in the warehouse.
  • Maintenance & Cleanup: We have macros that help us identify and clean up unused models or old sandbox environments, which is great for keeping the project tidy and managing warehouse costs.
  • Complex Mappers: In large fintechs, understanding the dozens of transaction types and subtypes can be complex. We build mapper macros that contain this logic, so if we need to classify a transaction in multiple models, we can call the macro instead of duplicating the logic everywhere.

Ensuring Data Quality: "Fail Fast, Shift Left"

dbt's built-in testing is one of its most powerful features. Our philosophy on testing is simple: fail fast and shift the test as far left as possible.

This means we apply tests at the earliest possible stage, often on the stg_raw models themselves. If bad data is coming from the source, we want to know immediately and not after it has already propagated through the entire warehouse and corrupted our dashboards. This "shift left" approach to data testing saves countless hours of debugging and ensures that trust in the data remains high.

Development and CI/CD

Our developers work in isolated sandbox environments, developing against production data without affecting it. When their work is ready, they open a Pull Request to the main branch of our monorepo. This automatically kicks off a CI/CD pipeline that runs, builds, and tests only the models that have been changed, ensuring that no bad code makes it into production.

Up Next

We now have a robust, tested, and well documented transformation pipeline. We can reliably turn raw data into valuable insights. But how do we run these ingestion and transformation jobs on a schedule and handle dependencies between them? That's where orchestration comes in.

Next up: "The Conductor of the Data Orchestra: End-to-End Pipeline Orchestration with Airflow."

Join our Newsletter

Get the latest insights on data analytics, modern data stacks, and business intelligence delivered straight to your inbox.

RSS Feed

Prefer RSS? Subscribe to our RSS feed to get updates directly in your feed reader.

Subscribe to RSS