Blueprint for a Solid Foundation: Managing Your Data Stack with Terraform

A Note on Culture Before We Begin: A modern data stack will not solve all your company's problems. In fact, it won't even solve all your data problems. We believe powerful technology must go hand-in-hand with a strong data culture. That's why we work with our clients on all fronts, helping them build not just the right stack, but also the mindset and practices to make data truly transformative. We'll likely talk more about this in a future post.

With the "why" of our series established, it's time to get our hands dirty. Every robust house needs a solid foundation, and in the world of data platforms, that foundation is infrastructure. How it's defined, provisioned, and maintained. This is where we begin.

The benefits of Infrastructure as Code (IaC) are likely no surprise to anyone in a modern tech organization. Consistency, repeatability, and version control have revolutionized how we manage servers and services. However, we often see a critical component left out of this paradigm: the data warehouse itself.

That's why we bring the same IaC principles our DevOps colleagues have enjoyed for years directly into the data engineering workflow. Managing our data warehouse's roles, schemas, and grants as code is crucial. It provides a clear, auditable trail of how everything fits together and ensures our security model is as robust and version controlled as our application code.

Our Tool of Choice: A Practical Approach

We use Terraform to manage our infrastructure. Why? Honestly, because we like it, we understand it, and it gets the job done.

While there are other excellent tools like Pulumi or native solutions like AWS CloudFormation, we believe the specific tool is less important than the principle. The best tool is the one your team understands and can use effectively.

A word of caution: don't overdo it. It's easy to get caught up in complex frameworks like Terragrunt or Terraspace, but for most projects, they are unnecessary overhead. Start simple and add complexity only when you have a clear and present need for it.

Structuring for Success: The Monorepo Blueprint

To understand how we structure our Terraform code, you first need to understand our repository structure. We are firm believers in the monorepo approach for a data stack, which provides a single source of truth for the entire platform. We'll save the full debate on the pros and cons of monorepos for a future post.

The key takeaway is that this structure serves as a reusable skeleton. It allows us to think of each core component (ingestion, transformation, orchestration) as a swappable "Lego block." This modularity is fundamental to our blueprint, as it enables us to adapt the stack to specific client needs without reinventing the wheel.

Our monorepos are typically organized into four main directories, representing the core components of the stack:

extraction/: Houses the project for our chosen data ingestion tool.
transformation/ (or dwh/): Contains the project for our data transformation framework.
orchestration/ (or scheduler/): Holds the project for our pipeline orchestrator.
infrastructure/: The Terraform project that defines and manages the entire stack.

This structure is mirrored within our infrastructure directory:

.
├── extraction/
├── transformation/
├── orchestration/
└── infrastructure/
    ├── common/
    ├── extraction/
    ├── transformation/
    ├── orchestration/
    └── data_warehouse/

This parallel structure makes it incredibly intuitive to find the infrastructure code related to a specific component. For example, infrastructure/extraction/ contains the Terraform definitions for the resources our chosen ingestion tool needs to run, such as its AWS ECS task definition, container repository (ECR), and IAM roles.

The two special directories are:

common/: For shared resources used by multiple components, like S3 buckets, the metadata database for our orchestrator and ingestion tool, and secrets management configurations.
data_warehouse/: Dedicated to managing all objects within our data warehouse.

Managing the Data Warehouse with Terraform: A Practical Example

This is where the power of IaC in a data context truly shines. By managing the data warehouse as code, we gain several key advantages:

Declarative Management: The entire data warehouse environment (schemas, engines, roles, users, and permission) is defined in code. This enables version control, peer review, and automated, repeatable deployments.
Modular Structure: We organize Terraform files by resource type (e.g., databases.tf, roles.tf, warehouses.tf, grants.tf), making the codebase clean, scalable, and easy to maintain.
Robust Security Model: We implement a strong Role-Based Access Control (RBAC) model directly in code, ensuring access is managed systematically and securely.
Secure Credential Handling: We use key-pair authentication and integration with services like AWS Secrets Manager, eliminating hardcoded passwords and protecting our environment.

Here is a simple, anonymized code snippet that demonstrates this pattern using the Snowflake provider. The same declarative principles apply whether you are managing Snowflake, BigQuery, or another platform. You would simply use that platform's specific Terraform provider and resources.

# main.tf

# Configure the Snowflake provider
provider "snowflake" {
  # The account, username, and private_key_path would typically be
  # passed in as variables or configured through environment variables
  # for better security and flexibility.
  account          = var.snowflake_account
  username         = var.snowflake_username
  private_key_path = var.snowflake_private_key_path
}

# Define a virtual warehouse for data processing
resource "snowflake_warehouse" "analyst_warehouse" {
  name           = "ANALYST_WH"
  warehouse_size = "X-SMALL"
  auto_suspend   = 60 # Suspend after 60 seconds of inactivity
  auto_resume    = true
  comment        = "Warehouse for the data analytics team."
}

# Define a role for data analysts
resource "snowflake_role" "analyst_role" {
  name    = "ANALYST"
  comment = "Role for data analysts with read-only access."
}

# Grant the new role permissions to use the warehouse
resource "snowflake_grant_privileges_to_role" "analyst_wh_grant" {
  privileges = ["USAGE"]
  role_name  = snowflake_role.analyst_role.name
  on_account_object {
    object_type = "WAREHOUSE"
    object_name = snowflake_warehouse.analyst_warehouse.name
  }
}

This declarative approach is infinitely more reliable and auditable than a series of manual clicks in a UI.

A Note on CI/CD

While not strictly part of the infrastructure/ folder, CI/CD pipelines are deeply related. We adapt to our clients' preferred tools, whether it's GitHub Actions, Jenkins, or GitLab CI.

The key is to implement conditional logic within the pipeline. When a change is pushed to the main branch, the pipeline inspects which directories have changed. A change only in the extraction/ directory should only trigger a build and push of the data ingestion tool's Docker image; it doesn't need to run transformation tests or redeploy the orchestrator. This keeps our pipelines efficient and fast.

Here’s a look at what this pipeline graph looks like in a GitHub Actions workflow from one of our actual client repositories. It might look complex with all the arrows, but that’s the conditional logic in action, ensuring only the relevant jobs run based on the specific files that were changed.

Github Actions Graph

Up Next

With a solid, automated, and version controlled foundation in place, we're ready to start populating our data warehouse. In the next chapter, we'll dive deep into the "E" and "L" of ELT: Data Ingestion.

Blueprint for a Solid Foundation: Managing Your Data Stack with Terraform

Blueprint for a Solid Foundation: Managing Your Data Stack with Terraform

Our Tool of Choice: A Practical Approach

Structuring for Success: The Monorepo Blueprint

Managing the Data Warehouse with Terraform: A Practical Example

A Note on CI/CD

Up Next

Join our Newsletter

RSS Feed