The Conductor of the Data Orchestra: End-to-End Pipeline Orchestration with Airflow
We have built our foundation, ingested our data, and transformed it into valuable insights. But these components are like individual musicians in an orchestra; without a conductor, the result is noise, not a symphony. This final technical post in our series is about the conductor: the orchestrator that tells each component when to play its part, ensuring the entire data pipeline runs in perfect harmony.
The "Why": A Pragmatic Approach to Orchestration
Our choice of orchestrator might surprise you. We use Apache Airflow, but not because we think it's a perfect tool. In fact, many data engineers have a love hate relationship with it. However, a core part of our philosophy is to build platforms that serve our clients for the long term. This means putting aside our personal preferences and ego.
The reality is that Airflow is the most broadly used orchestration tool in the world. The community is massive, documentation is abundant, and finding engineers who know how to use it is far easier than for any other tool. When we hand over the keys to a client, we need to be confident that their team, whatever its future seniority or structure, can successfully maintain and extend the stack. Choosing the industry standard is the pragmatic and responsible choice.
Our core philosophy is to keep our Airflow footprint small, simple, and decoupled. We use Airflow as a pure orchestrator, not a data processing framework. This has several key advantages:
- Separation of Concerns: Tools like Meltano and dbt manage their own state. They don't rely on Airflow's metadata, which means Airflow doesn't become a single point of failure for pipeline logic.
- Future Proofing: By keeping the logic within the tools themselves, our clients can easily migrate to a different orchestrator in the future if they choose.
- Resilience: The orchestrator is no longer a hyper critical part of the infrastructure. If the Airflow cluster has a problem, we can simply drop it and redeploy it. It's that easy.
The "How": Our Airflow Blueprint in Action
With this philosophy in mind, let's look at how we actually implement Airflow to conduct our Meltano and dbt jobs.
Infrastructure: Self-Hosted and Containerized
We typically deploy Airflow on a client's own cloud infrastructure, using services like AWS ECS for the core services and Fargate/EC2 tasks for running the individual jobs. We manage this entire setup with Terraform, naturally.
We tend to stay away from managed services like Cloud Composer or MWAA. Because we don't run heavy logic inside Airflow itself and the managed offerings are often overkill and far more expensive. This self hosted approach also simplifies upgrades, as we aren't tightly coupled to the managed service's release cycle.
Every change to our dags
folder in our monorepo triggers a CI/CD pipeline that builds and deploys a new image or, more often, simply copies the new and changed DAG files to the running Airflow instance.
Orchestrating Meltano for Ingestion
We build a simple, reusable component that can execute any Meltano job. Using this, we create one ingestion DAG for each data source. The DAG is incredibly simple; it just runs the Meltano task with the correct parameters for that source.
As a best practice, we also create a simple "Meltano State" DAG. This is a helper utility that allows us to read or overwrite the Meltano state saved in our metadata database directly from the Airflow UI. This is invaluable when we need to do a backfill of a specific source. It saves us the trouble of manually connecting to the database and changing records, which is always a risky operation.
Orchestrating dbt for Transformation
Similarly to Meltano, we have a reusable component for executing dbt
commands. The key to our dbt orchestration strategy is tags. We use dbt tags to identify, group, and execute related models, which gives us immense flexibility. We generally follow one of two patterns depending on the project's maturity:
- The Monolith DAG (For new projects): When a data model is new or small, the simplest approach is a single dbt DAG. It contains a sequence of tasks that run the entire dbt warehouse in the correct order. You might start with a single
dbt build
command and add more granular steps later. - The Multi-DAG (For mature projects): For larger, more complex data warehouses, we create multiple dbt DAGs, one for each major tag (which usually represents a business domain like
finance
ormarketing
). This allows for much more granular control and efficiency.
Neither approach is strictly better; it's about choosing the right pattern for the current needs and evolving it over time.
Tying it all Together: Data-Aware Scheduling
As the number of DAGs grows, trying to schedule them with simple cron expressions becomes a painful and brittle exercise. We always prefer data-aware scheduling, where a DAG runs only when its upstream data dependencies have been met.
Our common pattern is to have a single "dummy" DAG that runs on a schedule (e.g., every day after midnight). Its only job is to produce a "dataset" (a modern Airflow feature) that signals the start of a new cycle. All of our Meltano ingestion DAGs are configured to trigger when this dataset is produced.
As each Meltano DAG completes, it produces its own dataset (e.g., raw_mysql_transactions
). This is where the power of the multi-DAG dbt pattern comes in. The dbt_finance
DAG, which depends on the transactions
data, can start running the moment the ingestion is finished, without having to wait for the dbt_marketing
models whose data might still be ingesting. This creates a highly efficient, event driven system where the entire warehouse is refreshed as quickly as possible.
Up Next
We've now assembled all the individual components of our blueprint: a solid foundation, scalable ingestion, robust transformation, and reliable orchestration. In our final post, we'll zoom out and look at the full picture, tracing the flow of data from source to insight and summarizing our philosophy for building a truly modern data stack.
Next up: "The Blueprint in Action: An End-to-End Look at Our Modern Data Stack."