Taming the Data Sources: A Scalable Extraction Strategy with Meltano
With our version-controlled infrastructure in place, we now have a solid foundation. The next logical step is to get data flowing into our warehouse. This article focuses on the "E" (Extract) and "L" (Load) in the modern ELT paradigm.
The "Why": The Strategic Shift from ETL to ELT
Years ago, the standard approach was ETL (Extract, Transform, Load). Data was extracted, subjected to a series of complex transformations in flight, and only then loaded into the data warehouse. We, like many in the industry, have shifted to ELT (Extract, Load, Transform). There are two primary reasons for this evolution.
First, the game changed. Cloud data warehouses like BigQuery and Snowflake became immensely powerful, capable of processing massive volumes of data with minimal configuration. The need to pre process data using complex and costly compute clusters (like AWS EMR) diminished. It became more efficient to load raw data directly and leverage the warehouse's power for transformations.
Second, and perhaps more importantly, is traceability. When you perform 15 transformations on data before it even lands in your warehouse, debugging becomes a nightmare. If an unexpected output appears, retracing the data's journey through a black box of transformation code is incredibly difficult. With ELT, you load the raw, untouched data first. Every transformation step from that point on is typically done in SQL within the warehouse, creating a clear, auditable trail that is far easier to debug and understand. This separation of concerns is a cornerstone of modern data engineering.
The "What": Our Tool of Choice - Meltano & the Singer Standard
For our extraction and loading "Lego block," we choose Meltano, an open-source DataOps platform built around the Singer standard.
Our preference for open source is a core part of our philosophy. In a world with increasing layers of abstraction that make our work "easier" but less transparent, open source solutions empower you. They give you the option to open the door, look under the hood, understand what's happening, and even fix it yourself if needed.
The Meltano/Singer architecture excels at moving data from Point A to Point B. It streams data from a source (a "tap") to a destination (a "target") without needing to write to disk, it maintains a "state" to remember its progress, and its configuration is primarily YAML based, making it accessible even for non experts. We believe it's a tool that can serve small to large companies for years to come.
A Dose of Realism: Open Source vs. SaaS
While we are strong advocates for this technology, it's important to be pragmatic. The power of "opening the door" sometimes means you have to open it to fix bugs or add missing features. We've done this many times, contributing to various taps and targets.
This means that depending on the seniority of your team or the size of your company, it can be a smart business decision to start by paying a SaaS provider to handle data ingestion for you. The trade off is control vs. convenience, and the right answer depends entirely on your context.
The "How": Ingestion in Practice
Talk is cheap; let's look at some code. With Meltano, you can define an entire extraction pipeline in a simple meltano.yml
file.
plugins:
extractors:
- name: tap-mysql
variant: transferwise
pip_url: pipelinewise-tap-mysql
- name: tap-mysql-db01-gestion
inherit_from: tap-mysql
config:
database: svc-gestion-prd
select:
- "svc-gestion-prd-business_quote.*"
metadata:
"svc-gestion-prd-business_quote":
replication-method: INCREMENTAL
replication-key: created_at
key_properties: ["id"]
loaders:
- name: target-bigquery
variant: z3z1ma
pip_url: git+https://github.com/z3z1ma/target-bigquery.git
config:
dataset: raw
denormalized: true
upsert: true
With this YAML, a couple of environment variables for credentials, and a simple command (meltano el tap-mysql-db01-gestion target-bigquery
), you are moving data. This file defines a base MySQL tap, inherits from it to connect to a specific production database, selects a table, and tells Meltano to replicate it incrementally. It's configuration as code for data ingestion.
Best Practices: Containerization and Wrappers
Of course, we adopt a couple of best practices. First, we containerize the Meltano project in a Docker image. This eliminates "it works on my machine" issues and makes our extraction jobs portable, ready to be plugged into any orchestrator.
Containerization also allows for a neat trick: using a wrapper script. We often use a simple .sh
script as our container's entrypoint. This script can handle pre flight logic before executing the meltano el
command. A common use case is managing credentials. For example, a BigQuery target requires a path to a JSON credentials file. We can inject the contents of that JSON as a base64 encoded environment variable, and the wrapper script will decode it, write it to a file inside the container, and set the correct file path variable for the target to use.
State Management Explained
So, how does Meltano know not to re sync the entire database every time? Through state management.
The process is simple and elegant. As the tap streams data, it periodically emits a special STATE
message (a JSON blob that acts as a bookmark). Meltano captures this state message and saves it. The next time the job runs, Meltano passes the last saved state back to the tap. The tap then knows exactly where to resume, for example: "select all records where created_at
is greater than the timestamp in this state message."
Meltano can store this state in various backends, like an S3 bucket. We, however, prefer to use a database backend, often reusing the same metadata database that our orchestrator already requires. It's a pragmatic choice that leverages existing infrastructure.
Up Next
We've established a solid, repeatable foundation and now have a scalable, observable process for extracting data from our sources and loading it into our data warehouse. The raw ingredients are in the kitchen. In our next post, we'll tackle the "T" in ELT: From Raw to Ready: Transforming Data with the Power of dbt.