Taming the Data Sources: A Scalable Extraction Strategy with Meltano

With our version-controlled infrastructure in place, we now have a solid foundation. The next logical step is to get data flowing into our warehouse. This article focuses on the "E" (Extract) and "L" (Load) in the modern ELT paradigm.

The "Why": The Strategic Shift from ETL to ELT

Years ago, the standard approach was ETL (Extract, Transform, Load). Data was extracted, subjected to a series of complex transformations in flight, and only then loaded into the data warehouse. We, like many in the industry, have shifted to ELT (Extract, Load, Transform). There are two primary reasons for this evolution.

First, the game changed. Cloud data warehouses like BigQuery and Snowflake became immensely powerful, capable of processing massive volumes of data with minimal configuration. The need to pre process data using complex and costly compute clusters (like AWS EMR) diminished. It became more efficient to load raw data directly and leverage the warehouse's power for transformations.

Second, and perhaps more importantly, is traceability. When you perform 15 transformations on data before it even lands in your warehouse, debugging becomes a nightmare. If an unexpected output appears, retracing the data's journey through a black box of transformation code is incredibly difficult. With ELT, you load the raw, untouched data first. Every transformation step from that point on is typically done in SQL within the warehouse, creating a clear, auditable trail that is far easier to debug and understand. This separation of concerns is a cornerstone of modern data engineering.

The "What": Our Tool of Choice - Meltano & the Singer Standard

For our extraction and loading "Lego block," we choose Meltano, an open-source DataOps platform built around the Singer standard.

Our preference for open source is a core part of our philosophy. In a world with increasing layers of abstraction that make our work "easier" but less transparent, open source solutions empower you. They give you the option to open the door, look under the hood, understand what's happening, and even fix it yourself if needed.

The Meltano/Singer architecture excels at moving data from Point A to Point B. It streams data from a source (a "tap") to a destination (a "target") without needing to write to disk, it maintains a "state" to remember its progress, and its configuration is primarily YAML based, making it accessible even for non experts. We believe it's a tool that can serve small to large companies for years to come.

A Dose of Realism: Open Source vs. SaaS

While we are strong advocates for this technology, it's important to be pragmatic. The power of "opening the door" sometimes means you have to open it to fix bugs or add missing features. We've done this many times, contributing to various taps and targets.

This means that depending on the seniority of your team or the size of your company, it can be a smart business decision to start by paying a SaaS provider to handle data ingestion for you. The trade off is control vs. convenience, and the right answer depends entirely on your context.

The "How": Ingestion in Practice

Talk is cheap; let's look at some code. With Meltano, you can define an entire extraction pipeline in a simple meltano.yml file.

plugins:
  extractors:
    - name: tap-mysql
      variant: transferwise
      pip_url: pipelinewise-tap-mysql

    - name: tap-mysql-db01-gestion
      inherit_from: tap-mysql
      config:
        database: svc-gestion-prd
      select:
        - "svc-gestion-prd-business_quote.*"
      metadata:
        "svc-gestion-prd-business_quote":
          replication-method: INCREMENTAL
          replication-key: created_at
          key_properties: ["id"]

  loaders:
    - name: target-bigquery
      variant: z3z1ma
      pip_url: git+https://github.com/z3z1ma/target-bigquery.git
      config:
        dataset: raw
        denormalized: true
        upsert: true

With this YAML, a couple of environment variables for credentials, and a simple command (meltano el tap-mysql-db01-gestion target-bigquery), you are moving data. This file defines a base MySQL tap, inherits from it to connect to a specific production database, selects a table, and tells Meltano to replicate it incrementally. It's configuration as code for data ingestion.

Best Practices: Containerization and Wrappers

Of course, we adopt a couple of best practices. First, we containerize the Meltano project in a Docker image. This eliminates "it works on my machine" issues and makes our extraction jobs portable, ready to be plugged into any orchestrator.

Containerization also allows for a neat trick: using a wrapper script. We often use a simple .sh script as our container's entrypoint. This script can handle pre flight logic before executing the meltano el command. A common use case is managing credentials. For example, a BigQuery target requires a path to a JSON credentials file. We can inject the contents of that JSON as a base64 encoded environment variable, and the wrapper script will decode it, write it to a file inside the container, and set the correct file path variable for the target to use.

State Management Explained

So, how does Meltano know not to re sync the entire database every time? Through state management.

The process is simple and elegant. As the tap streams data, it periodically emits a special STATE message (a JSON blob that acts as a bookmark). Meltano captures this state message and saves it. The next time the job runs, Meltano passes the last saved state back to the tap. The tap then knows exactly where to resume, for example: "select all records where created_at is greater than the timestamp in this state message."

Meltano can store this state in various backends, like an S3 bucket. We, however, prefer to use a database backend, often reusing the same metadata database that our orchestrator already requires. It's a pragmatic choice that leverages existing infrastructure.

Up Next

We've established a solid, repeatable foundation and now have a scalable, observable process for extracting data from our sources and loading it into our data warehouse. The raw ingredients are in the kitchen. In our next post, we'll tackle the "T" in ELT: From Raw to Ready: Transforming Data with the Power of dbt.

Taming the Data Sources: A Scalable Extraction Strategy with Meltano

Taming the Data Sources: A Scalable Extraction Strategy with Meltano

The "Why": The Strategic Shift from ETL to ELT

The "What": Our Tool of Choice - Meltano & the Singer Standard

A Dose of Realism: Open Source vs. SaaS

The "How": Ingestion in Practice

Best Practices: Containerization and Wrappers

State Management Explained

Up Next

Join our Newsletter

RSS Feed