Skip to main content
Featured Post

How We Built a Data Intelligence Platform for a Major SF VC from Scratch

We partnered with a software engineering agency to build a full data stack for a leading San Francisco venture capital firm, breaking their dependency on a black-box enrichment vendor and giving them a unified view of people, companies, and signals for the first time.

By Manu Ponsa
Model: Claude Sonnet 4.6 (On Cowork)
#case-study#venture-capital#data-platform#dbt#snowflake#airflow#meltano#identity-resolution#affinity

TL;DR (for the busy execs)

  • Built a full data intelligence platform from scratch for a major San Francisco VC in roughly three months.
  • Broke their dependency on a black-box enrichment vendor that was processing their own paid data providers without visibility or control.
  • Designed a custom identity resolution layer that unifies persons and companies across four data sources (Affinity CRM, Foresight, Signa, and Live Data) into a single, trustworthy entity graph.
  • Delivered the work in partnership with a software engineering agency, running workshops throughout so they could apply the same pattern independently.
  • The result powers a deal flow management tool where investors track signals, make decisions, and push entities directly into their CRM (with Claude connected to Snowflake for natural language queries on top).

Context

The client is a major venture capital firm in San Francisco. Like most established funds, they had a growing set of enrichment providers feeding them data about founders, companies, and signals (job changes, LinkedIn connections, Twitter follows) -- the kinds of early indicators that matter when you're trying to find the next breakout company before anyone else does.

The problem was not a lack of data. It was a lack of control over it.

Their stack before this engagement was built around Coda packs. Data wrangling, logic, reporting (all happening inside Coda with varying degrees of reliability). And underneath that, a vendor called Foresight was doing all the heavy lifting: processing signals from Signa and Live Data (providers the client was paying for themselves), consolidating everything, and offloading the result into a shared Snowflake instance.

On paper, that sounds convenient. In practice, it meant the fund had no real visibility into how their own data was being processed, no way to fix enrichment issues without going through vendor support, and (as it turned out) slow and frustrating support at that.

The ask was clear: take ownership of this. Build a real foundation.


How we got involved

We came into this project as data partners to a software engineering agency that had been engaged by the client to build a deal flow management application. The agency had strong product and engineering capabilities but no experience building data platforms. We had both.

Rather than just delivering a data layer and stepping away, the goal from the start was to make the agency genuinely capable in this space. We ran workshops along the way (with the agency team and with the client) covering how data platforms work, what the components are, how to deploy and extend them. That investment paid off: the agency took the same architectural pattern and applied it to another client of theirs. A real win-win.


What we built

The stack follows our standard modern data architecture:

  • Meltano for batch extraction, with a custom tap-affinity to pull CRM data.
  • Real-time webhook ingestion via AWS Lambda + Kinesis Firehose + Snowpipe so Affinity events land in Snowflake within minutes.
  • Snowflake as the central warehouse, with separate databases for raw ingestion, transformations, and analytics.
  • dbt for all transformation and modeling work.
  • Apache Airflow on ECS for orchestration.
  • AWS infrastructure managed with Terraform, with a full CI/CD pipeline on GitHub Actions that deploys only what changed.

The whole thing lives in a monorepo with every component (infrastructure, extraction, transformation, orchestration) version-controlled and reproducible.

But the most interesting and technically challenging part of this project was not the infrastructure. It was the data intelligence model we built on top of it.


The hard part: giving entities a single identity

Here is the core problem. The fund had four data sources that all knew about the same people and companies but used completely different identifiers. Affinity had its own person IDs. Foresight had its own. Signa and Live Data had theirs. The same founder could appear across all four with no common key tying them together.

To build a unified view of any person or company (one that aggregated their signals, their interaction history, their enrichment from every provider) you first had to solve identity resolution.

We designed a clustering system that uses a plug-and-play tier architecture. The idea is simple: normalize the matching keys you already have (LinkedIn URLs, Twitter handles, full names, company domains), define independent matching tiers by confidence level, and resolve conflicts using the highest-confidence match available.

For persons, the priority order is: LinkedIn URL (highest confidence), then full name, then name plus location as a fallback. For companies: company name, then Twitter URL, then LinkedIn URL.

One thing that made this tricky was data quality in the source systems. Foresight had a single LinkedIn URL assigned to 364 different people. Full names like "the team" or "investor relations" appeared dozens of times. Without guardrails, those garbage keys would chain completely unrelated people into the same cluster.

We added size caps at each tier (a LinkedIn URL mapping to more than 10 distinct full names is excluded from clustering entirely, fullname clusters larger than 20 records are dropped) and size guards on cross-tier merges so no two large clusters can be stitched together through an ambiguous shared key.

The result is an entities model in Snowflake that unifies persons and companies across all four sources into a single record per entity, with a cluster_id that any downstream model can join on, a source_ids object showing every raw ID from every source that maps to that entity, and a status field reflecting the latest investor action taken on them.

It is not perfect (we know we will need to extend the matching strategy as the fund adds new sources) but it works reliably for the current scope, and the architecture is designed so adding a new matching method or a new data source requires changing at most two files.


Signals, interactions, and the unified dataset

On top of the entity layer, we built three more production models:

Signals (job changes, LinkedIn connections, and Twitter follows from Signa and Live Data) deduplicated across sources because Foresight re-publishes the same signals it receives from those providers, sometimes with a day's timestamp skew. Each signal is enriched with the full entity record so consumers get everything they need in one place.

Interactions (emails, meetings, and calls from Affinity CRM) normalized and enriched with entity metadata. Incremental pipelines so only new interactions are processed on each run.

Actions (investor decisions logged in Foresight: assignments, outreach, passes, meetings) drive the status and priority fields on every entity, so the fund always knows where each person or company stands in their workflow.


Three months, one engineer

The project ran for roughly three months, with a clear progression across them.

The first month was scoping and infrastructure: getting alignment on what we were building, standing up the AWS and Snowflake environments, wiring up the CI/CD pipeline, and pulling Affinity data in for the first time.

The second month was core modeling: staging layers for every source, intermediate transformations, and the first version of the entity clustering logic.

The third month was about refinement: tightening the data intelligence model, handling edge cases in identity resolution, and making sure signals, interactions, and actions were all correctly enriched and deduplicated.

Since the handoff, the client's team has taken over ongoing development. They are already adding new sources (Harmonic is next) and the architecture is designed to absorb that without structural changes. I am still available to help when questions come up, but ownership is clearly with them.


What the investors see

The agency built the client-facing application on top of this data layer. The flow is straightforward: signals come in from the enrichment providers, the data stack processes and enriches them, and the tool surfaces the resulting entities to the investment team. Investors review signals, decide which entities are worth tracking, and push them directly into Affinity CRM to continue the relationship from there.

One thing that made a real difference was connecting Claude to the Snowflake instance. Because the entire data model is well-documented (every model, every field, every relationship) an LLM can navigate the warehouse reliably when an investor asks a question in plain language. That was simply not possible when the logic lived inside Coda packs. The structure and documentation that a proper data layer enforces is exactly what makes AI-powered access work.


Why this mattered

VCs live and die by the quality of their deal flow. The ability to spot the right signal early (before a founder raises, before a company gets competitive attention) is a real edge.

What this project gave the fund was not just better data. It was ownership of their own data. The enrichment providers they were paying for now feed directly into infrastructure they control. The logic that determines how signals are processed is visible, testable, and changeable. The identity resolution that ties it all together is documented and extensible.

That shift (from a vendor-managed black box to a governed, owned data layer) is where the real value lives.

RSS Feed

Prefer RSS? Subscribe to our RSS feed to get updates directly in your feed reader.

Subscribe to RSS

Want to talk through a similar data problem?

If this is close to the kind of work your team needs, request a conversation and tell us what you are trying to solve.

Request a Conversation