Data Engineering After AI

Feb 24

Moving Data Was Never the Point. Meaning It Is.

4 Comments

Enjoyed this blog and love the direction. But I still see the ownership problem — someone still needs to validate the Context Store, and it's the same domain experts who won't own their data today.

The most valuable business context lives in people's heads, not in schemas or data profiles. In the SAP world, a field like PRCTR behaves differently across company codes. Certain document types in ACDOCA need to be excluded for specific reporting scenarios. That knowledge is tacit — it came from years of working with the system, not from profiling the data. How do you capture that in a contract or a schema?

This is going to be very challenging, and I'd love to see how AI evolves to solve it.

John Y Miller

Feb 24Edited

I like where you are taking the Data Engineering discussion! Organizations need to recognize the value of managing context and that its new to many running projects required data pipelines. Getting specifications right will also be important as we employ coding agents as first class engineers. And as for spec-driven-development stuff... like everything else in this new era of engineering it's getting redesigned as well!

Also, see https://ai.bythebay.io/talks/spec-driven-development

Sonal Goyal

Mar 18

Ananth, this is one of the clearest articulations I've seen of what's actually changing in the discipline. The framing of meaning over movement is exactly right, and the early/late binding distinction based on accountability rather than domain maturity is genuinely novel — I'll be using that mental model.

I want to push on the Link step, because I think it's the hardest one and the article treats it as almost naturally following from Contextualize. In practice, linkage is where context architectures break silently. You can have impeccable contracts, a well-governed Context Store, and a beautifully maintained lineage graph — and still have your CRM customer and your product user be two different entities with no connection between them. The semantic meaning you've so carefully built in one system doesn't travel to the other because there's no trusted identity bridge.

What makes this especially tricky in the AI era is exactly what you identify with the Context Store: agents need stable, grounded context to act reliably. But if the entity underneath the context is unresolved — if "Acme Corp" in your billing system and "ACME Corporation" in your support tool haven't been linked — no amount of semantic enrichment fixes the downstream inference. You get systematically wrong answers delivered with high confidence.

The accountability boundary model actually applies here too. Within a controlled environment you can enforce a canonical entity ID. Across external or legacy systems, you have to discover identity the same way you discover semantic context — probabilistically, with human validation on the uncertain cases. The Contextualize pipeline you describe and a robust entity resolution pipeline are structurally siblings. I'd argue Link deserves its own dedicated pipeline treatment, not just a mention as the third leg of the acronym.

This is what we've been building toward at Zingg — and this piece makes the architectural stakes clearer than almost anything else I've read this year.

I. G.

Apr 16Edited

Very well written.

I have just recently had discussions with my colleagues, that relate to this.

The core question was: “Why do we need to se an agent’s thought process and plan, before we let it execute it”.

Some types of agentic frameworks now just move forward, but the answer was mostly related to responsibility, ownership, and especially validation of quality.

The discussion went on.

The main argument along the lines of “If the agent can get the ‘right’ context, there will be no need to validate it”

And I think this notion is worth thinking about.

If a model can get (and hold) the right context, where the right context is:

- the relevant information about the data itself

- the relevant information about the business needs

- relevant information about the landscape the business operates in, and therefore the opportunities

- no irrelevant information (noise)

Then, there is no reason it cannot eclipse that same frontier that is outlined in the article.

It is sort of a utopia, this concept of the “ideal context”, but in principle it is a valid argument in my mind.

And I think, it is worth thinking over, since the ai companies are likely to explore this challenge and master it over time.

We probably have quite a long time where we are needed to bring this context to the machine, which, as I read it, is what you outline will be the next frontier.

The question of ownership and responsibility still lingers.

Thanks for a well written article.

Data Engineering Weekly

Data Engineering After AI