ETL is Dead

Mar 11

Why the shift from human-operated to agent-operated data warehouses demands a new architecture

18 Comments

Please don’t ever say “xyz” is dead. It makes you look like an idiot. I mean, really, really stupid and/or ignorant. Or a bot.

/unsubscribe

Reply (1)

Ananth Packkildurai

Mar 12

What part of the article you disagree with?

Alice Guo

Apr 1

Seriously with this type of click bait AI generated title?

Reply (1)

Ananth Packkildurai

Apr 1

Did you read through the article? What part do you disagree with?

SANTOSH Kadam

Mar 13

Good Article.

May many more such insights come through.

chiau hung

Mar 12

good article!!! a great read! thank you!

Kawaljeet Matta

Mar 11

Great insightful article. Can you recommend any book related to this topic with the changing things. Thank you.

Peter Andrew Nolan

Mar 11

Hello Ananth,

"The medallion architecture — Bronze to Silver to Gold — is an assembly line designed for human inspection at each station."

Medallion Architecture is not an architecture. It's a marketing slogan.

No one who knows how to build a data warehouse takes "Medallion Architecture" seriously. And we don't build anything that looks like "Medallion Architecture".

We build the following.

1. Landing Area.

2. Staging Area.

3. Archival Layer,

4. Base Layer.

5. Multi-Level Summary Layer.

Since 2002 the archive layer and base layer have become relatively collapsed. They exist but in 2002 I invented the last piece of the puzzle to be able to archive data in a dimensional model on a daily basis without bankrupting the company on processing power needed.

Bill Inmon brought out his Data Warehouse 2.0 book in 2008 to deal with textual style data and he has since brought out his tool called Textual ETL.

There are far too many articles like yours being written as "advertising" when men like you would do far better to actually use my software and learn how to do data warehousing properly.

Ok?

Peter Andrew Nolan

Mar 11

Hello Ananth.

No. ETL is not dead. Not in the slightest. AI can not build ETL.

AI can copy data from one place to another. But then again, so can I using my ETL software.

Using my ETL Software I can create an extraction process and staging area for a 50,000 field large operational system in a week. So how does AI help me do that? If it can do it in a day? So what?

Taking data from large operational systems and putting it into a dimensional data warehouse requires that a data warehouse architect look into what needs to be done and builds what needs to be done.

Currently, we are at rates of 12K-15K fields mapped in a 220 hour work month for my paid for ETL and 6K-8K per work month for my free version.

Mapping 6K-8K fields from a large operational system across to a data warehouse per 220 months work month makes that task essentially free when compared to all the other tasks.

I have heard people talking about "data pipelines" for some time now. The demos I see and the discussions I hear are all about moving similar, and even the SAME data more than once.

People are very, very, VERY secretive about these "data pipelines" they are building and I presume the reason is that they are very bad. I have seen demos of dbt and similar products and they are a joke.

BTW I have invented the future of data warehousing and the future is "Mega Models". I have written posts about it.

But no. AI is not doing anything useful in the ETL area. The most useful thing I have found for AI to do is to do voicer overs for my posts in a womans voice because men will listen to a womans voice longer than they will listen to a mans voice.

You can get my ETL software, for free, from the freebies link in my profile. Those people who have used it saves a lot of money, even if they went into production with Informatica or DataStage.

Ok?

If the men who call themselves "data engineers" who are building star schema data warehouses want to improve their standing with the business side of the house? They can save their companies a whole lot of money by using my free ETL and free data models.

Any man who wants to run his own consulting company building star schema data warehouses can win more business by having greater pricing flexibility by using my free software.

Ok?

Reply (1)

Brent Brewington

Mar 21

Lol, you should go watch a demo of Snowflake’s Cortex Code or Databricks’ Genie Code. I think there’s a lot you don’t know you don’t know about what AI can do, now with Agent Skills in the mix

Reply (3)

Abdul

Mar 23

But is it enough to render Data/Analytics Engineers useless? There is more to building pipelines than just generating spark and sql code (which I find LLM Agentic models to not be as good at)

Peter Andrew Nolan

Mar 21

Hi Brent,

just to let you know the history. I wrote my first ETL tool in 1995 in cobol. I used it to win a number of big deals. With that ETL software we could bid fixed price data warehouses and know we be on time and under budget. The price was $A300K in consulting fees in Australia and USD300K in consulting fees in Asia. The extra being for travel and accommodation.

In 2002 I was talking with Ralph Kimball about how bad the ETL tools were, in particular Informatica, and how I wished we had better ETL tools.

Ralph suggested I write my own and that sounded like a very good idea because I knew if I could write the ETL tool I wanted that I could sell it for very good money.

We maintained the rate of mapping 1,000 fields per work month that I established in 1997 in my new ETL Tool but the C++ version was a BIG improvement on the cobol version.

In 2004-5 I sold three copies to the richest man in Australia for EUR20K each. Then in 2007 I had my breakthrough sale and sold a copy for EUR30K for an unlimited use license to a consulting company in Germany. www.key-work.de. With a mapping rate of 1,000 fields per month they could then undercut the big consulting firms and my software helped make them successful.

Alas, my wife saw that EUR30K check and believed her divorce lawyer when she told Jennifer that she would get 50% of my software revenue for life and she attacked me in the Irish divorce courts despite the fact we are both Australian and had only lived in Ireland for 6 years.

In 2009 I was doing Carphone warehouse in the UK and we were on Netezza. The netezza machine was fast enough that we could update my software to generate SQL as the ETL and not do the actual ETL in C++ programs. This was a huge breakthrough as it meant the data did not come out of the database to be manipulated. All the manipulation happens inside the database. Netezza saw how this benefitted them and we were selling our telco data models and ETL in partnership with Netezza when IBM bought Netezza and kicked us out in favour of their own data models and DataStage.

Mapping at 1,000 fields per 220 hour work month was faster than anyone else could go so we didn't bother trying to find out how to go faster. In 2017 I had a reason to figure out how to go faster and we got to 6K-8K fields mapped per 220 hour work month. That is the version available for free.

Two years ago I had reason to investigate if I could make it go faster again and I am now at peaks of 1,000 fields mapped in a 15 hour work day...so comfortably 500 fields mapped in an 8 hour work day.

This rate makes possible what I am doing now which is creating "Mega Models". You can read about them on my substack. Fundamentally we are now able to co-host many customers copies of one source system in one large data warehouse in the cloud just on SQL Server Standard Edition. This means there is only one suite of ETL from the staging area to the data warehouse.

This means that the cost of hosting any single copy of a "Large Operational System" in a data warehouse is FAR lower than it can be done in any other way. SQL Server SE is very cheap and you can now have 22 cores on SE. So it will run quite a large data warehouse and can ingest many customers data into the one data warehouse data model.

Also, we have even discovered how to have no data co-habitation in a single table if that is what the customer is willing to pay for.

There are two options. One is to have a single target table in each case with views over the top with "customer number" in the view and providing security via views. The other is to have a table for each customers separate data and to have an "ETL View" over the top that the ETL can write to and then the customers views only draw data from their table. No data co-habitation.

Of course that is more expensive to support because there are N times more tables where N is the number of customers who want their data in their own tables. But if the customer decides they want that data separation and are prepared to pay the extra support costs then we would do that.

That is the future of data warehousing Brent. Mega Models in the cloud where many customers data lives in one database and it is all supported by ETL as SQL. And 95%+ of that ETL is generated from the incoming data models.

It's going to be very cool. And my proposal to the smartest men in data warehousing is to pick the ERP / Large Operational System they know best and hire me as an advisor to building their Mega Model. I will only be taking on one client per Mega Model and there are two large operational systems already taken.

In this whole mix? I don't see anything that AI can help me with. If you think there is? I would gladly watch any demo you pointed me to.

Ok?

Peter Andrew Nolan

Mar 21

Hi Brent, I have actually played around with both snowflake and databricks to test my ETL software on. It works on snowflake but I could not get it to work on databricks.

In any case it works very well on all the standard databases and it even runs in reasonable time on DB2 BLU with columnar tables.

IBM put in a patch for DB2 specifically at the request of one of my customers. The insert and updates sped up 10X and now a columnar table only takes about 10X the amount of time to insert or update as a row based table.

There is no change in the SQL ETL for the columnar table. This means anyone who has the scale that needs a columnar database like DB2 can get their ETL tool from my link for free.

The reason AI is not interesting to me in writing SQL ETL is that my software uses a spreadsheet as the source code and 95 percent plus of the entries in the spreadsheet are generated anyway. I have had days where I have mapped 1,000 fields. Something that used to take 220 hours to do. When I say “day” I mean 15 hours.

So yes, i know what these tools can do and they are a long way behind what I have been doing for more than 20 years now.

OK?

Dami Adeoye

Apr 14

What I love about the title is that it compels you to read the article. However, what I love about the article is that it helps you realize that there are certain areas where ETL will become obsolete, both in the present and in the future. The first time I read the article, I was not only intrigued by the title but also by the fact that I had recently learned about MCPs and had started using them. As someone who works in the field of data platform development, I could relate to the points you made and could see it being a reality in my world.

Mayank

Largely agree with your views.

Had similar observations some time back.

https://substack.com/@mayankm/note/c-187457704?r=1fbv4

BTW, love the analogy 'Dead like landline, not like Latin' :)

Ruh

Apr 21

awesome article man. especially the warehouse analogy. it drove everything home

Chris Lettieri

Apr 14

This framing resonates. The warehouse-for-humans analogy is exactly right, and "context architecture" is the right job title for what survives.

I've been converging on something similar from the agent systems / foundation model side. The way I've come to think about it: context engineering is index design.

Agents need a navigational metadata layer (what I'm calling context blocks) that sit above raw data and answers one question before the agent touches anything: "should I look here?"

It maps almost directly to Apache Iceberg's metadata architecture, just adapted for unstructured data where metadata has to be generated rather than computed.

Data blocks (raw content), context blocks (routing signals), a context graph (for traversal), and context snapshots (for time travel and drift detection). The agent reads the map first, then retrieves.

I wrote up the full architecture here if it's useful: https://bitsofchris.com/p/context-engineering-is-index-design

MLCreature

Mar 14

Long live ETL.

Data Engineering Weekly

ETL is Dead