4 Comments
User's avatar
Chris Kornaros's avatar

I think you’re just describing the natural evolution of the Data Engineering role. The field has dealt with unstructured data for a while now, it’s just that using LLMs to parse/document is newer. To me, that’s just utilization of a new tool, in the same way Airflow is more recent than Cron.

In my experience, there are a lot of data engineering roles that want or require unstructured/NoSQL. Even if you aren’t working with video on the scale of Netflix, dealing with PDFs is fundamentally similar (clearly not the same).

Expand full comment
Ananth Packkildurai's avatar

The Skill Set differs slightly from traditional data engineering [SQL-centric frameworks like dbt, etc]. Unstructured data processing requires unique skills like understanding concurrent programming and chunking techniques, which are uncommon for SQL-centric data pipelines.

Expand full comment
Chris Kornaros's avatar

I mean traditional data engineering is incredibly broad, you seem to just be describing ETL for structured data. True data engineering is a broad term that has always included ways to handle unstructured data (even if it wasn’t as common or easy as it is now).

It feels a bit disingenuous to slap AI on the job title and then post that this is all brand new. For example, multithreading/concurrency and chunking are both common techniques in data engineering for structured data. In no way is that unique to unstructured data, which is my entire point, you’re just describing how data engineers can use a new tool or implement it in their workflow. You haven’t made the case for this being a completely different role, other than slapping AI on the job title.

Expand full comment
Arvind Patil's avatar

Very engaging post!

Expand full comment