Data Engineering Weekly #110

The Weekly Data Engineering Newsletter

Dec 05, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Before we start this week, I’m sorry to disappoint you all: Zero-ETL is nothing, but someone dumps the data into your S3 bucket instead of doing it yourself. You still require cleaning it up. Have fun!!!!

Jack Pullikottil: Reinventing Data Models: Keystone for Modern Data Platforms

What is the role of data models in modern data platforms, and how have they changed in recent years? The author narrates why the data models are still important for managing data assets' structure, content, and relationships but also need to keep agility in mind to bring business velocity. The article highlights the challenges of maintaining data models in a world where SQL data warehouses are no longer the primary data platform. The author discusses the need for richer metadata to support complex data lineage and evolving privacy requirements.

https://medium.com/@moving-the-needle/reinventing-data-models-keystone-for-modern-data-platforms-132d8283acbc

Barr Moses: What’s Next for Data Engineering in 2023? 7 Predictions

We are navigating a challenging economy which brings focus on optimizations a lot. Given the market condition, what would be a leading trend in data for 2023? Where would the companies spend their $$$?

The author gives seven predictions. My take on this

The prediction is spot on with the cost optimization, but #1 (cost optimization) & #2 (specialization) conflict. The cost optimization favors more generalized than specialized, so it will be interesting to see how it will turn out.
I agree with #3 (central data platform team remains) and #6 (data warehouse and data lake difference blur); it will be amazing if #4 ( > 51% ML application in production) becomes true.
On #5, I have a vested interest in Data Contract with Schemata, So hell yeah.
On #7, I'm a bit pessimistic about it, given the massive fragmentation in the data infrastructure today with the modern data stack.

https://towardsdatascience.com/whats-next-for-data-engineering-in-2023-7-predictions-b57e3c1bf2d3

Joseph Monti: It’s Time We Treat Our Data Like an API

The application development came a long way in standardizing the interoperability of services. COM/DCOM, CORBA, WSDL to REST Api & rpc frameworks gRPC. The journey created a developer tooling around it and economics around companies like Swagger & Postman. The author narrates the need for Data like the API. We started Schemata on a similar mission, so a big yes. It's time we treat our data like API.

https://joemonti.org/its-time-we-treat-our-data-like-an-api-2a5723b3830b

Sean Byrnes: You Have Too Many Metrics

The metrics is a valuable tool for simplifying complex business information and helping to understand how the business is doing. Should we create more metrics to understand the business? The author narrates why choosing a small number of high-quality metrics reduces unnecessary noise and improves decision-making.

httpJetBlue'singpoint.substack.com/p/you-have-too-many-metrics

Chouaieb Nemri: My favorite AI / ML / Analytics AWS re:Invent 2022 announcements

It's AWS re: invent time, and AWS did tons of product updates on AI/ ML & Analytics tools. The author ranked the favorites from the announcement. It's more curiosity than excitement for me to see how Athena supports Spark's announcement. I like the idea of "serverless Spark applications". What is your favorite announcement? Please comment.

https://c-nemri.medium.com/my-favorite-ai-ml-analytics-aws-re-invent-2022-announcements-b5744c68d5f8

Meta: Enabling static analysis of SQL queries at Meta

What is all your pipeline is a collection of CTE (Common Table Expression) which occasionally persist data? What if CTE can run in parallel and does a speculative execution, reuse/ rewrite for optimal usage?

I recently shared the thought and am excited to see Meta’s blog on static analysis of SQL queries. Though it is not exactly what I described, the possibility of a static analyzer on SQL is exciting.

There is a need for a SQL orchestration engine that is “Pipeline aware” and brings optimization and type safety to data engineering. Let’s call it “dbt next” 😉

https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/

Netflix: Ready-to-go sample data pipelines with Dataflow

Developing a test environment is one of the hardest parts of data engineering. Netflix writes about Dataflow and how it supports generating sample workflow with the mocked data to boost developer productivity.

https://netflixtechblog.com/ready-to-go-sample-data-pipelines-with-dataflow-17440a9e141d

TraceQL: a first-of-its-kind query language to accelerate trace analysis in Tempo 2.0

Trace analytics picking momentum in the observability to better understand causal analysis of system failures. There is a lot of similarity between funnel analytics and trace analytics. Is Trace an appropriate data structure for funnel analysis than dimensional modeling? It is something to explore further and delighted to see the release of TraceQL from Grafana.

https://grafana.com/blog/2022/11/30/traceql-a-first-of-its-kind-query-language-to-accelerate-trace-analysis-in-tempo-2.0/

Shopify: Using Server-Sent Events to Simplify Real-time Streaming at Scale

The server-side event as a communication model suits us well when we have an application design for precomputed & predetermined delivery model. Shopify writes about the system design of Black Friday shopping live visualization.

https://shopifyengineering.myshopify.com/blogs/engineering/server-sent-events-data-streaming

Expedia: Unify Data Lakes Across Multi-Regions in the Cloud

The Expedia data platform team writes about unifying data lakes across multi-region using AWS Lake Formation and Glue, which allows federated cross-region data lakes spanning multiple geographic regions in the cloud. This new solution allows teams to access the data without data replication, improving scalability and reducing data latency.

https://medium.com/expedia-group-tech/unify-data-lakes-across-multi-regions-in-the-cloud-61119db325f9

Thoughtworks: Effective machine learning - Shifting quality left

Thoughtworks writes the best practices to implement effective machine learning, and one of the key aspects of it shift-left the data quality via contracts!!!

💯 Shift Left, bringing consumers close to the source via Data Contract, is the key to an effective data pipeline.

https://www.thoughtworks.com/insights/blog/machine-learning-and-ai/effective-ml-part-II

Helpshift: Generating Chatbot performance insights using Spark SQL at Helpshift

A primary function of the data team is to build a feedback loop for the product performance to improve efficiency and measure the business impact. The Helpshift data team writes an exciting blog about how it runs the performance analysis of its chatbot product with Spark.

https://medium.com/helpshift-engineering/generating-chatbot-performance-insights-using-spark-sql-at-helpshift-6cf15e905604

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly