Data Engineering Weekly

Data Engineering Weekly

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #110
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Data Engineering Weekly
The Weekly Data Engineering Newsletter
Over 35,000 subscribers
Already have an account? Sign in

Data Engineering Weekly #110

The Weekly Data Engineering Newsletter

Ananth Packkildurai's avatar
Ananth Packkildurai
Dec 05, 2022
3

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #110
Copy link
Facebook
Email
Notes
More
Share

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Before we start this week, I’m sorry to disappoint you all: Zero-ETL is nothing, but someone dumps the data into your S3 bucket instead of doing it yourself. You still require cleaning it up. Have fun!!!!

Jack Pullikottil: Reinventing Data Models: Keystone for Modern Data Platforms

What is the role of data models in modern data platforms, and how have they changed in recent years? The author narrates why the data models are still important for managing data assets' structure, content, and relationships but also need to keep agility in mind to bring business velocity. The article highlights the challenges of maintaining data models in a world where SQL data warehouses are no longer the primary data platform. The author discusses the need for richer metadata to support complex data lineage and evolving privacy requirements.

https://medium.com/@moving-the-needle/reinventing-data-models-keystone-for-modern-data-platforms-132d8283acbc


Barr Moses: What’s Next for Data Engineering in 2023? 7 Predictions

We are navigating a challenging economy which brings focus on optimizations a lot. Given the market condition, what would be a leading trend in data for 2023? Where would the companies spend their $$$?

The author gives seven predictions. My take on this

  • The prediction is spot on with the cost optimization, but #1 (cost optimization) & #2 (specialization) conflict. The cost optimization favors more generalized than specialized, so it will be interesting to see how it will turn out.

  • I agree with #3 (central data platform team remains) and #6 (data warehouse and data lake difference blur); it will be amazing if #4 ( > 51% ML application in production) becomes true.

  • On #5, I have a vested interest in Data Contract with Schemata, So hell yeah.

  • On #7, I'm a bit pessimistic about it, given the massive fragmentation in the data infrastructure today with the modern data stack.

https://towardsdatascience.com/whats-next-for-data-engineering-in-2023-7-predictions-b57e3c1bf2d3


Joseph Monti: It’s Time We Treat Our Data Like an API

The application development came a long way in standardizing the interoperability of services. COM/DCOM, CORBA, WSDL to REST Api & rpc frameworks gRPC. The journey created a developer tooling around it and economics around companies like Swagger & Postman. The author narrates the need for Data like the API. We started Schemata on a similar mission, so a big yes. It's time we treat our data like API.

https://joemonti.org/its-time-we-treat-our-data-like-an-api-2a5723b3830b


Sean Byrnes: You Have Too Many Metrics

The metrics is a valuable tool for simplifying complex business information and helping to understand how the business is doing. Should we create more metrics to understand the business? The author narrates why choosing a small number of high-quality metrics reduces unnecessary noise and improves decision-making.

httpJetBlue'singpoint.substack.com/p/you-have-too-many-metrics


Sponsored: Write a SQL Query, Get a Data-in-Motion Pipeline!

Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.

  • Streaming plus batch unified in a single platform.

  • Stateful procInfluxDB'sscale - joins, aggregations, upserts

  • Orchestration auto-generated from the data and SQL

  • Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift

Try now and get 30 Days Free


Chouaieb Nemri: My favorite AI / ML / Analytics AWS re:Invent 2022 announcements

It's AWS re: invent time, and AWS did tons of product updates on AI/ ML & Analytics tools. The author ranked the favorites from the announcement. It's more curiosity than excitement for me to see how Athena supports Spark's announcement. I like the idea of "serverless Spark applications". What is your favorite announcement? Please comment.

https://c-nemri.medium.com/my-favorite-ai-ml-analytics-aws-re-invent-2022-announcements-b5744c68d5f8


Meta: Enabling static analysis of SQL queries at Meta

What is all your pipeline is a collection of CTE (Common Table Expression) which occasionally persist data? What if CTE can run in parallel and does a speculative execution, reuse/ rewrite for optimal usage?

I recently shared the thought and am excited to see Meta’s blog on static analysis of SQL queries. Though it is not exactly what I described, the possibility of a static analyzer on SQL is exciting.

There is a need for a SQL orchestration engine that is “Pipeline aware” and brings optimization and type safety to data engineering. Let’s call it “dbt next” 😉

https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/


Sponsored: [Live Webinar] How JetBlue Builds Trust in Data and Improves Model Accuracy

The data team at JetBlue Airways, a leading carrier in the United States, is responsible for powering insights for the entire organization’s operational and customer service activities. Learn how JetBlue’s data engineering and data science teams leverage Monte Carlo and Snowflake together to accelerate data analysis and drive business value.

Data Engineering Weekly Readers can Save Your Seats by clicking the link.


Netflix: Ready-to-go sample data pipelines with Dataflow

Developing a test environment is one of the hardest parts of data engineering. Netflix writes about Dataflow and how it supports generating sample workflow with the mocked data to boost developer productivity.

https://netflixtechblog.com/ready-to-go-sample-data-pipelines-with-dataflow-17440a9e141d


TraceQL: a first-of-its-kind query language to accelerate trace analysis in Tempo 2.0

Trace analytics picking momentum in the observability to better understand causal analysis of system failures. There is a lot of similarity between funnel analytics and trace analytics. Is Trace an appropriate data structure for funnel analysis than dimensional modeling? It is something to explore further and delighted to see the release of TraceQL from Grafana.

https://grafana.com/blog/2022/11/30/traceql-a-first-of-its-kind-query-language-to-accelerate-trace-analysis-in-tempo-2.0/


Sponsored: Webinar - How InfluxData eliminated data silos in weeks with RudderStack

Join RudderStack and InfluxDB’s Director of Analytics, Mona Sami, on Wednesday, December 7th, to learn how the InfluxDB team used RudderStack to establish their data warehouse as a single source of truth.

https://www.rudderstack.com/events/how-influxdata-eliminated-data-silos-in-weeks-with-rudderstack/


Shopify: Using Server-Sent Events to Simplify Real-time Streaming at Scale

The server-side event as a communication model suits us well when we have an application design for precomputed & predetermined delivery model. Shopify writes about the system design of Black Friday shopping live visualization.

https://shopifyengineering.myshopify.com/blogs/engineering/server-sent-events-data-streaming


Expedia: Unify Data Lakes Across Multi-Regions in the Cloud

The Expedia data platform team writes about unifying data lakes across multi-region using AWS Lake Formation and Glue, which allows federated cross-region data lakes spanning multiple geographic regions in the cloud. This new solution allows teams to access the data without data replication, improving scalability and reducing data latency.

https://medium.com/expedia-group-tech/unify-data-lakes-across-multi-regions-in-the-cloud-61119db325f9


Thoughtworks: Effective machine learning - Shifting quality left

Thoughtworks writes the best practices to implement effective machine learning, and one of the key aspects of it shift-left the data quality via contracts!!!

💯 Shift Left, bringing consumers close to the source via Data Contract, is the key to an effective data pipeline.

https://www.thoughtworks.com/insights/blog/machine-learning-and-ai/effective-ml-part-II


Helpshift: Generating Chatbot performance insights using Spark SQL at Helpshift

A primary function of the data team is to build a feedback loop for the product performance to improve efficiency and measure the business impact. The Helpshift data team writes an exciting blog about how it runs the performance analysis of its chatbot product with Spark.

https://medium.com/helpshift-engineering/generating-chatbot-performance-insights-using-spark-sql-at-helpshift-6cf15e905604


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.


Subscribe to Data Engineering Weekly

By Ananth Packkildurai · Launched 5 years ago
The Weekly Data Engineering Newsletter
3

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #110
Copy link
Facebook
Email
Notes
More
Share

Discussion about this post

User's avatar
Functional Data Engineering - A Blueprint
How to build a Recoverable & Reproducible data pipeline
Dec 22, 2022 • 
Ananth Packkildurai
73

Share this post

Data Engineering Weekly
Data Engineering Weekly
Functional Data Engineering - A Blueprint
Copy link
Facebook
Email
Notes
More
3
The Future of Data Engineering: DEW's 2025 Predictions
Emerging Innovations, Evolving Roles, and the Roadmap to Scalable AI-Driven Insights
Dec 19, 2024 • 
Ananth Packkildurai
47

Share this post

Data Engineering Weekly
Data Engineering Weekly
The Future of Data Engineering: DEW's 2025 Predictions
Copy link
Facebook
Email
Notes
More
2
Towards Composable Data Infrastructure
A Case for Federated Data Catalog
Apr 11 • 
Ananth Packkildurai
37

Share this post

Data Engineering Weekly
Data Engineering Weekly
Towards Composable Data Infrastructure
Copy link
Facebook
Email
Notes
More

Ready for more?

© 2025 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.