Data Engineering Weekly #110
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Before we start this week, I’m sorry to disappoint you all: Zero-ETL is nothing, but someone dumps the data into your S3 bucket instead of doing it yourself. You still require cleaning it up. Have fun!!!!
Jack Pullikottil: Reinventing Data Models: Keystone for Modern Data Platforms
What is the role of data models in modern data platforms, and how have they changed in recent years? The author narrates why the data models are still important for managing data assets' structure, content, and relationships but also need to keep agility in mind to bring business velocity. The article highlights the challenges of maintaining data models in a world where SQL data warehouses are no longer the primary data platform. The author discusses the need for richer metadata to support complex data lineage and evolving privacy requirements.
Barr Moses: What’s Next for Data Engineering in 2023? 7 Predictions
We are navigating a challenging economy which brings focus on optimizations a lot. Given the market condition, what would be a leading trend in data for 2023? Where would the companies spend their $$$?
The author gives seven predictions. My take on this
The prediction is spot on with the cost optimization, but #1 (cost optimization) & #2 (specialization) conflict. The cost optimization favors more generalized than specialized, so it will be interesting to see how it will turn out.
I agree with #3 (central data platform team remains) and #6 (data warehouse and data lake difference blur); it will be amazing if #4 ( > 51% ML application in production) becomes true.
On #5, I have a vested interest in Data Contract with Schemata, So hell yeah.
On #7, I'm a bit pessimistic about it, given the massive fragmentation in the data infrastructure today with the modern data stack.
Joseph Monti: It’s Time We Treat Our Data Like an API
The application development came a long way in standardizing the interoperability of services. COM/DCOM, CORBA, WSDL to REST Api & rpc frameworks gRPC. The journey created a developer tooling around it and economics around companies like Swagger & Postman. The author narrates the need for Data like the API. We started Schemata on a similar mission, so a big yes. It's time we treat our data like API.
Sean Byrnes: You Have Too Many Metrics
The metrics is a valuable tool for simplifying complex business information and helping to understand how the business is doing. Should we create more metrics to understand the business? The author narrates why choosing a small number of high-quality metrics reduces unnecessary noise and improves decision-making.
Sponsored: Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful procInfluxDB'sscale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Chouaieb Nemri: My favorite AI / ML / Analytics AWS re:Invent 2022 announcements
It's AWS re: invent time, and AWS did tons of product updates on AI/ ML & Analytics tools. The author ranked the favorites from the announcement. It's more curiosity than excitement for me to see how Athena supports Spark's announcement. I like the idea of "serverless Spark applications". What is your favorite announcement? Please comment.
Meta: Enabling static analysis of SQL queries at Meta
What is all your pipeline is a collection of CTE (Common Table Expression) which occasionally persist data? What if CTE can run in parallel and does a speculative execution, reuse/ rewrite for optimal usage?
I recently shared the thought and am excited to see Meta’s blog on static analysis of SQL queries. Though it is not exactly what I described, the possibility of a static analyzer on SQL is exciting.
There is a need for a SQL orchestration engine that is “Pipeline aware” and brings optimization and type safety to data engineering. Let’s call it “dbt next” 😉
Sponsored: [Live Webinar] How JetBlue Builds Trust in Data and Improves Model Accuracy
The data team at JetBlue Airways, a leading carrier in the United States, is responsible for powering insights for the entire organization’s operational and customer service activities. Learn how JetBlue’s data engineering and data science teams leverage Monte Carlo and Snowflake together to accelerate data analysis and drive business value.
Data Engineering Weekly Readers can Save Your Seats by clicking the link.
Netflix: Ready-to-go sample data pipelines with Dataflow
Developing a test environment is one of the hardest parts of data engineering. Netflix writes about Dataflow and how it supports generating sample workflow with the mocked data to boost developer productivity.
TraceQL: a first-of-its-kind query language to accelerate trace analysis in Tempo 2.0
Trace analytics picking momentum in the observability to better understand causal analysis of system failures. There is a lot of similarity between funnel analytics and trace analytics. Is Trace an appropriate data structure for funnel analysis than dimensional modeling? It is something to explore further and delighted to see the release of TraceQL from Grafana.
Sponsored: Webinar - How InfluxData eliminated data silos in weeks with RudderStack
Join RudderStack and InfluxDB’s Director of Analytics, Mona Sami, on Wednesday, December 7th, to learn how the InfluxDB team used RudderStack to establish their data warehouse as a single source of truth.
Shopify: Using Server-Sent Events to Simplify Real-time Streaming at Scale
The server-side event as a communication model suits us well when we have an application design for precomputed & predetermined delivery model. Shopify writes about the system design of Black Friday shopping live visualization.
Expedia: Unify Data Lakes Across Multi-Regions in the Cloud
The Expedia data platform team writes about unifying data lakes across multi-region using AWS Lake Formation and Glue, which allows federated cross-region data lakes spanning multiple geographic regions in the cloud. This new solution allows teams to access the data without data replication, improving scalability and reducing data latency.
Thoughtworks: Effective machine learning - Shifting quality left
Thoughtworks writes the best practices to implement effective machine learning, and one of the key aspects of it shift-left the data quality via contracts!!!
💯 Shift Left, bringing consumers close to the source via Data Contract, is the key to an effective data pipeline.
Helpshift: Generating Chatbot performance insights using Spark SQL at Helpshift
A primary function of the data team is to build a feedback loop for the product performance to improve efficiency and measure the business impact. The Helpshift data team writes an exciting blog about how it runs the performance analysis of its chatbot product with Spark.
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.