Data Engineering Weekly #101
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Shopify: What is a Full Stack Data Scientist?
I found this description interesting and how far the industry came up from defining who is a data scientist to a full stack data scientist. :-)
Pedram Navid: Deep Dive: What The Heck Is the Metrics Layer
What is the metrics layer? The blog establishes the need for the metrics layer and walks through the metrics layer development timeline. The author compares the current state of the metrics layer products Looker, dbt & LightDash.
Teej: Understanding the Snowflake Query Optimizer
Is Snowflake's approach to query every query as a full table scan? Maybe not. The author explains how Snowflake does partition pruning, Query rewriting, predicate pushdown, column pruning, and join query optimization.
Petrica Leuca: dbt, snowflake, and time traveling.
The more the tools get used, the shortcoming of tools will surface. It is a perfect place to improve the system. The author highlights some design constraints from dbt on Snowflake and Postgres.
Sponsored- Firebolt: The Creator of Airflow About His Recipe for Smart Data-Driven Companies
This time on The Data Engineering Show, Maxime Beauchemin – The guy behind Airflow, Superset, and Preset, tells the bros about his recipe for smart data-driven companies. Choosing the right system and services is key for a successful start and can help you avoid the chaos of having too many tools spread across multiple teams.
Instacart: How Instacart Uses Embeddings to Improve Search Relevance
Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Instacart writes about how it uses embeddings in search relevancy.
LinkedIn: Real-time analytics on network flow data with Apache Pinot
The OLAP engines are a perfect fit but also underutilized in operational analytics. LinkedIn writes an excellent case study of Real-time Operational Metrics Analysis (ROMA) using Apache Pinot.
Sponsored: Rudderstack - Better Customer Data Integration Management For Growing Teams
In this piece, Ben Rogojan outlines your options for solving data integration challenges as your company grows: building a scalable framework or architecting a stack with the right tools. Check it out for some practical advice on which approach to take.
Twitter: Data Quality Automation at Twitter
Twitter writes about its Data Quality Platform on top of Great Expectation. The blog narrates how the system integrated with the Airflow orchestration process.
Back Market Tech: Our understanding of Data Lineage.
Like Customer 360 view for retailers, the end-to-end lineage is the holy grail for the data engineers. Can we define data lineage in one view? The blog narrates three types of lineage views from the Back Market team's perspective.
Sponsored: Soda - Podcast: Data Mesh in Practice
Max Schultze, Data Engineering Manager at Zalando, and Prof. Dr. Arif Wider, Professor of Software Engineering at HTW Berlin, share their experience to bring forward the practical side of data mesh from an engineer's perspective, and answer challenging questions which tackle some of the common misconceptions of putting data mesh into practice.
Glance: Computing Live stream viewers count in real-time at a High Scale !!
The glance team writes about the infrastructure design for the live stream viewer's computation. The push vs. pull design is an exciting read.
Samar Deen: Trust in AI versus trustworthy AI: Why is it important?
What is trust in AI vs. trustworthy AI? Why is it essential to understand and incorporate from the beginning of your AI journey? The author narrates the critical difference by reviewing basic statistical applications to help data scientists build trust in their models.
Part 1: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-1-of-3-af28195b7612
Part 2: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-2-of-3-77e5edebe898
Part 3: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-3-of-3-c9aef3cb2478
Carlin Eng: A Sequel to SQL? An introduction to Malloy
Looker's Malloy is an exciting project to watch. The blog gives an excellent overview of the project. As the author quotes, Could Malloy finally be the language that replaces SQL? It's a nigh impossible task, but I am very much hoping that it succeeds. Let's keep watching this space.
Kenny Ning: How to Fix Your LookML Project Structure
A well-defined Looker project can significantly simplify the data pipeline complexity. The author gives some practical design solutions for the Looker project.
Bartosz Konieczny: Table formats - Delta Lake
How does the table format of the LakeHouse systems like Delta Lake work? The author took a sneak peek at the writing and reading part of Delta Lake.
Part 1: https://www.waitingforcode.com/delta-lake/acid-file-formats-writing-delta-lake/read
Part 2: https://www.waitingforcode.com/delta-lake/table-formats-reading-delta-lake/read
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.