Data Engineering Weekly #101

The Weekly Data Engineering Newsletter

Sep 19, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Shopify: What is a Full Stack Data Scientist?

I found this description interesting and how far the industry came up from defining who is a data scientist to a full stack data scientist. :-)

https://shopifyengineering.myshopify.com/blogs/engineering/what-is-a-full-stack-data-scientist

Pedram Navid: Deep Dive: What The Heck Is the Metrics Layer

What is the metrics layer? The blog establishes the need for the metrics layer and walks through the metrics layer development timeline. The author compares the current state of the metrics layer products Looker, dbt & LightDash.

https://pedram.substack.com/p/what-is-the-metrics-layer

Teej: Understanding the Snowflake Query Optimizer

Is Snowflake's approach to query every query as a full table scan? Maybe not. The author explains how Snowflake does partition pruning, Query rewriting, predicate pushdown, column pruning, and join query optimization.

https://teej.ghost.io/understanding-the-snowflake-query-optimizer/

Petrica Leuca: dbt, snowflake, and time traveling.

The more the tools get used, the shortcoming of tools will surface. It is a perfect place to improve the system. The author highlights some design constraints from dbt on Snowflake and Postgres.

https://blog.devgenius.io/dbt-snowflake-and-time-traveling-4253fb703f44

Instacart: How Instacart Uses Embeddings to Improve Search Relevance

Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Instacart writes about how it uses embeddings in search relevancy.

https://tech.instacart.com/how-instacart-uses-embeddings-to-improve-search-relevance-e569839c3c36

LinkedIn: Real-time analytics on network flow data with Apache Pinot

The OLAP engines are a perfect fit but also underutilized in operational analytics. LinkedIn writes an excellent case study of Real-time Operational Metrics Analysis (ROMA) using Apache Pinot.

https://engineering.linkedin.com/blog/2022/real-time-analytics-on-network-flow-data-with-apache-pinot

Twitter: Data Quality Automation at Twitter

Twitter writes about its Data Quality Platform on top of Great Expectation. The blog narrates how the system integrated with the Airflow orchestration process.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/data-quality-automation-at-twitter

Back Market Tech: Our understanding of Data Lineage.

Like Customer 360 view for retailers, the end-to-end lineage is the holy grail for the data engineers. Can we define data lineage in one view? The blog narrates three types of lineage views from the Back Market team's perspective.

Dataset lineage
Job Lineage
Run lineage

https://engineering.backmarket.com/our-understanding-of-data-lineage-c72f5718abd0

Glance: Computing Live stream viewers count in real-time at a High Scale !!

The glance team writes about the infrastructure design for the live stream viewer's computation. The push vs. pull design is an exciting read.

https://engg.glance.com/computing-live-stream-viewers-count-in-real-time-at-high-scale-ef813bc1b9cb

Samar Deen: Trust in AI versus trustworthy AI: Why is it important?

What is trust in AI vs. trustworthy AI? Why is it essential to understand and incorporate from the beginning of your AI journey? The author narrates the critical difference by reviewing basic statistical applications to help data scientists build trust in their models.

Part 1: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-1-of-3-af28195b7612

Part 2: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-2-of-3-77e5edebe898

Part 3: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-3-of-3-c9aef3cb2478

Carlin Eng: A Sequel to SQL? An introduction to Malloy

Looker's Malloy is an exciting project to watch. The blog gives an excellent overview of the project. As the author quotes, Could Malloy finally be the language that replaces SQL? It's a nigh impossible task, but I am very much hoping that it succeeds. Let's keep watching this space.

https://carlineng.com/?postid=malloy-intro#blog

Kenny Ning: How to Fix Your LookML Project Structure

A well-defined Looker project can significantly simplify the data pipeline complexity. The author gives some practical design solutions for the Looker project.

https://www.spectacles.dev/post/fix-your-lookml-project-structure

Bartosz Konieczny: Table formats - Delta Lake

How does the table format of the LakeHouse systems like Delta Lake work? The author took a sneak peek at the writing and reading part of Delta Lake.

Part 1: https://www.waitingforcode.com/delta-lake/acid-file-formats-writing-delta-lake/read

Part 2: https://www.waitingforcode.com/delta-lake/table-formats-reading-delta-lake/read

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?