Data Engineering Weekly

Share this post
Data Engineering Weekly #101
www.dataengineeringweekly.com

Data Engineering Weekly #101

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Sep 19, 2022
8
Share this post
Data Engineering Weekly #101
www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Shopify: What is a Full Stack Data Scientist?

I found this description interesting and how far the industry came up from defining who is a data scientist to a full stack data scientist. :-)

https://shopifyengineering.myshopify.com/blogs/engineering/what-is-a-full-stack-data-scientist


Pedram Navid: Deep Dive: What The Heck Is the Metrics Layer

What is the metrics layer? The blog establishes the need for the metrics layer and walks through the metrics layer development timeline. The author compares the current state of the metrics layer products Looker, dbt & LightDash.

https://pedram.substack.com/p/what-is-the-metrics-layer


Teej: Understanding the Snowflake Query Optimizer

Is Snowflake's approach to query every query as a full table scan? Maybe not. The author explains how Snowflake does partition pruning, Query rewriting, predicate pushdown, column pruning, and join query optimization.

https://teej.ghost.io/understanding-the-snowflake-query-optimizer/


Petrica Leuca: dbt, snowflake, and time traveling.

The more the tools get used, the shortcoming of tools will surface. It is a perfect place to improve the system. The author highlights some design constraints from dbt on Snowflake and Postgres.

https://blog.devgenius.io/dbt-snowflake-and-time-traveling-4253fb703f44


Sponsored- Firebolt: The Creator of Airflow About His Recipe for Smart Data-Driven Companies

This time on The Data Engineering Show, Maxime Beauchemin – The guy behind Airflow, Superset, and Preset, tells the bros about his recipe for smart data-driven companies. Choosing the right system and services is key for a successful start and can help you avoid the chaos of having too many tools spread across multiple teams.

https://www.firebolt.io/blog/how-preset-built-a-data-driven-organization-from-the-ground-up-podcast


Instacart: How Instacart Uses Embeddings to Improve Search Relevance

Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Instacart writes about how it uses embeddings in search relevancy.

https://tech.instacart.com/how-instacart-uses-embeddings-to-improve-search-relevance-e569839c3c36


LinkedIn: Real-time analytics on network flow data with Apache Pinot

The OLAP engines are a perfect fit but also underutilized in operational analytics. LinkedIn writes an excellent case study of Real-time Operational Metrics Analysis (ROMA) using Apache Pinot.

https://engineering.linkedin.com/blog/2022/real-time-analytics-on-network-flow-data-with-apache-pinot


Sponsored: Rudderstack - Better Customer Data Integration Management For Growing Teams

In this piece, Ben Rogojan outlines your options for solving data integration challenges as your company grows: building a scalable framework or architecting a stack with the right tools. Check it out for some practical advice on which approach to take.

https://www.rudderstack.com/blog/better-customer-data-integration-management-for-growing-teams


Twitter: Data Quality Automation at Twitter

Twitter writes about its Data Quality Platform on top of Great Expectation. The blog narrates how the system integrated with the Airflow orchestration process.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/data-quality-automation-at-twitter


Back Market Tech: Our understanding of Data Lineage.

Like Customer 360 view for retailers, the end-to-end lineage is the holy grail for the data engineers. Can we define data lineage in one view? The blog narrates three types of lineage views from the Back Market team's perspective.

  1. Dataset lineage

  2. Job Lineage

  3. Run lineage

https://engineering.backmarket.com/our-understanding-of-data-lineage-c72f5718abd0


Sponsored: Soda - Podcast: Data Mesh in Practice

Max Schultze, Data Engineering Manager at Zalando, and Prof. Dr. Arif Wider, Professor of Software Engineering at HTW Berlin, share their experience to bring forward the practical side of data mesh from an engineer's perspective, and answer challenging questions which tackle some of the common misconceptions of putting data mesh into practice.

https://directory.libsyn.com/episode/index/id/24095136


Glance: Computing Live stream viewers count in real-time at a High Scale !!

The glance team writes about the infrastructure design for the live stream viewer's computation. The push vs. pull design is an exciting read.

https://engg.glance.com/computing-live-stream-viewers-count-in-real-time-at-high-scale-ef813bc1b9cb


Samar Deen: Trust in AI versus trustworthy AI: Why is it important?

What is trust in AI vs. trustworthy AI? Why is it essential to understand and incorporate from the beginning of your AI journey? The author narrates the critical difference by reviewing basic statistical applications to help data scientists build trust in their models.

Part 1: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-1-of-3-af28195b7612

Part 2: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-2-of-3-77e5edebe898

Part 3: https://medium.com/data-science-at-microsoft/trust-in-ai-versus-trustworthy-ai-why-is-it-important-part-3-of-3-c9aef3cb2478


Carlin Eng: A Sequel to SQL? An introduction to Malloy

Looker's Malloy is an exciting project to watch. The blog gives an excellent overview of the project. As the author quotes, Could Malloy finally be the language that replaces SQL? It's a nigh impossible task, but I am very much hoping that it succeeds. Let's keep watching this space.

https://carlineng.com/?postid=malloy-intro#blog


Kenny Ning: How to Fix Your LookML Project Structure

A well-defined Looker project can significantly simplify the data pipeline complexity. The author gives some practical design solutions for the Looker project.

https://www.spectacles.dev/post/fix-your-lookml-project-structure


Bartosz Konieczny: Table formats - Delta Lake

How does the table format of the LakeHouse systems like Delta Lake work? The author took a sneak peek at the writing and reading part of Delta Lake.

Part 1: https://www.waitingforcode.com/delta-lake/acid-file-formats-writing-delta-lake/read

Part 2: https://www.waitingforcode.com/delta-lake/table-formats-reading-delta-lake/read


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #101
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing