Data Engineering Weekly #98

The Weekly Data Engineering Newsletter

Aug 29, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Wei Jian: Lessons learned after 1 year with dbt

An excellent reflection article sharing experience running dbt after a year. dbt model grown from 0 to 700 dbt models is impressive growth. The growth aligns with the strategy the team adopted, Messy to start and clean afterward.

There are two approaches to handle the demand: build dbt models carefully while meeting all best practices (more proper), or build them fast and messy and worry about tech debt later (as long as the data is correct).

https://medium.com/@imweijian/lessons-learned-after-1-year-with-dbt-a7f0ccf85b12

ManyChat: Data Modeling Today - launching cost-effective analytics for ManyChat

“Proper data modeling will take less time and money to add new features or new capabilities to the analytics. Your data will increase over time, even if your business is stable. But proper data modeling can help you avoid proportional growth of costs.”

ManyChat, at the same time, writes about its experience putting data models as a catalyst for growth.

My take on both approaches.

Balancing the growth with clean/ reusable datasets with a cost concern will be a hot data problem to solve next. Top-Down data modeling or Bottom-Up tech debt handling, Data Modeling plays a significant role in the health of data for a company.

https://medium.com/manychat-team/data-modeling-today-launching-cost-effective-analytics-for-manychat-764d305f287b

John Foley: The Final Days of the Legacy Data Warehouse

I heard the first time a company moved from Snowflake to Teradata, citing unpredictable costs. It surprised me since I assume the legacy data warehouses are reaching the final days, as the author pointed out.

https://clouddb.substack.com/p/the-final-days-of-the-legacy-data

The Twitter thread sparked some exciting conversation, where Bobby Neelon shared some practical tips to optimize Snowflake

Dr. Russell S. Pierce@RussellSPierce

@BNeelon @matsonj @ananthdurai Do they have a 'this query will cost > {threshold}, are you sure' as a feature via an ODBC driver? I'm new to Snowflake cost controls.

5:01 PM · Aug 27, 2022

1 Repost · 2 Likes

Wise Engineering: 3 Main Elements in Your Snowflake Bill

Continuing the Snowflake cost debate, Wise engineering writes a timely article on three main elements in your Snowflake bill to understand & optimize for cost efficiency.

Someone should launch a Udemy course: Understanding Snowflake billing

https://medium.com/wise-engineering/3-main-elements-in-your-snowflake-bill-45331ab7b224

Walmart Global Tech: Implementation of SCD-2 (Slowly Changing Dimension) with Apache Hudi

Throwing everything into a data lake is a long-gone architecture approach, as the modern LakeHouses brings Data Modeling back to the mainstream. Walmart writes an excellent article on handing SCD-2 (Slowly Changing Dimension) with Apache Hudi.

https://medium.com/walmartglobaltech/implementation-of-scd-2-slowly-changing-dimension-with-apache-hudi-465e0eb94a5

Rina Diane Caballar: The Rise of SQL - It’s become the second programming language everyone needs to know

SQL everywhere has become the norm, from transactional databases to data warehouses to streaming databases. Several noSQL systems formally adopted SQL-ish syntax as an interface. The article reflects the same sentiment about why SQL has become the second programming language everyone needs to know.

https://spectrum.ieee.org/the-rise-of-sql

Paul Ramsey: Rise of the Anti-Join

SQL provides multiple ways to express and examine the set operation between two or more entities. The author writes an experiment to understand various ways to frame queries for Find me all the things in set "A" that are not in set "B."

https://www.crunchydata.com/blog/rise-of-the-anti-join

Chip Huyen: Streaming-First Infrastructure for Real-Time Machine Learning

Variability and unpredictability with late-arriving features make the real-time machine learning pipeline challenging. The author writes about various infrastructure strategies to build streaming-first infrastructure for real-time machine learning.

https://www.infoq.com/articles/streaming-first-real-time-ml/

Uber: Uber Freight Carrier Metrics with Near-Real-Time Analytics

Uber writes an exciting blog about the generation of Uber Freight's carrier scoreboard analytical system. The blog narrates the iteration from querying MySQL DB to establishing a real-time streaming system using Apache Pinot & Flink.

https://www.uber.com/blog/uber-freight-carrier-metrics-with-near-real-time-analytics/

Etsy: Towards Machine Learning Observability at Etsy

Etsy writes an exciting article about its observability approach for the Machine Learning Platform. The blog narrates the motivation to build a centralized ML observability platform, challenges, and a high-level design.

https://www.etsy.com/codeascraft/towards-machine-learning-observability-at-etsy

Instacart: How Instacart Uses Machine Learning-Driven Autocomplete to Help People Fill Their Carts

Adopting Machine Learning to power the product feature to enrich the user experience is always a delight to read. Instacart writes one such experience sharing how it uses ML to enrich autocomplete.

https://tech.instacart.com/how-instacart-uses-machine-learning-driven-autocomplete-to-help-people-fill-their-carts-9bc56d22bafb

Cory Maklin: Data Governance Checklist

As a data engineer, you can't escape from Data Governance. What is Data governance? The author explains Data Governance with a checklist to narrate the process, roles, and people involved et al.,

https://medium.com/@corymaklin/data-governance-checklist-152a3a691002

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?