Data Engineering Weekly

Share this post

Data Engineering Weekly #98

www.dataengineeringweekly.com

Data Engineering Weekly #98

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Aug 29, 2022
4
Share this post

Data Engineering Weekly #98

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Wei Jian: Lessons learned after 1 year with dbt

An excellent reflection article sharing experience running dbt after a year. dbt model grown from 0 to 700 dbt models is impressive growth. The growth aligns with the strategy the team adopted, Messy to start and clean afterward.

There are two approaches to handle the demand: build dbt models carefully while meeting all best practices (more proper), or build them fast and messy and worry about tech debt later (as long as the data is correct).

https://medium.com/@imweijian/lessons-learned-after-1-year-with-dbt-a7f0ccf85b12


ManyChat: Data Modeling Today - launching cost-effective analytics for ManyChat

“Proper data modeling will take less time and money to add new features or new capabilities to the analytics. Your data will increase over time, even if your business is stable. But proper data modeling can help you avoid proportional growth of costs.”

ManyChat, at the same time, writes about its experience putting data models as a catalyst for growth.

My take on both approaches.

Balancing the growth with clean/ reusable datasets with a cost concern will be a hot data problem to solve next. Top-Down data modeling or Bottom-Up tech debt handling, Data Modeling plays a significant role in the health of data for a company.

https://medium.com/manychat-team/data-modeling-today-launching-cost-effective-analytics-for-manychat-764d305f287b


John Foley: The Final Days of the Legacy Data Warehouse

I heard the first time a company moved from Snowflake to Teradata, citing unpredictable costs. It surprised me since I assume the legacy data warehouses are reaching the final days, as the author pointed out.

https://clouddb.substack.com/p/the-final-days-of-the-legacy-data

The Twitter thread sparked some exciting conversation, where Bobby Neelon shared some practical tips to optimize Snowflake

Twitter avatar for @RussellSPierce
Dr. Russell S. Pierce @RussellSPierce
@BNeelon @matsonj @ananthdurai Do they have a 'this query will cost > {threshold}, are you sure' as a feature via an ODBC driver? I'm new to Snowflake cost controls.
5:01 PM ∙ Aug 27, 2022
2Likes1Retweet

Wise Engineering: 3 Main Elements in Your Snowflake Bill

Continuing the Snowflake cost debate, Wise engineering writes a timely article on three main elements in your Snowflake bill to understand & optimize for cost efficiency.

Someone should launch a Udemy course: Understanding Snowflake billing

https://medium.com/wise-engineering/3-main-elements-in-your-snowflake-bill-45331ab7b224


Sponsored: Firebolt is a proud sponsor of Data Engineering Weekly.

Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use .of modern architecture with a
sub-second performance at a terabyte scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.

https://www.firebolt.io/


Walmart Global Tech: Implementation of SCD-2 (Slowly Changing Dimension) with Apache Hudi

Throwing everything into a data lake is a long-gone architecture approach, as the modern LakeHouses brings Data Modeling back to the mainstream. Walmart writes an excellent article on handing SCD-2 (Slowly Changing Dimension) with Apache Hudi.

https://medium.com/walmartglobaltech/implementation-of-scd-2-slowly-changing-dimension-with-apache-hudi-465e0eb94a5


Rina Diane Caballar: The Rise of SQL - It’s become the second programming language everyone needs to know

SQL everywhere has become the norm, from transactional databases to data warehouses to streaming databases. Several noSQL systems formally adopted SQL-ish syntax as an interface. The article reflects the same sentiment about why SQL has become the second programming language everyone needs to know.

https://spectrum.ieee.org/the-rise-of-sql


Sponsored: Insights and Musings from Holden Karau, Open Source Engineer at Netflix, in this Data Dream Team podcast episode

If you’re a data engineer and you want to become a better data engineer, or you’re switching from software engineering to data engineering...I think a great thing that you can do is code reviews and open-source to the tools that you’re going to be using. You’re going to understand how they’re built. You’re going to see how the different pieces interact. And of course, you can do this by writing code in the tools that you’re using as well.

https://sodapodcast.libsyn.com/ep-011-meet-holden-karau-author-and-open-source-engineer-at-netflix


Paul Ramsey: Rise of the Anti-Join

SQL provides multiple ways to express and examine the set operation between two or more entities. The author writes an experiment to understand various ways to frame queries for Find me all the things in set "A" that are not in set "B."

https://www.crunchydata.com/blog/rise-of-the-anti-join


Chip Huyen: Streaming-First Infrastructure for Real-Time Machine Learning

Variability and unpredictability with late-arriving features make the real-time machine learning pipeline challenging. The author writes about various infrastructure strategies to build streaming-first infrastructure for real-time machine learning.

https://www.infoq.com/articles/streaming-first-real-time-ml/


Sponsored: Why Business Applications Create Data Integration Debt

This article from Ben Rogajan explores the challenges of data integration in a world where more teams need access to more data for more complex use cases, and it outlines the pitfalls of attacking data integration without a thoughtful strategy.

https://www.rudderstack.com/blog/why-business-applications-create-data-integration-debt


Uber: Uber Freight Carrier Metrics with Near-Real-Time Analytics

Uber writes an exciting blog about the generation of Uber Freight's carrier scoreboard analytical system. The blog narrates the iteration from querying MySQL DB to establishing a real-time streaming system using Apache Pinot & Flink.

https://www.uber.com/blog/uber-freight-carrier-metrics-with-near-real-time-analytics/


Etsy: Towards Machine Learning Observability at Etsy

Etsy writes an exciting article about its observability approach for the Machine Learning Platform. The blog narrates the motivation to build a centralized ML observability platform, challenges, and a high-level design.

https://www.etsy.com/codeascraft/towards-machine-learning-observability-at-etsy


Instacart: How Instacart Uses Machine Learning-Driven Autocomplete to Help People Fill Their Carts

Adopting Machine Learning to power the product feature to enrich the user experience is always a delight to read. Instacart writes one such experience sharing how it uses ML to enrich autocomplete.

https://tech.instacart.com/how-instacart-uses-machine-learning-driven-autocomplete-to-help-people-fill-their-carts-9bc56d22bafb


Cory Maklin: Data Governance Checklist

As a data engineer, you can't escape from Data Governance. What is Data governance? The author explains Data Governance with a checklist to narrate the process, roles, and people involved et al.,

https://medium.com/@corymaklin/data-governance-checklist-152a3a691002


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #98

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing