Data Engineering Weekly

Share this post
Data Engineering Weekly #93
www.dataengineeringweekly.com

Data Engineering Weekly #93

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Jul 18, 2022
2
Share this post
Data Engineering Weekly #93
www.dataengineeringweekly.com

Sponsored: Rudderstack - DROP the Modern Data Stack

It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architecture and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.


Joseph Gonzalez: How Machine Learning Became Useful - Reflecting on a Decade of Research

Everyone talked about ML, but what they needed was data

An excellent article captures the author's observation on industrial-scale ML engineering. The author stresses the importance of access to the data & compute with simple abstractions is the key for ML. It is an excellent read on the next significant challenges of ML.

https://medium.com/@profjoeyg/how-machine-learning-became-useful-5732c3419c81


Airbnb: How Airbnb Safeguards Changes in Production - Evolution of Airbnb’s experimentation platform

Data Engineering practices & A/B testing have broader applications in reliability engineering. While working for the Slack observability team, I realized a simple Apdex score computation on the HTTP error code is a significant feedback loop to improve the reliability of the systems. Airbnb writes an excellent blog narrating how it uses AB testing & data science to safeguard changes in production.

https://medium.com/airbnb-engineering/how-airbnb-safeguards-changes-in-production-9fc9024f3446


Criteo: Scheduling Data Pipelines at Criteo - Introducing Criteo’s BigDataFlow project

Criteo writes about its homegrown scheduler Cuttle and a self-service web application called BigDataflow as an abstraction for Cuttle. The design centered around building a simpler task abstraction. By inferring the DAG of tasks statically, the platform implements all the things workflow management systems like task execution, data retention & backfilling.

Part 1: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-1-8b257c6c8e55

Part 2: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-2-8b0da38ff3a4


Riskified Technology: Spark Streaming as a Service

Riskfield writes about its self-serve streaming infrastructure using Spark streaming. The first blocker for a platform is to minimize the manual-intensive tasks that can create unavoidable bureaucracy. It's admirable that the author called it out explicitly in the blog.

https://medium.com/riskified-technology/spark-streaming-as-a-service-53420d8a857b


Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.

Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.

https://www.firebolt.io/


Mikkel Dengsøe: Data teams are getting larger, faster - On the relationship between data team size and complexity

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
As the size of the organization grows, the data maturity shrink. The complexity of the data outgrown the usability the data. Have anyone seen this pattern? Curious to know data folks' thoughts on it.
6:52 PM ∙ Feb 10, 2022
4Likes3Retweets

I observed that data maturity shrinks as an organization grows a few months back. The author establishes the same case and shares a few techniques to handle the complexity.

https://mikkeldengsoe.substack.com/p/data-team-size


LinkedIn: Measuring marketing incremental impacts beyond last click attribution

A marketing campaign can reach the consumer via multiple channels. The last-click conversion measurement can misrepresent the influence of a marketing channel. LinkedIn writes about measuring incremental impacts using the Bayesian Structural Time Series (BSTS) model approach to measure the causal effect of an intervention.

https://engineering.linkedin.com/blog/2022/measuring-marketing-incremental-impacts


Shippeo: Debezium to Snowflake - Lessons learned building data replication in production

Shippeo writes about its CDC infrastructure using Debezium, Kafka & Snowflake. The blog includes excellent insights on data format choice, Postgres reliability, and observability metrics.

https://medium.com/shippeo-tech-blog/debezium-to-snowflake-lessons-learned-building-data-replication-in-production-a5430a9fe85b


Sponsored: Rudderstack - The Data Maturity Journey - Webinar July 27th at 10:30 AM PT / 1:30 ET

Join RudderStack live with the Seattle Data Guy, Ben Rogojan, and Max Werner, Owner at Obsessive Analytics Consulting, to learn about the four stages of The Data Maturity Journey. You'll come away with practical architectures you can use to drive better decision making at every stage of your companies growth.

https://www.rudderstack.com/video-library/the-data-maturity-journey


GovTech Singapore: Towards a comparable metric for AI model interpretability

Data Engineering Weekly frequently features an article about the social impact of AI. The XAI (explainable AI) plays a vital role in bringing transparency to the AI system. The two-part blog from GovTech Singapore explains the applications of XAI and an overview of XAI methods.

Part 1: https://medium.com/dsaid-govtech/towards-a-comparable-metric-for-ai-model-interpretability-part-1-d55d4bae8a58

Part 2: https://medium.com/dsaid-govtech/towards-a-comparable-metric-for-ai-model-interpretability-part-2-423e4fc2b232


Policygenius: How we implemented a Tableau governance strategy

Policygenius writes about implementing the data governance policy for Tableau. The blog highlights the classic dashboard management problem and explains how a decentralized data-driven approach helped them to scale the data governance strategy.

https://medium.com/policygenius-stories/how-we-implemented-a-tableau-governance-strategy-59c055727433


Guang X: Lessons learned from Azure Data Factory

I hear significantly less about the Azure services for data engineering. The blog is an excellent first insight into the lesson learned from Azure Data Factory.

https://medium.com/@guangx/lessons-learned-from-azure-data-factory-4778eca0fc25


Jellysmack Labs: How Jellysmack Pushed Data Science Jobs Orchestration to a Production-ready Level

Jellysmack Labs writes about its Airflow usage in running production-ready data science jobs. The focus on DAG quality and uniformity in declaring a DAG & task is an exciting read.

https://medium.com/jellysmacklabs/how-jellysmack-pushed-data-science-jobs-orchestration-to-a-production-ready-level-e92dc4786413


Walmart Global Tech: DataBathing — A Framework for Transferring the Query to Spark Code

Staying with the uniformity of the jobs, Walmart writes about DataBathing, a framework to transpile the SQL to Spark Dataframe calculation flow code. The blog claim the performance increased from 10 to 80% with DataBathing. The blog mentioned that there would be a follow-up blog post on why such a transformer is required. I'm more curious to read the follow-up.

https://medium.com/walmartglobaltech/databathing-a-framework-for-transferring-the-query-to-spark-code-484957a7e049


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #93
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing