Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Ternary Data: Data Contracts & Domain Ownership w/ Ananth Packkildurai
Last week sit with Joe Reis & Matt Housley about data contracts and domain ownership. I talked about Schemata, the first open-source “data contract” framework. I believe the data contract & data sharing is the next big wave in data engineering. You can find more details about Schemata in schemata.app
GoCardless: Data Contracts at GoCardless — 6 Months On
Staying on Data Contracts, where Schemata.app solving the semantic layer of the data contracts, GoCardless writes about the data transportation using the Outbox Pattern.
https://medium.com/gocardless-tech/data-contracts-at-gocardless-6-months-on-bbf24a37206e
An excellent blog from Debezium on implementing CDC with Outbox pattern.
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
Shopify: Lessons Learned From Running Apache Airflow at Scale
Shopify writes about lessons learned from running Airflow at scale. Shopify runs 10,000 DAGs with an average of 400 concurrent tasks at any given point and 150,000 DAG runs per day!. The manifest file approach is exciting and reminds me of Slack's "SlackDAG," an implementation of Airflow DAG that helped solve ownership problems and enforce the best practices as part of the CI/ CD process.
https://shopifyengineering.myshopify.com/blogs/engineering/lessons-learned-apache-airflow-scale
Sponsored: Firebolt - Embedded Analytics vs. Data Apps
But Data Apps is still a loosely defined term, and there’s a lot of debate and confusion about what it really means, and how it differs from traditional dashboarding and embedded analytics. Boaz Farkash shares his point of view on the subject.
https://www.firebolt.io/blog/embedded-analytics-vs-data-apps
Vimeo: The evolution of event data collection at Vimeo, part 1: the Fatal Attraction era
Defining a well-regulated event field at the source will reduce the significant burden on the ETL system. Vimeo writes about its journey of building a scalable event tracking library. The blog narrates the pros & cons of “attribute-based” tracking vs. “user-action-based” tracking.
Expedia: Software Architectural Patterns in Data Engineering
Data Engineering has come a long way from click & drop tools to robust data frameworks to programmatically author data pipelines. It opens up adopting software engineering best practices. Expedia compares the data engineering tools/frameworks with the software architectural pattern.
Sponsored: Monte Carlo Data - The Modern Data Leader’s Playbook
Learn how today’s best data engineering and analytics leaders are staying ahead of the competition in our complete guide.
Download the modern data leader’s playbook
DoorDash: Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash
An efficient experimentation framework is vital for the safe and faster iteration of the product. DoorDash writes about Dash-AB, a centralized library for statistical analysis.
Netflix: A Survey of Causal Inference Applications at Netflix
Staying with the importance of the culture of experimentation, Netflix had an internal Causal Inference and Experimentation Summit. I thought this was amazing and could apply to other parts of the critical infrastructures. Netflix shares a sneak peek of the event in this blog post with a few selected talks.
https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f
Grubbhub: Forecasting Grubhub Order Volume At Scale
Real-time demand forecasting in a supply chain system is always challenging. Grubhub writes about its demand forecasting data infrastructure and design principles.
https://bytes.grubhub.com/forecasting-grubhub-order-volume-at-scale-a966c2f901d2
Sponsored: RudderStack - Fireside Chat: The Future of Analytics on the Modern Data Stack
Join RuddersStack for a fireside chat on the future of analytics with Hex co-founder, Barry McCardel, and Transform co-founder, Nick Handel. They'll talk about bridging the gap between data and business functions, discuss the current state of analytics, and examine the next challenges to be tackled in data analytics.
https://www.rudderstack.com/video-library/future-of-analytics-on-the-modern-data-stack
MoMoTechnologies: MLOps at MoMo: Feature Store
Feature store becomes an essential part of the data infrastructure. MoMo writes about the need for a feature store and its evaluation of open-source feature stores. The blog narrates how MoMo developed ML workflow, ingestion, and data quality management.
https://tech.info-momo.com/mlops-at-momo-feature-store-e38e59da272e
Erick Reyes: This is how I onboarded more than 10 Data Engineers and got excellent reviews and feedback
A well-thought-through onboarding process boosts the developer's productivity and establishes an inclusive engineering culture. The author shared thoughts on approaches to onboard new data engineers. Please comment how is data engineering onboarding looks like in your team.
Ifihanagbara Olusheye: How to Use PyScript – A Python Frontend Framework
PyScript generated a lot of excitement in the data community. I found this is an excellent tutorial on how to use PyScript
https://www.freecodecamp.org/news/pyscript-python-front-end-framework/
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.