Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Benn Stancil: Fine, let's talk about data contracts
Data Contract is the most discussed topic recently in the data world. Benn highlighted that he agrees that disagreement is a problem but disagrees that we need an agreement to solve it.
In Benn’s article,
Data contracts make exactly that trade. They replace a brittle technical system with a negotiating table. And the more that contracts depend on one another, the more people will want to be involved. I don’t know if that kills innovation, but it’s at least an annoying set of conversations that most people don’t want to have.
I'm afraid I have to disagree with this assessment. Change Management is all we do in software engineering.
Code Review, PRD, RFC, Sprint Planning; Everything is a negotiation in software engineering. Does it kill innovation? No, in fact, it accelerates industrial-scale innovation. So why the special treatment for Data Engineers?
https://benn.substack.com/p/data-contracts
It brings a question; Hey, Ananth. I’m a Data Engineering Leader. When should I focus on Data Contracts? I made a Magic Quadrant for you.
Ping me on LinkedIn. Curious to know your thoughts: https://www.linkedin.com/in/ananthdurai/
Chad Sanderson: The Production-Grade Data Pipeline
Chad talks about what it takes to build a production-grade data pipeline. The article focus on
Collaborative design
Contracts
Expectations
Monitoring
Change Management
https://dataproducts.substack.com/p/the-production-grade-data-pipeline
Lauren Balik: How Fivetran + dbt actually fail
Does ELT is way more heavily rent-seeking than ETL? Did we shift right too far to do the data transformation? The author discusses Fivetran and dbt as an example of the ELT model.
https://medium.com/@laurengreerbalik/how-fivetran-dbt-actually-fail-3a20083b2506
Ben Rogojan: Onboarding For Data Teams
The onboarding process is easily the best time to learn about organizational culture. An effective onboarding process demonstrates strong empathetical and inclusive engineering practices. The author writes about the experience of data team onboarding processes.
https://medium.com/coriers/onboarding-for-data-teams-100e041a012c
Sponsored: Firebolt - Cloud Data Warehouse Costs: Look Before You Leap
Have you ever totally overrun your monthly budget for an analytics environment overnight? Here are a few thoughts on how we prepare ourselves for what lies ahead in the public cloud and in the economy. In this post, we look at factors to consider when building a data warehouse. Our goal is to point out the potholes you are most likely to hit from a cost perspective and what you can do to avoid them.
https://www.firebolt.io/blog/cloud-data-warehouse-costs-look-before-you-leap
Intuit: How to Drive Grassroots AI Innovation? Tap into a Diversity of Ideas
Bottom Up innovation is the best way to fuel and iterate a company's growth. Intuit writes about six steps to drive grassroots innovations. Seeing 74% of innovation paper submissions from IC (Individual Contributors)Engineers is impressive.
Murat Demirbas: SQLite: Past, Present, and Future
SQLLite is reaching the browser. I can’t wait to try analytics on edge.
The author discusses the SQLite architecture, transaction guarantees in SQLite, and what is ahead of SQLite in the near future.
https://muratbuffalo.blogspot.com/2022/09/sqlite-past-present-and-future.html
Sponsored: Soda - Podcast: Data Mesh in Practice
Max Schultze, Data Engineering Manager at Zalando, and Prof. Dr. Arif Wider, Professor of Software Engineering at HTW Berlin, share their experience in bringing forward the practical side of data mesh from an engineer's perspective and answer challenging questions that tackle some of the common misconceptions of putting data mesh into practice.
https://directory.libsyn.com/episode/index/id/24095136
Robin Moffatt: Data Engineering in 2022: Storage and Access
Looking back at history and comparing the current state is always good. It is an exciting time for data engineering with the significant investment and progress in storing and querying data. The author compares the days of Hadoop/ HDFS to the LakeHouse architecture and progress made in data infrastructure.
https://rmoff.net/2022/09/14/data-engineering-in-2022-storage-and-access/
Dr.Vijay Srinivas Agneeswaran: Efficient transformers: Survey of recent work
Transformers become standards in NLP tasks such as machine translation, text summarization, question answering, etc. The author published the transformers' categorization based on a survey of efficient transformers.
Computational complexity
Spectral complexity
Robustness
Privacy
Approximation
Model compression
Sponsored: Rudderstack - Better Customer Data Integration Management For Growing Teams
In this piece, Ben Rogojan outlines your options for solving data integration challenges as your company grows: building a scalable framework or architecting a stack with the right tools. Check it out for some practical advice on which approach to take.
https://www.rudderstack.com/blog/better-customer-data-integration-management-for-growing-teams
Slack: Recommend API - Unified end-to-end machine learning infrastructure to generate recommendations
Slack writes about its unified end-to-end machine learning infrastructure to generate recommendations. The article highlights some product experiences where machine learning provides a rich experience. The article is a classic example of how to use ML to drive product features and growth.
https://slack.engineering/recommend-api/
Netflix: Machine Learning for Fraud Detection in Streaming Services
Netflix switched from account sharing okay in 2016 to crack down on account sharing in 2022. I believe the article is the first gist of how fraud detection work behind the scene.
Picnic: MLOps Principles to build Picnic’s Data Science Platform
Every piece of infrastructure is driven by the basic principle of an organization toward achieving business goals. Picnic writes about its principles for building an internal data science platform.
https://blog.picnic.nl/mlops-principles-to-build-picnics-data-science-platform-851cbe2e8045
Marc Kelechava: Monitoring machine learning systems at Faire
Faire writes about its real-time ranking feature, challenges in monitoring real-time ranking model evaluation metrics in near real-time, and anomaly detection on critical metrics. The blog is an excellent read on how to build a reactive system to improve operational efficiency.
https://craft.faire.com/monitoring-machine-learning-systems-at-faire-6d5f8337e9e7
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.