Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First Ever Data Observability Summit. Join Today's Leading Data Pioneers
Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
Benn Stancil: The Modern Data Experience - How a revolution comes together. Or doesn’t
Benn Stancil writes another exciting blog highlighting the missing focus on the modern data experience. The growth of modern data engineering tooling focus on a specific part of data engineering leads to data practitioners' isolated and inconsistent experience. An integrated data platform experience that can connect the modern and past data tools greatly accelerates data-driven culture.
https://benn.substack.com/p/the-modern-data-experience
Microsoft Research Lab: Discovering Related Data At Scale
There are several advantages of adopting a decentralized schema-on-read data lake approach. However, it can leads to inconsistency in the naming of the schema. A "server" column can be named as "Machine" or "Host" or "instance" in other tables. Finding column relationships is a complex task historically solved by sampling the data or finding the unique value matching. Microsoft lab writes an exciting paper that uses SQL query logs to find the relationship.
Paper: https://www.microsoft.com/en-us/research/publication/discovering-related-data-at-scale/
Talk:
Jordan Volz: Five Predictions for the Future of the Modern Data Stack
The emerging cloud-native data platforms, collectively known as the "modern data stack," simplify entry barriers to data analytics. The author walks through the developments on the modern data stack and bright side of "Modern Data Stack V2", focusing on AI, Data Sharing, Data Governance, Streaming & Application Serving.
InfoQ: AI, ML, and Data Engineering InfoQ Trends Report - August 2021
InfoQ released 2021 AI/ML/Data Engineering trends as a CHASM model. The top highlights are the Deep learning frameworks moved from innovators to early adopters and AutoML picking momentum. I've not come across any business process automation with digital assistance, so finding the digital assistance frameworks at the Early Adopters stage is a bit of a surprise.
https://www.infoq.com/articles/ai-ml-data-engineering-trends-2021/
Trifacta: Summer of SQL - Why It’s Back
We can associate the growth of modern data stacks and SQL reclaiming the throne of data engineering. The blog is an excellent overview of why SQL is back now and why it is a perfect tool for data engineering?
https://www.trifacta.com/blog/sql-for-elt-and-cloud-data-engineering/
Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue
Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.
https://rudderstack.com/blog/churn-prediction-with-bigqueryml
Slack: Data Lineage at Slack
Slack writes its data lineage journey highlighting lineage ingestion and consumption part of it. The Notification service out of the lineage data is an excellent reminder that the potential of the lineage exponentially increases when we start integrating it into the data practitioner's workflow.
https://slack.engineering/data-lineage-at-slack/
Gusto: What is Growth Engineering?
I am an application developer. Why should I care about data engineering?
I've been asked this question in one of the data engineering talks. I thought a bit and responded without much conviction that,
Every engineer is a data engineer /practitioner.
The blog from Gusto is an exciting read on growth engineering practices with the AARRR metrics framework, and I still stand by my statement :-).
https://engineering.gusto.com/what-is-growth-engineering/
Sisu: Why aren't cloud analytics platforms just UDFs?
UDFs bring uniformity and consistency to the data pipeline's business logic; however, few cloud providers support it, and there are no standards for defining the UDF. LinkedIn attempted to solve this problem with Transport: Towards Logical Independence Using Translatable Portable UDFs.
The author raises an excellent question on the role of UDFs in the modern data platform and the importance of UDF standardizations.
https://sisudata.com/blog/cloud-analytics-platforms
Nubank: Scaling data analytics with software engineering best practices
Nubank writes about its process of scaling data analytics with software engineering practices. The blog is an exciting reminder on focusing on structured dataset creations, collaboration & knowledge sharing, and the lifecycle management of the datasets.
https://building.nubank.com.br/scaling-data-analytics-with-software-engineering-best-practices/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.