Data Engineering Weekly #116

The Weekly Data Engineering Newsletter

Jan 30, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

It Depends: Andrew Brust on What Will Happen in 2023 in Data, Analytics, and AI

I’m a regular listener of the ItDepends podcast, and I found Sanjeev always bring insightful view from the guest speakers. I've read/watched a few 2023 data predictions but found this conversation more practical, with fewer buzzwords highlighting data practitioners' reality. The conversation emphasizes the importance of governance and even points to a new role, "Legal Engineering."

The conversation around data observability points out the growing gap in data observability [aka finding the things] vs. fixing the data quality. The data observability solutions require more attention in the remediation workflow to ensure they don’t end up as a disjointed workflow like Data Catalogs.

Meta: Tulip - Modernizing Meta’s data platform

Meta writes about Tulip's adoption story, its data transportation, and the serialization protocol for its data platform. The highlight for me is the debugging tool “loggertail.” The utility of writing the analytics data to a void space (/dev/null) and dynamic subscribe and consume from a command line is a very powerful one.

https://engineering.fb.com/2023/01/26/data-infrastructure/tulip-modernizing-metas-data-platform/

Part 1: https://engineering.fb.com/2022/11/09/developer-tools/tulip-schematizing-metas-data-platform/

LucianoSphere: Build custom-informed GPT-3-based chatbots for your website with very simple code

ChatGPT is the talk of the town; I found this article an exciting one to build custom GPT-3-based chatbots. Please try it out and let us know what you build!!!

https://towardsdatascience.com/custom-informed-gpt-3-models-for-your-website-with-very-simple-code-47134b25620b

Luke Lin: Incorporate data into your next product

The adoption of data products moves beyond empowering business operations to build customer-facing applications. The author writes from a product manager perspective things to consider while adopting data-driven customer-facing data products.

https://pmdata.substack.com/p/incorporate-data-into-your-next-product

Jeff Chou: Is Databricks’s autoscaling cost efficient?

Autoscaling is a curse and a blessing on its own. Autoscale up & down is a highly preferable architecture for the data pipeline to adopt the workload spike at the peak of the hour. The author writes an interesting benchmark result comparing Databricks autoscaling vs. fixed computing.

In our analysis, we saw that a fixed cluster could outperform an autoscaled cluster in both runtime and costs for the 3 workloads we looked at by 37%, 28%, and 65%.

As with any benchmark results, it varies depending on your workload and configuration, but overall the findings highlight the autoscale infrastructure requires more maturity.

https://medium.com/sync-computing/is-databrickss-autoscaling-cost-efficient-610e6ece4831

Coinbase: SOON (Spark cOntinuOus iNgestion) for near real-time data at Coinbase - Part 1

Coinbase writes about its unified data ingestion framework SOON!! The blog highlights the complexity around data sourcing from various sources with the mix of CDC and non-CDC events. TIL about Databricks Change Data Feed, and yes Coinbase uses both Snowflake and Databricks :-)

https://www.coinbase.com/blog/soon-for-near-real-time-data-at-coinbase-part-1

Miro: Writing data product pipelines with Airflow

The Data Contract concept aims to establish the guarantees and expectations around the production-ready pipeline. While defining the expectations and constraints, developer productivity plays a critical role. Miro writes about writing a product pipeline with Airflow with a practical approach to implementing data contracts in the batch pipeline.

https://medium.com/miro-engineering/writing-data-product-pipelines-with-airflow-1ace222f8f5a

Etsy: Improving Support for Deep Learning in Etsy's ML Platform

Etsy writes about the changing landscape of its ML platform, from classic gradient-boosted trees to deep learning techniques. The blog shows an example of its search ranking and a walkthrough of how the team improves observability and early feedback tools.

https://www.etsy.com/codeascraft/improving-support-for-deep-learning-in-etsy39s-ml-platform

Dropbox: Accelerating our A/B experiments with machine learning

The A/B experiment pipeline is often the last one to finish in a data infrastructure; The window of confidence (the acceptable data sample) is always a question while making a business decision by the learnings from the A/B tests. Dropbox writes about the practical complexity of longer experiments and the usage of machine learning to make an intuitive decision with an example of the Expected Revenue model.

https://dropbox.tech/machine-learning/accelerating-our-a-b-experiments-with-machine-learning-xr

Flat Pack Tech: Online Analytics Pipeline Re-engineering

Flat Pack Tech writes about the re-architecture of its analytical pipeline. It is the first blog I came across that talks about dataform, the dbt alternative that only works with Google BigQuery!!!

https://medium.com/flat-pack-tech/online-analytics-pipeline-re-engineering-14a73792d468

Mehdio: 10 Lessons Learned In 10 Years Of Data

Mehdio shares 10 experiences out of a decade of a data engineering career. The blog dropped some very insightful trends of the past, highlighting past hype, The Cloud will take data engineers’ jobs, The aspiration to become Data scientists, the rush to put Notebooks into production, the modern data stack mess [Oh, my fav one], and Rust!!!

https://mehdio.medium.com/10-lessons-learned-in-10-years-of-data-1-2-4e3a8c358745

https://mehdio.medium.com/10-lessons-learned-in-10-years-of-data-2-2-9cc556840134

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly