Data Engineering Weekly #59

Weekly Data Engineering Newsletter

Oct 11, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers

Abhi Sivasailam: Data Mesh at Flexport - Driving Buy-in and Social/Org Challenges

Possibly one of the good walkthroughs of data mesh implementation experience shared from Flexport. The logical division of the transaction and analytical layers, focusing on modeling, not the technology, domain interoperability is a first-order concern, are some of the key takeaways from the talk.

Financial Times: So You Want To Be A… Business Analyst

A great read of practical recommendations if you start thinking of switching your career towards business analyst. My favorite suggestion,

Examine websites or apps you regularly use, especially those with a customer journey or flow (e.g., buying something). Assess what works well or what doesn’t.

https://medium.com/ft-product-technology/so-you-want-to-be-a-business-analyst-fc28596411f5

Mark Grover: 3 Steps for a Successful Data Migration

We can measure the effectiveness of a team by the number of clean migration projects executed. I used "clean migration" because, as the author says

In most migrations, getting 90% done isn't good enough, even getting 100% done is not good enough; you have to kill something old for a migration to be successful.

https://towardsdatascience.com/3-steps-for-a-successful-data-migration-9de8e7f1671c

Jupyter: Looking at notebooks from a new perspective

The blog narrates two Jupyter rendering engines, nbconvert and Voilà, with the highlights of Déjàvu utility, which specifies new default values for several options to mimic Voilà's behavior for hiding input cells and prompt numbers.

https://blog.jupyter.org/looking-at-notebooks-from-a-new-perspective-bfd06797f188

PayPal: PayPal Introduces Dione, an Open-Source Spark Indexing Library

PayPal writes about the implementation of Dione, an open-source indexing library that creates a "shadow" table with the selected key values. Avro B-Tree implementation further optimizes the indexing to support single-row fetch tasks.

https://medium.com/paypal-tech/paypal-introduces-dione-an-open-source-spark-indexing-library-783e12800585

LinkedIn: Our approach to building transparent and explainable AI systems

LinkedIn writes about its approach to build explainable AI systems. The blog defines transparency in AI as,

AI system behavior and its related components are understandable, explainable, and interpretable.

The blog narrates implementing a homegrown system called Intellige: A user-facing model explainer for narrative explanations.

Paper: Intellige: A User-Facing Model Explainer for Narrative Explanations

https://engineering.linkedin.com/blog/2021/transparent-and-explainable-AI-systems

Netflix: Interpreting A/B test results - false positives and statistical significance

Netflix writes the third part of the A/B testing series highlighting what an A/B test is and how Netflix decision-makers using A/B testing. The third part of the multi-part post narrates interpreting A/B test results highlighting false positives and statistical significance.

https://netflixtechblog.com/interpreting-a-b-test-results-false-positives-and-statistical-significance-c1522d0db27a

Previous Parts:

Part 1: Decision Making at Netflix

Part 2: What is an A/B test

Zalando: Space-efficient machine learning feature stores using probabilistic data structures - a benchmark

Zalando writes an exciting blog post discussing the usage of the probabilistic data structure for feature storage. The benchmark results the traditional key-value storages using 15 GB of data vs. the probabilistic data structure uses 470 MN for 1.762 bill data points.

https://engineering.zalando.com/posts/2021/10/space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures.html

Databricks: Pandas API on Upcoming Apache Spark™ 3.2

Databricks announces that the pandas API will be part of the Apache Spark™ 3.2 release by merging the Koalas onto PySpark. The blog follows the successive iterations on the pandas API, such as 90% API compatibility coverage, more type-hints, performance improvements, and stabilization.

https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?