Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers
Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
Abhi Sivasailam: Data Mesh at Flexport - Driving Buy-in and Social/Org Challenges
Possibly one of the good walkthroughs of data mesh implementation experience shared from Flexport. The logical division of the transaction and analytical layers, focusing on modeling, not the technology, domain interoperability is a first-order concern, are some of the key takeaways from the talk.
Financial Times: So You Want To Be A… Business Analyst
A great read of practical recommendations if you start thinking of switching your career towards business analyst. My favorite suggestion,
Examine websites or apps you regularly use, especially those with a customer journey or flow (e.g., buying something). Assess what works well or what doesn’t.
https://medium.com/ft-product-technology/so-you-want-to-be-a-business-analyst-fc28596411f5
Mark Grover: 3 Steps for a Successful Data Migration
We can measure the effectiveness of a team by the number of clean migration projects executed. I used "clean migration" because, as the author says
In most migrations, getting 90% done isn't good enough, even getting 100% done is not good enough; you have to kill something old for a migration to be successful.
https://towardsdatascience.com/3-steps-for-a-successful-data-migration-9de8e7f1671c
Jupyter: Looking at notebooks from a new perspective
The blog narrates two Jupyter rendering engines, nbconvert and Voilà, with the highlights of Déjàvu utility, which specifies new default values for several options to mimic Voilà's behavior for hiding input cells and prompt numbers.
https://blog.jupyter.org/looking-at-notebooks-from-a-new-perspective-bfd06797f188
PayPal: PayPal Introduces Dione, an Open-Source Spark Indexing Library
PayPal writes about the implementation of Dione, an open-source indexing library that creates a "shadow" table with the selected key values. Avro B-Tree implementation further optimizes the indexing to support single-row fetch tasks.
Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue
Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.
https://rudderstack.com/blog/churn-prediction-with-bigqueryml
LinkedIn: Our approach to building transparent and explainable AI systems
LinkedIn writes about its approach to build explainable AI systems. The blog defines transparency in AI as,
AI system behavior and its related components are understandable, explainable, and interpretable.
The blog narrates implementing a homegrown system called Intellige: A user-facing model explainer for narrative explanations.
Paper:
Intellige: A User-Facing Model Explainer for Narrative Explanations
https://engineering.linkedin.com/blog/2021/transparent-and-explainable-AI-systems
Netflix: Interpreting A/B test results - false positives and statistical significance
Netflix writes the third part of the A/B testing series highlighting what an A/B test is and how Netflix decision-makers using A/B testing. The third part of the multi-part post narrates interpreting A/B test results highlighting false positives and statistical significance.
Previous Parts:
Part 1: Decision Making at Netflix
Zalando: Space-efficient machine learning feature stores using probabilistic data structures - a benchmark
Zalando writes an exciting blog post discussing the usage of the probabilistic data structure for feature storage. The benchmark results the traditional key-value storages using 15 GB of data vs. the probabilistic data structure uses 470 MN for 1.762 bill data points.
Databricks: Pandas API on Upcoming Apache Spark™ 3.2
Databricks announces that the pandas API will be part of the Apache Spark™ 3.2 release by merging the Koalas onto PySpark. The blog follows the successive iterations on the pandas API, such as 90% API compatibility coverage, more type-hints, performance improvements, and stabilization.
https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.