Data Engineering Weekly #78

Weekly Data Engineering Newsletter

Mar 14, 2022

Data Council - Austin 2022

Data Council published the Austin 2022 schedule here. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20

https://www.datacouncil.ai/austin

George Ho: Data Collection is Hard. You Should Try It

Data Analytics and Data Science is the last mile solution to amplify the value of the data. As the author points out, collecting data is a unique opportunity to learn many staple technologies in data.

https://www.georgeho.org/data-collection-is-hard/

DeepMind: Predicting the past with Ithaca

If you're a history buff, this might excite you. DeepMind writes about Ithaca, the first deep neural network that can restore the missing text of damaged inscriptions, identify their original location, and help establish the date they were created.

https://deepmind.com/blog/article/Predicting-the-past-with-Ithaca

Nature Paper: Restoring and attributing ancient texts using deep neural networks

ithaca GitHub: https://github.com/deepmind/ithaca

Observable: Wordle, 15 Million Tweets Later

Wordle became my favorite family bonding activity, and I was thrilled to see the data analytics. As expected, four is the most common guess count, with frequency decreasing rapidly on either side. There is some fantastic product design principle hidden in that insight for sure.

https://observablehq.com/@rlesser/wordle-twitter-exploration

Salesforce: Einstein Evaluation Store — Beyond Metrics for ML/AI Quality

Metrics always require (human) interpretation to be actionable. Tests are immediately actionable by automated processes.

Salesforce writes about why a metric-centric strategy won’t scale and the paradigm shift on test-centric ML with Evaluation store.

https://engineering.salesforce.com/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-4ec2f5504421

Uber: One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™

Historically, access control is thought of as a separate layer on top of data storage. The tabular format such as Iceberg, DeltaLake & hudi formats opens up the possibility of implementing fine-grained access control co-exist with the storage. It's exciting to read Uber's implementation of access control, encryption, and retention on top of Apache Parquet.

https://eng.uber.com/one-stone-three-birds-finer-grained-encryption-apache-parquet/

Pinterest: Addressing Python Dependency Confusion at Pinterest

A Dependency Confusion attack or supply chain substitution attack occurs when a software installer script is tricked into pulling a malicious code file from a public repository instead of the intended file of the same name from an internal repository.

How evident is it? Here is an article talks about the implementation of it.

Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies

Pinterest writes about how it approaches dependency confusion.

https://medium.com/pinterest-engineering/addressing-python-dependency-confusion-at-pinterest-e0a0609c8e9

Picnic: 7 Antifragile Principles for a Successful Data Warehouse

Antifragile principles give a new perspective in system design on top of resiliency and feedback loop. The article may be a good starting point for a conversation among the data community about Chaos engineering & Antifragility. Picnic writes an exciting article about its Antifragile principles in the data warehouse. The seven principles are,

Sticking to simple rules
Avoiding naive interventions that do more harm than good in the long term
Built-in redundancy and layers (no single point of failure)
Ensuring that everyone has a stake
Experimenting and tinkering — taking lots of small risks.
Keeping our options open
Not reinventing the wheel — looking for habits and rules that have been around for a long time.

https://blog.picnic.nl/7-antifragile-principles-for-a-successful-data-warehouse-574b655f0bc6

Benchling: A Look at the Evolution of Benchling’s Search Architecture

Benchling writes an interesting article about its search infrastructure from ElasticSearch to Postgres and again to ElasticSearch. The article's theme is about the hardness in maintaining a consistent view of two disjointed infrastructures is a pain.

https://benchling.engineering/a-look-at-the-evolution-of-benchlings-search-architecture-c4d5327452c

Richard Startin: RangeBitmap - How range indexes work in Apache Pinot

One of the exciting features I like about Apache Pinot is picking and choosing an indexing strategy for each column. The article is an excellent in-depth explanation of RangeBitmap and how the range indexes in Apache Pinot works.

https://richardstartin.github.io/posts/range-bitmap-index

Sarah Krasnik: No Code is the Future

The developer community divides between low code/ no code vs. code only approach. In the AEW (Analytical Engineering Weekly), Tristian made a significant point.

If we say that the only appropriate way to participate in certain activities is to do so via writing code, then we are inherently excluding the majority of humanity.

My thought on this,

We treat programming as elite work, but every job requires some level of automation. A farmer could automate their job with a no-code solution to predict crop growth and quality. Since we keep the barrier of entry to such a system, the farmer has to rely on a middle man that we call a marketplace platform. Democratization of the technology removing the barrier to entry in programming is vital to maintain the balance of the society. The advantage of the code-only approach is the testability and repeatability. Version control & code review are one way to implement.

https://sarahsnewsletter.substack.com/p/no-code-is-the-future

Talabat: Perspectives- Talabat’s Data Aggregation Framework

Talabat writes about Perspectives, its internal metric aggregation framework. The Configuration as a Code for metrics generation is becoming popular (e.g., dbt metrics layer). It will be a curious case study to see when the descriptive DSL like SQL becomes too much of a barrier to entry that pushes the momentum behind the Configuration as a Code.

https://medium.com/talabat-tech/perspectivestalabats-data-aggregation-framework-c8fb3ba6d08

Shopify: 7 Tips For Optimizing Apache Flink Applications

Operating a streaming platform is always challenging, and Shopify writes some excellent tips to optimize the Apache Flink applications.

https://shopifyengineering.myshopify.com/blogs/engineering/optimizing-apache-flink-applications-tips

Redhat: Which is better: A single Kafka cluster to rule them all, or many?

Redhat writes an exciting article comparing running one Kafka cluster to rule them all vs. running multiple Kafka clusters. The multi-tenant vs. multi-instance is an exciting system design debate, and I'm always in favor of the multi-instance model in the cloud environment.

https://developers.redhat.com/articles/2022/03/10/which-better-single-kafka-cluster-rule-them-all-or-many

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly