Data Engineering Weekly

Share this post
Data Engineering Weekly #78
www.dataengineeringweekly.com

Data Engineering Weekly #78

Weekly Data Engineering Newsletter

Ananth Packkildurai
Mar 14, 2022
3
Share this post
Data Engineering Weekly #78
www.dataengineeringweekly.com

Data Council - Austin 2022

Data Council published the Austin 2022 schedule here. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20

https://www.datacouncil.ai/austin


George Ho: Data Collection is Hard. You Should Try It

Data Analytics and Data Science is the last mile solution to amplify the value of the data. As the author points out, collecting data is a unique opportunity to learn many staple technologies in data.

https://www.georgeho.org/data-collection-is-hard/


DeepMind: Predicting the past with Ithaca

If you're a history buff, this might excite you. DeepMind writes about Ithaca, the first deep neural network that can restore the missing text of damaged inscriptions, identify their original location, and help establish the date they were created.

https://deepmind.com/blog/article/Predicting-the-past-with-Ithaca

Nature Paper: Restoring and attributing ancient texts using deep neural networks

ithaca GitHub: https://github.com/deepmind/ithaca


Observable: Wordle, 15 Million Tweets Later

Wordle became my favorite family bonding activity, and I was thrilled to see the data analytics. As expected, four is the most common guess count, with frequency decreasing rapidly on either side. There is some fantastic product design principle hidden in that insight for sure.

https://observablehq.com/@rlesser/wordle-twitter-exploration


Salesforce: Einstein Evaluation Store — Beyond Metrics for ML/AI Quality

Metrics always require (human) interpretation to be actionable. Tests are immediately actionable by automated processes.

Salesforce writes about why a metric-centric strategy won’t scale and the paradigm shift on test-centric ML with Evaluation store.

https://engineering.salesforce.com/einstein-evaluation-store-beyond-metrics-for-ml-ai-quality-4ec2f5504421


Uber: One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™

Historically, access control is thought of as a separate layer on top of data storage. The tabular format such as Iceberg, DeltaLake & hudi formats opens up the possibility of implementing fine-grained access control co-exist with the storage. It's exciting to read Uber's implementation of access control, encryption, and retention on top of Apache Parquet.

https://eng.uber.com/one-stone-three-birds-finer-grained-encryption-apache-parquet/


Pinterest: Addressing Python Dependency Confusion at Pinterest

A Dependency Confusion attack or supply chain substitution attack occurs when a software installer script is tricked into pulling a malicious code file from a public repository instead of the intended file of the same name from an internal repository.

How evident is it? Here is an article talks about the implementation of it.

Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies

Pinterest writes about how it approaches dependency confusion.

https://medium.com/pinterest-engineering/addressing-python-dependency-confusion-at-pinterest-e0a0609c8e9


Sponsored: Making Data Engineering Easier: Operational Analytics With Event Streaming and Reverse ETL

When it comes to Reverse ETL, business use cases typically get all the attention. Here, RudderStack focuses on how reverse ETL makes data engineering easier. They drive the point home with an example from their own data engineering team that involved using the Google Click ID (gclid) to get enriched conversions into Google Ads.

https://www.rudderstack.com/blog/making-data-engineering-easier-operational-analytics-with-event-streaming-and-reverse-etl


Picnic: 7 Antifragile Principles for a Successful Data Warehouse

Antifragile principles give a new perspective in system design on top of resiliency and feedback loop. The article may be a good starting point for a conversation among the data community about Chaos engineering & Antifragility. Picnic writes an exciting article about its Antifragile principles in the data warehouse. The seven principles are,

  1. Sticking to simple rules

  2. Avoiding naive interventions that do more harm than good in the long term

  3. Built-in redundancy and layers (no single point of failure)

  4. Ensuring that everyone has a stake

  5. Experimenting and tinkering — taking lots of small risks.

  6. Keeping our options open

  7. Not reinventing the wheel — looking for habits and rules that have been around for a long time.

https://blog.picnic.nl/7-antifragile-principles-for-a-successful-data-warehouse-574b655f0bc6


Benchling: A Look at the Evolution of Benchling’s Search Architecture

Benchling writes an interesting article about its search infrastructure from ElasticSearch to Postgres and again to ElasticSearch. The article's theme is about the hardness in maintaining a consistent view of two disjointed infrastructures is a pain.

https://benchling.engineering/a-look-at-the-evolution-of-benchlings-search-architecture-c4d5327452c


Richard Startin: RangeBitmap - How range indexes work in Apache Pinot

One of the exciting features I like about Apache Pinot is picking and choosing an indexing strategy for each column. The article is an excellent in-depth explanation of RangeBitmap and how the range indexes in Apache Pinot works.

https://richardstartin.github.io/posts/range-bitmap-index


Sarah Krasnik: No Code is the Future

The developer community divides between low code/ no code vs. code only approach. In the AEW (Analytical Engineering Weekly), Tristian made a significant point.

If we say that the only appropriate way to participate in certain activities is to do so via writing code, then we are inherently excluding the majority of humanity.

My thought on this,

We treat programming as elite work, but every job requires some level of automation. A farmer could automate their job with a no-code solution to predict crop growth and quality. Since we keep the barrier of entry to such a system, the farmer has to rely on a middle man that we call a marketplace platform. Democratization of the technology removing the barrier to entry in programming is vital to maintain the balance of the society. The advantage of the code-only approach is the testability and repeatability. Version control & code review are one way to implement.

https://sarahsnewsletter.substack.com/p/no-code-is-the-future


Sponsored: Rudderstack - The Data Stack Show Live: Is Reverse ETL Just Another Data Pipeline?

You’ve heard about Reverse ETL. Here’s your chance to learn all about the tooling from the folks who are creating it. Join Hosts Eric and Kostas for a live recording of The Data Stack Show on March 9th to get insights from experts at Census, Hightouch, and Workato.

https://datastackshow.com/livestream-registration-reverse-etl/


Talabat: Perspectives- Talabat’s Data Aggregation Framework

Talabat writes about Perspectives, its internal metric aggregation framework. The Configuration as a Code for metrics generation is becoming popular (e.g., dbt metrics layer). It will be a curious case study to see when the descriptive DSL like SQL becomes too much of a barrier to entry that pushes the momentum behind the Configuration as a Code.

https://medium.com/talabat-tech/perspectivestalabats-data-aggregation-framework-c8fb3ba6d08


Shopify: 7 Tips For Optimizing Apache Flink Applications

Operating a streaming platform is always challenging, and Shopify writes some excellent tips to optimize the Apache Flink applications.

https://shopifyengineering.myshopify.com/blogs/engineering/optimizing-apache-flink-applications-tips


Redhat: Which is better: A single Kafka cluster to rule them all, or many?

Redhat writes an exciting article comparing running one Kafka cluster to rule them all vs. running multiple Kafka clusters. The multi-tenant vs. multi-instance is an exciting system design debate, and I'm always in favor of the multi-instance model in the cloud environment.

https://developers.redhat.com/articles/2022/03/10/which-better-single-kafka-cluster-rule-them-all-or-many


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post
Data Engineering Weekly #78
www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing