Data Engineering Weekly #78
Weekly Data Engineering Newsletter
Data Council - Austin 2022
Data Council published the Austin 2022 schedule
here. The data engineering weekly readers can get a 20% discount using promo code
George Ho: Data Collection is Hard. You Should Try It
Data Analytics and Data Science is the last mile solution to amplify the value of the data. As the author points out, collecting data is a unique opportunity to learn many staple technologies in data.
DeepMind: Predicting the past with Ithaca
If you're a history buff, this might excite you. DeepMind writes about Ithaca, the first deep neural network that can restore the missing text of damaged inscriptions, identify their original location, and help establish the date they were created.
Restoring and attributing ancient texts using deep neural networks
Observable: Wordle, 15 Million Tweets Later
Wordle became my favorite family bonding activity, and I was thrilled to see the data analytics. As expected, four is the most common guess count, with frequency decreasing rapidly on either side. There is some fantastic product design principle hidden in that insight for sure.
Salesforce: Einstein Evaluation Store — Beyond Metrics for ML/AI Quality
Metrics always require (human) interpretation to be actionable. Tests are immediately actionable by automated processes.
Salesforce writes about why a metric-centric strategy won’t scale and the paradigm shift on test-centric ML with Evaluation store.
Uber: One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™
Historically, access control is thought of as a separate layer on top of data storage. The tabular format such as Iceberg, DeltaLake & hudi formats opens up the possibility of implementing fine-grained access control co-exist with the storage. It's exciting to read Uber's implementation of access control, encryption, and retention on top of Apache Parquet.
Pinterest: Addressing Python Dependency Confusion at Pinterest
Dependency Confusionattack or supply chain substitution attack occurs when a software installer script is tricked into pulling a malicious code file from a public repository instead of the intended file of the same name from an internal repository.
How evident is it? Here is an article talks about the implementation of it.
Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies
Pinterest writes about how it approaches dependency confusion.
Sponsored: Making Data Engineering Easier: Operational Analytics With Event Streaming and Reverse ETL
When it comes to Reverse ETL, business use cases typically get all the attention. Here, RudderStack focuses on how reverse ETL makes data engineering easier. They drive the point home with an example from their own data engineering team that involved using the Google Click ID (gclid) to get enriched conversions into Google Ads.
Picnic: 7 Antifragile Principles for a Successful Data Warehouse
Antifragile principles give a new perspective in system design on top of resiliency and feedback loop. The article may be a good starting point for a conversation among the data community about Chaos engineering & Antifragility. Picnic writes an exciting article about its Antifragile principles in the data warehouse. The seven principles are,
Sticking to simple rules
Avoiding naive interventions that do more harm than good in the long term
Built-in redundancy and layers (no single point of failure)
Ensuring that everyone has a stake
Experimenting and tinkering — taking lots of small risks.
Keeping our options open
Not reinventing the wheel — looking for habits and rules that have been around for a long time.
Benchling: A Look at the Evolution of Benchling’s Search Architecture
Benchling writes an interesting article about its search infrastructure from ElasticSearch to Postgres and again to ElasticSearch. The article's theme is about the hardness in maintaining a consistent view of two disjointed infrastructures is a pain.
Richard Startin: RangeBitmap - How range indexes work in Apache Pinot
One of the exciting features I like about Apache Pinot is picking and choosing an indexing strategy for each column. The article is an excellent in-depth explanation of RangeBitmap and how the range indexes in Apache Pinot works.
Sarah Krasnik: No Code is the Future
The developer community divides between low code/ no code vs. code only approach. In the
AEW (Analytical Engineering Weekly), Tristian made a significant point.
If we say that the only appropriate way to participate in certain activities is to do so via writing code, then we are inherently excluding the majority of humanity.
My thought on this,
We treat programming as elite work, but every job requires some level of automation. A farmer could automate their job with a no-code solution to predict crop growth and quality. Since we keep the barrier of entry to such a system, the farmer has to rely on a middle man that we call a marketplace platform. Democratization of the technology removing the barrier to entry in programming is vital to maintain the balance of the society. The advantage of the code-only approach is the testability and repeatability. Version control & code review are one way to implement.
Sponsored: Rudderstack - The Data Stack Show Live: Is Reverse ETL Just Another Data Pipeline?
You’ve heard about Reverse ETL. Here’s your chance to learn all about the tooling from the folks who are creating it. Join Hosts Eric and Kostas for a live recording of The Data Stack Show on March 9th to get insights from experts at Census, Hightouch, and Workato.
Talabat: Perspectives- Talabat’s Data Aggregation Framework
Talabat writes about Perspectives, its internal metric aggregation framework. The Configuration as a Code for metrics generation is becoming popular (e.g.,
dbt metrics layer). It will be a curious case study to see when the descriptive DSL like SQL becomes too much of a barrier to entry that pushes the momentum behind the Configuration as a Code.
Shopify: 7 Tips For Optimizing Apache Flink Applications
Operating a streaming platform is always challenging, and Shopify writes some excellent tips to optimize the Apache Flink applications.
Redhat: Which is better: A single Kafka cluster to rule them all, or many?
Redhat writes an exciting article comparing running one Kafka cluster to rule them all vs. running multiple Kafka clusters. The multi-tenant vs. multi-instance is an exciting system design debate, and I'm always in favor of the multi-instance model in the cloud environment.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.