Data Engineering Weekly #133

The Weekly Data Engineering Newsletter

Jun 05, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

David Heinemeier Hansson: The luxury of working without metrics

“The metrics you choose as a measure of success inherently shape the essence of the company you aspire to build” - A Data Practitioner.

Sometimes, the metrics you don’t also have the same impact. Do you need data/ metrics at all? Should the metrics define how you operate your business? An interesting take on operating without metrics.

https://world.hey.com/dhh/the-luxury-of-working-without-metrics-02e5dbac

LakeFs: How to Implement Write-Audit-Publish (WAP)

I wrote extensively about the WAP pattern in my latest article, An Engineering Guide to Data Quality - A Data Contract Perspective. Super excited to see a complete guide on implementing the WAP pattern in Iceberg, Hudi, and of course, with LakeFs.

https://lakefs.io/blog/how-to-implement-write-audit-publish/

Uber: Spark Analysers: Catching Anti-Patterns In Spark Apps

One of the challenges in commoditizing data processing engines like Spark is that it requires an expert user to understand and operate this system. Uber writes about automating the detection of Spark anti-patterns to educate the users better and help reduce their processing costs.

https://www.uber.com/en-US/blog/spark-analysers-catching-anti-patterns-in-spark-apps/

Policy Genius: Data Warehouse Testing Strategies for Better Data Quality

Data Testing and Data Observability are widely discussed topics in Data Engineering Weekly. However, both techniques test once the transformation task is completed. Can we test SQL business logic during the development phase itself? Perhaps unit test the pipeline?

The author writes an exciting article about adopting unit testing in the data pipeline by producing sample tables during the development. We will see more tools around the unit test framework for the data pipeline soon. I don’t think testing data quality on all the PRs against the production database is not a cost-effective solution. We can do better than that, tbh.

https://medium.com/policygenius-stories/data-warehouse-testing-strategies-for-better-data-quality-d5514f6a0dc9

Kaushik Muniandi: Text-Based Search - From Elastic Search to Vector Search

Last month or so, I experimented with vector search with embedding. I found the article that compares search efficiency among ElasticSearch, Azure Search, and custom vector search tables in DeltaLake and Apache Hudi interesting.

https://www.linkedin.com/pulse/text-based-search-from-elastic-vector-kaushik-muniandi/

Jatin Solanki: Vector Database - Concepts and examples

Staying with the vector search, a new class of Vector Databases is emerging in the market to improve the semantic search experiences. The author writes an excellent introduction to vector databases and their applications.

https://blog.devgenius.io/vector-database-concepts-and-examples-f73d7e683d3e

KOHO: Handling Schema Evolution in the Data Pipelines at KOHO

Schema management at the data ingestion service and the DLQ (Dead Letter Queue) pattern is emerging as the standard architecture pattern in event processing. Koho writes about its architecture to handle DLQ and schema management.

https://koho.dev/handling-schema-evolution-in-the-data-pipelines-at-koho-314472111477

LlamaIndex: Combining Text-to-SQL with Semantic Search for Retrieval Augmented Generation

A query engine can leverage the expressivity of SQL over structured data and join it with unstructured context from a vector database - The future of data processing.

Many of the real-world data, all the way from medical images to astro monitoring, are unstructured data. The future we wish to live in is the promise of vector search combined with SQL to access structured information.

https://medium.com/llamaindex-blog/combining-text-to-sql-with-semantic-search-for-retrieval-augmented-generation-c60af30ec3b

Grab: PII masking for privacy-grade machine learning.

Grab writes about the shift-left approach, where the data producer tags the PII information carried over the downstream processing system to take action for it.

I’m thrilled to see this approach reflecting the vision of Schemata with its Protobuf definition. The Schemata Open Contract definition has a comprehensive coverage of PII & Data Classification types here

https://github.com/ananthdurai/schemata/blob/main/src/opencontract/v1/org/schemata/protobuf/schemata.proto#L87

https://engineering.grab.com/pii-masking

Katharine Jarmul: Privacy Enhancing Technologies - An Introduction for Technologists

The PII masking is part of the privacy-enhancing technique. As a data practitioner, it is essential to understand privacy-enhancing technologies. The author writes a comprehensive guide about PET.

https://martinfowler.com/articles/intro-pet.html

Gabe Araujo: Introducing PandasAI: The Generative AI Python Library

I’ve not tried PandasAI, but the promise of it looks exciting.

data = pd.read_csv('dataset.csv')

data_cleaned = pdai.impute_missing_values(data)

If the data cleaning is as simple as calling a function, working with data will be a delight.

https://levelup.gitconnected.com/introducing-pandasai-the-generative-ai-python-library-568a971af014

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly