Data Engineering Weekly #145

The Weekly Data Engineering Newsletter

Sep 04, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. See how it works today.

Editor’s Note: DewCon.ai - October 12, Bengaluru, India Update

Hey folks! 📣 Exciting news! We've finalized the agenda for the conference, and we will be launching middle of this week. 🎤 And guess what? We've given our conference website a fresh look. 🌐

Tickets are selling fast; use the code DATAHERO for a special discount. 🎟️ Oh, and if your company's thinking of bulk booking, drop an email to ananth@dataengineeringweekly.com to get some awesome discounts. 📩

Looking forward to seeing you all! 👋🙂

Register Now →

Wes McKinney: The Road to Composable Data Systems - Thoughts on the Last 15 Years and the Future

On LinkedIn, you may hear this frequently, "We did this 30 years ago." or "Don't reinvent the wheel.". The reality is that software is an abstract layer that interacts with the hardware components like the CPU, Network IO & memory. Any advancement in this hardware layer will significantly influence the software architecture, so we often see "what goes around, comes around."

The author summarizes the last 15 years of data computation infrastructure and the emerging trend of composable data systems.

https://wesmckinney.com/blog/looking-back-15-years/

Meta: Scheduling Jupyter Notebooks at Meta

As a data engineer, I love to work with the Notebook, but the challenge always comes while running through the standard pipelining practices. The orchestration engines like Airflow & Dagsters provide sophisticated pipeline authoring tooling, which is often hard to integrate with the notebooks. I often want to click the “Schedule this Notebook” button and automatically generate the Airflow code to schedule and commit in Github. On a similar experience, Meta writes about its automation process to simplify Notebook scheduling.

https://engineering.fb.com/2023/08/29/security/scheduling-jupyter-notebooks-meta/

Kyle Weller: Delta, Hudi, Iceberg — A Benchmark Compilation

The LakeHouse architecture brings the best of the database and the data lake into the data infrastructure. Which one to choose? Which one runs faster than the other? Following a benchmark result between Delta & Iceberg, Apache Hudi added its benchmark in the same repo. It’s exciting to see this benchmark happen in a public Github, which brings more openness into the ecosystem.

I do not believe in the benchmark as the answer is always “It depends on your workload,” and the author rightly mentioned it in the article.

One key thing to remember when running TPC-DS benchmarks comparing Hudi, Delta, and Iceberg is that by default, Delta + Iceberg is optimized for append-only workloads, while Hudi is, by default, optimized for mutable workloads.

https://medium.com/@kywe665/delta-hudi-iceberg-a-benchmark-compilation-a5630c69cffc

Weibinzen: GraphAr - A Standard Data File Format for Graph Data Storage and Retrieval

Last week, I came across some interesting file formats on top of the existing columnar formats like Parquet. GraphAr is a standardized file format for graph data, computation/storage system independent, and provides a set of interfaces for generating, accessing, and transforming these formatted files. It piqued my interest in my outstanding question: Can we model DataWarehouse as a graph rather than a relational model?

https://blog.graphscope.io/graphar-a-standard-data-file-format-for-graph-data-storage-and-retrieval-765a2efba519

Cloud Native Geo: Performance Explorations of GeoParquet (and DuckDB)

GeoParquet is a second data format I came across this week. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet. The author explores the performance of GeoParquet with DuckDB. Combining WebAssembly, DuckDB, and GeoParquet formats brings exciting interactive web applications like visualizing 1 million building footprints with 7 million total coordinates for the U.S. state of Utah and calculating their area on the fly.

https://cloudnativegeo.org/blog/2023/08/performance-explorations-of-geoparquet-and-duckdb/

Areca Data: The definitive guide to debugging dbt

Every system will fail in production with unknown errors. The author writes about the set of common dbt error codes and how to debug them.

Related Note: If you're like me and get a kick out of system failures in production environments, you should definitely read the article "Operating Effectively in High Surprise Mode." It's a great resource for anyone who wants to learn how to deal with unexpected events calmly and collectedly.

https://www.arecadata.com/the-definitive-guide-for-debugging-dbt/

Adevinta: How we matured Fisher, our A/B testing Package

Experimentation is the core of many online business operations, and we’ve seen many companies trying to democratize the experimentation platform. On a similar line, Adevinta writes about Fisher, a Python package that enables Data Scientists to run straightforward hypothesis testing and to produce comprehensive reports with very few lines of code.

https://medium.com/adevinta-tech-blog/how-we-matured-fisher-our-a-b-testing-package-110c99b993fc

Databricks: Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection for Large Language Models

The adoption of LLM in enterprises is slowly increasing, and so are the techniques and infrastructure around Fine-Tuning prompts. Databricks writes about LoRA, an improved fine-tuning method where instead of fine-tuning all the weights of the pre-trained large language model, two smaller matrices that approximate this larger matrix are fine-tuned.

https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms

Oana Olteanu: Fine-Tuning LLMs - learnings from the DeepLearning SF Meetup

In another exciting read on Fine-Tunning LLM, the author shared a lesson learned from the recent SF deep learning meetup. The author also highlights how Lora (Low-Rank Adaptation of Large Language Models) can get the same accuracy as end-to-end fine tuning but is much cheaper in training since you update a small set of parameters.

anti-vc

Fine Tuning LLMs - learnings from the DeepLearning SF Meetup

11 years ago, NASA marked the 135th and final mission of the American Space Shuttle program. I was in the rainy north of Germany, in undergrad, researching echo state neural networks for handwritten recognition; you can read my thesis if you want to see how far the field has come. I used Matlab, it was painful and definitely not considered cool. Fast fo…

2 years ago · 4 likes · Oana Olteanu

AWS: Develop a Cost-Aware Culture for Optimized and Efficient SaaS Growth

The Instacart S1 filling triggered some interesting conversations across many forums.

My take on this is that I admire what Instacart engineering can build on top of Snowflake. Anyone who has worked in a high-growth startup like Instacart can resonate with the "growth~cost-cutting cycles," known as the Pendulum Effect. Organizations swing the operating principles as growth at any cost in one year and then swing the pendulum towards cost optimization, hoping the chaos results in optimal resource usage efficiency.

However, the pendulum effect is not an excuse for inefficient system design. Engineers must be aware of the cost-fit function of their solution, and I learned the system design lesson during an incident mitigation process.

AWS writes an excellent article on the same, discussing how to develop a cost-aware engineering culture.

https://aws.amazon.com/blogs/apn/develop-a-cost-aware-culture-for-optimized-and-efficient-saas-growth/

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly