Data Engineering Weekly #193

The Weekly Data Engineering Newsletter

Oct 14, 2024

Wix: The Emerging Economy of LLMs

The LLM, known as the token economy, is emerging as an agent-driven workflow tool across the industry. The author narrates why tokens are the new currency in the LLM economy.

I recently took a survey from a productivity tool I paid for, asking if I’m willing to pay double the subscription cost to use the LLM feature. I was like, hell no. From a consumer perspective, I want to pay the same but expect a much better experience. I’m still on the edge of the LLM economy, but I'm optimistic the tools will be LLM-driven.

https://medium.com/wix-engineering/the-emerging-economy-of-llms-883f2ab13067

Apple: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Can LLM develop mathematical reasoning capabilities? The paper from Apple evaluated the current leading LLMs and said no, it can’t. Key reasons for them are,

LLM relies on probabilistic pattern matching; hence, instead of understanding the underlying mathematical concepts, LLMs might simply replicate patterns they observed in their training data.
Small changes in input tokens can significantly alter model outputs, revealing token bias and fragility.
LLM performance deteriorates with increased complexity.

https://machinelearning.apple.com/research/gsm-symbolic

Joe Reis: Field Notes, Early Fall 2024 Edition

Joe Reis provides a great overview of what is happening in the data industry. LLM is entering the PoC phase; people are still confused but in the upskilling phase. A key highlight for this is 👇🏼

Data’s still a mess. Most data initiatives fail. Data teams are seen as a cost center and not getting the support they deserve. Same as it ever was.

Joe Reis

Field Notes, Early Fall 2024 Edition

9 months ago · 23 likes · 5 comments · Joe Reis

Uber: Genie - Uber’s Gen AI On-Call Copilot

Internal support is a disrupting but essential part of building a successful platform. Uber writes about Genie, a Gen AI on-call Copilot. Genie addresses these challenges by providing quick and accurate answers to questions, retrieving relevant information from internal knowledge bases, and reducing the need for constant interaction with on-call engineers.

https://www.uber.com/blog/genie-ubers-gen-ai-on-call-copilot/

Grab: Leveraging RAG-powered LLMs for Analytical Tasks

Grab writes about Data-Arks, an internal platform that houses frequently used SQL queries and Python functions. Data-Arks serves as a vital component in integrating Large Language Models (LLMs) into the analytics workflow, streamlining processes like generating regular metric reports and conducting fraud investigations

https://engineering.grab.com/transforming-the-analytics-landscape-with-RAG-powered-LLM.

Jack Vanlightly: Table format comparisons - Change queries and CDC

Incremental data processing is vital for an efficient and cost-effective data infrastructure. The author categorizes these queries into four types: append-only, upsert, min-delta (CDC), and full-delta (CDC). The article explores how each table format handles these queries, analyzing their strengths and limitations.

https://jack-vanlightly.com/blog/2024/9/19/table-format-comparisons-change-queries-and-cdc

Lak Lakshmanan: What goes into bronze, silver, and gold layers of a medallion data architecture?

If I understand correctly, the gist of the article is where you position the common data model/ metrics that can be used across the organization. I think these layers are a guiding principle instead of a strict framework. The common data models are considered the “core” domain, which is itself a kind of data mart. The article is a good reminder to focus on the “sharable core domain” in data modeling regardless of whether or not to expand the medallion architecture.

https://lakshmanok.medium.com/what-goes-into-bronze-silver-and-gold-layers-of-a-medallion-data-architecture-4b6fdfb405fc

Expedia: Enhancing Data Reliability With An SLO Platform

A diagram noting the flow of data from stream source, through processing, to the outcome.

Expedia Group Technology designed a new SLO platform to enhance data reliability, leveraging Kafka for event streaming, PostgreSQL for data storage, and APIs for querying. The platform efficiently ingests and enriches data from multiple sources with internal metadata, providing near real-time access and seamless integration with DataDog for proactive monitoring and real-time alerting.

https://medium.com/expedia-group-tech/enhancing-data-reliability-with-an-slo-platform-de00249756f6

GumGum: Boosting Batch Scoring Efficiency with BigQuery ML and ONNX

GumGum’s data engineering team optimized batch scoring by integrating BigQuery ML with ONNX, streamlining a previously complex workflow. Moving scoring directly into BigQuery eliminated the need for external Python-based containers, reducing both time and costs. This solution leverages Scikit-Learn models in ONNX format, allowing efficient, SQL-based batch scoring directly in BigQuery, significantly improving scoring performance on large datasets.

https://medium.com/gumgum-tech/boosting-batch-scoring-efficiency-with-bigquery-ml-and-onnx-85a114265c35

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly