Data Engineering Weekly #118

The Weekly Data Engineering Newsletter

Feb 13, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: Launching “Behind the Tech” Series & Data Council 2023

Data Engineering Weekly strives to bring our readers the best curation every week. As a Data Engineer, I always want to know what is behind the scene of a product and how efficiently we can use system design to solve a business problem efficiently. We open the Behind the Scene series for startup founders to write in-depth technical articles, system design, and efficient product usage.

To begin, we partner with Pathway.com to launch a three-part series about unlocking stream processing, - where Pathway talks about applying linear regression & classification in real-time. Stay tuned for more.

If you wish to write the Behind the Scene series, write us ananth@dataengineeringweekly.com. You can also submit the Founder's story and article suggestion.

Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20

MotherDuck: Big Data is Dead

All large data sets are generated over time. Time is almost always an axis in a data set. New orders come in every day. New taxi rides. New logging records. New games are being played. But compute needs will likely not change much over time; most analysis is done over recent data.

There is a lot of truth in this statement. Historical data processing is a rare event, where 99% of the computing happens over the last 24 hours of data. It’s true Big Data is dead, but we can’t deny it is a result of collective advancement in data processing techniques.

https://motherduck.com/blog/big-data-is-dead/

Dropbox: Balancing quality and coverage with our data validation framework

Data Testing should be part of the data creation lifecycle; it is not a standalone process. I believe the current data testing platforms can’t support the complex nature of data testing.

The Dropbox data team highlights the same problem. It describes how an extended Airflow operator that adopts the Write-Audit-Publish pattern with SQL helps to standardize the data testing strategy.

https://dropbox.tech/infrastructure/balancing-quality-and-coverage-with-our-data-validation-framework

Mixpanel: Tracking events at millisecond granularity

My first reaction while reading “Historically, Mixpanel used to track events at second-level granularity.”, Wait, what? None of the systems is perfect. I admire the Mixpanel team discussing the complexity of changing the timestamp and the system design for it.

https://engineering.mixpanel.com/tracking-events-at-milli-second-granularity-7d1fc7f29e31

Shopify: The Complex Data Models Behind Shopify's Tax Insights Feature

The blog comes at the right time when the data community frequently talks about the lost art of Data Modeling. Shopify shares its experience designing tax insight features, the business complexity, and lessons learned.

https://shopifyengineering.myshopify.com/blogs/engineering/complex-data-models-behind-shopify-tax-insights

Picnic: Deploying Data Pipelines using the Saga pattern

An interesting take on pipeline orchestration engine as a Saga pattern implementation. Picnic writes about how it automates pipeline deployment. The blog definitely added to my curiosity to think more.

https://blog.picnic.nl/deploying-data-pipelines-using-the-saga-pattern-ffc1cbe29cee

Atlassian: Data Processing Agreements (DPAs) 101: What app developers need to know

Atlassian continues to write about the importance of data privacy laws and what developers need to know about the regulatory requirements. A must-read for data engineering professionals.

https://blog.developer.atlassian.com/data-processing-agreements-dpas-developer-info/

Etsy: Adding Zonal Resiliency to Etsy’s Kafka Cluster

Cross-region (Zone) comes with its penalty of cost and latency in Kafka infrastructure. Etsy writes about resiliency engineering for Kafka infrastructure, adding Zonal resilience in Google Cloud.

Part 1: https://www.etsy.com/codeascraft/adding-zonal-resiliency-to-etsys-kafka-cluster-part-1

Part 2: https://www.etsy.com/codeascraft/leveraging-zonal-resiliency-to-improve-updates-for-etsys-kafka-cluster-part-2

Paulo Salem: Building GPT-3 applications — beyond the prompt

I started using chatGPT assistance for my day-to-day coding; It is a huge productive booster, and I don’t think I can go back without it. I’m surprised by how quickly it does the habit building and found this article is a pretty exciting tutorial on building gpt-3 applications.

https://medium.com/data-science-at-microsoft/building-gpt-3-applications-beyond-the-prompt-504140835560

Twitter: The data platform cluster operator service for Hadoop cluster management

Speaking of “Big Data is Dead,” Twitter writes about streamlining the Hadoop cluster operations. Twitter in the past wrote about its move to Google BigQuery; interestingly, Hadoop is still not replaceable internally.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2023/the-data-platform-cluster-operator-service-for-hadoop-cluster-management

Bruce Momjian: Will Postgres live forever? – Postgres Innovation: Full-Text Search

Any modern databases should support storing and processing semi-structured data & free text search. Expecting a well-defined upfront schema modeling is practically impossible with the variety of data sources we deal with. I found the blog very informative, and it talks about advancements in PostgreSQL to support full-text search.

https://willpostgresliveforever.com/postgres-innovation-full-text-search/

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly