Data Engineering Weekly

Share this post

Data Engineering Weekly #118

www.dataengineeringweekly.com

Data Engineering Weekly #118

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Feb 13
5
Share this post

Data Engineering Weekly #118

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Editor’s Note: Launching “Behind the Tech” Series & Data Council 2023

Data Engineering Weekly strives to bring our readers the best curation every week. As a Data Engineer, I always want to know what is behind the scene of a product and how efficiently we can use system design to solve a business problem efficiently. We open the Behind the Scene series for startup founders to write in-depth technical articles, system design, and efficient product usage.

To begin, we partner with Pathway.com to launch a three-part series about unlocking stream processing, - where Pathway talks about applying linear regression & classification in real-time. Stay tuned for more.

If you wish to write the Behind the Scene series, write us ananth@dataengineeringweekly.com. You can also submit the Founder's story and article suggestion.

Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m super excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20


MotherDuck: Big Data is Dead

All large data sets are generated over time. Time is almost always an axis in a data set. New orders come in every day. New taxi rides. New logging records. New games are being played. But compute needs will likely not change much over time; most analysis is done over recent data.

There is a lot of truth in this statement. Historical data processing is a rare event, where 99% of the computing happens over the last 24 hours of data. It’s true Big Data is dead, but we can’t deny it is a result of collective advancement in data processing techniques.

https://motherduck.com/blog/big-data-is-dead/


Dropbox: Balancing quality and coverage with our data validation framework

Data Testing should be part of the data creation lifecycle; it is not a standalone process. I believe the current data testing platforms can’t support the complex nature of data testing.

The Dropbox data team highlights the same problem. It describes how an extended Airflow operator that adopts the Write-Audit-Publish pattern with SQL helps to standardize the data testing strategy.

https://dropbox.tech/infrastructure/balancing-quality-and-coverage-with-our-data-validation-framework


Mixpanel: Tracking events at millisecond granularity

My first reaction while reading “Historically, Mixpanel used to track events at second-level granularity.”, Wait, what? None of the systems is perfect. I admire the Mixpanel team discussing the complexity of changing the timestamp and the system design for it.

https://engineering.mixpanel.com/tracking-events-at-milli-second-granularity-7d1fc7f29e31


Sponsored: Fireside Chat: The Future of CDPs

Join this live session with BARK CTO Nari Sitaraman, & RudderStack Founder Soumyadeb Mitra on 2/15 at 9 AM PT to make sense of the CDP evolution and get practical advice on how to drive competitive advantage as a data leader in 2023.

https://www.rudderstack.com/events/fireside-chat-the-future-of-cdps/


Shopify: The Complex Data Models Behind Shopify's Tax Insights Feature

The blog comes at the right time when the data community frequently talks about the lost art of Data Modeling. Shopify shares its experience designing tax insight features, the business complexity, and lessons learned.

https://shopifyengineering.myshopify.com/blogs/engineering/complex-data-models-behind-shopify-tax-insights


Picnic: Deploying Data Pipelines using the Saga pattern

An interesting take on pipeline orchestration engine as a Saga pattern implementation. Picnic writes about how it automates pipeline deployment. The blog definitely added to my curiosity to think more.

https://blog.picnic.nl/deploying-data-pipelines-using-the-saga-pattern-ffc1cbe29cee


Sponsored: [New] Winning Strategies—2023 Modern Data Leader’s Playbook 🏅

Don't fumble your data strategy in 2023. Learn how other data managers, directors, and other leaders set their teams up for success. See how to drive organizational impact at scale, touching on the technologies, processes, and cultural requirements necessary to succeed in this role.

Get The Playbook


Atlassian: Data Processing Agreements (DPAs) 101: What app developers need to know

Atlassian continues to write about the importance of data privacy laws and what developers need to know about the regulatory requirements. A must-read for data engineering professionals.

https://blog.developer.atlassian.com/data-processing-agreements-dpas-developer-info/


Etsy: Adding Zonal Resiliency to Etsy’s Kafka Cluster

Cross-region (Zone) comes with its penalty of cost and latency in Kafka infrastructure. Etsy writes about resiliency engineering for Kafka infrastructure, adding Zonal resilience in Google Cloud.

Part 1: https://www.etsy.com/codeascraft/adding-zonal-resiliency-to-etsys-kafka-cluster-part-1

Part 2: https://www.etsy.com/codeascraft/leveraging-zonal-resiliency-to-improve-updates-for-etsys-kafka-cluster-part-2


Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!

Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.

  • Streaming plus batch unified in a single platform.

  • Stateful processing at scale - joins, aggregations, upserts

  • Orchestration auto-generated from the data and SQL

  • Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift

Try now and get 30 Days Free


Paulo Salem: Building GPT-3 applications — beyond the prompt

I started using chatGPT assistance for my day-to-day coding; It is a huge productive booster, and I don’t think I can go back without it. I’m surprised by how quickly it does the habit building and found this article is a pretty exciting tutorial on building gpt-3 applications.

https://medium.com/data-science-at-microsoft/building-gpt-3-applications-beyond-the-prompt-504140835560


Twitter: The data platform cluster operator service for Hadoop cluster management

Speaking of “Big Data is Dead,” Twitter writes about streamlining the Hadoop cluster operations. Twitter in the past wrote about its move to Google BigQuery; interestingly, Hadoop is still not replaceable internally.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2023/the-data-platform-cluster-operator-service-for-hadoop-cluster-management


Bruce Momjian: Will Postgres live forever? – Postgres Innovation: Full-Text Search

Any modern databases should support storing and processing semi-structured data & free text search. Expecting a well-defined upfront schema modeling is practically impossible with the variety of data sources we deal with. I found the blog very informative, and it talks about advancements in PostgreSQL to support full-text search.

https://willpostgresliveforever.com/postgres-innovation-full-text-search/


All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #118

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing