Data Engineering Weekly #66

Weekly Data Engineering Newsletter

Dec 13, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

There is a lot of exciting stuff to catch up on in DataEngineeringWeekly this week, especially since we took off last week.

dbt labs: Coalesce - the analytics engineering conference 2021

Coalesce - the analytics engineering conference is a delight to watch this week. dbt as a metric layer is an exciting evolution and is the key highlight of the conference. If you missed the conference, you can watch all the replays here.

https://coalesce.getdbt.com/replays/

James Le: What I Learned From the Open Source Data Stack Conference 2021

Open-source Data Stack conference is another exciting data conference focused on the modern data stack in the open-source world. The author writes an excellent summary of the conference.

https://jameskle.com/writes/open-source-data-stack-2021

You can catch up on all the replays of the open source data stack conference 2021 here.

https://www.opensourcedatastack.com/stage/events

AWS: Top Announcements of AWS re:Invent 2021

AWS published the top announcements from the AWS re: Invent 2021 conference. The top announcements from the data engineering perspective are,

https://aws.amazon.com/blogs/aws/top-announcements-of-aws-reinvent-2021/

LinkedIn: Evolving LinkedIn’s analytics tech stack - Lessons from a large-scale data platform migration

LinkedIn shares its analytical stack transition story from Teradata data warehouse systems to open source big data technologies. The analytical stack includes 1400+ datasets, 900+ data flows, and 2100+ users. The migration strategy with improving the data model is an exciting read.

https://engineering.linkedin.com/blog/2021/evolving-linkedin-s-analytics-tech-stack

Tableau: Top Data books of 2021

Though the title says the top data books, the shortlisted books focus on data visualization or Tableau platform. Nonetheless, it is great to read data visualization books.

https://www.tableau.com/about/blog/2021/12/andy-cotgreave-top-data-books-2021

Erik Bernhardsson: Storm in the stratosphere - how the cloud will be reshuffled

Erik Bernhardsson writes an exciting prediction on cloud vendors' trends builds a case on top of the success of Snowflake over Redshift. It is undoubtedly true in the analytical world where AWS solutions always package and sell open-source tools but never go beyond simplifying the developer workflow. A couple of interesting predictions to highlight,

Kubernetes will be some weird thing people loved for five years, just like Hadoop was from 2009-2013, but the world will move on.
YAML will be something old jaded developers bring up after a few drinks. You know it's time to wrap up at the party at that point.

https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html

Shreya Shankar: The Modern ML Monitoring Mess: Rethinking Streaming Evaluation

A streaming sliding window with a finite interval is a go-to ML metrics monitoring strategy. The threshold, window size, and alerts are still defined manually for each metric. The author argues why this procedure to evaluate ML on streams of data is broken, highlighting representation differences, varying sample size & delayed feedback on the sliding window.

https://www.shreya-shankar.com/rethinking-ml-monitoring-1/

Microsoft: SynapseML - A simple, multilingual, and massively parallel machine learning library

Microsoft announces the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. With SynapseML, developers can build scalable and intelligent systems for solving challenges in domains such as Anomaly Detection, Computer Vision, Deep Learning, Text analytics, etc.

https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/

Data@Monzo: Mapping our data journey with column lineage

Monzo writes about its journey to bring column-level lineage to track and understand scope changes across the data warehouse and automatically detect unused columns. TIL about ZetaSQL, which helps to parse & analyze BigQuery Sql, and looking forward to playing around with it.

https://medium.com/data-monzo/mapping-our-data-journey-with-column-lineage-56209c00606d

PayPal: Building Data Quality into the Enterprise Data Lake

PayPal writes about Rule Execution Framework to manage a centralized rule configuration system to manage data quality rules & rulesets. The adoption of SQL to write complex data validation rules and the workflow focused on the domain owners to define the data quality rules are exciting.

https://medium.com/paypal-tech/building-data-quality-into-the-enterprise-data-lake-9dec305c3757

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly