Data Engineering Weekly #33

Weekly Data Engineering Newsletter

Welcome to the 33rd edition of the data engineering newsletter. This week's release is a new set of articles that focus on Michael Stonebraker’s Top 10 Big Data Blunders, Stanford University’s AI index report 2021, Maxime’s The future of the Business Intelligence is open source, Mehdi’s data engineering skills report, Apache Airflow survey 2020, DataMinded’s things to consider for Argo Workflow, Spotify’s new experimentation strategy, LightUp’s hidden data outages, Confluent’s real-time analytics with Kafka & Pinot, Pinterest’s Flink deployment framework, AWS’s new feature on Hudi, and Trino’s new window function enrichments.

Michael Stonebraker: Top 10 Big Data Blunders

Some of the recent articles and conversations around data modeling remind me of Michael Stonebraker's talk about the top 10 big data mistakes. It is an excellent talk to watch/ re-watch.

Stanford University: The AI Index Report - Measuring trends in Artificial Intelligence

Stanford University published AI Index Report for 2021, focusing on AI development in the USA. It’s an exciting read, and the top 9 takeaways are,

  1. “Drugs, Cancer, Molecular, Drug Discovery” received the greatest amount of private AI investment in 2020, with more than USD 13.8 billion, 4.5 times higher than 2019.

  2. In 2019, 65% of graduating North American PhDs in AI went into the industry—up from 44.4% in 2010

  3. AI systems can now compose text, audio, and images to a sufficiently high standard.

  4. The diversity challenge - In 2019, 45% of new U.S. resident AI Ph.D. graduates were white—by comparison, 2.4% were African American, and 3.2% were Hispanic.

  5. China overtakes the US in AI journal citations.

  6. The majority of the US AI Ph.D. grads are from abroad—and they’re staying in the US.

  7. Surveillance technologies are fast, cheap, and increasingly ubiquitous.

  8. AI ethics lacks benchmarks, and consensus remains a challenge.

  9. AI gained attention in congress: The 116th Congress is the most AI-focused congressional session in history. The number of mentions of AI in congressional record more than triple that of the 115th Congress.

Maxime Beauchemin: The Future of Business Intelligence is Open Source

The open-source databases and data processing ecosystem revolutionized software development. The author raised an interesting question: When it comes to the BI platform, Why is it mostly closed source?

Mehdi Ouazza: What are the most requested technical skills in the data job market? Insights from 35k+ data jobs ads

It is an insightful hack to understand the skills in demand in data engineering. SQL & Python the top skill to develop if you're into data science or data engineering. The author's take on Python over Scala for data engineering resonates well with the Spark ecosystem's current development.

Apache Airflow: Airflow survey 2020

Apache Airflow published the 2020 Airflow survey result. Some of the exciting trends to highlight

  1. 13.79 adoption of the general developer community outside the data engineers.

  2. 85% of people using Airflow like/ very likely recommends Airflow.

  3. Airflow local executor popular than the Kubernetes executor

  4. Slack & Github is a go-to place for technical questions, 2X higher than StackOverflow!!

DataMinded: What to consider before choosing Argo Workflow?

Argo Workflows is an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. The blog narrates the basic workflow using Argo and the pros and cons of Argo Workflow from the data engineering perspective.

Spotify: Spotify’s New Experimentation Coordination Strategy

Spotify wrote about its new experimentation coordination strategy and migrated the experimentation platform to using Bucket Reuse for all experiments. The narration on handling exclusive and nonexclusive experiments and the concept of paths exciting to read.

LightUp: Your Data Keeps Breaking Silently: Isolated Incidents or a New Category of Problems

LightUp writes an exciting two-part blog on the new category of the data outage termed "The Hidden Data Outages.". The blog narrates some of the case studies where the hidden data outages cause significant business loss and the call for a dedicated data monitoring platform.

If you consider the hidden data outages on all the company's financial earnings call, Essentially, Our entire economy depends on all the untested SQLs!!!!

Confluent: Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

Apache Pinot is a distributed analytics data store rapidly becoming the go-to solution for building real-time analytical applications at scale. The blog narrates how the real-time ingestion from Kafka to Apache Pinot works and the internal implementation of mutable vs. immutable segments, query processing & memory management.

Pinterest: Pinterest Flink Deployment Framework

Pinterest writes about its Flink deployment framework and the integration with the CI/ CD pipeline. The blog narrates some of the best practices, such as job deduplication, state preservation before deploying a new version, and focusing on its reversibility.

AWS: New features from Apache Hudi available in Amazon EMR

AWS highlighted the new feature improvements in the Apache Hudi available part of the AWS ecosystem. The ability to convert the existing parquet files to the Hudi format, seamless integration with AWS database migration services are some of the standout features. Redshift Spectrum’s ability to query the Apache Hudi dataset is an exciting trend to watch.

Trino: Introducing new window features

The SQL window function is a vital feature for analytics queries. Trino writes about its new improvements in supporting the window functions with the full support for the Range frame type, supporting the Group frame type, and adding the windowing as part of the WHERE clause.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.