Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #124, we selected the following article
dbt: State of Analytics Engineering
dbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report.
46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment.
Lack of coordination between data producers and data consumers is perceived by all respondents to be this year’s top threat to the ecosystem.
Data and analytics engineers are most likely to believe they have clear goals and are most likely to agree their work is valued.
71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most.
Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.”
https://www.getdbt.com/state-of-analytics-engineering-2023/
Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics Consulting
Very fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing.
Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT and computing to date look like the spinning jenny that was the start of the industrial revolution.
🤺🤺🤺🤺🤺🤺🤺🤺🤺May the best LLM wins!!
🤺🤺🤺🤺🤺🤺
LinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam
One of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink.
Wix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservices
Wix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines.
DEW #124: State of Analytics Engineering, ChatGPT, LLM & the Future of Data Consulting, Unified Streaming & Batch Pipeline, and Kafka Schema Management