Experience Enterprise-Grade Apache Airflow
Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more.
Ozge Demirci, Jonas Hannane & Xinrong Zhu: Who Is AI Replacing? The Impact of Generative AI on Online Freelancing Platforms
The economic impact of Gen AI is widely speculated, and we see few signs of impact. The paper highlights the substantial impact of generative AI on reducing demand for certain freelance jobs while increasing the complexity and pay of the remaining jobs, leading to greater competition and shifts in required skills. The key highlights of the paper,
1. Decrease in Job Posts: The introduction of ChatGPT led to a 21% decrease in job posts for automation-prone jobs (such as writing and coding) within eight months compared to jobs requiring manual-intensive skills. Image-generating AI technologies resulted in a 17% decrease in job posts related to image creation.
2. Increased Competition: Reducing job posts increased competition among freelancers. The remaining automation-prone jobs were more complex and offered higher pay.
3. Job Complexity and Pay: Despite the decrease in job posts, the complexity and pay for the remaining automation-prone jobs increased.
4. Specific Job Clusters Affected:
Writing jobs saw the most significant decrease in demand (30.37%).
Software, app, and web development jobs decreased by 20.62%.
Engineering jobs saw a 10.42% decline.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4602944
Leopold Aschenbrenner: Situational Awareness - The Decade Ahead
Tracing the advancements from GPT-2 to GPT-4, the paper argues that AGI (Artificial General Intelligence) by 2027 is plausible. The paper highlights several challenges, including the need for massive industrial mobilization to support the growing demands for GPU, data centers, and power infrastructure.
Controlling AI systems that are much smarter than humans is an unsolved technical problem, and failure could lead to catastrophic outcomes. What do you all think? Do you think human society can handle human-level intelligent machines?
https://situational-awareness.ai/
Ben Lorica: Why Your Generative AI Projects Are Failing
Yes, I added this article as a logical sequence of the previous two articles 😂 Though the promise of LLMs is amazing, enterprises struggle to integrate the system seamlessly without disturbing the workflow. Looming regulatory requirements, data quality, governance issues & model accuracy keep failing enterprises.
https://gradientflow.substack.com/p/why-your-generative-ai-projects-are
Sponsored: Try Fully Managed Apache Airflow for FREE
Run Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new sign-ups will receive a complimentary Airflow Fundamentals Certification exam (normally $150).
Astasia Myers & Eric Flaningam: The rise of AI data infrastructure
The article discusses the emergence of AI data infrastructure as a critical area for innovation. The authors emphasize the increasing need for high-quality data for training and inference, focusing on unstructured data pipelines, retrieval-augmented generation (RAG), data curation, and AI memory. It is a good reminder to the data industry that we need to solve the fundamentals of data engineering to utilize AI better.
https://www.felicis.com/insight/ai-data-infrastructure
Chris Riccomini: Data Lakehouse Catalog Reality Check
Databricks and Snowflake are talking a big game. So far, they've given us empty Github repositories and rewrites.
I don’t think anyone can better describe the catalog war than this.
Market pressure leads to marketing something that is not what it is and announcing that something is not ready yet. In all fairness, we can take it any day if it is a competition for open-source things.
https://materializedview.io/p/data-lakehouse-catalog-reality-check
Pedram Navid: The Rise of the Data Platform Engineer
Data Engineers, however, kept writing ETL pipelines. Sure, you could pay Fivetran to sync your Salesforce data, and maybe Stripe had a native Snowflake connector, but there was no escaping the long tail of data needs.
The blog is a good summarization of the ever-changing and c’ ever-changing and confusing role. The question essentially is, are we so back to building yaml frameworks?
https://databased.pedramnavid.com/p/the-rise-of-the-data-platform-engineer
Booking.com: Meta-experiments: Improving experimentation through experimentation
Can we experiment on the experimentation process? By implementing "meta-experiments," the team tested new features like low-power alerts, significantly boosting the quality of their A/B tests. This clever dogfooding enhanced their platform and gave the team a taste of their own medicine, fostering empathy for their users and uncovering pain points they hadn't experienced firsthand.
https://booking.ai/meta-experiments-improving-experimentation-through-experimentation-6bdee314c512
Instacart: Bandits for Marketing Optimization
Instacart discusses its adaptive experimentation system for optimizing paid marketing budgets. The system uses a two-step process:
It models performance curves using inverse-propensity-weighted regression to ensure valid causal inference.
It employs Thompson Sampling to balance exploration and exploitation when choosing marketing actions.
By continuously updating its estimates and intelligently introducing random perturbations, this approach has significantly improved Instacart's marketing efficiency compared to traditional methods.
https://tech.instacart.com/bandits-for-marketing-optimization-f5a63b9bfaa7
Lazaro Hurtado: Evaluating RAG capabilities of Small Language Models
In this article, the author evaluates Small Language Models (SLMs) for use in Retrieval Augmented Generation (RAG) systems, comparing their performance to larger models using the Needle-In-A-Haystack benchmark. Some fine-tuned SLMs, particularly Gemma 2B and Llama2 7B, perform well in tasks similar to those in RAG applications, suggesting the potential for more resource-efficient and environmentally friendly alternatives to Large Language Models. However, the authors note that further research is needed to assess SLMs' capabilities fully in more complex scenarios typical of RAG systems.
Geico: Searchable field-level encrypted customer PII with k-anonymity
Field-level encryption is a data protection measure that encrypts individual sensitive fields within records, keeping data encrypted throughout its lifecycle and narrowing the data protection focus to key management.
To enable searching of encrypted data, GEICO uses k-anonymization, which involves storing truncated hash digests alongside encrypted values and allows for secure searches without knowing the encryption key. The approach balances security and performance, requiring careful tuning of the hash truncation length to manage the trade-off between protection against dictionary attacks and the number of false positives in search results.
https://www.geico.com/techblog/searchable-field-level-encrypted-customer-pii/
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.