Yesterday's AI News Digest
2025-12-15
The AI industry seems to be entering a “show me, don’t tell me” phase this week - while December’s usual slowdown has mercifully spared us from another frenzy of acquisition announcements, we’re seeing something arguably more interesting: a collective obsession with proving these systems actually work. Every major foundation model provider is now rushing to release their own agentic coding tools (Mistral and Google both made moves here), even as the Linux Foundation scrambles to bring some organizational sanity to the chaos, and the real story might be hiding in the benchmarks: from neuroscience data analysis to statistical reliability improvements, there’s a quiet but determined effort to figure out how we actually evaluate whether LLMs are any good at the complex tasks we keep throwing at them.
📰 General News
(Google) Scholar Labs: An AI Powered Scholar Search
Google just launched Scholar Labs, an experimental AI search tool that tackles complex research questions by breaking them down into component topics and relationships. Instead of simple keyword matching, it analyzes your question from multiple angles, searches across scholarly papers, and explains how each result addresses your specific query. The feature supports follow-up questions for deeper exploration and is rolling out gradually to logged-in users in English, with a waitlist for those without access.
OpenAI built an AI coding agent and uses it to improve the agent itself
OpenAI now uses its AI coding agent Codex to build and improve Codex itself, with the company’s product lead saying “the vast majority of Codex is built by Codex.” The tool monitors its own training runs, processes user feedback to decide what to build next, and gets assigned tasks through the same project management systems as human engineers. In one striking example, four engineers used Codex to build the Sora Android app from scratch in just 18 days.
Gemini Live API Now GA on Vertex AI
Google’s Gemini Live API is now generally available on Vertex AI, letting enterprises build real-time voice and video AI agents that can be interrupted mid-sentence, understand tone and emotion, and analyze visual content during conversations. Early adopters are seeing serious results: United Wholesale Mortgage generated over 14,000 loans using their AI assistant Mia, while 11Sight boosted call resolution rates from 40% to 60% in nine months. The API runs on Gemini 2.5 Flash Native Audio, designed for low-latency multimodal interactions at enterprise scale.
BBVA embeds AI into banking workflows using ChatGPT Enterprise
Spanish banking giant BBVA is deploying ChatGPT Enterprise to 11,000 employees across all units, marking one of the largest AI rollouts in finance. After a 3,300-person pilot saved workers nearly three hours weekly on routine tasks, the bank is now embedding OpenAI’s tools into core operations like risk analysis and software development. BBVA already launched ‘Blue,’ an AI assistant for customers, and plans to let clients interact with the bank directly through ChatGPT with enterprise-grade security controls.
Microsoft’s Copilot usage analysis exposes the 2am philosophy question trend
Microsoft analyzed 37.5 million Copilot conversations and found people ask AI about religion and philosophy during early morning hours, with queries peaking around 2-3am. The data reveals surprisingly human patterns: health questions dominate mobile use at all times, programming conversations climb Monday through Friday while gaming queries surge on weekends, and relationship advice requests spike on Valentine’s Day. The shift from pure information searches to personal advice-seeking shows AI assistants are becoming digital confidants for life’s bigger questions.
Cursor Launches an AI Coding Tool for Designers
Cursor, the AI coding startup valued at $30 billion, just launched Visual Editor—a tool that lets designers build and modify web interfaces using natural language commands. Unlike typical vibe-coding apps that produce generic purple-gradient websites, Visual Editor offers professional-grade controls that map directly to CSS, letting designers tweak everything from corner radii to letter spacing. The move puts Cursor in direct competition with design giants like Figma and Adobe, while helping it fend off pressure from OpenAI and Anthropic in the AI coding space.
As AI Grows More Complex, Model Builders Rely on NVIDIA
OpenAI’s new GPT-5.2 model trained entirely on NVIDIA infrastructure, continuing a trend where most leading AI models now rely on the chipmaker’s platforms. NVIDIA’s GB300 systems deliver 4x faster training than previous generation Hopper chips, helping explain why companies from OpenAI to Runway to Cohere are building on Blackwell architecture. The performance advantage extends beyond language models to video generation, protein folding, and medical imaging. NVIDIA was the only company to submit results across all seven categories in the latest MLPerf industry benchmarks.
Mistral AI surfs vibe-coding tailwinds with new coding models
French AI startup Mistral just dropped Devstral 2, its new coding model, alongside Mistral Vibe, a command-line tool that lets developers automate code through natural language. The company is chasing Anthropic and coding-focused competitors with context-aware features that remember past interactions. Devstral 2 packs 123 billion parameters and needs serious hardware (four H100 GPUs), but there’s also Devstral Small at 24 billion parameters for local deployment. Both models are currently free via API, with paid pricing starting at $0.40/$2.00 per million tokens for the larger version.
Linux Foundation Announces the Formation of the Agentic AI Foundation
The Linux Foundation just launched the Agentic AI Foundation with backing from AI’s biggest players: Anthropic, OpenAI, Block, AWS, Google, and Microsoft. Three major projects anchor it: Anthropic’s Model Context Protocol (already adopted by 10,000+ servers and integrated into Claude, ChatGPT, and VS Code), Block’s goose agent framework, and OpenAI’s AGENTS.md standard (used in 60,000+ open source projects). The goal is creating neutral, open governance for the autonomous AI agents that will coordinate complex tasks across systems.
Slack CEO Denise Dresser to join OpenAI as chief revenue officer
OpenAI just poached Slack CEO Denise Dresser to become its new chief revenue officer, tasked with steering the company’s enterprise strategy and customer success. After 14+ years at Salesforce (Slack’s parent company), Dresser joins OpenAI at a critical moment as the company struggles with profitability despite massive growth. She’ll work under Fidji Simo, who herself jumped from Instacart to OpenAI earlier this year. Slack’s chief product officer Rob Seaman steps in as interim CEO.
Boom Supersonic raises $300M to build natural gas turbines for Crusoe data centers
Boom Supersonic, the company building supersonic passenger jets, just pivoted into power generation. The startup raised $300M to sell stationary versions of its jet turbines to data centers, landing a $1.25B deal with Crusoe for 29 turbines delivering 1.21 gigawatts by 2027. CEO Blake Scholl calls it their “Starlink moment” – profits will fund the company’s Overture supersonic aircraft development. The turbines share 80% of parts with Boom’s airborne engines, letting them cross-subsidize the expensive work of bringing back supersonic commercial flight.
Claude Code is coming to Slack, and that’s a bigger deal than it sounds
Anthropic is bringing Claude Code to Slack, letting developers kick off full coding sessions by tagging @Claude in chat threads. The beta goes beyond simple code snippets: Claude can now analyze bug reports or feature requests from Slack messages, identify the right repository, and post progress updates before opening pull requests. It’s part of a bigger trend where AI coding tools are moving out of traditional development environments and into collaboration platforms where teams already spend their time. The race is on to become the dominant AI assistant embedded in workplace tools, with Cursor and GitHub Copilot making similar moves.
Instacart pilots agentic commerce by embedding in ChatGPT
Instacart just became the first company to let you complete an entire grocery order inside ChatGPT—from meal planning to checkout—without ever leaving the chat. The integration uses OpenAI’s new Agentic Commerce Protocol and processes payments directly through Stripe. Instacart helped develop this capability by serving as an early testing partner for OpenAI’s Operator research preview, using its database of 1.8 billion products across 100,000 stores to train the AI on real-world inventory constraints. The company is betting that consumers will increasingly start shopping from AI platforms rather than traditional apps.
A first look at Google’s Project Aura glasses built with Xreal
Google’s Project Aura glasses, built with Xreal and launching in 2026, look like chunky sunglasses but pack a 70-degree field of view for running Android apps. The real story: every Android XR app works across devices without modification, solving the app shortage that’s plagued Vision Pro and Meta Ray-Bans. Even better, they’ll support iOS through Google’s apps like Maps and YouTube Music. The glasses include bright recording indicators and clear on/off switches to avoid Google Glass’s creepy reputation.
💰 BigMoneyDeals
Disney wants to drag you into the slop
Disney is paying OpenAI $1 billion to let users create AI-generated videos of Marvel, Pixar, and Star Wars characters through Sora, with plans to feature the content on Disney Plus. The deal turns subscribers into unpaid content creators while Disney avoids paying actual artists. Past Disney AI experiments went predictably wrong, like when Fortnite players made their AI Darth Vader spew hateful speech. The partnership gives OpenAI much-needed cash and Disney a pipeline of low-quality content it doesn’t have to produce itself.
Oboe raises $16 million from a16z for its AI-powered course-generation platform
Oboe, the AI-powered learning platform from Anchor’s co-founders, just raised $16 million from a16z three months after launch. The app generates personalized courses on any topic, complete with chapters, quizzes, and AI-generated podcasts that adapt their tone to the material. The startup is betting big on STEM education and ditching course generation limits in favor of a freemium model with $15-$40 monthly tiers for deeper access. With former Spotify execs at the helm and a16z impressed by the speed of content generation, Oboe wants to reach billions of learners worldwide.
Fal nabs $140M in fresh funding led by Sequoia, tripling valuation to $4.5B
Fal, the startup powering AI image, video, and audio models for developers, just raised $140 million at a $4.5 billion valuation—tripling its worth since July. The Series D was led by Sequoia with backing from Kleiner Perkins and Nvidia. Founded in 2021, Fal provides infrastructure for companies like Adobe, Shopify, and Canva, and has already crossed $200 million in revenue. This marks the company’s third fundraise this year, with the total deal including secondary sales reaching around $250 million.
Accenture and Anthropic partner to boost enterprise AI integration
Accenture and Anthropic are launching a dedicated business group to help enterprises actually deploy AI at scale. The partnership centers on Claude Code, Anthropic’s coding assistant that now claims over half the AI coding market. Accenture will train 30,000 of its own developers on the tool and build industry-specific solutions for regulated sectors like finance and healthcare. The focus is solving the hard parts: justifying inference costs, measuring real productivity gains, and navigating compliance requirements that typically stall AI projects in large organizations.
SoftBank and Nvidia reportedly in talks to fund SkildAI at $14B, nearly tripling its value
SoftBank and Nvidia are reportedly leading a $1+ billion investment in Skild AI at a $14 billion valuation, nearly tripling the robotics startup’s worth from $4.7 billion just seven months ago. The three-year-old company builds robot-agnostic foundation models rather than physical hardware, developing software ‘brains’ that can work across different robot types. The deal reflects surging investor appetite for AI robotics, with competitors like Physical Intelligence raising $600 million at $5.6 billion and Figure securing funding at a $39 billion valuation.
Tiger Global plans cautious venture future with a new $2.2B fund
Tiger Global is raising a $2.2 billion fund after learning some expensive lessons. The firm that backed 315 startups in 2021 alone and helped inflate the venture bubble is now promising a more cautious approach. Their latest fund is up 33% thanks to bets on OpenAI, Waymo, and Databricks, but their pitch letter admits AI valuations are elevated and often unsupported by fundamentals. Translation: they think we’re in another bubble and don’t want to repeat their mistakes.
In AI Play, IBM Acquires Data Streaming Provider Confluent
IBM is acquiring Confluent, a major data streaming platform built on Apache Kafka, in a deal that signals Big Blue’s push to strengthen its AI infrastructure capabilities. Confluent specializes in real-time data streaming, which has become critical for companies building AI applications that need to process and analyze data as it flows. The acquisition gives IBM a powerful tool for helping enterprise clients manage the massive data pipelines required for modern AI systems.
Meta Acquires Wearable AI Startup Limitless
Meta has acquired Limitless, a startup that built an AI-powered wearable pendant designed to record conversations and meetings. The deal brings Limitless’s team and technology into Meta’s Reality Labs division, which handles the company’s VR headsets and smart glasses. Limitless had raised $18 million and launched its $99 pendant earlier this year, positioning it as a personal AI assistant that captures and transcribes real-world interactions. The acquisition signals Meta’s continued push into AI-enhanced wearables beyond its Ray-Ban smart glasses partnership.
Google, Sony Innovation Fund, and Okta back Resemble AI’s push into deepfake detection
Resemble AI just raised $13 million from Google, Sony Innovation Fund, and Okta to fight deepfakes that cost victims $1.56 billion in fraud losses this year. The company’s new DETECT-3B Omni model claims 98% accuracy detecting fake audio, video, images, and text across 38 languages. With analysts predicting generative AI could enable $40 billion in US fraud losses by 2027, Resemble expects deepfake verification to become mandatory for official government communications and predicts companies without detection tools will face higher cyber insurance premiums.
🔬 Technical
A developer’s guide to Gemini Live API in Vertex AI
Google launched the Gemini Live API on Vertex AI, replacing the clunky speech-to-text-to-LLM-to-speech pipeline with a single WebSocket connection that processes native audio in real time. The API reads emotional tone from voice, knows when to interrupt (and when not to), and handles audio, text, and video simultaneously. Google released vanilla JavaScript and React starter templates, plus three production demos including a business advisor that listens to meetings and chimes in with relevant insights. Partner integrations with Daily, Twilio, and LiveKit let developers skip the networking complexity entirely.
Enabling small language models to solve complex reasoning tasks
MIT researchers built DisCIPL, a system where a large language model acts as a planner, dividing complex tasks among smaller models working in parallel. The approach matches OpenAI’s o1 reasoning system in accuracy on constrained tasks like itinerary planning and structured writing, while cutting costs by 80% and using 40% less compute. The trick: using Python code instead of text for reasoning, and running dozens of tiny Llama models simultaneously for pennies compared to premium reasoning models.
NeuroDiscoveryBench: Benchmarking AI for neuroscience data analysis
The Allen Institute for AI released NeuroDiscoveryBench, the first benchmark testing how well AI systems can analyze real neuroscience data. The dataset contains 70 questions requiring actual data analysis—not just factoid retrieval—drawn from three major brain research publications. Early results show AI agents like DataVoyager can answer 35% of questions correctly, while models without data access score only 6-8%, proving they can’t simply memorize answers. The benchmark reveals AI is making progress on scientific data analysis but still struggles with complex data preprocessing tasks.
New method improves the reliability of statistical estimations
MIT researchers discovered that standard methods for generating confidence intervals in spatial data analysis are often completely wrong, sometimes claiming 95% confidence when they’ve actually failed to capture the true relationship. The team developed a new technique that assumes data vary smoothly across space rather than assuming source and target data are similar. In tests with real data, their method was the only one that consistently produced reliable confidence intervals, which could help scientists in environmental science, economics, and epidemiology know when to trust their experimental results.
How we built a multi-agent system for superior business forecasting
Google Cloud and App Orchid built a multi-agent forecasting system that combines two specialized AI agents: one that understands a company’s historical data and another that predicts the future using Google’s TimesFM and Population Dynamics Foundation Model. The agents communicate via Google’s new Agent-to-Agent (A2A) Protocol, which lets AI agents from different organizations work together seamlessly. Users interact with a single orchestrator agent while the specialized agents collaborate behind the scenes to deliver accurate demand forecasts and resource predictions.
How NVIDIA H100 GPUs on CoreWeave’s AI Cloud Platform Delivered a Record-Breaking Graph500 Run
NVIDIA and CoreWeave just crushed the Graph500 benchmark, hitting 410 trillion traversed edges per second with 8,192 H100 GPUs. That’s more than double the competition’s performance while using 9x fewer nodes. The breakthrough: NVIDIA built a GPU-only system that bypasses CPUs entirely for graph processing, using custom software that lets hundreds of thousands of GPU threads send active messages simultaneously instead of just hundreds on CPUs. This could finally bring GPU acceleration to massive sparse workloads in weather forecasting, fluid dynamics, and cybersecurity that have been stuck on CPUs for decades.
Validating LLM-as-a-Judge Systems under Rating Indeterminacy
Carnegie Mellon researchers are tackling a fundamental problem with using LLMs as judges: rating indeterminacy. When evaluating AI outputs, there’s often no single “correct” score, yet current validation methods assume one exists. The team developed new frameworks to validate LLM judges even when ground truth is inherently fuzzy, addressing a critical gap as these systems increasingly replace human evaluators in AI development pipelines.
AlphaEvolve on Google Cloud: AI for agentic discovery and optimization
Google Cloud is releasing AlphaEvolve, a Gemini-powered coding agent that automatically discovers and optimizes algorithms through an evolutionary process. It works by having AI models mutate code, testing the results, and iterating on what performs best. Google already used it internally to recover 0.7% of global data center compute, speed up Gemini training by 1%, and accelerate TPU design. Now available in private preview, it’s aimed at industries tackling complex optimization problems in biotech, logistics, finance, and energy.
GigaTIME: Scaling tumor microenvironment modeling using virtual population generated by multimodal AI
Microsoft Research released GigaTIME, an AI model that converts cheap $5-10 pathology slides into detailed virtual images worth thousands of dollars. Published in Cell, the model analyzed 14,256 cancer patients across 51 hospitals, generating 300,000 virtual images that revealed 1,234 new links between tumor proteins and patient outcomes. The breakthrough makes population-scale cancer research possible without expensive lab equipment, and Microsoft made the model publicly available.
Closing Thoughts
This week underscored a fascinating shift in our AI ecosystem: as every major foundation model releases its own agentic coding assistant, the conversation is pivoting from raw capabilities to rigorous evaluation frameworks—a maturation the Linux Foundation is now attempting to orchestrate across the industry. December’s relative quiet on the M&A front might feel like a breather, but let’s be honest: everyone’s too busy debugging their new AI coding agents to negotiate term sheets. The real story isn’t the pause in dealmaking; it’s that we’re finally asking the right questions about how to measure what these systems actually do versus what they claim to do.
See you next week, where I’ll presumably be writing this newsletter with the help of three different agentic coders, each insisting their approach is superior. YAI 👋
Disclaimer: I use AI to help aggregate and process the news. I do my best to cross-check facts and sources (BTW: sources are available on-demand, or you could just google it :) ), but misinformation may still slip through. Always do your own research and apply critical thinking—with anything you consume these days, AI-generated or otherwise.



Hey, great read as always. I completely resonate with your observation about the 'show me, don't tell me' phase. Rigorous evaluation and demonstrating real-world utility are realy paramount for the field's genuine progress and trustworthiness. It's exactly the direction we need.
Excellent curation on the evaluation frameworks shift. The timing of Linux Foundation's Agentic AI Foundation is particualrly interesting because it comes right when everyone's releasing their coding agent and nobody has a clue how to actually benchmark them properly. I've noticed the same tension in enterprise deployments where teams are dunno if they're measuring productivity gains or just task completion rates. The NeuroDiscoveryBench piece showing 35% vs 6-8% accuracy difference reveals the gap between data analysis and memorization.