ThursdAI - The top AI news from the past week | ポッドキャスト on Audible

エピソード

📆 ThursdAI - Jul 10 - Grok 4 and 4 Heavy, SmolLM3, Liquid LFM2, Reka Flash & Vision, Perplexity Comet Browser, Devstral 1.1 & More AI News

2025/07/11

Hey everyone, Alex hereDon't you just love "new top LLM" drop weeks? I sure do! This week, we had a watch party for Grok-4, with over 20K tuning in to watch together, as the folks at XAI unveiled their newest and best model around. Two models in fact, Grok-4 and Grok-4 Heavy. We also had a very big open source week, we had the pleasure to chat with the creators of 3 open source models on the show, first with Elie from HuggingFace who just released SmoLM3, then with our friend Maxime Labonne who together with Liquid released a beautiful series of tiny on device models. Finally we had a chat with folks from Reka AI, and as they were on stage, someone in their org published a new open source Reka Flash model 👏 Talk about Breaking News right on the show! It was a very fun week and a great episode, so grab your favorite beverage and let me update you on everything that's going on in AI (as always, show notes at the end of the article) Open Source LLMsAs always, even on big weeks like this, we open the show with Open Source models first and this week, the western world caught up to the Chinese open source models we saw last week! HuggingFace SmolLM3 - SOTA fully open 3B with dual reasoning and long-context (𝕏, HF)We had Eli Bakouch from Hugging Face on the show and you could feel the pride radiating through the webcam. SmolLM 3 isn’t just “another tiny model”; it’s an 11-trillion-token monster masquerading inside a 3-billion-parameter body. It reasons, it follows instructions, and it does both “think step-by-step” and “give me the answer straight” on demand. Hugging Face open-sourced every checkpoint, every dataset recipe, every graph in W&B – so if you ever wanted a fully reproducible, multi-lingual pocket assistant that fits on a single GPU, this is it.They achieved the long context (128 K today, 256 K in internal tests) with a NoPE + YaRN recipe and salvaged the performance drop by literally merging two fine-tunes at 2 a.m. the night before release. Science by duct-tape, but it works: SmolLM 3 edges out Llama-3.2-3B, challenges Qwen-3, and stays within arm’s reach of Gemma-3-4B – all while loading faster than you can say “model soup.” 🤯Liquid AI’s LFM2: Blazing-Fast Models for the Edge (𝕏, Hugging Face)We started the show and I immediately got to hit the #BREAKINGNEWS button, as Liquid AI dropped LFM2, a new series of tiny (350M-1.2B) models focused on Edge devices.We then had the pleasure to host our friend Maxime Labonne, head of Post Training at Liquid AI, to come and tell us all about this incredible effort! Maxime, a legend in the model merging community, explained that LFM2 was designed from the ground up for efficiency. They’re not just scaled-down big models; they feature a novel hybrid architecture with convolution and attention layers specifically optimized for running on CPUs and devices like the Samsung Galaxy S24.Maxime pointed out that Out of the box, they won't replace ChatGPT, but when you fine-tune them for a specific task like translation, they can match models 60 times their size. This is a game-changer for creating powerful, specialized agents that run locally. Definitely a great release and on ThursdAI of all days! Mistrals updated Devstral 1.1 Smashes Coding Benchmarks (𝕏, HF)Mistral didn't want to be left behind on this Open Source bonanza week, and also, today, dropped an update to their excellent coding model Devstral. With 2 versions, an open weights Small and API-only Medium model, they have claimed an amazing 61.6% score on Swe Bench and the open source Small gets a SOTA 53%, the highest among the open source models! 10 points higher than the excellent DeepSwe we covered just last week!The thing to watch here is the incredible price performance, with this model beating Gemini 2.5 Pro and Claude 3.7 Sonnet while being 8x cheaper to run! DevStral small comes to us with an Apache 2.0 license, which we always welcome from the great folks at Mistral! Big Companies LLMs and APIsThere's only 1 winner this week, it seems that other foundational labs were very quiet to see what XAI is going to release. XAI releases Grok-4 and Grok-4 heavy - the world leading reasoning model (𝕏, Try It) Wow, what a show! Space uncle Elon together with the XAI crew, came fashionably late to their own stream, and unveiled the youngest but smartest brother of the Grok family, Grok 4 plus a multiple agents swarm they call Grok Heavy. We had a watch party with over 25K viewers across all streams who joined and watched together, this, fairly historic event! Why historic? Well, for one, they have scaled RL (Reinforcement Learning) for this model significantly more than any other lab did so far, which resulted in an incredible reasoner, able to solve HLE (Humanity's Last Exam) benchmark at an unprecedented 50% (while using tools) The other very much unprecedented result, is on the ArcAGI benchmark, specifically V2, which is designed to be very easy for humans and ...
続きを読む一部表示

1 時間 50 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📆 ThursdAI - Jul 3 - ERNIE 4.5, Hunyuan A13B, MAI-DxO outperforms doctors, RL beats SWE bench, Zuck MSL hiring spree & more AI news

2025/07/03

Hey everyone, Alex here 👋Welcome back to another mind-blowing week on ThursdAI! We’re diving into the first show of the second half of 2025, and let me tell you, AI is not slowing down. This week, we’ve got a massive wave of open-source models from Chinese giants like Baidu and Tencent that are shaking up the game, Meta’s jaw-dropping hiring spree with Zuck assembling an AI dream team, and Microsoft’s medical AI outperforming doctors on the toughest cases. Plus, a real-time AI game engine that had me geeking out on stream. Buckle up, folks, because we’ve got a lot to unpack!ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.We had incredible guests like Michael Luo from Agentica, dropping knowledge on RL coding agents, and Ivan Burazin from Daytona, revealing the infrastructure powering the agent era. We had an incredible episode this week, with over 8,000 views for the live show (as always, Links and Show notes in the end, and the YT live video is here for your convienience if you'd prefer watching) Open Source AI & LLMs: The Chinese Powerhouse WaveMan, if there’s one takeaway from this week, it’s that Chinese companies are absolutely dominating the open-source LLM scene. Let’s break down the heavy hitters that dropped this week and why they’ve got everyone talking.Baidu’s ERNIE 4.5: A Suite of 10 Models to Rule Them AllBaidu, a giant in the Chinese tech space, just flipped the script by open-sourcing their ERNIE 4.5 series. We’re talking 10 distinct models ranging from a whopping 424 billion parameters down to a tiny 0.3 billion. With an Apache 2.0 license, 128K context window, and multimodal capabilities handling image, video, and text input, this is a massive drop. Their biggest Mixture-of-Experts (MoE) model, with 47B active parameters, even outshines OpenAI’s o1 on visual knowledge tasks like DocVQA, scoring 93% compared to o1’s 81%! What’s wild to me is Baidu’s shift. They’ve been running ERNIE in production for years—think chatbots and more across their ecosystem—but they weren’t always open-source fans. Now, they’re not just joining the party, they’re hosting it. If you’re into tinkering, this is your playground—check it out on Hugging Face (HF) or dive into their technical paper (Paper).Tencent’s Hunyuan-A13B-Instruct: WizardLM Team Strikes AgainNext up, Tencent dropped Hunyuan-A13B-Instruct, and oh boy, does it have a backstory. This 80B parameter MoE model (13B active at inference) comes from the legendary WizardLM team, poached from Microsoft after a messy saga where their killer models got yanked from the internet over “safety concerns.” I remember the frustration—we were all hyped, then bam, gone. Now, under Tencent’s wing, they’ve cooked up a model with a 256K context window, hybrid fast-and-slow reasoning modes, and benchmarks that rival DeepSeek R1 and OpenAI o1 on agentic tasks. It scores an impressive 87% on AIME 2024, though it dips to 76% on 2025, hinting at some overfitting quirks. Though for a 12B active parameters model this all is still VERY impressive.Here’s the catch—the license. It excludes commercial use in the EU, UK, and South Korea, and bans usage if you’ve got over 100M active users. So, not as open as we’d like, but for its size, it’s a beast that fits on a single machine, making it a practical choice for many. They’ve also released two datasets, ArtifactsBench and C3-Bench, for code and agent evaluation. I’m not sold on the name—Hunyuan doesn’t roll off the tongue for Western markets—but the WizardLM pedigree means it’s worth a look. Try it out on Hugging Face (HF) or test it directly (Try It).Huawei’s Pangu Pro MoE: Sidestepping Sanctions with Ascend NPUsHuawei entered the fray with Pangu Pro MoE, a 72B parameter model with 16B active per token, and here’s what got me hyped—it’s trained entirely on their own Ascend NPUs, not Nvidia or AMD hardware. This is a bold move to bypass US sanctions, using 4,000 of these chips to preprocess 13 trillion tokens. The result? Up to 1,528 tokens per second per card with speculative decoding, outpacing dense models in speed and cost-efficiency. Performance-wise, it’s close to DeepSeek and Qwen, making it a contender for those outside the Nvidia ecosystem.I’m intrigued by the geopolitical angle here. Huawei’s proving you don’t need Western tech to build frontier models, and while we don’t know who’s got access to these Ascend NPUs, it’s likely a game-changer for Chinese firms. Licensing isn’t as permissive as MIT or Apache, but it’s still open-weight. Peek at it on Hugging Face (HF) for more details.DeepSWE-Preview: RL Coding Agent Hits 59% on SWE-BenchSwitching gears, I was blown away chatting with Michael Luo from Agentica about DeepSWE-Preview, an open-source coding agent trained with ...
続きを読む一部表示

1 時間 36 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📅 ThursdAI - Jun 26 - Gemini CLI, Flux Kontext Dev, Search Live, Anthropic destroys books, Zucks superintelligent team & more AI news

2025/06/26

Hey folks, Alex here, writing from... a undisclosed tropical paradise location 🏝️ I'm on vacation, but the AI news doesn't stop of course, and neither does ThursdAI. So huge shoutout to Wolfram Ravenwlf for running the show this week, Nisten, LDJ and Yam who joined. So... no long blogpost with analysis this week, but I'll def. recommend tuning in to the show that the folks ran, they had a few guests on, and even got some breaking news (new Flux Kontext that's open source) Of course many of you are readers and are here for the links, so I'm including the raw TL;DR + speaker notes as prepared by the folks for the show! P.S - our (rescheduled) hackathon is coming up in San Francisco, on July 12-13 called WeaveHacks, if you're interested at a chance to win a RoboDog, welcome to join us and give it a try. Register HEREOk, that's it for this week, please enjoy the show and see you next week! ThursdAI - June 26th, 2025 - TL;DR* Hosts and Guests* WolframRvnwlf - Host (@WolframRvnwlf)* Co-Hosts - @yampeleg, @nisten, @ldjconfirmed* Guest - Jason Kneen (@jasonkneen) - Discussing MCPs, coding tools, and agents* Guest - Hrishioa (@hrishioa) - Discussing agentic coding and spec-driven development* Open Source LLMs* Mistral Small 3.2 released with improved instruction following, reduced repetition & better function calling (X)* Unsloth AI releases dynamic GGUFs with fixed chat templates (X)* Kimi-VL-A3B-Thinking-2506 multimodal model updated for better video reasoning and higher resolution (Blog)* Chinese Academy of Science releases Stream-Omni, a new Any-to-Any model for unified multimodal input (HF, Paper)* Prime Intellect launches SYNTHETIC-2, an open reasoning dataset and synthetic data generation platform (X)* Big CO LLMs + APIs* Google* Gemini CLI, a new open-source AI agent, brings Gemini 2.5 Pro to your terminal (Blog, GitHub)* Google reduces free tier API limits for previous generation Gemini Flash models (X)* Search Live with voice conversation is now rolling out in AI Mode in the US (Blog, X)* Gemini API is now faster for video and PDF processing with improved caching (Docs)* Anthropic* Claude introduces an "artifacts" space for building, hosting, and sharing AI-powered apps (X)* Federal judge rules Anthropic's use of books for training Claude qualifies as fair use (X)* xAI* Elon Musk announces the successful launch of Tesla's Robotaxi (X)* Microsoft* Introduces Mu, a new language model powering the agent in Windows Settings (Blog)* Meta* Report: Meta pursued acquiring Ilya Sutskever's SSI, now hires co-founders Nat Friedman and Daniel Gross (X)* OpenAI* OpenAI removes mentions of its acquisition of Jony Ive's startup 'io' amid a trademark dispute (X)* OpenAI announces the release of DeepResearch in API + Webhook support (X)* This weeks Buzz* Alex is on vacation; WolframRvnwlf is attending AI Tinkerers Munich on July 25 (Event)* Join W&B Hackathon happening in 2 weeks in San Francisco - grand prize is a RoboDog! (Register for Free)* Vision & Video* MeiGen-MultiTalk code and checkpoints for multi-person talking head generation are released (GitHub, HF)* Google releases VideoPrism for generating adaptable video embeddings for various tasks (HF, Paper, GitHub)* Voice & Audio* ElevenLabs launches 11.ai, a voice-first personal assistant with MCP support (Sign Up, X)* Google Magenta releases Magenta RealTime, an open weights model for real-time music generation (Colab, Blog)* ElevenLabs launches a mobile app for iOS and Android for on-the-go voice generation (X)* AI Art & Diffusion & 3D* Google rolls out Imagen 4 and Imagen 4 Ultra in the Gemini API and Google AI Studio (Blog)* OmniGen 2 open weights model for enhanced image generation and editing is released (Project Page, Demo, Paper)* Tools* OpenMemory Chrome Extension provides shared memory across ChatGPT, Claude, Gemini and more (X)* LM Studio adds MCP support to connect local LLMs with your favorite servers (Blog)* Cursor is now available as a Slack integration (Dashboard)* All Hands AI releases the OpenHands CLI, a model-agnostic, open-source coding agent (Blog, Docs)* Warp 2.0 launches as an Agentic Development Environment with multi-threading (X)* Studies and Others* The /r/LocalLLaMA subreddit is back online after a brief moderation issue (Reddit, News)* Andrej Karpathy's talk "Software 3.0" discusses the future of programming in the age of AI (YouTube, Summary)Thank you, see you next week! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
続きを読む一部表示

1 時間 40 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📆 ThursdAI - June 19 - MiniMax M1 beats R1, OpenAI records your meetings, Gemini in GA, W&B uses Coreweave GPUs & more AI news

2025/06/20

Hey all, Alex here 👋This week, while not the busiest week in releases (we can't get a SOTA LLM every week now can we), was full of interesting open source releases, and feature updates such as the chatGPT meetings recorder (which we live tested on the show, the limit is 2 hours!)It was also a day after our annual W&B conference called FullyConnected, and so I had a few goodies to share with you, like answering the main question, when will W&B have some use of those GPUs from CoreWeave, the answer is... now! (We launched a brand new preview of an inference service with open source models)And finally, we had a great chat with Pankaj Gupta, co-founder and CEO of Yupp, a new service that lets users chat with the top AIs for free, while turning their votes into leaderboards for everyone else to understand which Gen AI model is best for which task/topic. It was a great conversation, and he even shared an invite code with all of us (I'll attach to the TL;DR and show notes, let's dive in!)00:00 Introduction and Welcome01:04 Show Overview and Audience Interaction01:49 Special Guest Announcement and Experiment03:05 Wolfram's Background and Upcoming Hosting04:42 TLDR: This Week's Highlights15:38 Open Source AI Releases32:34 Big Companies and APIs32:45 Google's Gemini Updates42:25 OpenAI's Latest Features54:30 Exciting Updates from Weights & Biases56:42 Introduction to Weights & Biases Inference Service57:41 Exploring the New Inference Playground58:44 User Questions and Model Recommendations59:44 Deep Dive into Model Evaluations01:05:55 Announcing Online Evaluations via Weave01:09:05 Introducing Pankaj Gupta from YUP.AI01:10:23 YUP.AI: A New Platform for Model Evaluations01:13:05 Discussion on Crowdsourced Evaluations01:27:11 New Developments in Video Models01:36:23 OpenAI's New Transcription Service01:39:48 Show Wrap-Up and Future PlansHere's the TL;DR and show notes linksThursdAI - June 19th, 2025 - TL;DR* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Guest - @pankaj - co-founder of Yupp.ai* Open Source LLMs* Moonshot AI open-sourced Kimi-Dev-72B (Github, HF)* MiniMax-M1 456B (45B Active) - reasoning model (Paper, HF, Try It, Github)* Big CO LLMs + APIs* Google drops Gemini 2.5 Pro/Flash GA, 2.5 Flash-Lite in Preview ( Blog, Tech report, Tweet)* Google launches Search Live: Talk, listen and explore in real time with AI Mode (Blog)* OpenAI adds MCP support to Deep Research in chatGPT (X, Docs)* OpenAI launches their meetings recorder in mac App (docs)* Zuck update: Considering bringing Nat Friedman and Daniel Gross to Meta (information)* This weeks Buzz* NEW! W&B Inference provides a unified interface to access and run top open-source AI models (inference, docs)* NEW! W&B Weave Online Evaluations delivers real-time production insights and continuous evaluation for AI agents across any cloud. (X)* The new platform offers "metal-to-token" observability, linking hardware performance directly to application-level metrics.* Vision & Video* ByteDance new video model beats VEO3 - Seedance.1.0 mini (Site, FAL)* MiniMax Hailuo 02 - 1080p native, SOTA instruction following (X, FAL)* Midjourney video is also here - great visuals (X)* Voice & Audio* Kyutai launches open-source, high-throughput streaming Speech-To-Text models for real-time applications (X, website)* Studies and Others* LLMs Flunk Real-World Coding Contests, Exposing a Major Skill Gap (Arxiv)* MIT Study: ChatGPT Use Causes Sharp Cognitive Decline (Arxiv)* Andrej Karpathy's "Software 3.0": The Dawn of English as a Programming Language (youtube, deck)* Tools* Yupp launches with 500+ AI models, a new leaderboard, and a user-powered feedback economy - use thursdai link* to get 50% extra credits* BrowserBase announces director.ai - an agent to run things on the web* Universal system prompt for reduction of hallucination (from Reddit)*Disclosure: while this isn't a paid promotion, I do think that yupp has a great value, I do get a bit more credits on their platform if you click my link and so do you. You can go to yupp.ai and register with no affiliation if you wish. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
続きを読む一部表示

1 時間 42 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📆 ThursdAI - June 12 - Meta’s $15B ScaleAI Power Play, OpenAI’s o3-pro & 90% Price Drop!

2025/06/13

Hey folks, this is Alex, finally back home! This week was full of crazy AI news, both model related but also shifts in the AI landscape and big companies, with Zuck going all in on scale & execu-hiring Alex Wang for a crazy $14B dollars. OpenAI meanwhile, maybe received a new shipment of GPUs? Otherwise, it’s hard to explain how they have dropped the o3 price by 80%, while also shipping o3-pro (in chat and API). Apple was also featured in today’s episode, but more so for the lack of AI news, completely delaying the “very personalized private Siri powered by Apple Intelligence” during WWDC25 this week. We had 2 guests on the show this week, Stefania Druga and Eric Provencher (who builds RepoPrompt). Stefania helped me cover the AI Engineer conference we all went to last week, and shared some cool Science CoPilot stuff she’s working on, while Eric is the GOTO guy for O3-pro helped us understand what this model is great for! As always, TL;DR and show notes at the bottom, video for those who prefer watching is attached below, let’s dive in! Big Companies LLMs & APIsLet’s start with big companies, because the landscape has shifted, new top reasoner models dropped and some huge companies didn’t deliver this week! Zuck goes all in on SuperIntelligence - Meta’s $14B stake in ScaleAI and Alex WangThis may be the most consequential piece of AI news today. Fresh from the dissapointing results of LLama 4, reports of top researchers leaving the Llama team, many have decided to exclude Meta from the AI race. We have a saying at ThursdAI, don’t bet against Zuck! Zuck decided to spend a lot of money (nearly 20% of their reported $65B investment in AI infrastructure) to get a 49% stake in Scale AI and bring Alex Wang it’s (now former) CEO to lead the new Superintelligence team at Meta. For folks who are not familiar with Scale, it’s a massive company in providing human annotated data services to all the big AI labs, Google, OpenAI, Microsoft, Anthropic.. all of them really. Alex Wang, is the youngest self made billionaire because of it, and now Zuck not only has access to all their expertise, but also to a very impressive AI persona, who could help revive the excitement about Meta’s AI efforts, help recruit the best researchers, and lead the way inside Meta. Wang is also an outspoken China hawk who spends as much time in congressional hearings as in Slack, so the geopolitics here are … spicy. Meta just stapled itself to the biggest annotation funnel on Earth, hired away Google’s Jack Rae (who was on the pod just last week, shipping for Google!) for brainy model alignment, and started waving seven-to-nine-figure comp packages at every researcher with “Transformer” in their citation list. Whatever disappointment you felt over Llama-4’s muted debut, Zuck clearly felt it too—and responded like a founder who still controls every voting share. OpenAI’s Game-Changer: o3 Price Slash & o3-pro launches to top the intelligence leaderboards!Meanwhile OpenAI dropping not one, but two mind-blowing updates. First, they’ve slashed the price of o3—their premium reasoning model—by a staggering 80%. We’re talking from $40/$10 per million tokens down to just $8/$2. That’s right, folks, it’s now in the same league as Claude Sonnet cost-wise, making top-tier intelligence dirt cheap. I remember when a price drop of 80% after a year got us excited; now it’s 80% in just four months with zero quality loss. They’ve confirmed it’s the full o3 model—no distillation or quantization here. How are they pulling this off? I’m guessing someone got a shipment of shiny new H200s from Jensen!And just when you thought it couldn’t get better, OpenAI rolled out o3-pro, their highest intelligence offering yet. Available for pro and team accounts, and via API (87% cheaper than o1-pro, by the way), this model—or consortium of models—is a beast. It’s topping charts on Artificial Analysis, barely edging out Gemini 2.5 as the new king. Benchmarks are insane: 93% on AIME 2024 (state-of-the-art territory), 84% on GPQA Diamond, and nearing a 3000 ELO score on competition coding. Human preference tests show 64-66% of folks prefer o3-pro for clarity and comprehensiveness across tasks like scientific analysis and personal writing.I’ve been playing with it myself, and the way o3-pro handles long context and tough problems is unreal. As my friend Eric Provencher (creator of RepoPrompt) shared on the show, it’s surgical—perfect for big refactors and bug diagnosis in coding. It’s got all the tools o3 has—web search, image analysis, memory personalization—and you can run it in background mode via API for async tasks. Sure, it’s slower due to deep reasoning (no streaming thought tokens), but the consistency and depth? Worth it. Oh, and funny story—I was prepping a talk for Hamel Hussain’s evals course, with a slide saying “don’t use large reasoning models if budget’s tight.” The day...
続きを読む一部表示

1 時間 33 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📆 ThursdAI - Jun 5, 2025 - Live from AI Engineer with Swyx, new Gemini 2.5 with Logan K and Jack Rae, Self Replicating agents with Morph Labs

2025/06/06

Hey folks, this is Alex, coming to you LIVE from the AI Engineer Worlds Fair! What an incredible episode this week, we recorded live from floor 30th at the Marriott in SF, while Yam was doing live correspondence from the floor of the AI Engineer event, all while Swyx, the cohost of Latent Space podcast, and the creator of AI Engineer (both the conference and the concept itself) joined us for the whole stream - here’s the edited version, please take a look. We've had around 6500 people tune in, and at some point we got 2 surprise guests, straight from the keynote stage, Logan Kilpatrick (PM for AI Studio and lead cheerleader for Gemini) and Jack Rae (principal scientist working on reasoning) joined us for a great chat about Gemini! Mind was absolutely blown! They have just launched the new Gemini 2.5 Pro and I though it would only be fitting to let their new model cover this podcast this week (so below is fully AI generated ... non slop I hope). The show notes and TL;DR is as always in the end. Okay, enough preamble… let's dive into the madness!🤯 Google Day at AI Engineer: New Gemini 2.5 Pro and a Look Inside the Machine's MindFor the first year of this podcast, a recurring theme was us asking, "Where's Google?" Well, it's safe to say that question has been answered with a firehose of innovation. We were lucky enough to be joined by Google DeepMind's Logan Kilpatrick and Jack Rae, the tech lead for "thinking" within Gemini, literally moments after they left the main stage.Surprise! A New Gemini 2.5 Pro Drops LiveLogan kicked things off with a bang, officially announcing a brand new, updated Gemini 2.5 Pro model right there during his keynote. He called it "hopefully the final update to 2.5 Pro," and it comes with a bunch of performance increases, closing the gap on feedback from previous versions and hitting SOTA on benchmarks like Aider.It's clear that the organizational shift to bring the research and product teams together under the DeepMind umbrella is paying massive dividends. Logan pointed out that Google has seen a 50x increase in AI inference over the past year. The flywheel is spinning, and it's spinning fast.How Gemini "Thinks"Then things got even more interesting. Jack Rae gave us an incredible deep dive into what "thinking" actually means for a language model. This was one of the most insightful parts of the conference for me.For years, the bottleneck for LLMs has been test-time compute. Models were trained to respond immediately, applying a fixed amount of computation to go from a prompt to an answer, no matter how hard the question. The only way to get a "smarter" response was to use a bigger model.Jack explained that "Thinking" shatters this limitation. Mechanically, Gemini now has a "thinking stage" where it can generate its own internal text—hypothesizing, testing, correcting, and reasoning—before committing to a final answer. It's an iterative loop of computation that the model can dynamically control, using more compute for harder problems. It learns how to think using reinforcement learning, getting a simple "correct" or "incorrect" signal and backpropagating that to shape its reasoning strategies.We're already seeing the results of this. Jack showed a clear trend: as models get better at reasoning, they're also using more test-time compute. This paradigm also gives developers a "thinking budget" slider in the API for Gemini 2.5 Flash and Pro, allowing a continuous trade-off between cost and performance.The future of this is even wilder. They're working on DeepThink, a high-budget mode for extremely hard problems that uses much deeper, parallel chains of thought. On the tough USA Math Olympiad, where the SOTA was negligible in January, 2.5 Pro reached the 50th percentile of human participants. DeepThink pushes that to the 65th percentile.Jack’s ultimate vision is inspired by the mathematician Ramanujan, who derived incredible theorems from a single textbook by just thinking deeply. The goal is for models to do the same—contemplate a small set of knowledge so deeply that they can push the frontiers of human understanding. Absolutely mind-bending stuff.🤖 MorphLabs and the Audacious Quest for Verified SuperintelligenceJust when I thought my mind couldn't be bent any further, we were joined by Jesse Han, the founder and CEO of MorphLabs. Fresh off his keynote, he laid out one of the most ambitious visions I've heard: building the infrastructure for the Singularity and developing "verified superintelligence."The big news was that Christian Szegedy is joining MorphLabs as Chief Scientist. For those who don't know, Christian is a legend—he invented batch norm and adversarial examples, co-founded XAI, and led code reasoning for Grok. That's a serious hire.Jesse’s talk was framed around a fascinating question: "What does it mean to have empathy for the machine?" He argues that as AI develops personhood, we need to think about what it wants. And what it wants, according ...
続きを読む一部表示

1 時間 44 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📆 ThursdAI - May 29 - DeepSeek R1 Resurfaces, VEO3 viral moments, Opus 4 a week after, Flux Kontext image editing & more AI news

2025/05/29

Hey everyone, Alex here 👋Welcome back to another absolutely wild week in AI! I'm coming to you live from the Fontainebleau Hotel in Vegas at the Imagine AI conference, and wow, what a perfect setting to discuss how AI is literally reimagining our world. After last week's absolute explosion of releases (Claude Opus 4, Google I/O madness, OpenAI Codex and Jony colab), this week gave us a chance to breathe... sort of. Because even in a "quiet" week, we still got a new DeepSeek model that's pushing boundaries, and the entire internet discovered that we might all just be prompts. Yeah, it's been that kind of week!Before we dive in, quick shoutout to everyone who joined us live - we had some technical hiccups with the Twitter Spaces audio (sorry about that!), but the YouTube stream was fire. And speaking of fire, we had two incredible guests join us: Charlie Holtz from Chorus (the multi-model chat app that's changing how we interact with AI) and Linus Eckenstam, who's been traveling the AI conference circuit and bringing us insights from the frontlines of the generative AI revolution.Open Source AI & LLMs: DeepSeek Whales & Mind-Bending PapersDeepSeek dropped R1-0528 out of nowhere, an update to their reasoning beast with some serious jumps in performance. We’re talking AIME at 91 (beating previous scores by a mile), LiveCodeBench at 73, and SWE verified at 57.6. It’s edging closer to heavyweights like o3, and folks on X are already calling it “clearer thinking.” There was hype it might’ve been R2, but the impact didn’t quite crash the stock exchange like past releases. Still, it’s likely among the best open-weight models out there.So what's new? Early reports and some of my own poking around suggest this model "thinks clearer now." Nisten mentioned that while previous DeepSeek models sometimes liked to "vibe around" and explore the latent space before settling on an answer, this one feels a bit more direct.And here’s the kicker—they also released an 8B distilled version based on Qwen3, runnable on your laptop. Yam called it potentially the best 8B model to date, and you can try it on Ollama right now. No need for a monster rig! The Mind-Bending "Learning to Reason Without External Rewards" PaperOkay, this paper result broke my brain, and apparently everyone else's too. This paper shows that models can improve through reinforcement learning with its own intuition of whether or not it's correct. 😮It's like the placebo effect for AI! The researchers trained models without telling them what was good or bad, but rather, utilized a new framework called Intuitor, where the reward was based on how the "self certainty". The thing that took my whole timeline by storm is, it works! GRPO (Group Policy Optimization) - the framework that DeepSeek gave to the world with R1 is based on external rewards (human optimize) and Intuitor seems to be mathcing or even exceeding some of GRPO results when Qwen2.5 3B was used to finetune. Incredible incredible stuffBig Companies LLMs & APIsClaude Opus 4: A Week Later – The Dev Darling?Claude Opus 4, whose launch we celebrated live on the show, has had a week to make its mark. Charlie Holtz, who's building Chorus (more on that amazing app in a bit!), shared that while it's sometimes "astrology" to judge the vibes of a new model, Opus 4 feels like a step change, especially in coding. He mentioned that Claude Code, powered by Opus 4 (and Sonnet 4 for implementation), is now tackling GitHub issues that were too complex just weeks ago. He even had a coworker who "vibe coded three websites in a weekend" with it – that's a tangible productivity boost!Linus Eckenstam highlighted how Lovable.dev saw their syntax error rates plummet by nearly 50% after integrating Claude 4. That’s quantifiable proof of improvement! It's clear Anthropic is leaning heavily into the developer/coding space. Claude Opus is now #1 on the LMArena WebDev arena, further cementing its reputation.I had my own magical moment with Opus 4 this week. I was working on an MCP observability talk for the AI Engineer conference and trying to integrate Weave (our observability and evals framework at Weights & Biases) into a project. Using Windsurf's Cascade agent (which now lets you bring your own Opus 4 key, by the way – good move, Windsurf!), Opus 4 not only tried to implement Weave into my agent but, when it got stuck, it figured out it had access to the Weights & Biases support bot via our MCP tool. It then formulated a question to the support bot (which is also AI-powered!), got an answer, and used that to fix the implementation. It then went back and checked if the Weave trace appeared in the dashboard! Agents talking to agents to solve a problem, all while I just watched – my jaw was on the floor. Absolutely mind-blowing.Quick Hits: Voice Updates from OpenAI & AnthropicOpenAI’s Advanced Voice Mode finally sings—yes, I’ve been waiting for this! It can belt out tunes like Mariah Carey, ...
続きを読む一部表示

1 時間 28 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
📆 ThursdAI - Veo3, Google IO25, Claude 4 Opus/Sonnet, OpenAI x Jony Ive, Codex, Copilot Agent - INSANE AI week

2025/05/23

Hey folks, Alex here, welcome back to ThursdAI! And folks, after the last week was the calm before the storm, "The storm came, y'all" – that's an understatement. This wasn't just a storm; it was an AI hurricane, a category 5 of announcements that left us all reeling (in the best way possible!). From being on the ground at Google I/O to live-watching Anthropic drop Claude 4 during our show, it's been an absolute whirlwind.This week was so packed, it felt like AI Christmas, with tech giants and open-source heroes alike showering us with gifts. We saw OpenAI play their classic pre-and-post-Google I/O chess game, Microsoft make some serious open-source moves, Google unleash an avalanche of updates, and Anthropic crash the party with Claude 4 Opus and Sonnet live stream in the middle of ThursdAI!So buckle up, because we're about to try and unpack this glorious chaos. As always, we're here to help you collectively know, learn, and stay up to date, so you don't have to. Let's dive in! (TL;DR and links in the end) Open Source LLMs Kicking Things OffEven with the titans battling, the open-source community dropped some serious heat this week. It wasn't the main headline grabber, but the releases were significant!Gemma 3n: Tiny But Mighty MatryoshkaFirst up, Google's Gemma 3n. This isn't just another small model; it's a "Nano-plus" preview, a 4-billion parameter MatFormer (Matryoshka Transformer – how cool is that name?) model designed for mobile-first multimodal applications. The really slick part? It has a nested 2-billion parameter sub-model that can run entirely on phones or Chromebooks.Yam was particularly excited about this one, pointing out the innovative "model inside another model" design. The idea is you can use half the model, not depth-wise, but throughout the layers, for a smaller footprint without sacrificing too much. It accepts interleaved text, image, audio, and video, supports ASR and speech translation, and even ships with RAG and function-calling libraries for edge apps. With a 128K token window and responsible AI features baked in, Gemma 3n is looking like a powerful tool for on-device AI. Google claims it beats prior 4B mobile models on MMLU-Lite and MMMU-Mini. It's an early preview in Google AI Studio, but it definitely flies on mobile devices.Mistral & AllHands Unleash Devstral 24BThen we got a collaboration from Mistral and AllHands: Devstral, a 24-billion parameter, state-of-the-art open model focused on code. We've been waiting for Mistral to drop some open-source goodness, and this one didn't disappoint.Nisten was super hyped, noting it beats o3-Mini on SWE-bench verified – a tough benchmark! He called it "the first proper vibe coder that you can run on a 3090," which is a big deal for coders who want local power and privacy. This is a fantastic development for the open-source coding community.The Pre-I/O Tremors: OpenAI & Microsoft Set the StageAs we predicted, OpenAI couldn't resist dropping some news right before Google I/O.OpenAI's Codex Returns as an AgentOpenAI launched Codex – yes, that Codex, but reborn as an asynchronous coding agent. This isn't just a CLI tool anymore; it connects to GitHub, does pull requests, fixes bugs, and navigates your codebase. It's powered by a new coding model fine-tuned for large codebases and was SOTA on SWE Agent when it dropped. Funnily, the model is also called Codex, this time, Codex-1. And this gives us a perfect opportunity to talk about the emerging categories I'm seeing among Code Generator agents and tools:* IDE-based (Cursor, Windsurf): Live pair programming in your editor* Vibe coding (Lovable, Bolt, v0): "Build me a UI" style tools for non-coders* CLI tools (Claude Code, Codex-cli): Terminal-based assistants* Async agents (Claude Code, Jules, Codex, GitHub Copilot agent, Devin): Work on your repos while you sleep, open pull requests for you to review, asyncCodex (this new one) falls into category number 4, and with today's release, Cursor seems to also strive to get to category number 4 with background processing. Microsoft BUILD: Open Source Copilot and Copilot Agent ModeThen came Microsoft Build, their huge developer conference, with a flurry of announcements.The biggest one for me? GitHub Copilot's front-end code is now open source! The VS Code editor part was already open, but the Copilot integration itself wasn't. This is a massive move, likely a direct answer to the insane valuations of VS Code clones like Cursor. Now, you can theoretically clone GitHub Copilot with VS Code and swing for the fences.GitHub Copilot also launched as an asynchronous coding assistant, very similar in function to OpenAI's Codex, allowing it to be assigned tasks and create/update PRs. This puts Copilot right into category 4 of code assistants, and with the native Github Integration, they may actually have a leg up in this race!And if that wasn't enough, Microsoft is adding MCP (Model Context Protocol) support directly into the Windows OS. The ...
続きを読む一部表示

1 時間 28 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く

特集

カテゴリー別

エピソード

📆 ThursdAI - Jul 10 - Grok 4 and 4 Heavy, SmolLM3, Liquid LFM2, Reka Flash & Vision, Perplexity Comet Browser, Devstral 1.1 & More AI News

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📆 ThursdAI - Jul 3 - ERNIE 4.5, Hunyuan A13B, MAI-DxO outperforms doctors, RL beats SWE bench, Zuck MSL hiring spree & more AI news

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📅 ThursdAI - Jun 26 - Gemini CLI, Flux Kontext Dev, Search Live, Anthropic destroys books, Zucks superintelligent team & more AI news

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📆 ThursdAI - June 19 - MiniMax M1 beats R1, OpenAI records your meetings, Gemini in GA, W&B uses Coreweave GPUs & more AI news

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📆 ThursdAI - June 12 - Meta’s $15B ScaleAI Power Play, OpenAI’s o3-pro & 90% Price Drop!

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📆 ThursdAI - Jun 5, 2025 - Live from AI Engineer with Swyx, new Gemini 2.5 with Logan K and Jack Rae, Self Replicating agents with Morph Labs

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📆 ThursdAI - May 29 - DeepSeek R1 Resurfaces, VEO3 viral moments, Opus 4 a week after, Flux Kontext image editing & more AI news

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

📆 ThursdAI - Veo3, Google IO25, Claude 4 Opus/Sonnet, OpenAI x Jony Ive, Codex, Copilot Agent - INSANE AI week

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました