ThursdAI - The top AI news from the past week

著者: From Weights & Biases Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
  • サマリー

  • Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.

    sub.thursdai.news
    Alex Volkov
    続きを読む 一部表示

あらすじ・解説

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.

sub.thursdai.news
Alex Volkov
エピソード
  • 💯 ThursdAI - 100th episode 🎉 - Meta LLama 4, Google tons of updates, ChatGPT memory, WandB MCP manifesto & more AI news
    2025/04/10
    Hey Folks, Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI 👏 100 episodes in a year and a half (as I started publishing much later than I started going live, and the first episode was embarrassing), 100 episodes that documented INCREDIBLE AI progress, we mention on the show today, we used to be excited by context windows jumping from 4K to 16K! I want to extend a huge thank you to every one of you, who subscribes, listens to the show on podcasts, joins the live recording (we regularly get over 1K live viewers across platforms), shares with friends and highest thank you for the paid supporters! 🫶 Sharing the AI news progress with you, energizes me to keep going, despite the absolute avalanche of news every week.And what a perfect way to celebrate the 100th episode, on a week that Meta dropped Llama 4, sending the open-source world into a frenzy (and a bit of chaos). Google unleashed a firehose of announcements at Google Next. The agent ecosystem got a massive boost with MCP and A2A developments. And we had fantastic guests join us – Michael Lou diving deep into the impressive DeepCoder-14B, and Liad Yosef & Ido Salomon sharing their wild ride creating the viral GitMCP tool.I really loved today's show, and I encourage those of you who only read, to give this a watch/listen, and those of you who only listen, enjoy the recorded version (though longer and less edited!) Now let's dive in, there's a LOT to talk about (TL;DR and show notes as always, at the end of the newsletter) Open Source AI & LLMs: Llama 4 Takes Center Stage (Amidst Some Drama)Meta drops Llama 4 - Scout 109B/17BA & Maverick 400B/17BA (Blog, HF, Try It)This was by far the biggest news of this last week, and it dropped... on a Saturday? (I was on the mountain ⛷️! What are you doing Zuck) Meta dropped the long awaited LLama-4 models, huge ones this time* Llama 4 Scout: 17B active parameters out of ~109B total (16 experts).* Llama 4 Maverick: 17B active parameters out of a whopping ~400B total (128 experts).* Unreleased: Behemoth - 288B active with 2 Trillion total parameters chonker!* Both base and instruct finetuned models were releasedThese new models are all Multimodal, Multilingual MoE (mixture of experts) architecture, and were trained with FP8, for significantly more tokens (around 30 Trillion Tokens!) with interleaved attention (iRoPE), and a refined SFT > RL > DPO post-training pipeline.The biggest highlight is the stated context windows, 10M for Scout and 1M for Maverick, which is insane (and honestly, I haven't yet seen a provider that is even remotely able to support anything of this length, nor do I have the tokens to verify it) The messy release - Big Oof from Big ZuckNot only did Meta release on a Saturday, messing up people's weekends, Meta apparently announced a high LM arena score, but the model they provided to LMArena was... not the model they released!?It caused LMArena to release the 2000 chats dataset, and truly, some examples are quite damning and show just how unreliable LMArena can be as vibe eval. Additionally, during the next days, folks noticed discrepancies between the stated eval scores Meta released, and the ability to evaluate them independently, including our own Wolfram, who noticed that a quantized version of Scout, performed better on his laptop while HIGHLY quantized (read: reduced precision) than it was performing on the Together API inference endpoint!? We've chatted on the show that this may be due to some VLLM issues, and speculated about other potential reasons for this. Worth noting the official response from Ahmad Al-Dahle, head of LLama at Meta, who mentioned stability issues between providers and absolutely denied any training on any benchmarksToo big for its own good (and us?)One of the main criticism the OSS community had about these releases, is that for many of us, the reason for celebrating Open Source AI, is the ability to run models without network, privately on our own devices. Llama 3 was released in 8-70B distilled versions and that was incredible for us local AI enthusiasts! These models, despite being "only" 17B active params, are huge and way to big to run on most local hardware, and so the question then is, if we're getting a model that HAS to run on a service, why not use Gemini 2.5 that's MUCH better and faster and cheaper than LLama? Why didn't Meta release those sizes? Was it due to an inability to beat Qwen/DeepSeek enough? 🤔 My TakeDespite the absolutely chaotic rollout, this is still a monumental effort from Meta. They spent millions on compute and salaries to give this to the community. Yes, no papers yet, the LM Arena thing was weird, and the inference wasn't ready. But Meta is standing up for Western open-source in a big way. We have to celebrate the core contribution while demanding better rollout practices next time. As Wolfram rightly said, the real test will be the fine-tunes and distillations the community ...
    続きを読む 一部表示
    1 時間 32 分
  • ThursdAI - Apr 3rd - OpenAI Goes Open?! Gemini Crushes Math, AI Actors Go Hollywood & MCP, Now with Observability?
    2025/04/03
    Woo! Welcome back to ThursdAI, show number 99! Can you believe it? We are one show away from hitting the big 100, which is just wild to me. And speaking of milestones, we just crossed 100,000 downloads on Substack alone! [Insert celebratory sound effect here 🎉]. Honestly, knowing so many of you tune in every week genuinely fills me with joy, but also a real commitment to keep bringing you the the high-signal, zero-fluff AI news you count on. Thank you for being part of this amazing community! 🙏And what a week it's been! I started out busy at work, playing with the native image generation in ChatGPT like everyone else (all 130 million of us!), and then I looked at my notes for today… an absolute mountain of updates. Seriously, one of those weeks where open source just exploded, big companies dropped major news, and the vision/video space is producing stuff that's crossing the uncanny valley.We’ve got OpenAI teasing a big open source release (yes, OpenAI might actually be open again!), Gemini 2.5 showing superhuman math skills, Amazon stepping into the agent ring, truly mind-blowing AI character generation from Meta, and a personal update on making the Model Context Protocol (MCP) observable. Plus, we had some fantastic guests join us live!So buckle up, grab your coffee (or whatever gets you through the AI whirlwind), because we have a lot to cover. Let's dive in! (as always, show notes and links in the end)OpenAI Makes Waves: Open Source Tease, Tough Evals & Billions RaisedIt feels like OpenAI was determined to dominate the headlines this week, hitting us from multiple angles.First, the potentially massive news: OpenAI is planning to release a new open source model in the "coming months"! Kevin Weil tweeted that they're working on a "highly capable open language model" and are actively seeking developer feedback through dedicated sessions (sign up here if interested) to "get this right." Word on the street is that this could be a powerful reasoning model. Sam Altman also cheekily added they won't slap on a Llama-style <700M user license limit. Seeing OpenAI potentially re-embrace its "Open" roots with a potentially SOTA model is huge. We'll be watching like hawks!Second, they dropped PaperBench, a brutal new benchmark evaluating an AI's ability to replicate ICML 2024 research papers from scratch (read paper, write code, run experiments, match results - no peeking at original code!). It's incredibly detailed (>8,300 tasks) and even includes meta-evaluation for the LLM judge they built (Nano-Eval framework also open sourced). The kicker? Claude 3.5 Sonnet (New) came out on top with just 21.0% replication score (human PhDs got 41.4%). Props to OpenAI for releasing an eval where they don’t even win. That’s what real benchmarking integrity looks like. You can find the code on GitHub and read the full paper here.Third, the casual 40 Billion Dollars funding round led by SoftBank. Valuing the company at 300 Billion. Yes, Billion with a B. More than Coke, more than Disney. The blog post was hilariously short for such a massive number. They also mentioned500 million weekly ChatGPT usersand the insane onboarding rate (1M users/hr!) thanks to native image generation, especially seeing huge growth in India. The scale is just mind-boggling.Oh, and for fun, try the new grumpy, EMO "Monday" voice in advanced voice mode. It's surprisingly entertaining.Open Source Powerhouses: Nomic & OpenHands Deliver SOTABeyond the OpenAI buzz, the open source community delivered some absolute gems, and we had guests from two key projects join us!Nomic Embed Multimodal: SOTA Embeddings for Visual DocsOur friends at Nomic AI are back with a killer release! We had Zach Nussbaum on the show discussing Nomic Embed Multimodal. These are new 3B & 7B parameter embedding models (available on Hugging Face) built on Alibaba's excellent Qwen2.5-VL. They achieved SOTA on visual document retrieval by cleverly embedding interleaved text-image sequences – perfect for PDFs and complex webpages.Zach highlighted that they chose the Qwen base because high-performing open VLMs under 3B params are still scarce, making it a solid foundation. Importantly, the 7B model comes with an Apache 2.0 license, and they've open sourced weights, code, and data. They offer both a powerful multi-vector version (ColNomic) and a faster single-vector one. Huge congrats to Nomic!OpenHands LM 32B & Agent: Accessible SOTA CodingRemember OpenDevin? It evolved into OpenHands, and the team just dropped their own OpenHands LM 32B! We chatted with co-founder Xingyao "Elle" Wang about this impressive Qwen 2.5 finetune (MIT licensed, on Hugging Face).It hits a remarkable 37.2% on SWE-Bench Verified (a coding benchmark measuring real-world repo tasks), competing with much larger models. Elle stressed they didn't just chase code completion scores; they focused on tuning for agentic capabilities – tool use, planning, self-correction – using trajectories from their ...
    続きを読む 一部表示
    1 時間 38 分
  • 📆 ThursdAI - Mar 27 - Gemini 2.5 Takes #1, OpenAI Goes Ghibli, DeepSeek V3 Roars, Qwen Omni, Wandb MCP & more AI news
    2025/03/27
    Hey everyone, Alex here 👋 Welcome back to ThursdAI! And folks, what an absolutely insane week it's been in the world of AI. Seriously, as I mentioned on the show, we don't often get weeks this packed with game-changing releases.We saw Google emphatically reclaim the #1 LLM spot with Gemini 2.5 Pro (and OpenAI try really hard to hit back with a new ChatGPT), DeepSeek dropped a monster 685B parameter open-source model, Qwen launched a tiny but mighty 7B Omni model that handles voice and video like a champ, and OpenAI finally gave us native image generation in GPT-4o, immediately unleashing a tidal wave of Ghibli-fication across the internet. It was intense, with big players seemingly trying to one-up each other constantly – remember when Sam Altman dropped Advanced Voice Mode right when Google was about to show Astra? This weeks was this, on steroids. We had a fantastic show trying to unpack it all, joined by the brilliant Tulsee Doshi from the Google Gemini team, my Weights & Biases colleague Morgan McQuire talking MCP tools, and the MLX King himself, Prince Canuma. Plus, my awesome co-hosts Wolfram, Nisten, and Yam were there to add their insights. (watch the LIVE recap or keep reading and listen to the audio pod) So, grab your beverage of choice, buckle up, and let's try to make sense of this AI whirlwind! (TL'DR and show notes at the bottom 👇)Big CO LLMs + APIs🔥 Google Reclaims #1 with Gemini 2.5 Pro (Thinking!)Okay, let's start with the big news. Google came out swinging this week, dropping Gemini 2.5 Pro and, based on the benchmarks and our initial impressions, taking back the crown for the best all-around LLM currently available. (Check out the X announcement, the official blog post, and seriously, go try it yourself at ai.dev).We were super lucky to have Tulsee Doshi, who leads the product team for Gemini modeling efforts at Google, join us on the show to give us the inside scoop. Gemini 2.5 Pro Experimental isn't just an incremental update; it's topping benchmarks in complex reasoning, science, math, and coding. As Tulsee explained, this isn't just about tweaking one thing – it's a combination of a significantly enhanced base model and improved post-training techniques, including integrating those "thinking" capabilities (like chain-of-thought) right into the core models.That's why they dropped "thinking" from the official name – it's not a separate mode anymore, it's becoming fundamental to how Gemini operates. Tulsee mentioned their goal is for the main line models to be thinking models, leveraging inference time when needed to get the best answer. This is a huge step towards more capable and reliable AI.The performance gains are staggering across the board. We saw massive jumps on benchmarks like AIME (up nearly 20 points!) and GPQA. But it's not just about the numbers. As Tulsee highlighted, Gemini 2.5 is proving to be incredibly well-rounded, excelling not only on academic benchmarks but also on human preference evaluations like LM Arena (where style control is key). The "vibes" are great, as Wolfram put it. My own testing on reasoning tasks confirms this – the latency is surprisingly low for such a powerful model (around 13 seconds on my hard reasoning questions compared to 45+ for others), and the accuracy is the highest I've seen yet at 66% on that specific challenging set.It also inherits the strengths of previous Gemini models – native multimodality and that massive long context window (up to 1M tokens!). Tulsee emphasized how crucial long context is, allowing the model to reason over entire code repos, large sets of financial documents, or research papers. The performance on long context tasks, like the needle-in-a-haystack test shown on Live Bench, is truly impressive, maintaining high accuracy even at 120k+ tokens where other models often falter significantly.Nisten mentioned on the show that while it's better than GPT-4o, it might not completely replace Sonnet 3.5 for him yet, especially for certain coding or medical tasks under 128k context. Still, the consensus is clear: Gemini 2.5 Pro is the absolute best model right now across categories. Go play with it!ARC-AGI 2 Benchmark Revealed (X, Interactive Blog)Also on the benchmark front, the challenging ARC-AGI 2 benchmark was revealed. This is designed to test tasks that are easy for humans but hard for LLMs. The initial results are sobering: base LLMs score 0% accuracy, and even current "thinking" models only reach about 4%. It highlights just how far we still have to go in developing truly robust AI reasoning, giving us another hill to climb.GPT-4o got another update (as I'm writing these words!) tied for #1 on LMArena, beating 4.5How much does Sam want to win over Google? So much he's letting it ALL out. Just now, we saw an update from LMArena and Sam, about a NEW GPT-4o (2025-03-26) which jumps OVER GPT 4.5 (like.. what?) and lands at number 2 on the LM Arena, jumping over 3o points.Tied #1 in Coding, Hard ...
    続きを読む 一部表示
    1 時間 24 分

ThursdAI - The top AI news from the past weekに寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。