PodXiv: The latest AI papers, decoded in 20 minutes.

エピソード

(LLM Code-Salesforce) CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

2025/07/05

Welcome to our podcast! Today, we're exploring CodeTree, a groundbreaking framework developed by researchers at The University of Texas at Austin and Salesforce Research. CodeTree revolutionises code generation by enabling Large Language Models (LLMs) to efficiently navigate the vast coding search space through an agent-guided tree search. This innovative approach employs a unified tree structure for explicitly exploring coding strategies, generating solutions, and refining them.
At its core, CodeTree leverages dedicated LLM agents: the Thinker for strategy generation, the Solver for initial code implementation, and the Debugger for solution improvement. Crucially, a Critic Agent dynamically guides the exploration by evaluating nodes, verifying solutions, and making crucial decisions like refining, aborting, or accepting a solution. This multi-agent collaboration, combined with environmental and AI-generated feedback, has led to significant performance gains across diverse coding benchmarks, including HumanEval, MBPP, CodeContests, and SWEBench.
However, CodeTree's effectiveness hinges on LLMs with strong reasoning abilities; smaller models may struggle with its complex instruction-following roles, potentially leading to misleading feedback. The framework currently prioritises functional correctness, leaving aspects like code readability or efficiency for future enhancements. Despite these limitations, CodeTree offers a powerful paradigm for automated code generation, demonstrating remarkable search efficiency, even with limited generation budgets.
Paper link: https://arxiv.org/pdf/2411.04329

続きを読む一部表示

19 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(FM-NVIDIA) Fugatto: Foundational Generative Audio Transformer Opus 1

2025/07/03

Fugatto, a new generalist audio synthesis and transformation model developed by NVIDIA, and ComposableART, an inference-time technique designed to enhance its capabilities. Fugatto distinguishes itself by its ability to follow free-form text instructions, often with optional audio inputs, addressing the challenge that audio data, unlike text, typically lacks inherent instructional information. The document details a comprehensive data and instruction generation strategy that leverages large language models (LLMs) and audio understanding models to create diverse and rich datasets, enabling Fugatto to handle a wide array of tasks including text-to-speech, text-to-audio, and audio transformations. Furthermore, ComposableART allows for compositional abilities, such as combining, interpolating, or negating instructions, providing fine-grained control over audio outputs beyond the training distribution. The text presents experimental evaluations demonstrating Fugatto's competitive performance against specialised models and highlights its emergent capabilities, such as synthesising novel sounds or performing tasks not explicitly trained for.
link: https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.pdf

続きを読む一部表示

18 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Application-NVIDIA) Small Language Models: The Future of Agentic AI

2025/07/03

The provided text argues that small language models (SLMs) are the future of agentic AI, positioning them as more economical and operationally suitable than large language models (LLMs) for the majority of tasks within AI agents. While LLMs excel at general conversations, agentic systems frequently involve repetitive, specialised tasks where SLMs offer advantages like lower latency, reduced computational requirements, and significant cost savings. The authors propose a shift to heterogeneous systems, where SLMs handle routine functions and LLMs are used sparingly for complex reasoning. The document also addresses common barriers to SLM adoption, such as existing infrastructure investments and popular misconceptions, and outlines a conversion algorithm for migrating agentic applications from LLMs to SLMs.
Link: https://arxiv.org/pdf/2506.02153

続きを読む一部表示

22 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Explainability-METR) Measuring AI Long Task Completion

2025/06/28

Welcome to PodXiv! In this episode, we dive into groundbreaking research from METR that introduces a novel metric for understanding AI capabilities: the 50%-task-completion time horizon. This unique measure quantifies how long humans typically take to complete tasks that AI models can achieve with a 50% success rate, offering intuitive insight into real-world performance.
The study reveals a staggering trend: frontier AI's time horizon has been doubling approximately every seven months since 2019, driven by improvements in reliability, mistake adaptation, logical reasoning, and tool use. This rapid progress has profound implications, with extrapolations suggesting AI could automate many month-long software tasks within five years, a critical insight for responsible AI governance and safety guardrails.
However, the research acknowledges crucial limitations. Current AI systems perform less effectively on "messier," less structured tasks and those requiring complex human-like context or interaction. These factors highlight that while impressive, the generalisation of these trends to all real-world intellectual labour requires further investigation. Tune in to explore the future of AI autonomy and its societal impact!
Paper: https://arxiv.org/pdf/2503.14499

続きを読む一部表示

15 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(FM) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

2025/06/22

Join us to explore MiniMax-M1, a revolutionary development from MiniMax, hailed as the world's first open-weight, large-scale hybrid-attention reasoning model. At its core, MiniMax-M1 leverages a sophisticated hybrid Mixture-of-Experts (MoE) architecture paired with a novel lightning attention mechanism, which together facilitate the efficient scaling of test-time compute. A significant advancement is its native support for an impressive 1 million token context length, an eightfold expansion compared to competitors like DeepSeek R1, making it exceptionally well-suited for complex tasks demanding the processing of extensive inputs and prolonged reasoning.
Further enhancing its capabilities, MiniMax-M1 was trained using CISPO, a pioneering reinforcement learning algorithm. This method, which clips importance sampling weights rather than token updates, notably boosts RL efficiency, demonstrated by the model’s full RL training completing in just three weeks on 512 H800 GPUs for a cost of only $534,700. The model exhibits particular strengths in practical applications such as complex software engineering, effective tool utilization, and various long-context tasks, having been rigorously trained in diverse real-world software engineering environments. While its innovative design and performance are thoroughly detailed, the provided sources do not explicitly outline any limitations of the MiniMax-M1 model.
To learn more, explore the full technical report: https://arxiv.org/abs/2506.13585.

続きを読む一部表示

12 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(FM-GOOGLE) Gemini 2.5: Technical Report

2025/06/20

Tune in to explore Google DeepMind's groundbreaking Gemini 2.X model family, featuring the highly capable Gemini 2.5 Pro and the efficient Gemini 2.5 Flash. These models represent a new frontier in AI, offering natively multimodal understanding, the ability to process over one million tokens of long context, and advanced reasoning through "Thinking" capabilities across diverse domains.
Gemini 2.5 Pro stands out for its State-of-the-Art performance in coding and reasoning, alongside remarkable multimodal understanding, capable of analysing up to three hours of video content. This enables exciting applications such as building interactive web applications, comprehensive codebase understanding, and powering next-generation agentic workflows, famously demonstrated by "Gemini Plays Pokémon".
However, the sources also highlight ongoing areas for development. While excelling, the models sometimes struggle with raw pixel vision input and exhibit a tendency for agents to repeat actions with very long contexts exceeding 100k tokens. Challenges like hallucinations and "context poisoning" can also occur. Despite notable increases in some critical capabilities (e.g., cyber uplift), Gemini 2.5 Pro has not reached Critical Capability Levels that would pose a significant risk of severe harm, with Google DeepMind actively accelerating mitigations in these areas.
Paper link: https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

続きを読む一部表示

16 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(FM-AMZN) Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

2025/06/18

Discover the revolutionary Proposer-Agent-Evaluator (PAE) system, developed by Amazon Science, which empowers foundation model agents to autonomously discover and practice skills in the wild. This novel approach overcomes the significant challenge of manually specifying an agent’s vast skill repertoire through human-annotated instructions, which severely limits scalability. PAE operates by having a context-aware task proposer generate instructions based on website information, an agent policy attempting these tasks, and an autonomous VLM-based evaluator providing reward signals for policy refinement via Reinforcement Learning (RL).
The system excels in challenging vision-based web navigation, demonstrating substantial improvements in zero-shot generalization to unseen tasks and websites (around 50% relative improvement) on real-world benchmarks like WebVoyager and WebArena. PAE enables agents to perform diverse goal-directed tasks, from finding directions to buying specific items online, without human supervision. Despite its advancements, current PAE models may still lag behind state-of-the-art proprietary models in complex reasoning, and their performance on dynamic live websites can vary. Nevertheless, this breakthrough by Amazon Science paves the way for more capable open-source foundation model agents.
Paper link: https://assets.amazon.science/74/38/965b25dc4a98b48186022a8588d3/proposer-agent-evaluator-pae-autonomous-skill-discovery-for-foundation-model-internet-agents.pdf

続きを読む一部表示

13 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Reasoning-Mixtral) Magistral: Boosting LLM Reasoning with Reinforcement Learning

2025/06/16

Tune into our latest podcast episode where we delve into Magistral, Mistral AI's groundbreaking first reasoning model. This innovative system pioneers a scalable reinforcement learning (RL) pipeline, built entirely from the ground up without relying on existing distilled traces. A key novelty is its demonstration of pure RL training for large language models (LLMs), showing that RL on text data alone significantly boosts capabilities like multimodal understanding, instruction following, and function calling. For instance, Magistral Medium, trained solely with RL, achieved a nearly 50% increase in AIME-24 accuracy over its base model.
While powerful, the model experiences a slight degradation in multilingual reasoning compared to English, performing 4.3-9.9% lower on AIME 2024 benchmarks. Additionally, experiments with proportional rewards for code tasks or entropy bonuses for exploration were unsuccessful or unstable, suggesting nuances in RL application. Magistral is primarily applied to complex mathematical and coding problems, with proven efficacy in multilingual and multimodal contexts. Its strong foundation supports future advancements in tool-use and intelligent agents.
Find the full paper here: https://arxiv.org/pdf/2506.10910

続きを読む一部表示

22 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く

特集

カテゴリー別

エピソード

(LLM Code-Salesforce) CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(FM-NVIDIA) Fugatto: Foundational Generative Audio Transformer Opus 1

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Application-NVIDIA) Small Language Models: The Future of Agentic AI

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Explainability-METR) Measuring AI Long Task Completion

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(FM) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(FM-GOOGLE) Gemini 2.5: Technical Report

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(FM-AMZN) Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Reasoning-Mixtral) Magistral: Boosting LLM Reasoning with Reinforcement Learning

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました