

- 2025.03.31 | 减少token使用,提升领域效率。
本期的 15 篇论文如下: [00:22] 💡 AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation(AdaptiVocab:通过轻量级词汇自适应增强LLM在特定领域的效率) [01:01] 🤖 Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback(探索人类反馈强化学习中的数据缩放趋势与影响) [01:41] 🤔 Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation(推荐之前先思考:释放序列推荐中的潜在推理能力) [02:19] 💡 A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond(大型推理模型高效推理综述:语言、多模态及其他) [02:58] 🖼 ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation(ORIGEN:文本到图像生成中零样本三维方向定位) [03:44] 🧠 OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning(OThink-MR1:通过动态强化学习激发多模态通用推理能力) [04:25] 🔄 ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback(ReFeed:基于反馈反射推理的多维度摘要改进) [04:59] 🎬 Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency(Free4D:无需微调的具有时空一致性的4D场景生成) [05:37] 🧪 PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving(物理学:在大学水平物理问题求解中对基础模型进行基准测试) [06:24] 🗣 Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics(感知准确的3D说话头生成:新定义、语音-网格表示和评估指标) [07:03] 🎬 Segment Any Motion in Videos(视频中的任意运动对象分割) [07:42] 🖼 Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging(Hi3DGen:基于法线桥接的图像高保真3D几何体生成) [08:28] 🖼 Your ViT is Secretly an Image Segmentation Model(你的ViT竟然是图像分割模型) [09:04] 🤔 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding(4D-Bench:用于4D对象理解的多模态大型语言模型基准测试) [09:48] 💡 A Refined Analysis of Massive Activations in LLMs(LLM中大规模激活的精细化分析) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 【周末特辑】3月第4周最火AI论文 | 稀疏自编码器解读LLM推理特征,多模态模型创新。
本期的 5 篇论文如下: [00:37] TOP1(🔥109) | 🧠 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders(我已经覆盖了所有基础:通过稀疏自编码器解读大型语言模型中的推理特征) [02:42] TOP2(🔥92) | 🤖 Qwen2.5-Omni Technical Report(Qwen2.5-Omni技术报告) [05:10] TOP3(🔥83) | 🎬 Video-T1: Test-Time Scaling for Video Generation(Video-T1:面向视频生成的测试时缩放) [07:36] TOP4(🔥70) | 🧮 When Less is Enough: Adaptive Token Reduction for Efficient Image Representation(适可而止:用于高效图像表征的自适应Token缩减) [10:00] TOP5(🔥68) | 🎬 Long-Context Autoregressive Video Modeling with Next-Frame Prediction(基于下一帧预测的长程上下文自回归视频建模) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.28 | 视频推理提升,GUI动作预测优化
本期的 15 篇论文如下: [00:22] 🧠 Video-R1: Reinforcing Video Reasoning in MLLMs(Video-R1:增强多模态大语言模型中的视频推理) [01:02] 📱 UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning(UI-R1:通过强化学习增强GUI代理的动作预测) [01:41] 🤯 Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models(挑战推理的边界:一个面向大型语言模型设计的奥林匹克级别数学基准) [02:25] 🎬 VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness(VBench-2.0: 提升视频生成基准套件的内在真实性) [03:05] 🖼 LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis(LeX-Art:通过可扩展的高质量数据合成重新思考文本生成) [03:38] 🤖 Large Language Model Agent: A Survey on Methodology, Applications and Challenges(大型语言模型智能体:方法论、应用与挑战综述) [04:23] 🧠 ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation(ReaRAG:知识引导的推理增强大型推理模型的事实性,通过迭代检索增强生成) [05:01] 🖼 Lumina-Image 2.0: A Unified and Efficient Image Generative Framework(Lumina-Image 2.0:一个统一且高效的图像生成框架) [05:48] 🤖 Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks(具身推理器:协同视觉搜索、推理和行动以完成具身交互任务) [06:27] 💡 ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition(ResearchBench:基于灵感驱动的任务分解评估大语言模型在科学发现中的能力) [07:12] 🚀 Optimal Stepsize for Diffusion Sampling(扩散采样的最优步长) [07:46] 🤔 Exploring the Evolution of Physics Cognition in Video Generation: A Survey(视频生成中物理认知进化探索:一项综述) [08:24] 🎤 FinAudio: A Benchmark for Audio Large Language Models in Financial Applications(FinAudio:金融应用中音频大语言模型的基准测试) [09:01] 🗣 ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model(ChatAnyone:基于分层运动扩散模型的风格化实时人像视频生成) [09:40] 🧠 ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging(ZJUKLAB团队在SemEval-2025 Task 4:通过模型融合实现知识遗忘) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.27 | Dita跨模态策略优异,Qwen2.5-Omni多模态实时响应。
本期的 15 篇论文如下: [00:26] 🤖 Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy(Dita:扩展扩散Transformer以实现通用视觉-语言-动作策略) [01:07] 🤖 Qwen2.5-Omni Technical Report(Qwen2.5-Omni技术报告) [01:46] 🧩 LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?(乐高拼图:多模态大型语言模型在多步空间推理方面的表现如何?) [02:35] 🎬 Wan: Open and Advanced Large-Scale Video Generative Models(万:开放且先进的大规模视频生成模型) [03:24] 💡 Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models(无条件先验至关重要!改进微调扩散模型的条件生成) [04:04] 🔍 Open Deep Search: Democratizing Search with Open-source Reasoning Agents(开放深度搜索:通过开源推理Agent实现搜索的民主化) [04:44] 🖼 GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers(GenHancer:不完美的生成模型是隐藏的强大视觉中心增强器) [05:24] 📊 BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation(BizGen:推进信息图生成中的文章级可视化文本渲染) [06:01] 🤖 Gemini Robotics: Bringing AI into the Physical World(Gemini Robotics:将人工智能带入物理世界) [06:39] 🧠 MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search(MCTS-RAG:利用蒙特卡洛树搜索增强检索增强生成) [07:22] 🚀 AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset(AccVideo:利用合成数据集加速视频扩散模型) [07:54] 🖼 ViLBench: A Suite for Vision-Language Process Reward Modeling(ViLBench:一个用于视觉-语言过程奖励建模的套件) [08:33] 💾 LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation(LogQuant:通过卓越精度保持实现KV缓存的对数分布2比特量化) [09:12] 🚗 ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems(ADS-Edit:面向自动驾驶系统的多模态知识编辑数据集) [09:55] 🖼 Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models(超越文字:通过多模态自回归模型推进长文本图像生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.26 | 视频预测性能提升,多模态预训练效果显著。
本期的 15 篇论文如下: [00:22] 🎬 Long-Context Autoregressive Video Modeling with Next-Frame Prediction(基于下一帧预测的长程上下文自回归视频建模) [01:01] 🖼 CoMP: Continual Multimodal Pre-training for Vision Foundation Models(CoMP:面向视觉基础模型的持续多模态预训练) [01:42] 🎬 Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation(探索大型多模态模型在视频理解中的幻觉现象:基准、分析与缓解) [02:28] 📈 Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing(基于随机生成与回滚预算强制的Flow模型推理时扩展) [03:14] 🖼 Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation(揪出伪造:基于大型多模态模型的合成图像检测与伪影解释) [03:54] 🖼 Scaling Vision Pre-Training to 4K Resolution(将视觉预训练扩展到4K分辨率) [04:33] 🤔 Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking(三思而后行:通过扩展多轮测试时思考来增强LLM推理能力) [05:15] 🖼 CoLLM: A Large Language Model for Composed Image Retrieval(CoLLM:用于组合图像检索的大型语言模型) [05:53] 🤖 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding(MDocAgent:用于文档理解的多模态多代理框架) [06:35] 🖼 Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models(基于扩散模型的潜在空间超分辨率高分辨率图像生成) [07:13] 🔍 ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning(ReSearch:通过强化学习训练大型语言模型以进行搜索推理) [07:54] 🛡 LookAhead Tuning: Safer Language Models via Partial Answer Previews(前瞻调优:通过部分答案预览实现更安全的语言模型) [08:38] 💡 Frequency Dynamic Convolution for Dense Image Prediction(用于密集图像预测的频率动态卷积) [09:18] 🖼 LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation(LPOSS:基于图像块和像素的标签传播,用于开放词汇语义分割) [09:51] 🧬 Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation(基于直通引导的Gumbel-Softmax Flow Matching用于可控生物序列生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.25 | 稀疏自编码器解读LLM中的推理特征,交互视频革新
本期的 15 篇论文如下: [00:24] 🧠 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders(我已经覆盖了所有基础:通过稀疏自编码器解读大型语言模型中的推理特征) [01:03] 🎮 Position: Interactive Generative Video as Next-Generation Game Engine(立场:交互式生成视频作为下一代游戏引擎) [01:47] 🎬 Video-T1: Test-Time Scaling for Video Generation(Video-T1:面向视频生成的测试时缩放) [02:35] 🌐 Aether: Geometric-Aware Unified World Modeling(Aether:几何感知统一世界建模) [03:11] 🧠 SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild(SimpleRL-Zoo:探索和驯服开放基础模型中的零强化学习) [03:51] 🎬 OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models(OmnimatteZero:基于预训练视频扩散模型的免训练实时全域Matte) [04:31] 🤖 Judge Anything: MLLM as a Judge Across Any Modality(万物皆可判:多模态大型语言模型作为跨模态的评估者) [05:16] 💡 LEMMA: Learning from Errors for MatheMatical Advancement in LLMs(LEMMA:通过从错误中学习促进大型语言模型在数学领域的进步) [05:57] 🖼 Equivariant Image Modeling(等变图像建模) [06:37] 🚀 Training-free Diffusion Acceleration with Bottleneck Sampling(基于瓶颈采样的免训练扩散加速方法) [07:11] ✨ CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models(CFG-Zero*:改进的用于Flow Matching模型的无分类器引导) [07:59] 🤔 Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models(视频简单问答:面向大型视频语言模型的事实性评估) [08:39] 🚄 FFN Fusion: Rethinking Sequential Computation in Large Language Models(FFN融合:重新思考大型语言模型中的序列计算) [09:20] 🛡 Defeating Prompt Injections by Design(通过设计击败提示注入攻击) [10:00] 🤝 AgentRxiv: Towards Collaborative Autonomous Research(AgentRxiv:迈向协同自主研究) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.24 | 多智能体协作提升性能,苏格拉底式对话优化提示。
本期的 15 篇论文如下: [00:22] 🧠 MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving(MAPS:一个基于大七人格和苏格拉底指导的多智能体框架,用于多模态科学问题求解) [01:09] 🤖 MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization(MARS:一个融合苏格拉底式指导的多智能体自动提示优化框架) [01:55] 🤖 RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints(RoboFactory:探索具有组合约束的具身智能体协作) [02:38] 🧮 When Less is Enough: Adaptive Token Reduction for Efficient Image Representation(适可而止:用于高效图像表征的自适应Token缩减) [03:21] 🌉 Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation(用于自回归视觉生成的连续和离散令牌桥接) [03:55] 🧠 OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement(OpenVLThinker:通过迭代自提升进行复杂视觉语言推理的早期探索) [04:37] ✍ Modifying Large Language Model Post-Training for Diverse Creative Writing(修改大型语言模型后训练以实现多样化的创意写作) [05:21] 🧮 MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems(MathFlow: 提升 MLLM 在视觉数学问题中的感知流程) [06:05] 🎬 Enabling Versatile Controls for Video Diffusion Models(实现视频扩散模型的多功能控制) [06:48] 🎬 ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering(ETVA:通过细粒度问题生成与回答评估文本到视频的对齐) [07:27] 🖼 Single Image Iterative Subject-driven Generation and Editing(单图像迭代式主体驱动生成与编辑) [08:12] 🎨 When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO(当偏好出现分歧:通过少数群体感知自适应DPO对齐扩散模型) [08:56] ⚖ From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration(从头到尾:通过自适应数据校准实现大型视觉-语言模型中的平衡表征) [09:37] 🚀 FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models(FastCuRL:基于渐进式上下文扩展的课程强化学习,用于高效训练类R1推理模型) [10:13] 🗣 TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting(TaoAvatar:基于3D高斯溅射的增强现实中实时逼真的全身对话化身) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 【周末特辑】3月第3周最火AI论文 | 序列建模创新,视频渲染突破
本期的 5 篇论文如下: [00:37] TOP1(🔥118) | 🦢 RWKV-7 "Goose" with Expressive Dynamic State Evolution(RWKV-7 "Goose":具有表达性动态状态演化的序列建模) [02:36] TOP2(🔥115) | 🎥 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video(ReCamMaster:基于单视频的相机控制生成式渲染) [05:18] TOP3(🔥89) | 🤖 DAPO: An Open-Source LLM Reinforcement Learning System at Scale(DAPO:一个大规模的开源LLM强化学习系统) [07:45] TOP4(🔥84) | 🎥 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation(DropletVideo:探索整体时空一致性视频生成的数据集与方法) [10:28] TOP5(🔥79) | 🎨 PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity(PLADIS:通过利用稀疏性,在推理时突破扩散模型中Attention的限制) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.21 | 蒸馏提升超分辨率效率,优化推理减少计算负担。
本期的 15 篇论文如下: [00:23] 🖼 One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation(基于蒸馏的单步残差转移扩散超分辨率) [01:01] 🤔 Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models(停止过度思考:大型语言模型高效推理综述) [01:38] 🚀 Unleashing Vecset Diffusion Model for Fast Shape Generation(释放Vecset扩散模型以实现快速形状生成) [02:18] 🤖 Survey on Evaluation of LLM-based Agents(基于大型语言模型(LLM)的智能体评估方法综述) [02:56] 🎨 DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers(DiffMoE:用于可扩展扩散Transformer的动态Token选择) [03:33] 🤖 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning(Cosmos-Reason1:从物理常识到具身推理) [04:14] 🖼 Scale-wise Distillation of Diffusion Models(扩散模型的尺度wise蒸馏) [04:54] 🗜 Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models(面向视频大语言模型的即插即用1.x-Bit KV缓存量化) [05:36] 🧮 MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion(MathFusion:通过指令融合增强大型语言模型解决数学问题的能力) [06:17] 🖼 InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity(无限的你:在保留身份的同时进行灵活的照片重塑) [06:56] 🎮 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse(JARVIS-VLA:通过后训练大规模视觉语言模型,使用键盘和鼠标玩视觉游戏) [07:41] 🧠 CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners(CaKE:电路感知编辑实现通用知识学习器) [08:26] 🖼 Ultra-Resolution Adaptation with Ease(简易的超分辨率自适应) [09:04] 🎨 Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts(专家竞赛:一种灵活的路由策略,用于扩展具有混合专家模型的扩散Transformer) [09:48] 🎬 MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance(MagicMotion:基于稠密到稀疏轨迹引导的可控视频生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.20 | 自适应前瞻采样优化推理;强化学习提升3D网格质量
本期的 15 篇论文如下: [00:23] 🔍 $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation($\phi$-解码:用于平衡推理时探索与利用的自适应前瞻采样) [01:08] 🎨 DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning(DeepMesh:基于强化学习的自回归艺术家网格创建) [01:51] 🌷 TULIP: Towards Unified Language-Image Pretraining(TULIP:迈向统一的语言-图像预训练) [02:26] 🤖 Cube: A Roblox View of 3D Intelligence(Cube:Roblox 视角下的 3D 智能) [03:06] 📱 Efficient Personalization of Quantized Diffusion Model without Backpropagation(无需反向传播的量化扩散模型高效个性化) [03:48] 🎬 Temporal Regularization Makes Your Video Generator Stronger(时间正则化使你的视频生成器更强大) [04:21] 🤖 STEVE: AStep Verification Pipeline for Computer-use Agent Training(STEVE:用于计算机使用代理训练的步骤验证管道) [04:59] 🖼 LEGION: Learning to Ground and Explain for Synthetic Image Detection(LEGION:学习定位与解释以用于合成图像检测) [05:41] 🎶 MusicInfuser: Making Video Diffusion Listen and Dance(MusicInfuser:让视频扩散模型聆听与舞动) [06:24] 👋 ViSpeak: Visual Instruction Feedback in Streaming Videos(ViSpeak:流视频中的视觉指令反馈) [07:03] 🧠 GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction(GKG-LLM:一个用于广义知识图谱构建的统一框架) [07:46] 👁 Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning(通过随身携带的视觉条件反射缓解多模态长链思维推理中的视觉遗忘) [08:32] 🗣 Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait(解锁姿态多样性:用于音频驱动的说话人像的精确高效的基于隐式关键点的时空扩散) [09:09] 🤖 ELTEX: A Framework for Domain-Driven Synthetic Data Generation(ELTEX:一种领域驱动的合成数据生成框架) [09:52] 🧪 CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning(CURIE:评估大型语言模型在多任务科学长文本理解与推理方面的能力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.19 | 动态序列建模优势,视频生成理解挑战
本期的 15 篇论文如下: [00:21] 🦢 RWKV-7 "Goose" with Expressive Dynamic State Evolution(RWKV-7 "Goose":具有表达性动态状态演化的序列建模) [00:55] 🤯 Impossible Videos(不可能的视频) [01:38] 🎨 Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM(Creation-MMBench:评估多模态大型语言模型中具有上下文感知能力的创造性智能) [02:17] 🤖 DAPO: An Open-Source LLM Reinforcement Learning System at Scale(DAPO:一个大规模的开源LLM强化学习系统) [02:58] 🧠 DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding(DeepPerception:提升多模态大型语言模型中类R1认知视觉感知能力,用于知识密集型视觉定位) [03:39] 🖼 CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era(CapArena:LLM时代下详细图像描述的基准测试与分析) [04:25] 🤖 Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation(无限可动性:通过程序生成实现可伸缩的高保真铰接物体合成) [05:13] 🧠 Frac-Connections: Fractional Extension of Hyper-Connections(Frac-Connections:超连接的分数扩展) [05:52] 🌍 Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control(宇宙-迁移1:基于自适应多模态控制的条件世界生成) [06:30] 🧐 MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification(MPBench:用于过程错误识别的综合多模态推理基准) [07:13] 🤖 Aligning Multimodal LLM with Human Preference: A Survey(多模态大语言模型与人类偏好对齐:一项综述) [07:51] ⏱ Measuring AI Ability to Complete Long Tasks(衡量人工智能完成长时任务的能力) [08:38] 🎭 Concat-ID: Towards Universal Identity-Preserving Video Synthesis(Concat-ID:面向通用身份保持的视频合成) [09:13] 🖼 FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis(FlexWorld: 用于灵活视角合成的渐进式扩展3D场景) [09:50] 🤔 Temporal Consistency for LLM Reasoning Process Error Identification(LLM推理过程错误识别的时序一致性方法) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.18 | 视频生成新方法,人形机器人新框架
本期的 15 篇论文如下: [00:21] 🎥 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation(DropletVideo:探索整体时空一致性视频生成的数据集与方法) [01:10] 🤖 Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills(Being-0:一个具有视觉-语言模型和模块化技能的人形机器人代理) [01:49] 🖼 DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models(DreamRenderer:驯服大规模文本到图像模型中的多实例属性控制) [02:38] 🖼 Edit Transfer: Learning Image Editing via Vision In-Context Relations(编辑迁移:通过视觉上下文关系学习图像编辑) [03:12] 🖼 Personalize Anything for Free with Diffusion Transformer(使用扩散Transformer免费实现任何物体的个性化) [03:53] 🎬 WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes(WideRange4D:通过宽范围运动和场景实现高质量4D重建) [04:30] 🎨 BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing(BlobCtrl: 用于元素级图像生成与编辑的统一且灵活的框架) [05:14] 🛡 reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs(reWordBench:通过转换输入来评估和提升奖励模型的鲁棒性) [05:54] 🔬 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research(MicroVQA:一个用于基于显微镜的科学研究的多模态推理基准) [06:31] 🧠 Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey(多模态思维链推理:一项综合综述) [07:09] 🤖 Free-form language-based robotic reasoning and grasping(基于自由形式语言的机器人推理与抓取) [07:45] 🧠 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization(R1-VL:通过逐步分组相对策略优化学习多模态大型语言模型的推理) [08:35] 🤔 V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning(V-STaR:视频时空推理能力评测基准) [09:18] 🎬 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning(VideoMind:用于长视频推理的链式LoRA Agent) [09:51] 🖼 Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation(奖励足以实现快速逼真的文本到图像生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.17 | 新相机轨迹生成,稀疏性提升图像质量
本期的 15 篇论文如下: [00:25] 🎥 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video(ReCamMaster:基于单视频的相机控制生成式渲染) [01:11] 💡 PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity(PLADIS:通过利用稀疏性,在扩散模型推理时突破注意力机制的限制) [01:50] 🤖 Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning(对抗性数据收集:用于高效和鲁棒机器人模仿学习的人机协作扰动) [02:38] 📊 Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models(关于有效性和效率的技术:状态空间模型综述) [03:25] 🤖 API Agents vs. GUI Agents: Divergence and Convergence(API智能体与GUI智能体:差异与融合) [03:57] 🛡 Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks(联邦学习的脆弱性探索:梯度反演攻击深度解析) [04:47] 🎬 Large-scale Pre-training for Grounded Video Caption Generation(面向视频内容理解的大规模预训练) [05:31] 🌉 FlowTok: Flowing Seamlessly Across Text and Image Tokens(FlowTok:在文本和图像Token之间无缝流动) [06:08] ⚕ TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools(TxAgent:一个用于跨工具领域进行治疗推理的AI Agent) [06:47] 🤔 Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?(Kolmogorov-Arnold注意力机制:可学习的注意力机制更适合视觉Transformer吗?) [07:27] 📸 VGGT: Visual Geometry Grounded Transformer(VGGT:基于视觉几何的Transformer) [08:14] 🦜 Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption(Cockatiel:集成合成数据与人类偏好训练,实现细致的视频描述) [08:52] 🖼 Neighboring Autoregressive Modeling for Efficient Visual Generation(相邻自回归建模:用于高效视觉生成) [09:26] 🔬 ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges(ProJudge:一个基于多模态大语言模型的过程评估器的多模态多学科基准和指令微调数据集) [10:02] 🖼 ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy(ARMOR v0.1:通过非对称协同的交错多模态生成增强自回归多模态理解模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 【周末特辑】3月第2周最火AI论文 | 稀疏自编码器提升文本检测,自动化ICD编码提高医疗效率。
本期的 5 篇论文如下: [00:44] TOP1(🔥208) | 🤖 Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders(基于稀疏自编码器的人工文本检测特征分析) [03:15] TOP2(🔥122) | 🇷 RuCCoD: Towards Automated ICD Coding in Russian(RuCCoD:面向俄语自动化的ICD编码研究) [05:35] TOP3(🔥104) | 🌐 Unified Reward Model for Multimodal Understanding and Generation(多模态理解和生成的统一奖励模型) [07:58] TOP4(🔥89) | 🌏 Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia(众包、爬取还是生成?创建东南亚视觉语言数据集SEA-VL) [10:21] TOP5(🔥73) | 🧠 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL(LMM-R1:通过两阶段基于规则的强化学习赋予3B参数大模态模型强大的推理能力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2025.03.14 | CoSTA*优化多轮编辑效率,无声品牌攻击揭示扩散模型脆弱性。
本期的 15 篇论文如下: [00:25] 🖼 CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing(CoSTA*:面向多轮图像编辑的成本敏感工具路径代理) [01:03] 🎭 Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models(无声品牌攻击:针对文本到图像扩散模型的无触发数据投毒攻击) [01:45] 🌍 World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning(世界建模提升规划器性能:双重偏好优化用于具身任务规划) [02:30] 🗺 Charting and Navigating Hugging Face's Model Atlas(绘制与导航Hugging Face的模型地图) [03:14] 🧠 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing(GoT:释放多模态大型语言模型的推理能力用于视觉生成与编辑) [03:48] 🎨 CoRe^2: Collect, Reflect and Refine to Generate Better and Faster(CoRe^2:收集、反思与精炼以生成更快更好的图像) [04:29] 🧠 Transformers without Normalization(无需归一化的Transformer) [05:06] 🌐 GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding(GroundingSuite:测量复杂多粒度像素接地) [05:50] 🤖 New Trends for Modern Machine Translation with Large Reasoning Models(现代机器翻译的新趋势:基于大型推理模型的研究) [06:32] 📝 Shifting Long-Context LLMs Research from Input to Output(从输入到输出:长上下文大语言模型研究的转变) [07:09] 🌐 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search(视觉网页指令:通过网络搜索扩展多模态指令数据) [07:54] 🧠 DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation(DiT-Air: 重新审视扩散模型架构设计在文本到图像生成中的效率) [08:35] 🐱 Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark(我看起来像一只猫吗?分类图像生成基准) [09:20] 🎥 Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k(Open-Sora 2.0:以20万美元训练商用级视频生成模型) [10:01] 🎥 Long Context Tuning for Video Generation(长上下文调优用于视频生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递