![HuggingFace 每日AI论文速递](https://image.xyzcdn.net/FoqTpAyCk31T4qhoA-PuH8wDyDNx.png@small)
![拨号上网](https://image.xyzcdn.net/FgXewa2u8ZtlixX-9ZqWLYojL09K.jpg@small)
- 【周末特辑】12月第2周最火AI论文 | 扩展策略提升模型性能,多模态系统优化长期交互。
本期的 5 篇论文如下: [00:43] TOP1(🔥95) | 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(扩展开源多模态模型性能边界:模型、数据与测试时扩展) [03:01] TOP2(🔥65) | 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [05:09] TOP3(🔥64) | 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation(揭开强化学习代理中记忆复杂性的分类与评估方法) [07:29] TOP4(🔥61) | 🎥 STIV: Scalable Text and Image Conditioned Video Generation(STIV:可扩展的文本与图像条件视频生成) [09:46] TOP5(🔥53) | 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning(ProcessBench:识别数学推理中的过程错误) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.13 每日AI论文 | 多模态系统提升长期交互,phi-4优化STEM问答表现。
本期的 23 篇论文如下: [00:23] 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [01:03] 🧠 Phi-4 Technical Report(Phi-4 技术报告) [01:43] 🧠 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions(欧几里得:通过合成高保真视觉描述提升多模态大语言模型) [02:27] 🌐 Multimodal Latent Language Modeling with Next-Token Diffusion(多模态潜在语言建模与下一词扩散) [03:10] 🌐 EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM(EasyRef:基于多模态大语言模型的扩散模型通用化图像参考) [03:57] 🌐 AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials(AgentTrek:通过网络教程引导回放的代理轨迹合成) [04:43] 🌟 Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion(神经光装置:利用多光源扩散解锁精确物体法线和材质估计) [05:24] 📱 SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training(SnapGen:通过高效架构和训练驯服高分辨率文本到图像模型以适应移动设备) [06:02] 🔬 PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations(PIG:物理信息高斯函数作为自适应参数化网格表示) [06:49] 📊 Learned Compression for Compressed Learning(压缩学习中的学习压缩) [07:32] 🎙 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition(Lyra:一个高效且以语音为中心的全认知框架) [08:20] 📊 RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios(RuleArena:在现实场景中评估LLMs规则引导推理能力的基准) [09:08] 👀 Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders(Gaze-LLE:通过大规模学习编码器进行注视目标估计) [10:02] 🧠 JuStRank: Benchmarking LLM Judges for System Ranking(JuStRank:基准测试用于系统排名的LLM评判器) [10:43] 🧠 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation(OLA-VLM:通过辅助嵌入蒸馏提升多模态大语言模型的视觉感知能力) [11:34] 📚 The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(版权材料对大型语言模型的影响:挪威视角) [12:16] 🔗 Word Sense Linking: Disambiguating Outside the Sandbox(词义链接:超越沙盒的消歧) [12:58] 🌐 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction(FreeSplatter:无姿态高斯喷射用于稀疏视图三维重建) [13:42] 🎥 DisPose: Disentangling Pose Guidance for Controllable Human Image Animation(DisPose:解耦姿态引导的可控人体图像动画) [14:26] 🖼 LoRACLR: Contrastive Adaptation for Customization of Diffusion Models(LoRACLR:对比适应用于扩散模型的定制化) [15:21] 🧭 SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts(SAME:学习基于状态自适应混合专家的通用语言引导视觉导航) [16:05] 🌟 Arbitrary-steps Image Super-resolution via Diffusion Inversion(基于扩散反演的任意步图像超分辨率) [16:46] 📚 Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages(Shiksha:面向印度语言的技术领域翻译数据集与模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.12 每日AI论文 | 多视角视频生成突破,复杂场景模型提升
本期的 14 篇论文如下: [00:23] 🎥 SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints(SynCamMaster:同步多视角视频生成) [01:07] 🌐 LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations(LAION-SG:用于训练复杂图像-文本模型的增强型大规模数据集与结构化注释) [01:51] 🌐 POINTS1.5: Building a Vision-Language Model towards Real World Applications(POINTS1.5:构建面向实际应用的视觉语言模型) [02:28] 🎨 Learning Flow Fields in Attention for Controllable Person Image Generation(在注意力中学习流场用于可控人物图像生成) [03:11] 🎥 StyleMaster: Stylize Your Video with Artistic Generation and Translation(风格大师:艺术生成与转换的视频风格化) [04:00] 🔍 Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction(生成密集化:学习在高保真泛化三维重建中密集化高斯分布) [04:46] 🎥 StreamChat: Chatting with Streaming Video(流媒体聊天:与流媒体视频互动) [05:28] 🧠 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark(3DSRBench:一个综合的3D空间推理基准) [06:12] 🏃 Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation(Mogo:用于高质量3D人体运动生成的RQ分层因果Transformer) [07:01] 🧠 KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models(KaSA:知识感知奇异值适应大型语言模型) [07:40] 🖼 FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models(FlowEdit:基于预训练流模型的无逆向文本编辑) [08:17] 🎨 StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements(StyleStudio:基于文本的风格迁移与风格元素选择性控制) [09:03] 🌍 MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation(MIT-10M:大规模多语言图像翻译并行语料库) [09:50] 🚀 Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel(自引导数据飞轮的语言引导导航学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.11 每日AI论文 | 代码模型评估改进,视频生成技术突破
本期的 23 篇论文如下: [00:25] 🧑 Evaluating and Aligning CodeLLMs on Human Preference(评估与对齐代码大语言模型的人类偏好) [01:19] 🎥 STIV: Scalable Text and Image Conditioned Video Generation(STIV:可扩展的文本与图像条件视频生成) [01:59] 🎨 DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation(DiffSensei:连接多模态大语言模型与扩散模型以实现定制化漫画生成) [02:39] 🔒 Hidden in the Noise: Two-Stage Robust Watermarking for Images(隐藏在噪声中:图像的两阶段鲁棒水印技术) [03:19] 🎥 UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics(UniReal:通过学习真实世界动态实现通用图像生成与编辑) [04:04] 📄 OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations(全向文档基准:多样PDF文档解析的综合评估) [04:50] 🎨 FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models(FiVA:用于文本到图像扩散模型的细粒度视觉属性数据集) [05:32] 🎥 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation(3D轨迹大师:掌握视频生成中的多实体三维运动) [06:09] 🧠 Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation(框架表示假设:多标记语言模型的可解释性与概念引导文本生成) [06:55] 🧠 Perception Tokens Enhance Visual Reasoning in Multimodal Language Models(感知令牌增强多模态语言模型的视觉推理能力) [07:41] 🎥 Video Motion Transfer with Diffusion Transformers(基于扩散变换器的视频运动迁移) [08:23] 🚀 EMOv2: Pushing 5M Vision Model Frontier(EMOv2:推动5M规模视觉模型前沿) [09:02] 🛡 Granite Guardian(花岗岩守护者) [09:44] 🌟 ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance(ILLUME:让您的LLMs看见、绘制并自我增强) [10:30] 🎥 ObjCtrl-2.5D: Training-free Object Control with Camera Poses(ObjCtrl-2.5D:无需训练的对象控制与相机姿态) [11:21] 🚀 LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation(LoRA.rar:通过超网络学习合并LoRA以实现主题-风格条件图像生成) [12:12] 📱 MoViE: Mobile Diffusion for Video Editing(MoViE:移动设备上的扩散模型视频编辑) [12:46] 🧬 Chimera: Improving Generalist Model with Domain-Specific Experts(奇美拉:通过特定领域专家提升通用模型) [13:28] 🌐 Fully Open Source Moxin-7B Technical Report(全开源Moxin-7B技术报告) [14:09] 📱 Mobile Video Diffusion(移动视频扩散) [14:45] 🤖 Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation(情境化反驳言论:适应、个性化与评估策略) [15:24] 🤖 Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment(最大化对齐与最小化反馈:高效学习视觉运动机器人策略对齐的奖励) [16:15] 🔒 A New Federated Learning Framework Against Gradient Inversion Attacks(一种对抗梯度反演攻击的新型联邦学习框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.10 每日AI论文 | 识别数学推理错误,评估强化学习记忆。
本期的 9 篇论文如下: [00:23] 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning(ProcessBench:识别数学推理中的过程错误) [01:13] 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation(揭开强化学习代理中记忆复杂性的分类与评估方法) [01:58] 🧠 Training Large Language Models to Reason in a Continuous Latent Space(在连续潜在空间中训练大型语言模型进行推理) [02:38] 🌐 Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models(探索多粒度概念注释在多模态大语言模型中的应用) [03:22] 🎥 Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation(Divot:基于扩散模型的视频理解与生成) [04:09] 🎥 You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale(所见即所得:在无姿态视频上大规模学习3D创作) [04:53] 🌍 Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space(地球的全局与密集嵌入:潜在空间中的Major TOM浮动) [05:31] 🌐 Robust Multi-bit Text Watermark with LLM-based Paraphrasers(基于LLM的鲁棒多比特文本水印) [06:15] 🤖 CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction(CARP:通过粗到细自回归预测进行视觉运动策略学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.09 每日AI论文 | 提升多模态模型性能,优化文本到视频生成质量。
本期的 11 篇论文如下: [00:27] 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(扩展开源多模态模型性能边界:模型、数据与测试时扩展) [00:58] 🎥 LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment(利用人类反馈进行文本到视频模型对齐) [01:41] 🧠 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale(MAmmoTH-VL:大规模指令调优激发多模态推理) [02:24] 🤖 EXAONE 3.5: Series of Large Language Models for Real-world Use Cases(EXAONE 3.5:面向实际应用的大型语言模型系列) [03:26] 🤖 Moto: Latent Motion Token as the Bridging Language for Robot Manipulation(Moto:作为机器人操作桥梁语言的潜在运动标记) [04:10] 🚀 APOLLO: SGD-like Memory, AdamW-level Performance(APOLLO:类似SGD的内存,AdamW级别的性能) [04:49] ⚡ SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion(SwiftEdit:通过一步扩散实现闪电般快速的文本引导图像编辑) [05:26] 🎥 GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration(GenMAC:基于多智能体协作的组合式文本到视频生成) [06:07] ⏱ Mind the Time: Temporally-Controlled Multi-Event Video Generation(注意时间:时间控制的多事件视频生成) [06:42] 🏠 2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction(2DGS-Room:基于种子引导的2D高斯喷射与几何约束的高保真室内场景重建) [07:20] 🗣 DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling(DEMO:通过细粒度元素建模重构对话交互) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 【周末特辑】12月第1周最火AI论文 | SNOOPI提升文生图模型效率,PaliGemma 2优化视觉语言模型迁移性能
本期的 5 篇论文如下: [00:40] TOP1(🔥102) | 🚀 SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance(SNOOPI:超强一步扩散蒸馏与适当引导) [02:39] TOP2(🔥100) | 🔄 PaliGemma 2: A Family of Versatile VLMs for Transfer(PaliGemma 2:多功能视觉语言模型的迁移研究) [04:40] TOP3(🔥64) | 🔍 VisionZip: Longer is Better but Not Necessary in Vision Language Models(视觉压缩:视觉语言模型中长度并非必要优势) [06:14] TOP4(🔥60) | 🖼 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(X-Prompt:面向自回归视觉语言基础模型的通用上下文图像生成) [08:19] TOP5(🔥54) | 🎥 VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation(视频思维生成:多镜头视频生成的协作框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.06 每日AI论文 | 视觉压缩提升效率,代码监控增强机器人可靠性。
本期的 23 篇论文如下: [00:23] 🔍 VisionZip: Longer is Better but Not Necessary in Vision Language Models(视觉压缩:视觉语言模型中长度并非必要优势) [01:03] 🤖 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection(代码即监控:约束感知的视觉编程用于反应性和前瞻性机器人故障检测) [01:43] 🖥 Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction(Aguvis:统一纯视觉自主GUI交互代理) [02:27] 🔊 A Noise is Worth Diffusion Guidance(噪声值得扩散引导) [03:04] 📊 Evaluating Language Models as Synthetic Data Generators(评估语言模型作为合成数据生成器) [03:48] 🌐 Structured 3D Latents for Scalable and Versatile 3D Generation(结构化3D潜在表示在可扩展和多功能3D生成中的应用) [04:26] 🌐 MV-Adapter: Multi-view Consistent Image Generation Made Easy(MV-Adapter:多视角一致图像生成变得简单) [05:05] 🖼 Negative Token Merging: Image-based Adversarial Feature Guidance(负向标记合并:基于图像的对抗特征引导) [05:41] 🌐 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion(佛罗伦萨-VL:通过生成视觉编码器和深度-广度融合增强视觉语言模型) [06:18] 📈 Densing Law of LLMs(大语言模型的密度定律) [06:59] 🌌 Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis(无限:高分辨率图像合成中的比特位自回归建模) [07:37] ⚽ Towards Universal Soccer Video Understanding(面向通用足球视频理解) [08:15] 🎨 HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing(HumanEdit:一个高质量的人类奖励数据集,用于基于指令的图像编辑) [08:53] 👗 AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models(任意服装虚拟试穿:基于潜在扩散模型的可定制多服装生成) [09:35] 🌍 Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation(全球MMLU:理解和解决多语言评估中的文化和语言偏见) [10:11] 🌐 Personalized Multimodal Large Language Models: A Survey(个性化多模态大语言模型:综述) [10:55] ⚡ ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality(ZipAR:通过空间局部性加速自回归图像生成) [11:36] 🧠 MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities(MRGen:基于扩散的可控数据引擎用于无标注模态的MRI分割) [12:14] 🧠 Discriminative Fine-tuning of LVLMs(判别性微调的大视觉语言模型) [12:48] 🧠 Monet: Mixture of Monosemantic Experts for Transformers(Monet:Transformer的单语义专家混合模型) [13:24] 🌊 OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows(全流:多模态校正流的任意到任意生成) [13:59] 🧠 KV Shifting Attention Enhances Language Modeling(KV移位注意力增强语言建模) [14:40] 🌍 Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement(Marco-LLM:通过大规模多语言训练实现跨语言增强) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.05 每日AI论文 | 提升文本到图像扩散模型,生成沉浸式360度视频。
本期的 15 篇论文如下: [00:24] 🚀 SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance(SNOOPI:超强一步扩散蒸馏与适当引导) [01:06] 🎥 Imagine360: Immersive 360 Video Generation from Perspective Anchor(Imagine360:从透视锚点生成沉浸式360度视频) [01:40] 🚗 Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion(扩散模型在高效3D LiDAR场景补全中的蒸馏方法) [02:13] 🔄 PaliGemma 2: A Family of Versatile VLMs for Transfer(PaliGemma 2:多功能视觉语言模型的迁移研究) [02:52] 🌊 TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation(TokenFlow:多模态理解和生成的统一图像分词器) [03:31] 🌐 VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models(VARCO-VISION:拓展韩国视觉语言模型的前沿) [04:05] 🌐 NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images(NVComposer:通过多张稀疏和未定位图像提升生成新视角合成) [04:49] 🎥 Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding(视频-3D大语言模型:学习位置感知视频表示用于3D场景理解) [05:34] 🔍 CleanDIFT: Diffusion Features without Noise(CleanDIFT:无噪声扩散特征) [06:11] 🎨 MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation(MIDI:单张图像生成多实例3D场景的新方法) [06:53] 🎥 One Shot, One Talk: Whole-body Talking Avatar from a Single Image(一图一语:从单张图像生成全身说话虚拟形象) [07:33] 📹 Mimir: Improving Video Diffusion Models for Precise Text Understanding(米米尔:提升视频扩散模型在精确文本理解中的应用) [08:07] 🎨 NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training(硝基融合:通过动态对抗训练实现高保真单步扩散) [08:47] 🧩 Weighted-Reward Preference Optimization for Implicit Model Fusion(加权奖励偏好优化用于隐式模型融合) [09:37] 🔍 Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning(Inst-IT:通过显式视觉提示指令调优提升多模态实例理解) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.04 每日AI论文 | 多镜头视频生成框架提升叙事连贯性,关键令牌识别增强LLM推理能力。
本期的 15 篇论文如下: [00:24] 🎥 VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation(视频思维生成:多镜头视频生成的协作框架) [01:04] 🧠 Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability(关键令牌重要性:令牌级对比估计提升LLM的推理能力) [01:45] 🔄 Free Process Rewards without Process Labels(无过程标签的自由过程奖励) [02:30] 🎧 AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?(AV-Odyssey 基准:多模态大语言模型真的能理解视听信息吗?) [03:04] 🤖 MALT: Improving Reasoning with Multi-Agent LLM Training(MALT:通过多智能体LLM训练提升推理能力) [03:45] 🎥 OmniCreator: Self-Supervised Unified Generation with Universal Editing(全能创作者:自监督统一生成与通用编辑) [04:23] 🌴 Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis(真相还是幻象?面向端到端事实性评估的LLM-Oasis) [05:08] 📚 OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation(OCR 阻碍 RAG:评估 OCR 对检索增强生成系统的级联影响) [05:51] 📊 Scaling Image Tokenizers with Grouped Spherical Quantization(基于分组球面量化的图像标记器扩展) [06:27] 🌐 LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences(LSceneLLM:利用自适应视觉偏好增强大型3D场景理解) [07:09] ⚙ A dynamic parallel method for performance optimization on hybrid CPUs(混合CPU性能优化的动态并行方法) [08:00] 🌐 MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation(MaskRIS:语义扭曲感知的数据增强方法用于指称图像分割) [08:46] 🎥 Motion Prompting: Controlling Video Generation with Motion Trajectories(运动提示:通过运动轨迹控制视频生成) [09:27] 🎥 VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval(视频亮点:联合视频亮点检测与时刻检索的特征精炼与跨任务对齐Transformer) [10:01] 🤖 Generating a Low-code Complete Workflow via Task Decomposition and RAG(通过任务分解和RAG生成低代码完整工作流程) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.03 每日AI论文 | X-Prompt提升图像生成,GATE OpenING评估图文生成。
本期的 24 篇论文如下: [00:23] 🖼 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(X-Prompt:面向自回归视觉语言基础模型的通用上下文图像生成) [00:58] 📊 GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation(GATE 开放:一个综合基准用于评估开放式交错图文生成) [01:32] 🖼 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis(Switti:为文本到图像合成设计尺度变换器) [02:09] 🎥 Open-Sora Plan: Open-Source Large Video Generation Model(开放Sora计划:开源大型视频生成模型) [02:55] 🎥 TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video(TAPTRv3:时空上下文增强长视频中任意点的鲁棒跟踪) [03:37] 🤖 o1-Coder: an o1 Replication for Coding(o1-Coder:一个面向编码任务的o1模型复现) [04:12] 🤖 SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters(SOLAMI:沉浸式互动的3D自主角色社交视觉-语言-动作建模) [04:49] 🎥 VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation(VISTA:通过视频时空增强提升长时和高分辨率视频理解) [05:38] 🔍 TinyFusion: Diffusion Transformers Learned Shallow(微型融合:浅层扩散变换器的学习) [06:19] 🔍 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models(VLsI:从大型到小型视觉语言模型的层级交互化) [06:52] 🎙 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait(FLOAT:基于生成运动潜在流匹配的音频驱动说话人像) [07:32] 🚀 Efficient Track Anything(高效追踪任何目标) [08:15] 🌊 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation(在矢量场中引导校正流模型以实现受控图像生成) [08:50] 🎥 Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation(长视频扩散生成与分段交叉注意力及内容丰富的视频数据集构建) [09:33] 📹 WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model(WF-VAE:通过小波驱动的能量流动增强视频VAE以用于潜在视频扩散模型) [10:11] 🔍 VLSBench: Unveiling Visual Leakage in Multimodal Safety(VLSBench:揭示多模态安全中的视觉泄露问题) [10:51] 🧠 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information(VisOnlyQA:大型视觉语言模型在几何信息视觉感知方面仍存在困难) [11:41] 🎮 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos(PhysGame:揭示游戏视频中的物理常识违规) [12:14] 🗣 Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input(协作实例导航:利用代理自我对话最小化用户输入) [12:51] 🌍 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge(评估多语言理解能力:基于区域知识) [13:28] 🎨 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge(无艺术生成模型:无需图形艺术知识的艺术创作) [14:02] 📈 A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models(大型语言模型测试时计算的简单可证明缩放定律) [14:41] 🌐 World-consistent Video Diffusion with Explicit 3D Modeling(世界一致性视频扩散与显式3D建模) [15:22] 🔊 Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning(面向低资源环境下跨语言音频滥用检测的小样本学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 【月末特辑】11月最火AI论文 | OpenCoder性能媲美专有模型,SDXL Turbo增强图像模型可解释性。
本期的 10 篇论文如下: [00:41] TOP1(🔥109) | 🔓 OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models(开放编码器:顶级代码大语言模型的开放食谱) [02:35] TOP2(🔥75) | 🔍 Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders(解构SDXL Turbo:使用稀疏自编码器解释文本到图像模型) [04:35] TOP3(🔥72) | 🖼 ROICtrl: Boosting Instance Control for Visual Generation(ROICtrl:提升视觉生成的实例控制) [06:38] TOP4(🔥69) | 🎥 ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning(ReCapture:使用掩码视频微调生成用户提供视频的生成性摄像机控制) [08:21] TOP5(🔥68) | 🌐 LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models(LLaMA-Mesh:将3D网格生成与语言模型统一) [10:13] TOP6(🔥67) | 🌍 Generative World Explorer(生成世界探索者) [12:39] TOP7(🔥64) | 📄 HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems(HtmlRAG:在RAG系统中,HTML比纯文本更适合建模检索知识) [14:52] TOP8(🔥63) | ⚡ BitNet a4.8: 4-bit Activations for 1-bit LLMs(BitNet a4.8:1位大语言模型的4位激活) [16:41] TOP9(🔥62) | 🖼 Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models(Add-it:基于预训练扩散模型的图像无训练对象插入) [18:16] TOP10(🔥61) | 🧠 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization(通过混合偏好优化提升多模态大语言模型的推理能力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.12.02 每日AI论文 | HiAR-ICL提升复杂任务表现,多模态模型领域适应增强。
本期的 14 篇论文如下: [00:25] 🧠 Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS(超越示例:通过蒙特卡洛树搜索在上下文学习中的高级自动化推理范式) [01:06] 🌐 On Domain-Specific Post-Training for Multimodal Large Language Models(针对多模态大语言模型的领域特定后训练研究) [01:39] 🎥 Video Depth without Video Models(无需视频模型的视频深度估计) [02:10] 🧩 Puzzle: Distillation-Based NAS for Inference-Optimized LLMs(谜题:基于蒸馏的神经架构搜索用于优化推理的大型语言模型) [02:58] ⏱ Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model(时间步嵌入提示:视频扩散模型的缓存时机) [03:39] 🎥 Trajectory Attention for Fine-grained Video Motion Control(细粒度视频运动控制的轨迹注意力) [04:26] 🌐 FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion(FAM扩散:频率与注意力调制用于稳定扩散的高分辨率图像生成) [05:07] 🌊 DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding(DisCoRD:通过修正流解码将离散标记转换为连续运动) [05:52] 📐 AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos(AlphaTablets:单目视频三维平面重建的通用平面表示) [06:30] 🎥 Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing(每帧一览:视频-Ma²mba高效长视频理解的多轴梯度检查点技术) [07:07] 📹 AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers(AC3D:分析并改进视频扩散变换器中的3D相机控制) [07:52] 📰 LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification(无手动标注数据的文本分类LLM师生框架:以IPTC新闻主题分类为例) [08:38] 🎥 Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling(时空跳跃引导增强视频扩散采样) [09:09] 🔄 Reverse Thinking Makes LLMs Stronger Reasoners(逆向思维使大型语言模型成为更强的推理者) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 【周末特辑】11月第5周最火AI论文 | 提升视觉生成实例控制,增强UI视觉代理交互能力。
本期的 5 篇论文如下: [00:40] TOP1(🔥71) | 🖼 ROICtrl: Boosting Instance Control for Visual Generation(ROICtrl:提升视觉生成的实例控制) [02:30] TOP2(🔥59) | 🖥 ShowUI: One Vision-Language-Action Model for GUI Visual Agent(ShowUI:一种用于GUI视觉代理的视觉-语言-动作模型) [05:02] TOP3(🔥54) | 🚀 TÜLU 3: Pushing Frontiers in Open Language Model Post-Training(TÜLU 3:推动开放语言模型后训练的前沿) [07:13] TOP4(🔥40) | 🌐 Material Anything: Generating Materials for Any 3D Object via Diffusion(材料生成:通过扩散生成任意3D对象的材料) [09:12] TOP5(🔥39) | 🌐 OminiControl: Minimal and Universal Control for Diffusion Transformer(OminiControl:扩散Transformer的最小且通用控制) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
- 2024.11.29 每日AI论文 | 视觉语言模型提升,图像生成自动化
本期的 6 篇论文如下: [00:26] 🧠 Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning(批评者-V:视觉语言模型批评者帮助捕捉多模态推理中的错误) [01:04] 🤖 ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting(ChatGen:从自由聊天中自动生成文本到图像) [01:43] 👕 TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models(TryOffDiff:基于扩散模型的高保真服装重建虚拟试衣) [02:24] 🎥 Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models(自由引导:基于无梯度路径积分控制的增强型文本到视频生成与大规模视觉语言模型) [03:15] 🤖 Morph: A Motion-free Physics Optimization Framework for Human Motion Generation(Morph:一种无运动的物理优化框架用于人体运动生成) [03:49] 📄 LongKey: Keyphrase Extraction for Long Documents(长键:长文档的关键短语提取) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递