👍 71
06/05 08:00
Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlyi
中文介绍 发现LLM的UnEmbedding矩阵可充当文本嵌入的「特征透镜」,揭示其作为嵌入模型表现不佳的原因。通过分析该矩阵,提出改进方法,提升文本嵌入质量。
👍 43
06/04 08:00
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing
中文介绍 提出SoCRATES评估框架,用于可靠自动评估LLM调解员在多领域和社会认知变化下的表现。其模拟实时调解轨迹,考虑情感与意图动态,弥补现有评估的不足。
👍 42
06/03 08:00
Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic be
中文介绍 引入GENEB,一个大规模诊断基准,解决基因组模型难以比较的问题。它统一碎片化基准和评估协议,提供公平对比,揭示模型在不同任务上的真实表现。
👍 39
06/05 08:00
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneere
中文介绍 提出MMAE,首个面向通用指令式音频编辑的大规模多任务基准。覆盖多种编辑任务,推动智能音频编辑评估标准化,为模型比较提供统一平台。
👍 24
06/05 08:00
Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understandin
中文介绍 提出SWE-Explore基准,细粒度评估编码代理人仓库探索能力(如理解结构、定位相关代码),弥补SWE-bench等整体二分类评估的不足。
👍 24
06/05 08:00
Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexi
中文介绍 提出AnchorWorld框架,实现具身自我中心世界模拟,支持基于视角演化的自定义控制。增强交互完整性,提供灵活的可控性,推动交互世界建模实用化。
👍 22
06/04 08:00
Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only poli
中文介绍 观点文章指出,通用机器人智能的瓶颈不仅是策略缩放(如VLA模型),还涉及世界模型、认知结构等。提出机器人需要超越VLA和世界模型的更全面框架。
👍 19
06/05 08:00
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiabl
中文介绍 分析在线策略蒸馏(OPD)在参数空间中的更新轨迹,并与SFT和RL对比。揭示OPD的几何特性,解释其提升LLM推理能力的原因。
👍 19
06/04 08:00
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-erro
中文介绍 提出ToolMaze基准,评估LLM工具集成推理中的动态路径发现与异常恢复能力。区分系统化重规划与盲目试错,填补真实工具故障评估空白。
👍 18
06/04 08:00
Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated r
中文介绍 提出SubtleMemory基准,评估长期AI代理对细粒度关系记忆的辨别能力。测试代理在记忆冲突与上下文变化中的表现,推动持久助理的精准记忆管理。
👍 14
06/04 08:00
Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent vie
中文介绍 利用因果图对LLM推理过程建模,结合反事实链提供可解释性。不同于用LLM恢复外部因果图,该方法直接解释LLM内部决策机制。
👍 13
05/29 08:00
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports o
中文介绍 通过八种开源LLM对比Likert自我报告与行为观察,发现人类心理测量问卷会误表征LLM行为。指出应谨慎使用问卷评估LLM人格与价值观。
👍 13
06/04 08:00
We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a s
中文介绍 提出UnpredictaBench基准,评估LLM捕捉真实分布随机性的能力。针对模型输出分布坍塌问题,测试其在模拟场景中的分布保真度。
👍 13
06/05 08:00
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and
中文介绍 推动MLLM从短片段向长视频、多模态知识密集型场景理解。提出「观察-记忆-推理」范式,应对稀疏证据、长程依赖和多模态对齐等挑战。
👍 12
06/04 08:00
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through sp
中文介绍 发现图像到视频扩散模型中,2步生成比50步生成物理一致性更好。原因是视觉细化阶段可能消除运动先验,提出锁定运动先验的方法以提升物理合理性。
👍 12
06/04 08:00
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason fr
中文介绍 提出利用世界模拟器增强VLM的视觉空间推理,使其能想象未观察场景、保持跨视图一致性。结合世界模型与思维链,提升自主空间推理能力。
👍 9
05/26 08:00
Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The ha
中文介绍 提出SIA框架,通过「束缚」与「权重更新」实现AI自我改进。结合两条独立研究线,使AI能自动调整自身模型与代理行为,减少人类干预。
👍 5
05/28 08:00
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a
中文介绍 提出ResearchClawBench,评估AI代理端到端自主科学研究能力,覆盖10个领域40个任务。每个任务基于真实论文,测试完整研究流程。
👍 5
05/22 08:00
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatia
中文介绍 重新审视VLM的空间数值理解能力,发现其输出看似合理但未必真实基于空间感知。提出SPACENUM评估方法,区分真实理解与表面拟合。
👍 4
06/04 08:00
Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring criti
中文介绍 提出双语多模态基准Almieyar-Oryx-BloomBench,基于认知科学设计,系统诊断VLM的真实推理能力。覆盖多维度任务,推动类人多模态智能评估。