HFEPX Archive Slice

HFEPX Monthly Archive: 2025-05

Updated from current HFEPX corpus (Apr 12, 2026). 134 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 134 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: AdvBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 30, 2025.

Papers: 134 Last published: May 30, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 134 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

8.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

23.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

18.7% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 24.6% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.2% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
May 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Jailbreak success rate
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
May 23, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
May 28, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
Incentivizing Strong Reasoning from Weak Supervision
May 26, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
May 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds
May 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Automatic Metrics	Rtc Bench	Jailbreak success rate	Not reported
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods May 23, 2025	Automatic Metrics	TruthfulQA	Accuracy	Not reported
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation May 28, 2025	Automatic Metrics	Not reported	Cost	Not reported
Incentivizing Strong Reasoning from Weak Supervision May 26, 2025	Automatic Metrics	Not reported	Cost	Not reported
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation May 28, 2025	Automatic Metrics	Not reported	Accuracy, Perplexity	Not reported
Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds May 28, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning May 27, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction May 26, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Inference-time Alignment in Continuous Space May 26, 2025	Not reported	AdvBench	Not reported	Not reported
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases May 23, 2025	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (18.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.2% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.5% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (19.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.2% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (7.5% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 2.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.2% coverage).
Annotation unit is under-specified (7.5% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AdvBench vs ALFWorld) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AdvBench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 2.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (33)
Llm As Judge (3)
Simulation Env (3)

Top Metrics

Accuracy (18)
Cost (6)
Recall (3)
Jailbreak success rate (2)

Top Benchmarks

AdvBench (1)
ALFWorld (1)
DROP (1)
HotpotQA (1)

Quality Controls

Calibration (3)

Papers In This Archive Slice

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Yuwen Tan, Yuan Qing, Boqing Gong · May 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi · May 30, 2025 · Citations: 0
Online Fair Division with Additional Information
Tzeh Yuan Neoh, Jannik Peters, Nicholas Teh · May 30, 2025 · Citations: 0

We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be allocated irrevocably.
When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025 · Citations: 0

To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving
Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang · May 29, 2025 · Citations: 0

We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows.
Probing Association Biases in LLM Moderation Over-Sensitivity
Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi · May 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Formula-R1: Incentivizing LLM Reasoning over Complex Tables with Numerical Computation via Formula-Driven Reinforcement Learning
Lang Cao, Jingxian Xu, Hanbing Liu, Jinyu Wang, Mengyu Zhou · May 29, 2025 · Citations: 0

Long Horizon

We demonstrate the effectiveness of Formula Tuning through extensive experiments on seven table reasoning benchmarks.
AJF: Adaptive Jailbreak Framework Based on the Comprehension Ability of Black-Box Large Language Models
Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao · May 29, 2025 · Citations: 0

Red Team

Building on this insight, we propose an Adaptive Jailbreak Framework (AJF) based on the comprehension ability of black-box large language models.
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · May 29, 2025 · Citations: 0
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025 · Citations: 0

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025 · Citations: 0

Pairwise Preference

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
StressTest: Can YOUR Speech LM Handle the Stress?
Iddo Yosha, Gallil Maimon, Yossi Adi · May 28, 2025 · Citations: 0

Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs.
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi · May 28, 2025 · Citations: 0
Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds
Anish R Joishy, Ishwar B Balappanawar, Vamshi Krishna Bonagiri, Manas Gaur, Krishnaprasad Thirunarayan · May 28, 2025 · Citations: 0

Evaluation of 11 LLMs across six diverse reasoning datasets reveals a consistent failure: model accuracy plummets by an average of 14% in counterfactual scenarios compared to knowledge-aligned ones.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan · May 28, 2025 · Citations: 0

Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines.
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Tianmai M. Zhang, Neil F. Abernethy · May 28, 2025 · Citations: 0

Expert Verification

However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Web Browsing

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
VeriTrail: Closed-Domain Hallucination Detection with Traceability
Dasha Metropolitansky, Jonathan Larson · May 27, 2025 · Citations: 0
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang · May 27, 2025 · Citations: 0
How Does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She, Xiao Liu · May 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025 · Citations: 0

Tool Use

Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable…
Augmenting Research Ideation with Data: An Empirical Investigation in Social Science
Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang · May 27, 2025 · Citations: 0
RPM: Reasoning-Level Personalization for Black-Box Large Language Models
Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee · May 27, 2025 · Citations: 0

Pairwise Preference

While black-box large language models are widely deployed, they produce generic outputs that overlook individual user preferences.
Generalizable Heuristic Generation Through LLMs with Meta-Optimization
Yiding Shi, Jianan Zhou, Wen Song, Jieyi Bi, Yaoxin Wu · May 27, 2025 · Citations: 0
Tracing and Reversing Edits in LLMs
Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer · May 27, 2025 · Citations: 0
Do LLMs Understand Collaborative Signals? Diagnosis and Repair
Shahrooz Pouryousef, Ali Montazeralghaem · May 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Wideband RF Radiance Field Modeling Using Frequency-embedded 3D Gaussian Splatting
Zechen Li, Lanqing Yang, Yiheng Bian, Hao Pan, Yongjian Fu · May 27, 2025 · Citations: 0
PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025 · Citations: 0

Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025 · Citations: 0

Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen · May 26, 2025 · Citations: 0

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence.
Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko · May 26, 2025 · Citations: 0
Token Distillation: Attention-aware Input Embeddings For New Tokens
Konstantin Dobler, Desmond Elliott, Gerard de Melo · May 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ERC-SVD: Error-Controlled SVD for Large Language Model Compression
Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang · May 26, 2025 · Citations: 0
Inference-time Alignment in Continuous Space
Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao · May 26, 2025 · Citations: 0

Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility.
Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao · May 26, 2025 · Citations: 0

Demonstrations

Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks.
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang · May 26, 2025 · Citations: 0

Critique Edit

To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision.
Types of Relations: Defining Analogies with Category Theory
Claire Ott, Frank Jäkel · May 26, 2025 · Citations: 0

In order to behave intelligently both humans and machines have to represent their knowledge adequately for how it is used.
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel · May 26, 2025 · Citations: 0

Pairwise Preference

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap.
Graceful Forgetting in Generative Language Models
Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue · May 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu · May 26, 2025 · Citations: 0
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments
Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake · May 25, 2025 · Citations: 0
LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen · May 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do LLMs have a Gender (Entropy) Bias?
Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta · May 24, 2025 · Citations: 0

We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across…
Disentangling Knowledge Representations for Large Language Model Editing
Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren · May 24, 2025 · Citations: 0

To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge.
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song · May 24, 2025 · Citations: 0

To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities.
Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ShIOEnv: A Command Evaluation Environment for Grammar-Constrained Synthesis and Execution Behavior Modeling
Jarrod Ragsdale, Rajendra Boppana · May 23, 2025 · Citations: 0
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak · May 23, 2025 · Citations: 0

Long Horizon

We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base.
Training with Pseudo-Code for Instruction Following
Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor · May 23, 2025 · Citations: 0

Demonstrations

We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models.
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha · May 23, 2025 · Citations: 0

Pairwise Preference

Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods.
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang · May 23, 2025 · Citations: 0
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025 · Citations: 0

In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning…
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025 · Citations: 0

On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 absolute percentage points over DAPO.
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025 · Citations: 0

Red Team

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang · May 22, 2025 · Citations: 0
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Hiding in Plain Sight: A Steganographic Approach to Stealthy LLM Jailbreaks
Jianing Geng, Biao Yi, Zekun Fei, Ruiqi He, Lihai Nie · May 22, 2025 · Citations: 0
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng · May 22, 2025 · Citations: 0

The rapid development and widespread adoption of Audio Large Language Models (ALLMs) demand rigorous evaluation of their trustworthiness.
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025 · Citations: 0

Pairwise Preference

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
Dynamic Token Reweighting for Robust Vision-Language Models
Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma · May 22, 2025 · Citations: 0

Red Team

Large vision-language models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now