HFEPX Hub

General + Pairwise Preference Papers

Updated from current HFEPX corpus (Feb 27, 2026). 43 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 43 Last published: Feb 26, 2026 Global RSS Tag RSS

GeneralPairwise Preference

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 43 papers for General + Pairwise Preference Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Retrieval, LMSYS Chatbot Arena and metric focus on accuracy, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems , CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
automatic metrics appears in 76.7% of papers in this hub.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems , CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , Validating Political Position Predictions of Arguments , Moral Preferences of LLMs Under Directed Contextual Influence , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

Protocol Takeaways

Most common quality-control signal is rater calibration (4.7% of papers).

Evidence: Who can we trust? LLM-as-a-jury for Comparative Assessment , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments , Moral Preferences of LLMs Under Directed Contextual Influence
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning , Simplifying Outcomes of Language Model Component Analyses with ELIA , Moral Preferences of LLMs Under Directed Contextual Influence , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments , Moral Preferences of LLMs Under Directed Contextual Influence

Benchmark Interpretation

Retrieval appears in 16.3% of hub papers (7/43); use this cohort for benchmark-matched comparisons.
LMSYS Chatbot Arena appears in 7% of hub papers (3/43); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 14% of hub papers (6/43); compare with a secondary metric before ranking methods.
agreement is reported in 7% of hub papers (3/43); compare with a secondary metric before ranking methods.

Researcher Checklist

Maintain strength on Papers with explicit human feedback. Coverage is strong (100% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (11.6% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (30.2% vs 35% target).
Tighten coverage on Papers naming evaluation metrics. Coverage is usable but incomplete (32.6% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (7% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (34.9% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (11.6% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (30.2% vs 35% target).

Papers naming evaluation metrics

Coverage is usable but incomplete (32.6% vs 35% target).

Papers with known rater population

Coverage is a replication risk (7% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (34.9% vs 35% target).

Known Limitations

Only 11.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=1, left_only=5, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=6, right_only=33

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=33

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 7 papers (16.3%)

7 papers (16.3%) mention Retrieval.

Examples: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , Validating Political Position Predictions of Arguments , ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation

Benchmark Brief

LMSYS Chatbot Arena

Coverage: 3 papers (7%)

3 papers (7%) mention LMSYS Chatbot Arena.

Examples: SCOPE: Selective Conformal Optimized Pairwise LLM Judging , Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty , A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

Benchmark Brief

AlpacaEval

Coverage: 1 papers (2.3%)

1 papers (2.3%) mention AlpacaEval.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Metric Brief

accuracy

Coverage: 6 papers (14%)

6 papers (14%) mention accuracy.

Examples: DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , CAMEL: Confidence-Gated Reflection for Reward Modeling , Modeling Distinct Human Interaction in Web Agents

Metric Brief

agreement

Coverage: 3 papers (7%)

3 papers (7%) mention agreement.

Examples: Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments , HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Metric Brief

calibration

Coverage: 1 papers (2.3%)

1 papers (2.3%) mention calibration.

Examples: Who can we trust? LLM-as-a-jury for Comparative Assessment

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Moral Preferences of LLMs Under Directed Contextual Influence , DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs).
Probing Graph Neural Network Activation Patterns Through Graph Topology
Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback
Chenyang Zhao, Vinny Cahill, Ivana Dusparic · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRlaif Or Synthetic Feedback Human Eval

Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Reward models play a fundamental role in aligning large language models with human preferences.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.
Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0

Pairwise Preference Human Eval

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
Simplifying Outcomes of Language Model Component Analyses with ELIA
Aaron Louis Eidt, Nils Feldhus · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations.
Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Simulation Env Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
Learning Personalized Agents from Human Feedback
Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
How to Train Your Long-Context Visual Document Model
Austin Veselka · Feb 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performanc
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Human Eval

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion str
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs).
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalLlm As Judge

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System
Anantha Sharma · Jan 3, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity in
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Jonathan Kamp, Roos Bakker, Dominique Blok · Dec 11, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
Designing and Evaluating Chain-of-Hints for Scientific Question Answering
Anubhav Jangra, Smaranda Muresan · Oct 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.
Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Aligning large language models with human preferences is critical for creating reliable and controllable AI systems.
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025 · Citations: 0

Pairwise Preference Simulation Env

In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun · Jun 30, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Conducting supervised and preference fine-tuning of large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values.
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025 · Citations: 0

Pairwise Preference Simulation Env

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Evaluating the Diversity and Quality of LLM Generated Content
Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin · Apr 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling
Kaleel Mahmood, Shaoyi Huang · Dec 8, 2024 · Citations: 0

Pairwise Preference Automatic Metrics

One of the key challenges in Transformer architectures is the quadratic complexity of the attention mechanism, which limits the efficient processing of long sequences.

General + Pairwise Preference Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs