MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Automatic Metrics General Red Team Web Browsing

Abstract

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: Yes
Feedback types: Red Team
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Web Browsing
Quality controls: Not reported
Confidence: 0.70
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

success ratejailbreak success rate

Research Brief

Deterministic synthesis

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. HFEPX signals include Red Team, Automatic Metrics, Web Browsing with confidence 0.70. Updated from current HFEPX corpus.

Generated Mar 4, 2026, 5:52 AM · Grounded in abstract + metadata only

Key Takeaways

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether…
We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (success rate, jailbreak success rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic…
To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation.

Why It Matters For Eval

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic…

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Red Team
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: success rate, jailbreak success rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments Protocol Overlap

Citations: 0 Relevance: 8.70 Shared tag: Red TeamShared tag: Web Browsing
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
- Shared metric mentions
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
Reasoning Up the Instruction Ladder for Controllable Language Models Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
What Matters For Safety Alignment? Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Web Browsing
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Go-Browse: Training Web Agents with Structured Exploration Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Web Browsing
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote