Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.
Human Feedback Types
strong Pairwise Preference
Confidence: High Source: Persisted extraction evidenced
Directly usable for protocol triage.
Evidence snippet: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95.
Evaluation Modes
strong Automatic Metrics
Confidence: High Source: Persisted extraction evidenced
Includes extracted eval setup.
Evidence snippet: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95.
Quality Controls
missing Not reported
Confidence: Low Source: Persisted extraction missing
No explicit QC controls found.
Evidence snippet: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95.
Benchmarks / Datasets
strong Semeval
Confidence: High Source: Persisted extraction evidenced
Useful for quick benchmark comparison.
Evidence snippet: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95.
Reported Metrics
strong Accuracy
Confidence: High Source: Persisted extraction evidenced
Useful for evaluation criteria comparison.
Evidence snippet: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95.
Rater Population
missing Unknown
Confidence: Low Source: Persisted extraction missing
Rater source not explicitly reported.
Evidence snippet: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95.