MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset
2024-06-01
Full analysis loading… Code implementations, benchmark data, and reproduction guides are being assembled. Please check back shortly.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.