SOTAVerified

Benchmarking

Papers

Showing 14211430 of 5548 papers

TitleStatusHype
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
Autonomous Reinforcement Learning: Formalism and BenchmarkingCode1
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation AlgorithmsCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Show:102550
← PrevPage 143 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified