SOTAVerified

Benchmarking

Papers

Showing 651675 of 5548 papers

TitleStatusHype
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
POGEMA: A Benchmark Platform for Cooperative Multi-Agent PathfindingCode1
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and EvaluationsCode1
Restore Anything Model via Efficient Degradation AdaptationCode1
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse ModalitiesCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
Separable Operator NetworksCode1
When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph BenchmarkCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videosCode1
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generationCode1
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse DisciplinesCode1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
Training on the Test Task Confounds Evaluation and EmergenceCode1
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental LearningCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Replication in Visual Diffusion Models: A Survey and OutlookCode1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
Benchmark on Drug Target Interaction Modeling from a Structure PerspectiveCode1
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language ModelsCode1
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking DatasetCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Occlusion-Aware Seamless SegmentationCode1
Show:102550
← PrevPage 27 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified