SOTAVerified

Benchmarking

Papers

Showing 10511100 of 5548 papers

TitleStatusHype
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
3D Common Corruptions and Data AugmentationCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking PlatformCode1
Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networksCode1
From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image SegmentationCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Benchmarking Meaning Representations in Neural Semantic ParsingCode1
3D AffordanceNet: A Benchmark for Visual Object Affordance UnderstandingCode1
Benchmarking Meta-embeddings: What Works and What Does NotCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
2.5D Visual Relationship DetectionCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Benchmarking Large Multimodal Models against Common CorruptionsCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
Show:102550
← PrevPage 22 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified