SOTAVerified

Benchmarking

Papers

Showing 926950 of 5548 papers

TitleStatusHype
Benchmarking deep inverse models over time, and the neural-adjoint methodCode1
A Call to Reflect on Evaluation Practices for Failure Detection in Image ClassificationCode1
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and ToolkitCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
LoLI-Street: Benchmarking Low-Light Image Enhancement and BeyondCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Benchmarking Deep Learning Interpretability in Time Series PredictionsCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Benchmarking Deep Models for Salient Object DetectionCode1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and TasksCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and CollaborationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
High-Dimensional Inference in Bayesian NetworksCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
Show:102550
← PrevPage 38 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified