SOTAVerified

Benchmarking

Papers

Showing 10261050 of 5548 papers

TitleStatusHype
Beyond Benchmarks: On The False Promise of AI Regulation0
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement LearningCode7
Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study0
Benchmarking global optimization techniques for unmanned aerial vehicle path planning0
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM AgentsCode3
Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems0
The Karp Dataset0
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy VideoCode2
Enhancing Biomedical Relation Extraction with DirectionalityCode1
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF0
Leveraging LLMs to Create a Haptic Devices' Recommendation System0
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization0
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)0
Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes0
Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning0
Benchmarking Image Perturbations for Testing Automated Driving Assistance SystemsCode0
Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing0
Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model0
Benchmarking Large Language Models via Random Variables0
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language ModelsCode1
Show:102550
← PrevPage 42 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified