SOTAVerified

Benchmarking

Papers

Showing 10011050 of 5548 papers

TitleStatusHype
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering0
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal ModelsCode1
Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural NetworksCode0
True Online TD-Replan(lambda) Achieving Planning through Replaying0
Evolving Hard Maximum Cut Instances for Quantum Approximate Optimization Algorithms0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency0
Unraveling the Capabilities of Language Models in News SummarizationCode0
The iToBoS dataset: skin region images extracted from 3D total body photographs for lesion detectionCode0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research0
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language ModelCode2
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate CampaignsCode1
Molecular-driven Foundation Model for Oncologic PathologyCode4
Benchmarking Quantum Convolutional Neural Networks for Signal Classification in Simulated Gamma-Ray Burst Detection0
Making Sense of Data in the Wild: Data Analysis Automation at Scale0
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding0
A Benchmarking Environment for Worker Flexibility in Flexible Job Shop Scheduling Problems0
Transfer of Knowledge through Reverse Annealing: A Preliminary Analysis of the Benefits and What to Share0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation0
Benchmarking Quantum Reinforcement LearningCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry0
Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets?0
Beyond Benchmarks: On The False Promise of AI Regulation0
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement LearningCode7
Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study0
Benchmarking global optimization techniques for unmanned aerial vehicle path planning0
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM AgentsCode3
Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems0
The Karp Dataset0
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy VideoCode2
Enhancing Biomedical Relation Extraction with DirectionalityCode1
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF0
Leveraging LLMs to Create a Haptic Devices' Recommendation System0
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization0
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)0
Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes0
Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning0
Benchmarking Image Perturbations for Testing Automated Driving Assistance SystemsCode0
Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing0
Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model0
Benchmarking Large Language Models via Random Variables0
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language ModelsCode1
Show:102550
← PrevPage 21 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified