SOTAVerified

Benchmarking

Papers

Showing 13511400 of 5548 papers

TitleStatusHype
HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images0
Perspective on recent developments and challenges in regulatory and systems genomics0
HourVideo: 1-Hour Video-Language UnderstandingCode2
Learn to Solve Vehicle Routing Problems ASAP: A Neural Optimization Approach for Time-Constrained Vehicle Routing Problems with Finite Vehicle Fleet0
Benchmarking Large Language Models with Integer Sequence Generation Tasks0
Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking0
Beemo: Benchmark of Expert-edited Machine-generated OutputsCode0
SPINEX_ Symbolic Regression: Similarity-based Symbolic Regression with Explainable Neighbors Exploration0
TDDBench: A Benchmark for Training data detection0
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level0
On the Loss of Context-awareness in General Instruction Fine-tuningCode0
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning AgentCode3
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive PrototypingCode2
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
Imagining and building wise machines: The centrality of AI metacognition0
Benchmarking XAI Explanations with Human-Aligned Evaluations0
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph GenerationCode1
TableGPT2: A Large Multimodal Model with Tabular Data IntegrationCode4
ROAD-Waymo: Action Awareness at Scale for Autonomous DrivingCode1
SinaTools: Open Source Toolkit for Arabic Natural Language Processing0
FEET: A Framework for Evaluating Embedding TechniquesCode0
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models0
Artificial Intelligence for Microbiology and Microbiome Research0
A Review of Reinforcement Learning in Financial Applications0
Modern, Efficient, and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations0
Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model0
MIRFLEX: Music Information Retrieval Feature Library for ExtractionCode1
Benchmarking Bias in Large Language Models during Role-Playing0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image EditingCode0
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language ModelsCode1
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI AcceleratorsCode2
IdeaBench: Benchmarking Large Language Models for Research Idea GenerationCode0
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property PredictionCode1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and BenchmarkingCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
Benchmark Data Repositories for Better Benchmarking0
XRDSLAM: A Flexible and Modular Framework for Deep Learning based SLAMCode3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous AgentsCode3
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
CALE: Continuous Arcade Learning EnvironmentCode7
Low-Density 3D Point Cloud Classification0
Survey of Cultural Awareness in Language Models: Text and BeyondCode1
NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentationCode0
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning0
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes0
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Evaluating Cultural and Social Awareness of LLM Web Agents0
Show:102550
← PrevPage 28 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified