SOTAVerified

Benchmarking

Papers

Showing 25512600 of 5548 papers

TitleStatusHype
The Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation, Reconstruction and Radiance Field Methods0
WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking0
BEARD: Benchmarking the Adversarial Robustness for Dataset DistillationCode0
A survey of probabilistic generative frameworks for molecular simulationsCode0
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and DatasetCode0
HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere0
A Survey on Vision Autoregressive Model0
Evaluating the Generation of Spatial Relations in Text and Image Generative Models0
BuckTales : A multi-UAV dataset for multi-object tracking and re-identification of wild antelopes0
Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context EvaluationCode0
Benchmarking LLMs' Judgments with No Gold StandardCode0
MolMiner: Towards Controllable, 3D-Aware, Fragment-Based Molecular Design0
Low Dynamic Range for RIS-aided Bistatic Integrated Sensing and Communication0
Benchmarking Distributional Alignment of Large Language ModelsCode0
Benchmarking 3D multi-coil NC-PDNet MRI reconstruction0
FactLens: Benchmarking Fine-Grained Fact Verification0
A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust, Reliable, and Safe Learning Techniques for Real-world Robotics0
Open-set object detection: towards unified problem formulation and benchmarking0
Deep Learning Models for UAV-Assisted Bridge Inspection: A YOLO Benchmark Analysis0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding0
Perspective on recent developments and challenges in regulatory and systems genomics0
HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images0
Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries0
Learn to Solve Vehicle Routing Problems ASAP: A Neural Optimization Approach for Time-Constrained Vehicle Routing Problems with Finite Vehicle Fleet0
Benchmarking Large Language Models with Integer Sequence Generation Tasks0
Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale0
Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking0
Beemo: Benchmark of Expert-edited Machine-generated OutputsCode0
SPINEX_ Symbolic Regression: Similarity-based Symbolic Regression with Explainable Neighbors Exploration0
On the Loss of Context-awareness in General Instruction Fine-tuningCode0
TDDBench: A Benchmark for Training data detection0
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level0
Imagining and building wise machines: The centrality of AI metacognition0
Benchmarking XAI Explanations with Human-Aligned Evaluations0
SinaTools: Open Source Toolkit for Arabic Natural Language Processing0
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models0
FEET: A Framework for Evaluating Embedding TechniquesCode0
Artificial Intelligence for Microbiology and Microbiome Research0
Modern, Efficient, and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations0
Benchmarking Bias in Large Language Models during Role-Playing0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image EditingCode0
Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model0
A Review of Reinforcement Learning in Financial Applications0
IdeaBench: Benchmarking Large Language Models for Research Idea GenerationCode0
Benchmark Data Repositories for Better Benchmarking0
NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentationCode0
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning0
Evaluating Cultural and Social Awareness of LLM Web Agents0
Low-Density 3D Point Cloud Classification0
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes0
Show:102550
← PrevPage 52 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified