SOTAVerified

Benchmarking

Papers

Showing 526550 of 5548 papers

TitleStatusHype
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model ValidationCode0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language ModelsCode0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies0
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice TextCode1
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in ChineseCode2
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis0
Token Sequence Compression for Efficient Multimodal Computing0
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologies0
QuantBench: Benchmarking AI Methods for Quantitative Investment0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified BenchmarkCode0
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field EnlargementCode1
Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations0
Benchmarking machine learning models for predicting aerofoil performance0
Fluorescence Reference Target Quantitative Analysis LibraryCode0
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V30
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs0
Show:102550
← PrevPage 22 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified