SOTAVerified

Benchmarking

Papers

Showing 501550 of 5548 papers

TitleStatusHype
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach0
Overview and practical recommendations on using Shapley Values for identifying predictive biomarkers via CATE modeling0
EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP ModelsCode0
Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging0
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language ModelsCode0
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation0
EnronQA: Towards Personalized RAG over Private Documents0
InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method0
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and OutlookCode2
AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring0
MINERVA: Evaluating Complex Video ReasoningCode2
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework0
From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising0
Sadeed: Advancing Arabic Diacritization Through Small Language Model0
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training0
Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking0
The Leaderboard Illusion0
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System VerificationCode1
Hydra: Marker-Free RGB-D Hand-Eye Calibration0
TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social NetworksCode1
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks0
LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs0
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model ValidationCode0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language ModelsCode0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies0
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice TextCode1
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in ChineseCode2
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis0
Token Sequence Compression for Efficient Multimodal Computing0
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologies0
QuantBench: Benchmarking AI Methods for Quantitative Investment0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified BenchmarkCode0
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field EnlargementCode1
Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations0
Benchmarking machine learning models for predicting aerofoil performance0
Fluorescence Reference Target Quantitative Analysis LibraryCode0
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V30
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs0
Show:102550
← PrevPage 11 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified