SOTAVerified

Benchmarking

Papers

Showing 19011950 of 5548 papers

TitleStatusHype
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language ModelsCode0
EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP ModelsCode0
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation0
EnronQA: Towards Personalized RAG over Private Documents0
InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method0
AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring0
Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework0
From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising0
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training0
Sadeed: Advancing Arabic Diacritization Through Small Language Model0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language ModelsCode0
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories0
LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs0
Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model ValidationCode0
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks0
Hydra: Marker-Free RGB-D Hand-Eye Calibration0
The Leaderboard Illusion0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies0
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis0
QuantBench: Benchmarking AI Methods for Quantitative Investment0
Token Sequence Compression for Efficient Multimodal Computing0
Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologies0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified BenchmarkCode0
Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations0
Towards responsible AI for education: Hybrid human-AI to confront the Elephant in the room0
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents0
Fluorescence Reference Target Quantitative Analysis LibraryCode0
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs0
Benchmarking machine learning models for predicting aerofoil performance0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V30
Establishing Reliability Metrics for Reward Models in Large Language Models0
Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture0
Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues0
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays0
A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents0
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation0
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations0
AI Idea Bench 2025: AI Research Idea Generation Benchmark0
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers0
Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering0
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation0
Show:102550
← PrevPage 39 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified