SOTAVerified

Benchmarking

Papers

Showing 19011925 of 5548 papers

TitleStatusHype
EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP ModelsCode0
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language ModelsCode0
EnronQA: Towards Personalized RAG over Private Documents0
InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method0
AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring0
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation0
From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising0
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training0
Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework0
Sadeed: Advancing Arabic Diacritization Through Small Language Model0
The Leaderboard Illusion0
LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs0
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model ValidationCode0
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks0
Hydra: Marker-Free RGB-D Hand-Eye Calibration0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language ModelsCode0
Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets0
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics0
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Show:102550
← PrevPage 77 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified