SOTAVerified

Benchmarking

Papers

Showing 801850 of 5548 papers

TitleStatusHype
Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders0
Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature PerturbationCode0
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User DataCode1
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol0
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance0
Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms0
Benchmarking Reasoning Robustness in Large Language Models0
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets0
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained ModelsCode0
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model CompressionCode0
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical CasesCode0
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination0
ThrowBench: Benchmarking LLMs by Predicting Runtime ExceptionsCode0
InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
Eventprop training for efficient neuromorphic applications0
Towards Universal Learning-based Model for Cardiac Image Reconstruction: Summary of the CMRxRecon2024 Challenge0
UnPuzzle: A Unified Framework for Pathology Image AnalysisCode1
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum SystemsCode0
AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber AttacksCode0
Technical report of a DMD-based Characterization Method for Vision Sensors0
A2Perf: Real-World Autonomous Agents Benchmark0
Optimizing open-domain question answering with graph-based retrieval augmented generation0
Evaluation of Architectural Synthesis Using Generative AI0
One ruler to measure them all: Benchmarking multilingual long-context language modelsCode1
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority LanguagesCode0
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defensesCode1
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models0
From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image SegmentationCode1
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics0
Multi-Agent Reinforcement Learning with Long-Term Performance Objectives for Service Workforce Optimization0
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsCode1
FunBench: Benchmarking Fundus Reading Skills of MLLMs0
MAPS: Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Infrastructure0
Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks0
A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information0
LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation ConversationCode1
NeuroMorse: A Temporally Structured Dataset For Neuromorphic ComputingCode0
ProBench: Benchmarking Large Language Models in Competitive Programming0
Large Language Model-Based Benchmarking Experiment Settings for Evolutionary Multi-Objective Optimization0
PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice0
Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series0
Protein Structure Tokenization: Benchmarking and New RecipeCode1
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems0
EgoNormia: Benchmarking Physical Social Norm UnderstandingCode1
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action DetectionCode3
ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments0
Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approachesCode0
LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil MappingCode0
Show:102550
← PrevPage 17 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified