SOTAVerified

Benchmarking

Papers

Showing 15011550 of 5548 papers

TitleStatusHype
When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs LearningCode1
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty SimulationsCode0
Identifying Money Laundering Subgraphs on the BlockchainCode0
Benchmarking Agentic Workflow GenerationCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Audio Explanation Synthesis with Generative Foundation ModelsCode0
Advocating Character Error Rate for Multilingual ASR Evaluation0
Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated LearningCode0
Towards Generalisable Time Series Understanding Across DomainsCode1
Analysis of different disparity estimation techniques on aerial stereo image datasets0
InAttention: Linear Context Scaling for Transformers0
OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB0
TuringQ: Benchmarking AI Comprehension in Theory of ComputationCode0
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding0
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and BeyondCode2
M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes0
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision MakingCode3
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person PerspectiveCode1
QGym: Scalable Simulation and Benchmarking of Queuing Network ControllersCode0
FedGraph: A Research Library and Benchmark for Federated Graph LearningCode2
Active Evaluation Acquisition for Efficient LLM Benchmarking0
Manual Verbalizer Enrichment for Few-Shot Text Classification0
Benchmarking of a new data splitting method on volcanic eruption data0
Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems0
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the WildCode1
Rule-based Data Selection for Large Language Models0
Precise Model Benchmarking with Only a Few Observations0
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and DefenseCode2
Named Clinical Entity Recognition BenchmarkCode0
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation ModelsCode0
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection0
dattri: A Library for Efficient Data AttributionCode2
Adjusting Pretrained Backbones for PerformativityCode0
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends0
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms0
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic PlanningCode1
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels0
TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable QuestionsCode0
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension0
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities0
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices0
Benchmarking the Fidelity and Utility of Synthetic Relational Data0
PersoBench: Benchmarking Personalized Response Generation in Large Language ModelsCode0
Ward: Provable RAG Dataset Inference via LLM Watermarks0
Show:102550
← PrevPage 31 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified