SOTAVerified

Benchmarking

Papers

Showing 726750 of 5548 papers

TitleStatusHype
QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEsCode1
A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?0
Stop Overthinking: A Survey on Efficient Reasoning for Large Language ModelsCode4
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph0
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data ContaminationCode1
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs0
Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks0
Language-based Image Colorization: A Benchmark and BeyondCode0
Kolmogorov-Arnold Network for Transistor Compact Modeling0
Benchmarking Large Language Models for Handwritten Text Recognition0
VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-TuningCode2
SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes0
ImputeGAP: A Comprehensive Library for Time Series Imputation0
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack0
COPA: Comparing the Incomparable to Explore the Pareto Front0
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models0
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal SystemCode1
Benchmarking Failures in Tool-Augmented Language ModelsCode0
HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard0
Stable Virtual Camera: Generative View Synthesis with Diffusion Models0
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysisCode0
Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering0
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language ModelsCode0
Show:102550
← PrevPage 30 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified