SOTAVerified

Benchmarking

Papers

Showing 18511875 of 5548 papers

TitleStatusHype
A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
ExEBench: Benchmarking Foundation Models on Extreme Earth EventsCode0
Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 20300
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong0
From raw affiliations to organization identifiersCode0
Benchmarking Retrieval-Augmented Generation for Chemistry0
Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems0
PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints0
Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs0
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration0
Optimizing Recommendations using Fine-Tuned LLMs0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy0
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents0
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation0
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations0
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal KnowledgeCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters0
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization0
Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering PerspectiveCode0
Show:102550
← PrevPage 75 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified