SOTAVerified

Benchmarking

Papers

Showing 9761000 of 5548 papers

TitleStatusHype
Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline MaterialsCode1
PaReprop: Fast Parallelized Reversible BackpropagationCode1
KoLA: Carefully Benchmarking World Knowledge of Large Language ModelsCode1
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language ModelsCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
NeuroGraph: Benchmarks for Graph Machine Learning in Brain ConnectomicsCode1
Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical MLCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
RepoBench: Benchmarking Repository-Level Code Auto-Completion SystemsCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Str2Str: A Score-based Framework for Zero-shot Protein Conformation SamplingCode1
TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal DomainCode1
Spatially Resolved Gene Expression Prediction from H&E Histology Images via Bi-modal Contrastive LearningCode1
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language modelsCode1
Multilingual Conceptual Coverage in Text-to-Image ModelsCode1
Improving and Benchmarking Offline Reinforcement Learning AlgorithmsCode1
End-to-end Knowledge Retrieval with Multi-modal QueriesCode1
Accurate and Efficient Structural Ensemble Generation of Macrocyclic Peptides using Internal Coordinate DiffusionCode1
IDToolkit: A Toolkit for Benchmarking and Developing Inverse Design Algorithms in NanophotonicsCode1
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsCode1
Decoding the Underlying Meaning of Multimodal Hateful MemesCode1
Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial TasksCode1
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range MultilaterationCode1
ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability AssessmentCode1
Exploring Large Language Models for Classical PhilologyCode1
Show:102550
← PrevPage 40 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified