SOTAVerified

Benchmarking

Papers

Showing 24512500 of 5548 papers

TitleStatusHype
Efficient Lifelong Model Evaluation in an Era of Rapid ProgressCode1
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition0
Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized TasksCode2
FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry BenchmarkingCode0
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Editing Factual Knowledge and Explanatory Ability of Medical Large Language ModelsCode0
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection0
Beacon, a lightweight deep reinforcement learning benchmark library for flow controlCode1
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies0
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky PatternsCode0
Benchmarking Data Science AgentsCode1
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images0
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Benchmarking LLMs on the Semantic Overlap Summarization Task0
Partial Rankings of OptimizersCode0
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset0
Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems0
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMsCode0
PST-Bench: Tracing and Benchmarking the Source of PublicationsCode1
E(3)-equivariant models cannot learn chirality: Field-based molecular generation0
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs0
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving0
Benchmarking Observational Studies with Experimental Data under Right-Censoring0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation LearningCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media PlatformsCode0
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language ModelsCode0
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers0
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Benchmarking Retrieval-Augmented Generation for MedicineCode4
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Synthetic location trajectory generation using categorical diffusion modelsCode0
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation0
Event-Based Motion MagnificationCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
AnaloBench: Benchmarking the Identification of Abstract and Long-context AnalogiesCode0
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A BenchmarkCode2
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
PEDANTS: Cheap but Effective and Interpretable Answer EquivalenceCode2
VATr++: Choose Your Words Wisely for Handwritten Text Generation0
Learning Disentangled Audio Representations through Controlled Synthesis0
Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics0
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models CollapseCode0
Show:102550
← PrevPage 50 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified