SOTAVerified

Benchmarking

Papers

Showing 33013350 of 5548 papers

TitleStatusHype
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering0
Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground0
Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost BenchmarksCode0
Classification of the Fashion-MNIST Dataset on a Quantum Computer0
Model Lakes0
a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verificationCode0
A Bayesian Committee Machine Potential for Oxygen-containing Organic Compounds0
SINDy vs Hard Nonlinearities and Hidden Dynamics: a Benchmarking Study0
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms0
Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking0
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models0
Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance0
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition0
FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry BenchmarkingCode0
Editing Factual Knowledge and Explanatory Ability of Medical Large Language ModelsCode0
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies0
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky PatternsCode0
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images0
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection0
Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems0
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset0
Benchmarking LLMs on the Semantic Overlap Summarization Task0
Partial Rankings of OptimizersCode0
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMsCode0
E(3)-equivariant models cannot learn chirality: Field-based molecular generation0
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs0
Benchmarking Observational Studies with Experimental Data under Right-Censoring0
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language ModelsCode0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models0
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media PlatformsCode0
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers0
Synthetic location trajectory generation using categorical diffusion modelsCode0
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation0
AnaloBench: Benchmarking the Identification of Abstract and Long-context AnalogiesCode0
Learning Disentangled Audio Representations through Controlled Synthesis0
VATr++: Choose Your Words Wisely for Handwritten Text Generation0
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models CollapseCode0
Recommendations for Baselines and Benchmarking Approximate Gaussian Processes0
Multi-Fidelity Methods for Optimization: A Survey0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics0
SAWEC: Sensing-Assisted Wireless Edge ComputingCode0
Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data0
From Variability to Stability: Advancing RecSys Benchmarking PracticesCode0
Evaluation of simulation methods for tumor subclonal reconstruction0
Design and Realization of a Benchmarking Testbed for Evaluating Autonomous Platooning Algorithms0
Benchmarking multi-component signal processing methods in the time-frequency planeCode0
Privacy-Preserving Language Model Inference with Instance Obfuscation0
Show:102550
← PrevPage 67 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified