Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3301–3350 of 5548 papers

Title	Date	Tasks	Status
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering	Mar 5, 2024	BenchmarkingCode Generation	—Unverified
Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground	Mar 4, 2024	Benchmarking	—Unverified
Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks	Mar 4, 2024	Benchmarking	CodeCode Available
Classification of the Fashion-MNIST Dataset on a Quantum Computer	Mar 4, 2024	BenchmarkingQuantum Machine Learning	—Unverified
Model Lakes	Mar 4, 2024	BenchmarkingManagement	—Unverified
a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification	Mar 3, 2024	BenchmarkingSpeaker Verification	CodeCode Available
A Bayesian Committee Machine Potential for Oxygen-containing Organic Compounds	Mar 2, 2024	BenchmarkingPosition	—Unverified
SINDy vs Hard Nonlinearities and Hidden Dynamics: a Benchmarking Study	Mar 1, 2024	Benchmarking	—Unverified
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms	Mar 1, 2024	BenchmarkingStochastic Optimization	—Unverified
Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking	Mar 1, 2024	BenchmarkingImitation Learning	—Unverified
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models	Mar 1, 2024	BenchmarkingMathematical Reasoning	—Unverified
Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance	Mar 1, 2024	BenchmarkingStance Detection	—Unverified
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition	Feb 29, 2024	Action Unit DetectionArousal Estimation	—Unverified
FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking	Feb 28, 2024	BenchmarkingInductive Learning	CodeCode Available
Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models	Feb 28, 2024	BenchmarkingHallucination	CodeCode Available
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies	Feb 27, 2024	BenchmarkingSystematic Generalization	—Unverified
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky Patterns	Feb 27, 2024	BenchmarkingBinary Classification	CodeCode Available
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images	Feb 27, 2024	BenchmarkingDefect Detection	—Unverified
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection	Feb 27, 2024	Benchmarking	—Unverified
Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems	Feb 26, 2024	BenchmarkingEvolutionary Algorithms	—Unverified
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset	Feb 26, 2024	BenchmarkingCross-Lingual Transfer	—Unverified
Benchmarking LLMs on the Semantic Overlap Summarization Task	Feb 26, 2024	BenchmarkingDocument Summarization	—Unverified
Partial Rankings of Optimizers	Feb 26, 2024	Benchmarking	CodeCode Available
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs	Feb 25, 2024	BenchmarkingChatbot	CodeCode Available
E(3)-equivariant models cannot learn chirality: Field-based molecular generation	Feb 24, 2024	BenchmarkingGraph Neural Network	—Unverified
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs	Feb 24, 2024	BenchmarkingKnowledge Graphs	—Unverified
Benchmarking Observational Studies with Experimental Data under Right-Censoring	Feb 23, 2024	Benchmarking	—Unverified
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving	Feb 23, 2024	BenchmarkingDecision Making	—Unverified
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data	Feb 22, 2024	Benchmarking	CodeCode Available
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models	Feb 21, 2024	BenchmarkingForm	CodeCode Available
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models	Feb 21, 2024	BenchmarkingImage to text	—Unverified
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models	Feb 21, 2024	Benchmarking	—Unverified
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms	Feb 21, 2024	BenchmarkingHate Speech Detection	CodeCode Available
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers	Feb 20, 2024	Benchmarking	—Unverified
Synthetic location trajectory generation using categorical diffusion models	Feb 19, 2024	BenchmarkingDecision Making	CodeCode Available
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation	Feb 19, 2024	BenchmarkingChatbot	—Unverified
AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies	Feb 19, 2024	Benchmarking	CodeCode Available
Learning Disentangled Audio Representations through Controlled Synthesis	Feb 16, 2024	BenchmarkingDisentanglement	—Unverified
VATr++: Choose Your Words Wisely for Handwritten Text Generation	Feb 16, 2024	BenchmarkingText Generation	—Unverified
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse	Feb 15, 2024	BenchmarkingModel Editing	CodeCode Available
Recommendations for Baselines and Benchmarking Approximate Gaussian Processes	Feb 15, 2024	BenchmarkingGaussian Processes	—Unverified
Multi-Fidelity Methods for Optimization: A Survey	Feb 15, 2024	BenchmarkingComputational Efficiency	—Unverified
Large-scale Benchmarking of Metaphor-based Optimization Heuristics	Feb 15, 2024	BenchmarkingExperimental Design	—Unverified
SAWEC: Sensing-Assisted Wireless Edge Computing	Feb 15, 2024	BenchmarkingEdge-computing	CodeCode Available
Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data	Feb 15, 2024	BenchmarkingFederated Learning	—Unverified
From Variability to Stability: Advancing RecSys Benchmarking Practices	Feb 15, 2024	BenchmarkingCollaborative Filtering	CodeCode Available
Evaluation of simulation methods for tumor subclonal reconstruction	Feb 14, 2024	Benchmarking	—Unverified
Design and Realization of a Benchmarking Testbed for Evaluating Autonomous Platooning Algorithms	Feb 14, 2024	Autonomous DrivingBenchmarking	—Unverified
Benchmarking multi-component signal processing methods in the time-frequency plane	Feb 13, 2024	BenchmarkingDenoising	CodeCode Available
Privacy-Preserving Language Model Inference with Instance Obfuscation	Feb 13, 2024	BenchmarkingLanguage Modeling	—Unverified

Show:10 25 50

← PrevPage 67 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified