Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2501–2525 of 5548 papers

Title	Date	Tasks	Status	Hype
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse	Feb 15, 2024	BenchmarkingModel Editing	CodeCode Available	0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics	Feb 15, 2024	BenchmarkingExperimental Design	—Unverified	0
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator	Feb 15, 2024	BenchmarkingDiagnostic	CodeCode Available	2
Multi-Fidelity Methods for Optimization: A Survey	Feb 15, 2024	BenchmarkingComputational Efficiency	—Unverified	0
Recommendations for Baselines and Benchmarking Approximate Gaussian Processes	Feb 15, 2024	BenchmarkingGaussian Processes	—Unverified	0
Evaluation of simulation methods for tumor subclonal reconstruction	Feb 14, 2024	Benchmarking	—Unverified	0
Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking	Feb 14, 2024	BenchmarkingLanguage Modelling	CodeCode Available	1
MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models	Feb 14, 2024	BenchmarkingDiversity	CodeCode Available	2
Design and Realization of a Benchmarking Testbed for Evaluating Autonomous Platooning Algorithms	Feb 14, 2024	Autonomous DrivingBenchmarking	—Unverified	0
Benchmarking multi-component signal processing methods in the time-frequency plane	Feb 13, 2024	BenchmarkingDenoising	CodeCode Available	0
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents	Feb 13, 2024	BenchmarkingModel Selection	CodeCode Available	2
Privacy-Preserving Language Model Inference with Instance Obfuscation	Feb 13, 2024	BenchmarkingLanguage Modeling	—Unverified	0
BdSLW60: A Word-Level Bangla Sign Language Dataset	Feb 13, 2024	BenchmarkingGesture Recognition	CodeCode Available	0
EvoGPT-f: An Evolutionary GPT Framework for Benchmarking Formal Math Languages	Feb 12, 2024	Automated Theorem ProvingBenchmarking	—Unverified	0
Customizable Perturbation Synthesis for Robust SLAM Benchmarking	Feb 12, 2024	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available	2
Impact of spatial transformations on landscape features of CEC2022 basic benchmark problems	Feb 12, 2024	Benchmarking	—Unverified	0
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT	Feb 12, 2024	BenchmarkingChunking	—Unverified	0
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension	Feb 12, 2024	2kAutomatic Speech Recognition	CodeCode Available	2
Can Tree Based Approaches Surpass Deep Learning in Anomaly Detection? A Benchmarking Study	Feb 11, 2024	Anomaly DetectionBenchmarking	CodeCode Available	0
Explainable Global Wildfire Prediction Models using Graph Neural Networks	Feb 11, 2024	BenchmarkingCommunity Detection	CodeCode Available	1
ProtIR: Iterative Refinement between Retrievers and Predictors for Protein Function Annotation	Feb 10, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Estimating the Effect of Crosstalk Error on Circuit Fidelity Using Noisy Intermediate-Scale Quantum Devices	Feb 10, 2024	Benchmarking	—Unverified	0
Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation	Feb 9, 2024	6D Pose Estimation using RGBBenchmarking	—Unverified	0
Retrieve, Merge, Predict: Augmenting Tables with Data Lakes	Feb 9, 2024	AutoMLBenchmarking	CodeCode Available	1
LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education	Feb 9, 2024	BenchmarkingChatbot	—Unverified	0

Show:10 25 50

← PrevPage 101 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified