Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2101–2150 of 5548 papers

Title	Date	Tasks	Status	Hype
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation	Jun 9, 2024	BenchmarkingQuestion Generation	CodeCode Available	1
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications	Jun 8, 2024	BenchmarkingMamba	—Unverified	0
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation	Jun 8, 2024	BenchmarkingInstance Segmentation	—Unverified	0
VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metric	Jun 7, 2024	Anomaly DetectionBenchmarking	CodeCode Available	0
Behavior Structformer: Learning Players Representations with Structured Tokenization	Jun 7, 2024	Benchmarking	—Unverified	0
GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models	Jun 7, 2024	BenchmarkingDenoising	—Unverified	0
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild	Jun 7, 2024	BenchmarkingChatbot	CodeCode Available	3
Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain Activity	Jun 7, 2024	BenchmarkingEEG	CodeCode Available	0
CLoG: Benchmarking Continual Learning of Image Generation Models	Jun 7, 2024	BenchmarkingContinual Learning	CodeCode Available	1
Scenarios and Approaches for Situated Natural Language Explanations	Jun 7, 2024	BenchmarkingIn-Context Learning	—Unverified	0
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation	Jun 7, 2024	Benchmarking	—Unverified	0
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs	Jun 7, 2024	BenchmarkingDecoder	CodeCode Available	3
Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking	Jun 6, 2024	6D Pose Estimation using RGBBenchmarking	—Unverified	0
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation	Jun 6, 2024	BenchmarkingDrug Discovery	—Unverified	0
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As	Jun 6, 2024	ArticlesBenchmarking	—Unverified	0
Statistical Multicriteria Benchmarking via the GSD-Front	Jun 6, 2024	Benchmarking	—Unverified	0
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving	Jun 6, 2024	Autonomous DrivingBench2Drive	CodeCode Available	4
Better Late Than Never: Formulating and Benchmarking Recommendation Editing	Jun 6, 2024	BenchmarkingRecommendation Systems	CodeCode Available	0
Time Sensitive Knowledge Editing through Efficient Finetuning	Jun 6, 2024	Benchmarkingknowledge editing	—Unverified	0
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning	Jun 6, 2024	BenchmarkingScheduling	—Unverified	0
MLVU: Benchmarking Multi-task Long Video Understanding	Jun 6, 2024	BenchmarkingVideo Understanding	CodeCode Available	3
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices	Jun 6, 2024	BenchmarkingRAG	—Unverified	0
BEADs: Bias Evaluation Across Domains	Jun 6, 2024	BenchmarkingBias Detection	—Unverified	0
TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising	Jun 5, 2024	BenchmarkingDenoising	CodeCode Available	1
Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation	Jun 5, 2024	BenchmarkingImage Segmentation	—Unverified	0
CommonPower: A Framework for Safe Data-Driven Smart Grid Control	Jun 5, 2024	Benchmarkingenergy management	CodeCode Available	1
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection	Jun 5, 2024	Anomaly DetectionBenchmarking	—Unverified	0
CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark	Jun 5, 2024	Benchmarking	CodeCode Available	1
Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN Performance	Jun 4, 2024	BenchmarkingDrug Discovery	CodeCode Available	0
ACCORD: Closing the Commonsense Measurability Gap	Jun 4, 2024	BenchmarkingCommon Sense Reasoning	CodeCode Available	0
Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check	Jun 4, 2024	BenchmarkingRepresentation Learning	—Unverified	0
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset	Jun 4, 2024	Benchmarking	CodeCode Available	0
Analyzing the Feature Extractor Networks for Face Image Synthesis	Jun 4, 2024	BenchmarkingImage Generation	CodeCode Available	0
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability	Jun 4, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders	Jun 4, 2024	BenchmarkingClustering	CodeCode Available	1
Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs	Jun 4, 2024	BenchmarkingFairness	—Unverified	0
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models	Jun 3, 2024	BenchmarkingCode Completion	—Unverified	0
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection	Jun 3, 2024	Action RecognitionBenchmarking	—Unverified	0
LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions	Jun 3, 2024	Autonomous DrivingBenchmarking	—Unverified	0
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics	Jun 3, 2024	Audio ClassificationBenchmarking	CodeCode Available	1
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine	Jun 3, 2024	BenchmarkingQuestion Answering	CodeCode Available	2
Scaffold Splits Overestimate Virtual Screening Performance	Jun 2, 2024	BenchmarkingClustering	—Unverified	0
WebSuite: Systematically Evaluating Why Web Agents Fail	Jun 1, 2024	BenchmarkingDiagnostic	CodeCode Available	0
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models	Jun 1, 2024	Benchmarking	CodeCode Available	1
On the project risk baseline: integrating aleatory uncertainty into project scheduling	May 31, 2024	BenchmarkingScheduling	—Unverified	0
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild	May 30, 2024	Benchmarking	CodeCode Available	1
SECURE: Benchmarking Large Language Models for Cybersecurity	May 30, 2024	Benchmarking	CodeCode Available	1
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images	May 30, 2024	AllBenchmarking	—Unverified	0
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning	May 30, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
CoSy: Evaluating Textual Explanations of Neurons	May 30, 2024	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 43 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified