Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1026–1050 of 5548 papers

Title	Date	Tasks	Status	Hype
Beyond Benchmarks: On The False Promise of AI Regulation	Jan 26, 2025	Benchmarking	—Unverified	0
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning	Jan 25, 2025	BenchmarkingEvolutionary Algorithms	CodeCode Available	7
Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study	Jan 25, 2025	Benchmarking	—Unverified	0
Benchmarking global optimization techniques for unmanned aerial vehicle path planning	Jan 24, 2025	Benchmarkingglobal-optimization	—Unverified	0
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents	Jan 24, 2025	Benchmarking	CodeCode Available	3
Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems	Jan 24, 2025	BenchmarkingDiversity	—Unverified	0
The Karp Dataset	Jan 24, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video	Jan 24, 2025	3D ReconstructionBenchmarking	CodeCode Available	2
Enhancing Biomedical Relation Extraction with Directionality	Jan 23, 2025	BenchmarkingDocument-level Relation Extraction	CodeCode Available	1
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning	Jan 23, 2025	Benchmarkingimage-classification	—Unverified	0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale	Jan 23, 2025	Benchmarking	—Unverified	0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain	Jan 23, 2025	BenchmarkingDomain Adaptation	—Unverified	0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF	Jan 22, 2025	BenchmarkingHallucination	—Unverified	0
Leveraging LLMs to Create a Haptic Devices' Recommendation System	Jan 22, 2025	Benchmarking	—Unverified	0
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities	Jan 22, 2025	BenchmarkingReferring Expression	—Unverified	0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning	Jan 22, 2025	Benchmarking	CodeCode Available	0
CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization	Jan 22, 2025	Benchmarkingregression	—Unverified	0
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)	Jan 21, 2025	Benchmarking	—Unverified	0
Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes	Jan 21, 2025	Benchmarking	—Unverified	0
Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning	Jan 21, 2025	BenchmarkingContinual Learning	—Unverified	0
Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems	Jan 21, 2025	Autonomous VehiclesBenchmarking	CodeCode Available	0
Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing	Jan 20, 2025	BenchmarkingEvolutionary Algorithms	—Unverified	0
Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model	Jan 20, 2025	Benchmarking	—Unverified	0
Benchmarking Large Language Models via Random Variables	Jan 20, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models	Jan 19, 2025	BenchmarkingQuestion Answering	CodeCode Available	1

Show:10 25 50

← PrevPage 42 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified