Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 576–600 of 5548 papers

Title	Date	Tasks	Status	Hype
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantification	Nov 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset	Nov 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks	Nov 4, 2024	Action GenerationBenchmarking	CodeCode Available	1
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation	Nov 4, 2024	BenchmarkingGraph Generation	CodeCode Available	1
ROAD-Waymo: Action Awareness at Scale for Autonomous Driving	Nov 3, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
MIRFLEX: Music Information Retrieval Feature Library for Extraction	Nov 1, 2024	BenchmarkingInformation Retrieval	CodeCode Available	1
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models	Nov 1, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery	Oct 31, 2024	BenchmarkingCloud Removal	CodeCode Available	1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking	Oct 31, 2024	BenchmarkingImputation	CodeCode Available	1
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios	Oct 31, 2024	BenchmarkingLLM-generated Text Detection	CodeCode Available	1
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction	Oct 31, 2024	BenchmarkingPrediction	CodeCode Available	1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography	Oct 31, 2024	BenchmarkingElectromyography (EMG)	CodeCode Available	1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems	Oct 30, 2024	BenchmarkingManagement	CodeCode Available	1
Survey of Cultural Awareness in Language Models: Text and Beyond	Oct 30, 2024	Benchmarking	CodeCode Available	1
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment	Oct 28, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance	Oct 27, 2024	BenchmarkingCode Generation	CodeCode Available	1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios	Oct 25, 2024	BenchmarkingDiversity	CodeCode Available	1
Benchmarking Multi-Scene Fire and Smoke Detection	Oct 22, 2024	Benchmarking	CodeCode Available	1
Comprehensive benchmarking of large language models for RNA secondary structure prediction	Oct 21, 2024	Benchmarking	CodeCode Available	1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems	Oct 18, 2024	BenchmarkingQuestion Answering	CodeCode Available	1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments	Oct 18, 2024	Autonomous NavigationBenchmarking	CodeCode Available	1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all	Oct 17, 2024	AllBenchmarking	CodeCode Available	1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	Oct 16, 2024	BenchmarkingFairness	CodeCode Available	1
RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation	Oct 15, 2024	BenchmarkingInteractive Segmentation	CodeCode Available	1
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Oct 14, 2024	2kBenchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 24 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified