Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 651–700 of 5548 papers

Title	Date	Tasks	Status	Hype
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning	Jul 22, 2024	BenchmarkingHallucination	CodeCode Available	1
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding	Jul 20, 2024	BenchmarkingHeuristic Search	CodeCode Available	1
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations	Jul 19, 2024	BenchmarkingFairness	CodeCode Available	1
Restore Anything Model via Efficient Degradation Adaptation	Jul 18, 2024	5-Degradation Blind All-in-One Image RestorationBenchmarking	CodeCode Available	1
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities	Jul 16, 2024	BenchmarkingDomain Adaptation	CodeCode Available	1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models	Jul 16, 2024	BenchmarkingCode Generation	CodeCode Available	1
When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph Benchmark	Jul 15, 2024	BenchmarkingGraph Learning	CodeCode Available	1
Separable Operator Networks	Jul 15, 2024	BenchmarkingGPU	CodeCode Available	1
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin	Jul 15, 2024	Benchmarking	CodeCode Available	1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling	Jul 13, 2024	BenchmarkingMath	CodeCode Available	1
Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos	Jul 12, 2024	BenchmarkingPupil Dilation	CodeCode Available	1
Benchmarking Language Model Creativity: A Case Study on Code Generation	Jul 12, 2024	BenchmarkingCode Generation	CodeCode Available	1
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines	Jul 11, 2024	BenchmarkingPrediction	CodeCode Available	1
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation	Jul 11, 2024	Benchmarking	CodeCode Available	1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective	Jul 10, 2024	BenchmarkingDiagnostic	CodeCode Available	1
Training on the Test Task Confounds Evaluation and Emergence	Jul 10, 2024	BenchmarkingLanguage Modelling	CodeCode Available	1
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning	Jul 8, 2024	Benchmarkingclass-incremental learning	CodeCode Available	1
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates	Jul 8, 2024	Benchmarkingknowledge editing	CodeCode Available	1
Replication in Visual Diffusion Models: A Survey and Outlook	Jul 7, 2024	BenchmarkingSurvey	CodeCode Available	1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters	Jul 5, 2024	Benchmarkingvalid	CodeCode Available	1
Benchmark on Drug Target Interaction Modeling from a Structure Perspective	Jul 4, 2024	BenchmarkingDrug Discovery	CodeCode Available	1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarking	Jul 3, 2024	BenchmarkingObject	CodeCode Available	1
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models	Jul 3, 2024	Benchmarking	CodeCode Available	1
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset	Jul 3, 2024	BenchmarkingDiversity	CodeCode Available	1
Occlusion-Aware Seamless Segmentation	Jul 2, 2024	BenchmarkingDomain Adaptation	CodeCode Available	1
FineSurE: Fine-grained Summarization Evaluation using LLMs	Jul 1, 2024	BenchmarkingHallucination	CodeCode Available	1
AI Agents That Matter	Jul 1, 2024	Benchmarking	CodeCode Available	1
Overcoming Common Flaws in the Evaluation of Selective Classification Systems	Jul 1, 2024	BenchmarkingClassification	CodeCode Available	1
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents	Jul 1, 2024	Benchmarking	CodeCode Available	1
GraphArena: Benchmarking Large Language Models on Graph Computational Problems	Jun 29, 2024	BenchmarkingHallucination	CodeCode Available	1
iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities	Jun 27, 2024	Benchmarking	CodeCode Available	1
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection	Jun 25, 2024	BenchmarkingPrompt Learning	CodeCode Available	1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)	Jun 25, 2024	BenchmarkingExperimental Design	CodeCode Available	1
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?	Jun 25, 2024	Benchmarking	CodeCode Available	1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models	Jun 24, 2024	BenchmarkingData Augmentation	CodeCode Available	1
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design	Jun 24, 2024	BenchmarkingDrug Design	CodeCode Available	1
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track	Jun 24, 2024	BenchmarkingRAG	CodeCode Available	1
A Closer Look at Mortality Risk Prediction from Electrocardiograms	Jun 24, 2024	BenchmarkingPrediction	CodeCode Available	1
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models	Jun 21, 2024	Benchmarking	CodeCode Available	1
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data	Jun 20, 2024	BenchmarkingKolmogorov-Arnold Networks	CodeCode Available	1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	Jun 20, 2024	BenchmarkingClassification	CodeCode Available	1
BeHonest: Benchmarking Honesty in Large Language Models	Jun 19, 2024	BenchmarkingMisinformation	CodeCode Available	1
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking	Jun 17, 2024	BenchmarkingDemand Forecasting	CodeCode Available	1
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models	Jun 17, 2024	BenchmarkingFact Checking	CodeCode Available	1
A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking	Jun 15, 2024	BenchmarkingGPU	CodeCode Available	1
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs	Jun 14, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency	Jun 14, 2024	Benchmarking	CodeCode Available	1
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data	Jun 14, 2024	BenchmarkingDecision Making	CodeCode Available	1
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution	Jun 13, 2024	BenchmarkingImage Super-Resolution	CodeCode Available	1
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models	Jun 13, 2024	Benchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 14 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified