Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1851–1900 of 5548 papers

Title	Date	Tasks	Status	Hype
Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?	Jul 17, 2024	BenchmarkingSarcasm Detection	—Unverified	0
Benchmarking Robust Self-Supervised Learning Across Diverse Downstream Tasks	Jul 17, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	0
Temporal receptive field in dynamic graph learning: A comprehensive analysis	Jul 17, 2024	BenchmarkingDynamic Link Prediction	CodeCode Available	0
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models	Jul 17, 2024	BenchmarkingRed Teaming	CodeCode Available	2
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models	Jul 17, 2024	BenchmarkingLanguage Modelling	—Unverified	0
Feature interpretability in BCIs: exploring the role of network lateralization	Jul 16, 2024	BenchmarkingEEG	CodeCode Available	0
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection	Jul 16, 2024	BenchmarkingLoop Closure Detection	CodeCode Available	2
Benchmarking the Attribution Quality of Vision Models	Jul 16, 2024	BenchmarkingExplainable Models	CodeCode Available	0
A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification	Jul 16, 2024	BenchmarkingFew-Shot Learning	—Unverified	0
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities	Jul 16, 2024	BenchmarkingDomain Adaptation	CodeCode Available	1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models	Jul 16, 2024	BenchmarkingCode Generation	CodeCode Available	1
REMM:Rotation-Equivariant Framework for End-to-End Multimodal Image Matching	Jul 16, 2024	Benchmarking	CodeCode Available	0
On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction	Jul 15, 2024	Active LearningBenchmarking	—Unverified	0
Separable Operator Networks	Jul 15, 2024	BenchmarkingGPU	CodeCode Available	1
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin	Jul 15, 2024	Benchmarking	CodeCode Available	1
AstroMLab 1: Who Wins Astronomy Jeopardy!?	Jul 15, 2024	AstronomyBenchmarking	—Unverified	0
ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation	Jul 15, 2024	Benchmarking	—Unverified	0
Benchmarking Vision Language Models for Cultural Understanding	Jul 15, 2024	BenchmarkingQuestion Answering	—Unverified	0
When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph Benchmark	Jul 15, 2024	BenchmarkingGraph Learning	CodeCode Available	1
Experimental Benchmarking of Energy-saving Sub-Optimal Sliding Mode Control	Jul 14, 2024	Benchmarking	—Unverified	0
Automated detection of gibbon calls from passive acoustic monitoring data using convolutional neural networks in the "torch for R" ecosystem	Jul 13, 2024	BenchmarkingDeep Learning	—Unverified	0
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling	Jul 13, 2024	BenchmarkingMath	CodeCode Available	1
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs	Jul 13, 2024	BenchmarkingQuestion Answering	—Unverified	0
Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos	Jul 12, 2024	BenchmarkingPupil Dilation	CodeCode Available	1
Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment	Jul 12, 2024	BenchmarkingDecision Making	CodeCode Available	0
Benchmarking Language Model Creativity: A Case Study on Code Generation	Jul 12, 2024	BenchmarkingCode Generation	CodeCode Available	1
A Comprehensive Survey on Retrieval Methods in Recommender Systems	Jul 11, 2024	BenchmarkingRecommendation Systems	—Unverified	0
Evaluating Nuanced Bias in Large Language Model Free Response Answers	Jul 11, 2024	BenchmarkingLanguage Modeling	—Unverified	0
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous Driving	Jul 11, 2024	Autonomous DrivingBenchmarking	CodeCode Available	2
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation	Jul 11, 2024	Benchmarking	CodeCode Available	1
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines	Jul 11, 2024	BenchmarkingPrediction	CodeCode Available	1
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models	Jul 10, 2024	Benchmarking	—Unverified	0
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective	Jul 10, 2024	BenchmarkingDiagnostic	CodeCode Available	1
How Aligned are Different Alignment Metrics?	Jul 10, 2024	Benchmarking	—Unverified	0
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior	Jul 10, 2024	BenchmarkingDecoder	CodeCode Available	2
Training on the Test Task Confounds Evaluation and Emergence	Jul 10, 2024	BenchmarkingLanguage Modelling	CodeCode Available	1
Revisiting, Benchmarking and Understanding Unsupervised Graph Domain Adaptation	Jul 9, 2024	BenchmarkingDomain Adaptation	CodeCode Available	3
SPINEX-Clustering: Similarity-based Predictions with Explainable Neighbors Exploration for Clustering Problems	Jul 9, 2024	BenchmarkingClustering	—Unverified	0
Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability	Jul 9, 2024	BenchmarkingDecoder	—Unverified	0
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance	Jul 9, 2024	BenchmarkingConditional Image Generation	CodeCode Available	2
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction	Jul 9, 2024	Benchmarking	CodeCode Available	0
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates	Jul 8, 2024	Benchmarkingknowledge editing	CodeCode Available	1
Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments	Jul 8, 2024	BenchmarkingDecision Making	CodeCode Available	0
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning	Jul 8, 2024	Benchmarkingclass-incremental learning	CodeCode Available	1
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation	Jul 8, 2024	BenchmarkingGraph Embedding	—Unverified	0
TARGO: Benchmarking Target-driven Object Grasping under Occlusions	Jul 8, 2024	BenchmarkingObject	—Unverified	0
A Benchmark for Multi-speaker Anonymization	Jul 8, 2024	BenchmarkingDisentanglement	—Unverified	0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition	Jul 8, 2024	BenchmarkingDeep Learning	—Unverified	0
Replication in Visual Diffusion Models: A Survey and Outlook	Jul 7, 2024	BenchmarkingSurvey	CodeCode Available	1
Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNs	Jul 6, 2024	BenchmarkingDataset Generation	CodeCode Available	0

Show:10 25 50

← PrevPage 38 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified