SOTAVerified

Benchmarking

Papers

Showing 18511900 of 5548 papers

TitleStatusHype
Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?0
Benchmarking Robust Self-Supervised Learning Across Diverse Downstream TasksCode0
Temporal receptive field in dynamic graph learning: A comprehensive analysisCode0
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models0
Feature interpretability in BCIs: exploring the role of network lateralizationCode0
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure DetectionCode2
Benchmarking the Attribution Quality of Vision ModelsCode0
A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification0
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse ModalitiesCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
REMM:Rotation-Equivariant Framework for End-to-End Multimodal Image MatchingCode0
On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction0
Separable Operator NetworksCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
AstroMLab 1: Who Wins Astronomy Jeopardy!?0
ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation0
Benchmarking Vision Language Models for Cultural Understanding0
When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph BenchmarkCode1
Experimental Benchmarking of Energy-saving Sub-Optimal Sliding Mode Control0
Automated detection of gibbon calls from passive acoustic monitoring data using convolutional neural networks in the "torch for R" ecosystem0
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs0
Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videosCode1
Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic EnvironmentCode0
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
A Comprehensive Survey on Retrieval Methods in Recommender Systems0
Evaluating Nuanced Bias in Large Language Model Free Response Answers0
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous DrivingCode2
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generationCode1
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse DisciplinesCode1
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models0
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
How Aligned are Different Alignment Metrics?0
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph PriorCode2
Training on the Test Task Confounds Evaluation and EmergenceCode1
Revisiting, Benchmarking and Understanding Unsupervised Graph Domain AdaptationCode3
SPINEX-Clustering: Similarity-based Predictions with Explainable Neighbors Exploration for Clustering Problems0
Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability0
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible GuidanceCode2
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability predictionCode0
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation ExperimentsCode0
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental LearningCode1
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation0
TARGO: Benchmarking Target-driven Object Grasping under Occlusions0
A Benchmark for Multi-speaker Anonymization0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition0
Replication in Visual Diffusion Models: A Survey and OutlookCode1
Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNsCode0
Show:102550
← PrevPage 38 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified