SOTAVerified

Benchmarking

Papers

Showing 18761900 of 5548 papers

TitleStatusHype
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
A Comprehensive Survey on Retrieval Methods in Recommender Systems0
Evaluating Nuanced Bias in Large Language Model Free Response Answers0
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous DrivingCode2
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generationCode1
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse DisciplinesCode1
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models0
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
How Aligned are Different Alignment Metrics?0
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph PriorCode2
Training on the Test Task Confounds Evaluation and EmergenceCode1
Revisiting, Benchmarking and Understanding Unsupervised Graph Domain AdaptationCode3
SPINEX-Clustering: Similarity-based Predictions with Explainable Neighbors Exploration for Clustering Problems0
Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability0
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible GuidanceCode2
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability predictionCode0
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation ExperimentsCode0
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental LearningCode1
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation0
TARGO: Benchmarking Target-driven Object Grasping under Occlusions0
A Benchmark for Multi-speaker Anonymization0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition0
Replication in Visual Diffusion Models: A Survey and OutlookCode1
Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNsCode0
Show:102550
← PrevPage 76 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified