SOTAVerified

Benchmarking

Papers

Showing 24012425 of 5548 papers

TitleStatusHype
GRATIS: GeneRAting TIme Series with diverse and controllable characteristicsCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
DQI: Measuring Data Quality in NLPCode0
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
A General Benchmarking Framework for Text GenerationCode0
Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural NetworksCode0
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and MetricCode0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
GOAL: Towards Benchmarking Few-Shot Sports Game SummarizationCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
Domain2Vec: Domain Embedding for Unsupervised Domain AdaptationCode0
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two BenchmarksCode0
Separating form and meaning: Using self-consistency to quantify task understanding across multiple sensesCode0
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1MCode0
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining TasksCode0
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
Show:102550
← PrevPage 97 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified