SOTAVerified

Benchmarking

Papers

Showing 51015150 of 5548 papers

TitleStatusHype
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence BenchmarksCode0
Benchmarking LLM-based Relevance Judgment MethodsCode0
Toward 3D Object Reconstruction from Stereo ImagesCode0
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language ModelsCode0
Skelite: Compact Neural Networks for Efficient Iterative SkeletonizationCode0
Divergent Creativity in Humans and Large Language ModelsCode0
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time SeriesCode0
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and MetricCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
User-Guided Deep Anime Line Art Colorization with Conditional Adversarial NetworksCode0
Towards a Benchmark for Large Language Models for Business Process Management TasksCode0
Weighting-Based Treatment Effect Estimation via Distribution LearningCode0
Slot Filling for Extracting Reskilling and Upskilling Options from the WebCode0
On Pitfalls of RemOve-And-Retrain: Data Processing Inequality PerspectiveCode0
Distributional Depth-Based Estimation of Object Articulation ModelsCode0
Benchmarking Linguistic Diversity of Large Language ModelsCode0
On Recurrent Neural Networks for Sequence-based Processing in CommunicationsCode0
Benchmarking Learning Efficiency in Deep Reservoir ComputingCode0
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer SectionsCode0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Diversity Over Size: On the Effect of Sample and Topic Sizes for Topic-Dependent Argument Mining DatasetsCode0
On the Evaluation Consistency of Attribution-based ExplanationsCode0
On the Evaluation of Conditional GANsCode0
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
On the Fragility of Active Learners for Text ClassificationCode0
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image SegmentationCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
On the Loss of Context-awareness in General Instruction Fine-tuningCode0
HumaniBench: A Human-Centric Framework for Large Multimodal Models EvaluationCode0
SNaC: Coherence Error Detection for Narrative SummarizationCode0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
Using Motif Transitions for Temporal Graph GenerationCode0
Accurate Peak Detection in Multimodal Optimization via Approximated Landscape LearningCode0
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious BiasCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
Word Embeddings for the Construction DomainCode0
What Actions are Needed for Understanding Human Actions in Videos?Code0
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity AwarenessCode0
On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of ClassifiersCode0
On the Use of ArXiv as a DatasetCode0
On the use of automatically generated synthetic image datasets for benchmarking face recognitionCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NASCode0
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable ThresholdsCode0
On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale CompetitionCode0
Show:102550
← PrevPage 103 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified