SOTAVerified

Benchmarking

Papers

Showing 32763300 of 5548 papers

TitleStatusHype
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer0
LAraBench: Benchmarking Arabic AI with Large Language Models0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability AssessmentCode1
Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution TracesCode1
R2H: Building Multimodal Navigation Helpers that Respond to Help Requests0
When the Music Stops: Tip-of-the-Tongue Retrieval for MusicCode0
Benchmarking Machine Translation with Cultural AwarenessCode0
Robust Model-Based Optimization for Challenging Fitness LandscapesCode0
Exploring Large Language Models for Classical PhilologyCode1
Multilingual Large Language Models Are Not (Yet) Code-Switchers0
How Fragile is Relation Extraction under Entity Replacements?Code0
Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought MethodCode1
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting ApproachesCode0
Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate0
Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial AttacksCode1
Value-at-Risk-Based Portfolio Insurance: Performance Evaluation and Benchmarking Against CPPI in a Markov-Modulated Regime-Switching Market0
Patterns of Convergence and Bound Constraint Violation in Differential Evolution on SBOX-COST Benchmarking Suite0
Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language ModelsCode2
Separating form and meaning: Using self-consistency to quantify task understanding across multiple sensesCode0
TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks0
Ahead-of-Time P-Tuning0
Benchmarking Deep Learning Frameworks for Automated Diagnosis of Ocular Toxoplasmosis: A Comprehensive Approach to Classification and Segmentation0
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization0
Show:102550
← PrevPage 132 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified