SOTAVerified

Benchmarking

Papers

Showing 891900 of 5548 papers

TitleStatusHype
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton OperatorsCode2
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk0
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide0
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation FrameworkCode0
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks0
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems0
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
PredictaBoard: Benchmarking LLM Score PredictabilityCode0
Show:102550
← PrevPage 90 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified