SOTAVerified

Benchmarking

Papers

Showing 12511275 of 5548 papers

TitleStatusHype
Automatic Detection of Generated Text is Easiest when Humans are FooledCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Benchmarking the Generation of Fact Checking ExplanationsCode1
Benchmarking emergency department triage prediction models with machine learning and large public electronic health recordsCode1
How to Train Neural Field Representations: A Comprehensive Study and BenchmarkCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
AIPerf: Automated machine learning as an AI-HPC benchmarkCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research ChallengesCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
Benchmarking of DL Libraries and Models on Mobile DevicesCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Benchmarking Quantized Neural Networks on FPGAs with FINNCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive MediaCode1
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language ModelsCode1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
A framework for benchmarking clustering algorithmsCode1
"How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken ConversationsCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
Show:102550
← PrevPage 51 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified