SOTAVerified

Benchmarking

Papers

Showing 26512700 of 5548 papers

TitleStatusHype
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension0
AI PERSONA: Towards Life-long Personalization of LLMs0
Foundations for learning from noisy quantum experiments0
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate0
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension0
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
Framework and Benchmarks for Combinatorial and Mixed-variable Bayesian Optimization0
FRED: The Florence RGB-Event Drone Dataset0
Benchmarking projective simulation in navigation problems0
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification0
Benchmarking Single-Image Reflection Removal Algorithms0
A Survey on LLM-based News Recommender Systems0
How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study0
Human Body Shape Classification Based on a Single Image0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
Benchmarking SMT Performance for Farsi Using the TEP++ Corpus0
From Code to Play: Benchmarking Program Search for Games Using Large Language Models0
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms0
FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections0
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images0
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future0
How Good is a Video Summary? A New Benchmarking Dataset and Evaluation Framework Towards Realistic Video Summarization0
Benchmarking Spiking Neural Network Learning Methods with Varying Locality0
Fairness Index Measures to Evaluate Bias in Biometric Recognition0
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making0
Fairness-Aware Graph Neural Networks: A Survey0
From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution0
Benchmarking State-of-the-Art Deep Learning Software Tools0
From Sound Representation to Model Robustness0
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs0
Benchmarking state-of-the-art gradient boosting algorithms for classification0
Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images0
FSD-10: A Dataset for Competitive Sports Content Analysis0
FAIRification of MLC data0
A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking0
How Good Is Neural Combinatorial Optimization? A Systematic Evaluation on the Traveling Salesman Problem0
Full-scale modal testing of a Hawk T1A aircraft for benchmarking vibration-based methods0
Full-stack evaluation of Machine Learning inference workloads for RISC-V systems0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
FunBench: Benchmarking Fundus Reading Skills of MLLMs0
Functional Code Building Genetic Programming0
Efficient Pauli channel estimation with logarithmic quantum memory0
A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System0
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage0
Fuzzy Knowledge Distillation from High-Order TSK to Low-Order TSK0
A Survey of Spanish Clinical Language Models0
AI Matrix - Synthetic Benchmarks for DNN0
Show:102550
← PrevPage 54 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified