SOTAVerified

Benchmarking

Papers

Showing 24012450 of 5548 papers

TitleStatusHype
SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different LanguagesCode0
Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptation of Object DetectorsCode0
Recurrent Drafter for Fast Speculative Decoding in Large Language ModelsCode3
Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing FlowsCode0
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language ModelsCode9
IndicSTR12: A Dataset for Indic Scene Text Recognition0
An Approach to Evaluate Modeling Adequacy for Small-Signal Stability Analysis of IBR-related SSOs in Multimachine Systems0
A tutorial on multi-view autoencoders using the multi-view-AE library0
Better than classical? The subtle art of benchmarking quantum machine learning modelsCode7
(N,K)-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model0
Class Imbalance in Object Detection: An Experimental Diagnosis and Study of Mitigation StrategiesCode0
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource LanguagesCode1
Leveraging Foundation Models for Content-Based Medical Image Retrieval in RadiologyCode1
A Holistic Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation0
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
Multi-GPU-Enabled Hybrid Quantum-Classical Workflow in Quantum-HPC Middleware: Applications in Quantum SimulationsCode0
Synth4bench: a framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithmsCode0
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
Benchmarking Large Language Models for Molecule Prediction TasksCode0
Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis AgentsCode1
Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume0
R^2-Bench: Benchmarking the Robustness of Referring Perception Models under PerturbationsCode1
NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems0
Benchmarking News Recommendation in the Era of Green AI0
Improvements & Evaluations on the MLCommons CloudMask BenchmarkCode0
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in RecommendationCode1
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AICode0
Three Revisits to Node-Level Graph Anomaly Detection: Outliers, Message Passing and Hyperbolic Neural NetworksCode0
Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition TaskCode0
A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video0
BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving0
Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word ProblemCode0
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model AgentsCode2
Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation0
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering0
Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground0
SciAssess: Benchmarking LLM Proficiency in Scientific Literature AnalysisCode2
REAL-Colon: A dataset for developing real-world AI applications in colonoscopyCode2
Classification of the Fashion-MNIST Dataset on a Quantum Computer0
Model Lakes0
Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost BenchmarksCode0
a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verificationCode0
A Bayesian Committee Machine Potential for Oxygen-containing Organic Compounds0
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
SINDy vs Hard Nonlinearities and Hidden Dynamics: a Benchmarking Study0
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms0
Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance0
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models0
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMsCode1
Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking0
Show:102550
← PrevPage 49 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified