SOTAVerified

Benchmarking

Papers

Showing 47514800 of 5548 papers

TitleStatusHype
Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural NetworksCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
Benchmarking Temporal Reasoning and Alignment Across Chinese DynastiesCode0
Safe Trajectory Generation for Complex Urban Environments Using Spatio-temporal Semantic CorridorCode0
Natural Image Noise DatasetCode0
Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) AlgorithmsCode0
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness CalibrationCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance LabelsCode0
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority LanguagesCode0
The LOCATA Challenge: Acoustic Source Localization and TrackingCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
A Meta-Analysis of the Anomaly Detection ProblemCode0
On the Measure of IntelligenceCode0
Generalization and Regularization in DQNCode0
Automatic Resolution of Domain Name DisputesCode0
Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AICode0
Automatic benchmarking of large multimodal models via iterative experiment programmingCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
MineRL: A Large-Scale Dataset of Minecraft DemonstrationsCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch SchedulingCode0
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal CarcinomaCode0
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context UnderstandingCode0
Mirage: Model-Agnostic Graph Distillation for Graph ClassificationCode0
Benchmarking Subset Selection from Large Candidate Solution Sets in Evolutionary Multi-objective OptimizationCode0
Sanity Simulations for Saliency MethodsCode0
From Variability to Stability: Advancing RecSys Benchmarking PracticesCode0
ALTIS: Modernizing GPGPU BenchmarkingCode0
From raw affiliations to organization identifiersCode0
Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking InsightsCode0
3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method EvaluationCode0
MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and LearningCode0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
The Multiple Subnetwork Hypothesis: Enabling Multidomain Learning by Isolating Task-Specific Subnetworks in Feedforward Neural NetworksCode0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language ModelsCode0
SATBench: Benchmarking the speed-accuracy tradeoff in object recognition by humans and dynamic neural networksCode0
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library ScenariosCode0
From MNIST to ImageNet and Back: Benchmarking Continual Curriculum LearningCode0
SAWEC: Sensing-Assisted Wireless Edge ComputingCode0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
Vote'n'Rank: Revision of Benchmarking with Social Choice TheoryCode0
AlphaZip: Neural Network-Enhanced Lossless Text CompressionCode0
ML-Net: multi-label classification of biomedical texts with deep neural networksCode0
From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in HistopathologyCode0
mlOSP: Towards a Unified Implementation of Regression Monte Carlo AlgorithmsCode0
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language RepresentationCode0
MLPerf Inference BenchmarkCode0
Show:102550
← PrevPage 96 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified