SOTAVerified

Benchmarking

Papers

Showing 10511100 of 5548 papers

TitleStatusHype
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial TasksCode1
Benchmarking saliency methods for chest X-ray interpretationCode1
3D Common Corruptions and Data AugmentationCode1
GenISP: Neural ISP for Low-Light Machine CognitionCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
Benchmarking Self-Supervised Learning on Diverse Pathology DatasetsCode1
Geoclidean: Few-Shot Generalization in Euclidean GeometryCode1
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking PlatformCode1
Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networksCode1
From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image SegmentationCode1
Benchmarking Robustness of Text-Image Composed RetrievalCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Benchmarking Robustness to Adversarial Image ObfuscationsCode1
GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument RolesCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Generative Evaluation of Complex Reasoning in Large Language ModelsCode1
Benchmarking Robustness of Machine Reading Comprehension ModelsCode1
3D AffordanceNet: A Benchmark for Visual Object Affordance UnderstandingCode1
Generative CKM Construction using Partially Observed Data with Diffusion ModelCode1
Generative Wind Power Curve Modeling Via Machine Vision: A Self-learning Deep Convolutional Network Based MethodCode1
GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge LearningCode1
German's Next Language ModelCode1
Benchmarking Robustness of 3D Object Detection to Common CorruptionsCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking StudyCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation ModelsCode1
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data MiningCode1
2.5D Visual Relationship DetectionCode1
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug DesignCode1
Generating a Doppelganger Graph: Resembling but DistinctCode1
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution ShiftsCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
GEMv2: Multilingual NLG Benchmarking in a Single Line of CodeCode1
GAMA: a General Automated Machine learning AssistantCode1
GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease DetectionCode1
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural NetworksCode1
Benchmarking Quantized Neural Networks on FPGAs with FINNCode1
GADBench: Revisiting and Benchmarking Supervised Graph Anomaly DetectionCode1
GCondenser: Benchmarking Graph CondensationCode1
Benchmarking emergency department triage prediction models with machine learning and large public electronic health recordsCode1
FTNet: Feature Transverse Network for Thermal Image Semantic SegmentationCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Benchmarking Large Multimodal Models against Common CorruptionsCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering WorkflowCode1
Show:102550
← PrevPage 22 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified