SOTAVerified

Benchmarking

Papers

Showing 11011150 of 5548 papers

TitleStatusHype
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric AlgebrasCode1
Benchmarking saliency methods for chest X-ray interpretationCode1
Global Wheat Head Detection (GWHD) dataset: a large and diverse dataset of high resolution RGB labelled images to develop and benchmark wheat head detection methodsCode1
Benchmarking Spatial Relationships in Text-to-Image GenerationCode1
GraphGallery: A Platform for Fast Benchmarking and Easy Development of Graph Neural Networks Based Intelligent SoftwareCode1
Benchmarking Self-Supervised Learning on Diverse Pathology DatasetsCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
Benchmarking Simulation-Based InferenceCode1
Graphs, Constraints, and Search for the Abstraction and Reasoning CorpusCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
Grad DFT: a software library for machine learning enhanced density functional theoryCode1
Benchmarking Robustness of 3D Object Detection to Common CorruptionsCode1
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial TasksCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
Benchmarking the Generation of Fact Checking ExplanationsCode1
Geoclidean: Few-Shot Generalization in Euclidean GeometryCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based HateCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
Benchmarking LLMs' Swarm intelligenceCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Benchmarking Local Robustness of High-Accuracy Binary Neural Networks for Enhanced Traffic Sign RecognitionCode1
Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science DomainsCode1
Benchmarking Low-Shot Robustness to Natural Distribution ShiftsCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking PlatformCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal CorruptionsCode1
Benchmarking Robustness of Machine Reading Comprehension ModelsCode1
Benchmarking machine learning models on multi-centre eICU critical care datasetCode1
German's Next Language ModelCode1
GraphArena: Benchmarking Large Language Models on Graph Computational ProblemsCode1
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate CampaignsCode1
Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State DecodingCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Benchmarking Meaning Representations in Neural Semantic ParsingCode1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
Benchmarking Meta-embeddings: What Works and What Does NotCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
Generative Wind Power Curve Modeling Via Machine Vision: A Self-learning Deep Convolutional Network Based MethodCode1
Benchmarking Large Language Models for News SummarizationCode1
Show:102550
← PrevPage 23 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified