SOTAVerified

Benchmarking

Papers

Showing 18511900 of 5548 papers

TitleStatusHype
A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior0
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities0
ExEBench: Benchmarking Foundation Models on Extreme Earth EventsCode0
Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 20300
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong0
PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
From raw affiliations to organization identifiersCode0
Benchmarking Retrieval-Augmented Generation for Chemistry0
Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems0
Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs0
Optimizing Recommendations using Fine-Tuned LLMs0
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations0
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal KnowledgeCode0
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization0
Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering PerspectiveCode0
Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters0
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents0
Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection0
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation0
Advancing and Benchmarking Personalized Tool Invocation for LLMsCode0
False Promises in Medical Imaging AI? Assessing Validity of Outperformance ClaimsCode0
Alpha Excel Benchmark0
Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power TransformersCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
Call for Action: towards the next generation of symbolic regression benchmark0
Multimodal Benchmarking and Recommendation of Text-to-Image Generation ModelsCode0
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding ApproachCode0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning0
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealitiesCode0
Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking0
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive SegmentationCode0
Meta-Black-Box-Optimization through Offline Q-function LearningCode0
Representation Learning of Limit Order Book: A Comprehensive Study and BenchmarkingCode0
NbBench: Benchmarking Language Models for Comprehensive Nanobody TasksCode0
Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing0
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture0
BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach0
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey0
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking0
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language ModelsCode0
Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging0
Show:102550
← PrevPage 38 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified