SOTAVerified

Benchmarking

Papers

Showing 451500 of 5548 papers

TitleStatusHype
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong0
Benchmarking Retrieval-Augmented Generation for Chemistry0
Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems0
Optimizing Recommendations using Fine-Tuned LLMs0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration0
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 MinutesCode1
FNBench: Benchmarking Robust Federated Learning against Noisy LabelsCode1
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA0
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy0
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and OptimizationCode3
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization0
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation0
Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters0
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations0
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation modelsCode1
Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering PerspectiveCode0
scDrugMap: Benchmarking Large Foundation Models for Drug Response PredictionCode1
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal KnowledgeCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents0
Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection0
Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power TransformersCode0
False Promises in Medical Imaging AI? Assessing Validity of Outperformance ClaimsCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
RGB-Event Fusion with Self-Attention for Collision PredictionCode1
Advancing and Benchmarking Personalized Tool Invocation for LLMsCode0
Benchmarking LLMs' Swarm intelligenceCode1
Alpha Excel Benchmark0
Call for Action: towards the next generation of symbolic regression benchmark0
Multimodal Benchmarking and Recommendation of Text-to-Image Generation ModelsCode0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding ApproachCode0
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking0
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealitiesCode0
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning0
NbBench: Benchmarking Language Models for Comprehensive Nanobody TasksCode0
Meta-Black-Box-Optimization through Offline Q-function LearningCode0
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive SegmentationCode0
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoCode1
Representation Learning of Limit Order Book: A Comprehensive Study and BenchmarkingCode0
Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing0
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture0
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach0
Show:102550
← PrevPage 10 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified