SOTAVerified

Benchmarking

Papers

Showing 20012050 of 5548 papers

TitleStatusHype
Cross-functional transferability in universal machine learning interatomic potentials0
Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search0
Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image CompressionCode0
Riemannian Geometry for the classification of brain states with intracortical brain-computer interfaces0
Generative Adversarial Networks with Limited Data: A Survey and Benchmarking0
A Solid-State Nanopore Signal Generator for Training Machine Learning Models0
Towards Visual Text Grounding of Multimodal Large Language Model0
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation ModelsCode0
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIsCode0
Leveraging State Space Models in Long Range Genomics0
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological UnderpinningsCode0
Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical SystemsCode0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency0
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins0
Point Cloud Objective Quality: Benchmarking Features and Quality Evaluation0
Evaluating AI Recruitment Sourcing Tools by Human PreferenceCode0
Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge0
Accelerating IoV Intrusion Detection: Benchmarking GPU-Accelerated vs CPU-Based ML Libraries0
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks0
Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools0
FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking0
Proof of Humanity: A Multi-Layer Network Framework for Certifying Human-Originated Content in an AI-Dominated Internet0
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions0
Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms0
Benchmarking Federated Machine Unlearning methods for Tabular Data0
TDBench: Benchmarking Vision-Language Models in Understanding Top-Down ImagesCode0
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models0
LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactionsCode0
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-BenchCode0
Scaling Up Resonate-and-Fire Networks for Fast Deep LearningCode0
On Benchmarking Code LLMs for Android Malware Analysis0
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models0
Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers0
Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios0
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models0
Simple Feedfoward Neural Networks are Almost All You Need for Time Series Forecasting0
RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations0
Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous DomainsCode0
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis0
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation0
Generalization Bias in Large Language Model Summarization of Scientific Research0
An Advanced Ensemble Deep Learning Framework for Stock Price Prediction Using VAE, Transformer, and LSTM Model0
LIM: Large Interpolator Model for Dynamic Reconstruction0
Benchmarking Ultra-Low-Power μNPUs0
Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery0
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug ErrorsCode0
Show:102550
← PrevPage 41 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified