SOTAVerified

Benchmarking

Papers

Showing 551600 of 5548 papers

TitleStatusHype
Towards responsible AI for education: Hybrid human-AI to confront the Elephant in the room0
WASP: Benchmarking Web Agent Security Against Prompt Injection AttacksCode2
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture0
Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues0
Establishing Reliability Metrics for Reward Models in Large Language Models0
IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays0
A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents0
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation0
Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering0
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers0
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleCode2
AI Idea Bench 2025: AI Research Idea Generation Benchmark0
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations0
Integrated Super-resolution Sensing and Symbiotic Communication with 3D Sparse MIMO for Low-Altitude UAV Swarm0
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation0
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models0
Benchmarking LLM-based Relevance Judgment MethodsCode0
Benchmarking Multi-National Value Alignment for Large Language Models0
Enhancing Explainability and Reliable Decision-Making in Particle Swarm Optimization through Communication Topologies0
Local Data Quantity-Aware Weighted Averaging for Federated Learning with Dishonest Clients0
ALT: A Python Package for Lightweight Feature Representation in Time Series Classification0
Featuremetric benchmarking: Quantum computer benchmarks based on circuit features0
pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild0
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions0
Benchmarking Mutual Information-based Loss Functions in Federated Learning0
Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios0
Power Line Communication vs. Talkative Power Conversion: A Benchmarking Study0
Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic EnvironmentsCode0
Continual Learning Strategies for 3D Engineering Regression Problems: A Benchmarking StudyCode0
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real WebsitesCode3
Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation0
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR0
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis GenerationCode2
Mamba-Based Ensemble learning for White Blood Cell ClassificationCode0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives0
E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking0
FHBench: Towards Efficient and Personalized Federated Learning for Multimodal HealthcareCode0
Benchmarking Vision Language Models on German Factual Data0
BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs0
BoTTA: Benchmarking on-device Test Time Adaptation0
Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization0
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts0
LMFormer: Lane based Motion Prediction Transformer0
Benchmarking 3D Human Pose Estimation Models Under Occlusions0
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography0
TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic ProgrammingCode1
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models0
Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking StudyCode0
Show:102550
← PrevPage 12 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified