SOTAVerified

Benchmarking

Papers

Showing 551575 of 5548 papers

TitleStatusHype
Towards responsible AI for education: Hybrid human-AI to confront the Elephant in the room0
WASP: Benchmarking Web Agent Security Against Prompt Injection AttacksCode2
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture0
Establishing Reliability Metrics for Reward Models in Large Language Models0
Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues0
IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays0
A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents0
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation0
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers0
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleCode2
Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering0
AI Idea Bench 2025: AI Research Idea Generation Benchmark0
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations0
Integrated Super-resolution Sensing and Symbiotic Communication with 3D Sparse MIMO for Low-Altitude UAV Swarm0
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation0
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models0
Enhancing Explainability and Reliable Decision-Making in Particle Swarm Optimization through Communication Topologies0
Benchmarking LLM-based Relevance Judgment MethodsCode0
Benchmarking Multi-National Value Alignment for Large Language Models0
ALT: A Python Package for Lightweight Feature Representation in Time Series Classification0
Local Data Quantity-Aware Weighted Averaging for Federated Learning with Dishonest Clients0
Featuremetric benchmarking: Quantum computer benchmarks based on circuit features0
pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild0
Continual Learning Strategies for 3D Engineering Regression Problems: A Benchmarking StudyCode0
Show:102550
← PrevPage 23 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified