SOTAVerified

Benchmarking

Papers

Showing 27512775 of 5548 papers

TitleStatusHype
AlphaZip: Neural Network-Enhanced Lossless Text CompressionCode0
Towards Ground-truth-free Evaluation of Any Segmentation in Medical ImagesCode0
Building a continuous benchmarking ecosystem in bioinformatics0
Benchmarking Edge AI Platforms for High-Performance ML Inference0
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment BenchmarkingCode0
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests0
Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
Margin-bounded Confidence Scores for Out-of-Distribution DetectionCode0
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology0
Present and Future Generalization of Synthetic Image DetectorsCode0
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science CommunicatorsCode0
An Evolutionary Algorithm For the Vehicle Routing Problem with Drones with Interceptions0
CONGRA: Benchmarking Automatic Conflict ResolutionCode0
Efficient and Effective Model ExtractionCode0
Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection0
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time0
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive ProgressionsCode0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data0
Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks0
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation0
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines0
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific LeaderboardsCode0
ASR Benchmarking: Need for a More Representative Conversational DatasetCode0
Efficacy of Synthetic Data as a Benchmark0
Show:102550
← PrevPage 111 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified