SOTAVerified

Benchmarking

Papers

Showing 16011650 of 5548 papers

TitleStatusHype
HazeSpace2M: A Dataset for Haze Aware Single Image DehazingCode1
Benchmarking Domain Generalization Algorithms in Computational PathologyCode0
Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices0
SEN12-WATER: A New Dataset for Hydrological Applications and its Benchmarking0
Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling0
Ducho meets Elliot: Large-scale Benchmarks for Multimodal RecommendationCode0
HLB: Benchmarking LLMs' Humanlikeness in Language Use0
Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted DataCode0
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting FrameworkCode0
Small Language Models: Survey, Measurements, and InsightsCode2
Building a continuous benchmarking ecosystem in bioinformatics0
Benchmarking Edge AI Platforms for High-Performance ML Inference0
Boosting Healthcare LLMs Through Retrieved ContextCode1
Towards Ground-truth-free Evaluation of Any Segmentation in Medical ImagesCode0
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment BenchmarkingCode0
AlphaZip: Neural Network-Enhanced Lossless Text CompressionCode0
RMCBench: Benchmarking Large Language Models' Resistance to Malicious CodeCode1
Margin-bounded Confidence Scores for Out-of-Distribution DetectionCode0
Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests0
A Survey on Multimodal Benchmarks: In the Era of Large AI ModelsCode2
Efficient and Effective Model ExtractionCode0
CONGRA: Benchmarking Automatic Conflict ResolutionCode0
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology0
Present and Future Generalization of Synthetic Image DetectorsCode0
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science CommunicatorsCode0
An Evolutionary Algorithm For the Vehicle Routing Problem with Drones with Interceptions0
Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection0
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time0
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language ModelsCode1
Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data0
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive ProgressionsCode0
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific LeaderboardsCode0
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation0
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines0
Efficacy of Synthetic Data as a Benchmark0
PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection ModelsCode0
Hard-Label Cryptanalytic Extraction of Neural Network ModelsCode0
ASR Benchmarking: Need for a More Representative Conversational DatasetCode0
Advances in APPFL: A Comprehensive and Extensible Federated Learning FrameworkCode2
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness CalibrationCode0
WER We Stand: Benchmarking Urdu ASR Models0
Improve Machine Learning carbon footprint using Parquet dataset format and Mixed Precision training for regression models -- Part IICode0
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language ModelsCode0
The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event DetectionCode0
Quantum Kernel Learning for Small Dataset Modeling in Semiconductor Fabrication: Application to Ohmic Contact0
MetaFormer and CNN Hybrid Model for Polyp Image SegmentationCode1
Show:102550
← PrevPage 33 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified