SOTAVerified

Benchmarking

Papers

Showing 27512800 of 5548 papers

TitleStatusHype
AlphaZip: Neural Network-Enhanced Lossless Text CompressionCode0
Towards Ground-truth-free Evaluation of Any Segmentation in Medical ImagesCode0
Building a continuous benchmarking ecosystem in bioinformatics0
Benchmarking Edge AI Platforms for High-Performance ML Inference0
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment BenchmarkingCode0
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests0
Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
Margin-bounded Confidence Scores for Out-of-Distribution DetectionCode0
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology0
Present and Future Generalization of Synthetic Image DetectorsCode0
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science CommunicatorsCode0
An Evolutionary Algorithm For the Vehicle Routing Problem with Drones with Interceptions0
CONGRA: Benchmarking Automatic Conflict ResolutionCode0
Efficient and Effective Model ExtractionCode0
Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection0
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time0
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive ProgressionsCode0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data0
Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks0
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation0
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines0
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific LeaderboardsCode0
ASR Benchmarking: Need for a More Representative Conversational DatasetCode0
Efficacy of Synthetic Data as a Benchmark0
Hard-Label Cryptanalytic Extraction of Neural Network ModelsCode0
PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection ModelsCode0
Improve Machine Learning carbon footprint using Parquet dataset format and Mixed Precision training for regression models -- Part IICode0
WER We Stand: Benchmarking Urdu ASR Models0
The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event DetectionCode0
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language ModelsCode0
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness CalibrationCode0
Quantum Kernel Learning for Small Dataset Modeling in Semiconductor Fabrication: Application to Ohmic Contact0
Benchmarking VLMs' Reasoning About Persuasive Atypical Images0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data0
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study0
Text-To-Speech Synthesis In The Wild0
Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering0
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal0
The JPEG Pleno Learning-based Point Cloud Coding Standard: Serving Man and Machine0
Linear energy storage and flexibility model with ramp rate, ramping, deadline and capacity constraintsCode0
Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots0
Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG0
Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification0
Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning0
Improve Machine Learning carbon footprint using Nvidia GPU and Mixed Precision training for classification models -- Part ICode0
Benchmarking 2D Egocentric Hand Pose Datasets0
Understanding Foundation Models: Are We Back in 1924?0
Unsupervised Novelty Detection Methods Benchmarking with Wavelet DecompositionCode0
Show:102550
← PrevPage 56 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified