SOTAVerified

Benchmarking

Papers

Showing 38513900 of 5548 papers

TitleStatusHype
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
Benchmarking Machine Translation with Cultural AwarenessCode0
Multilingual Large Language Models Are Not (Yet) Code-Switchers0
Robust Model-Based Optimization for Challenging Fitness LandscapesCode0
Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate0
How Fragile is Relation Extraction under Entity Replacements?Code0
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting ApproachesCode0
Value-at-Risk-Based Portfolio Insurance: Performance Evaluation and Benchmarking Against CPPI in a Markov-Modulated Regime-Switching Market0
Patterns of Convergence and Bound Constraint Violation in Differential Evolution on SBOX-COST Benchmarking Suite0
TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks0
Separating form and meaning: Using self-consistency to quantify task understanding across multiple sensesCode0
Ahead-of-Time P-Tuning0
Benchmarking Deep Learning Frameworks for Automated Diagnosis of Ocular Toxoplasmosis: A Comprehensive Approach to Classification and Segmentation0
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization0
Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models0
Smiling Women Pitching Down: Auditing Representational and Presentational Gender Biases in Image Generative AI0
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks0
Restoring Images Captured in Arbitrary Hybrid Adverse Weather Conditions in One Go0
DLUE: Benchmarking Document Language Understanding0
OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking0
Predictive Models from Quantum Computer Benchmarks0
Benchmarking the human brain against computational architectures0
A Strong Sustainability Paradigm Based Analytical Hierarchy Process (SSP-AHP) Method to Evaluate Sustainable Healthcare Systems0
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine0
Uncertainty in GNN Learning Evaluations: The Importance of a Consistent Benchmark for Community Detection0
Comparing Foundation Models using Data Kernels0
Towards Segment Anything Model (SAM) for Medical Image Segmentation: A SurveyCode0
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness0
Semantic Segmentation using Vision Transformers: A survey0
Can LLMs Capture Human Preferences?0
Analyzing Hong Kong's Legal Judgments from a Computational Linguistics point-of-view0
A Simulation-Augmented Benchmarking Framework for Automatic RSO Streak Detection in Single-Frame Space Images0
Benchmarking Automated Machine Learning Methods for Price Forecasting Applications0
ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task0
On Pitfalls of RemOve-And-Retrain: Data Processing Inequality PerspectiveCode0
Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency0
CIMLA: Interpretable AI for inference of differential causal networks0
Unsupervised Synthetic Image Refinement via Contrastive Learning and Consistent Semantic-Structural Constraints0
Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation OncologyCode0
A Framework for Benchmarking Real-Time Embedded Object Detection0
Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification0
Learning a quantum computer's capability0
Towards a Benchmark for Scientific Understanding in Humans and Machines0
Depth Functions for Partial Orders with a Descriptive Analysis of Machine Learning AlgorithmsCode0
The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource LanguagesCode0
UDTIRI: An Online Open-Source Intelligent Road Inspection Benchmark Suite0
Computational and Exploratory Landscape Analysis of the GKLS Generator0
OOD-CV-v2: An extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images0
Towards Computational Performance Engineering for Unsupervised Concept Drift Detection -- Complexities, Benchmarking, Performance AnalysisCode0
Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy0
Show:102550
← PrevPage 78 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified