SOTAVerified

Benchmarking

Papers

Showing 35513600 of 5548 papers

TitleStatusHype
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset0
MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models0
Benchmarking Large Language Model Capabilities for Conditional Generation0
Benchmarking Language Models for Cyberbullying Identification and Classification from Social-media Texts0
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks0
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP0
Benchmarking Lane-changing Decision-making for Deep Reinforcement Learning0
MeltpoolNet: Melt pool Characteristic Prediction in Metal Additive Manufacturing Using Machine Learning0
Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition0
Towards Explainable Network Intrusion Detection using Large Language Models0
Benchmarking KAZE and MCM for Multiclass Classification0
What cleaves? Is proteasomal cleavage prediction reaching a ceiling?0
Benchmarking Joint Lexical and Syntactic Analysis on Multiword-Rich Data0
Benchmarking Joint Face Spoofing and Forgery Detection with Visual and Physiological Cues0
Metaethical Perspectives on 'Benchmarking' AI Ethics0
Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking0
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction0
A deep convolutional neural network model for rapid prediction of fluvial flood inundation0
Meta learning to classify intent and slot labels with noisy few shot examples0
Benchmarking Invertible Architectures on Inverse Problems0
Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models0
Metastatic Cancer Outcome Prediction with Injective Multiple Instance Pooling0
Benchmarking in Optimization: Best Practice and Open Issues0
Towards Graph Foundation Models: A Study on the Generalization of Positional and Structural Encodings0
Methods and open-source toolkit for analyzing and visualizing challenge results0
Methods and Trends in Detecting Generated Images: A Comprehensive Review0
Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry0
Bench-Marking Information Extraction in Semi-Structured Historical Handwritten Records0
Benchmarking Inference Performance of Deep Learning Models on Analog Devices0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation0
Benchmarking Individual Tree Mapping with Sub-meter Imagery0
Microtask crowdsourcing for disease mention annotation in PubMed abstracts0
Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP)0
Benchmarking Image Transformers for Prostate Cancer Detection from Ultrasound Data0
Benchmarking Image Sensors Under Adverse Weather Conditions for Autonomous Driving0
MileBench: Benchmarking MLLMs in Long Context0
Addressing the Real-world Class Imbalance Problem in Dermatology0
MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries0
Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs0
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification0
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets0
Mind the Retrosynthesis Gap: Bridging the divide between Single-step and Multi-step Retrosynthesis Prediction0
What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs0
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning0
Benchmarking Human Face Similarity Using Identical Twins0
Towards Ideal Temporal Graph Neural Networks: Evaluations and Conclusions after 10,000 GPU Hours0
Show:102550
← PrevPage 72 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified