SOTAVerified

Benchmarking

Papers

Showing 55015548 of 5548 papers

TitleStatusHype
FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding0
Few-Shot Defect Segmentation Leveraging Abundant Normal Training Samples Through Normal Background Regularization and Crop-and-Paste Operation0
Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring0
Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift0
AI PERSONA: Towards Life-long Personalization of LLMs0
Fiber Bundle Morphisms as a Framework for Modeling Many-to-Many Maps0
E(3)-equivariant models cannot learn chirality: Field-based molecular generation0
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations0
Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark0
Finance Language Model Evaluation (FLaME)0
CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods0
Financial Numeric Extreme Labelling: A Dataset and Benchmarking for XBRL Tagging0
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada0
TEP-GNN: Accurate Execution Time Prediction of Functional Tests using Graph Neural Networks0
Fine-Grained Classification of Pedestrians in Video: Benchmark and State of the Art0
Terabyte-scale supervised 3D training and benchmarking dataset of the mouse kidney0
Term-Class-Max-Support (TCMS): A Simple Text Document Categorization Approach Using Term-Class Relevance Measure0
Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering0
FineText: Text Classification via Attention-based Language Model Fine-tuning0
Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency0
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets0
FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets0
Building benchmarking frameworks for supporting replicability and reproducibility: spatial and textual analysis as an example0
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance0
FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking0
AI Matrix - Synthetic Benchmarks for DNN0
FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example0
FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization0
Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)0
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems0
Tetrad: Actively Secure 4PC for Secure Training and Inference0
FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents0
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation0
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models0
FlowMind: Automatic Workflow Generation with LLMs0
AI Idea Bench 2025: AI Research Idea Generation Benchmark0
Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text0
A Benchmark for Out of Distribution Detection in Point Cloud 3D Semantic Segmentation0
Fluorescent Neuronal Cells v2: Multi-Task, Multi-Format Annotations for Deep Learning in Microscopy0
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks0
Building a continuous benchmarking ecosystem in bioinformatics0
Enhancing Architecture Frameworks by Including Modern Stakeholders and their Views/Viewpoints0
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer0
BuckTales : A multi-UAV dataset for multi-object tracking and re-identification of wild antelopes0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
Show:102550
← PrevPage 111 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified