SOTAVerified

Benchmarking

Papers

Showing 25012525 of 5548 papers

TitleStatusHype
DispaRisk: Auditing Fairness Through Usable InformationCode0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Generalization and Regularization in DQNCode0
Benchmarking Language-agnostic Intent Classification for Virtual Assistant PlatformsCode0
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic CountingCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Benchmarking Keyword Spotting Efficiency on Neuromorphic HardwareCode0
Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation ExtractionCode0
Dialogue Quality and Emotion Annotations for Customer Support ConversationsCode0
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal CarcinomaCode0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
Benchmarking pre-trained text embedding models in aligning built asset informationCode0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
From MNIST to ImageNet and Back: Benchmarking Continual Curriculum LearningCode0
From raw affiliations to organization identifiersCode0
Benchmarking Intersectional Biases in NLPCode0
DFEE: Interactive DataFlow Execution and Evaluation KitCode0
A Manually Annotated Image-Caption Dataset for Detecting Children in the WildCode0
From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in HistopathologyCode0
Show:102550
← PrevPage 101 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified