SOTAVerified

Benchmarking

Papers

Showing 25262550 of 5548 papers

TitleStatusHype
FALCON: Feature-Label Constrained Graph Net Collapse for Memory Efficient GNNsCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Benchmarking Keyword Spotting Efficiency on Neuromorphic HardwareCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation ExtractionCode0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Dialogue Quality and Emotion Annotations for Customer Support ConversationsCode0
Benchmarking Intersectional Biases in NLPCode0
DFEE: Interactive DataFlow Execution and Evaluation KitCode0
A Manually Annotated Image-Caption Dataset for Detecting Children in the WildCode0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
Benchmarking Commercial Intent Detection Services with Practice-Driven EvaluationsCode0
From raw affiliations to organization identifiersCode0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
From Variability to Stability: Advancing RecSys Benchmarking PracticesCode0
From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in HistopathologyCode0
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language RepresentationCode0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
FR-MRInet: A Deep Convolutional Encoder-Decoder for Brain Tumor Segmentation with Relu-RGB and Sliding-windowCode0
From MNIST to ImageNet and Back: Benchmarking Continual Curriculum LearningCode0
Arabic Speech Recognition by End-to-End, Modular Systems and HumanCode0
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological UnderpinningsCode0
Recognizing Object Affordances to Support Scene Reasoning for Manipulation TasksCode0
Detecting critical treatment effect bias in small subgroupsCode0
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringCode0
Show:102550
← PrevPage 102 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified