SOTAVerified

Benchmarking

Papers

Showing 25012550 of 5548 papers

TitleStatusHype
Graph-theoretical approach to robust 3D normal extraction of LiDAR dataCode0
Grounded Intuition of GPT-Vision's Abilities with Scientific ImagesCode0
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client AvailabilityCode0
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AICode0
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Are Large Language Models Good at Utility Judgments?Code0
DispaRisk: Auditing Fairness Through Usable InformationCode0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural NetworksCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
GOAL: Towards Benchmarking Few-Shot Sports Game SummarizationCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Benchmarking Language-agnostic Intent Classification for Virtual Assistant PlatformsCode0
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic CountingCode0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Generalization and Regularization in DQNCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on TurkishCode0
FALCON: Feature-Label Constrained Graph Net Collapse for Memory Efficient GNNsCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Benchmarking Keyword Spotting Efficiency on Neuromorphic HardwareCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation ExtractionCode0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Dialogue Quality and Emotion Annotations for Customer Support ConversationsCode0
Benchmarking Intersectional Biases in NLPCode0
DFEE: Interactive DataFlow Execution and Evaluation KitCode0
A Manually Annotated Image-Caption Dataset for Detecting Children in the WildCode0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
Benchmarking Commercial Intent Detection Services with Practice-Driven EvaluationsCode0
From raw affiliations to organization identifiersCode0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code RepositoriesCode0
From Variability to Stability: Advancing RecSys Benchmarking PracticesCode0
From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in HistopathologyCode0
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language RepresentationCode0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological EngineeringCode0
FR-MRInet: A Deep Convolutional Encoder-Decoder for Brain Tumor Segmentation with Relu-RGB and Sliding-windowCode0
From MNIST to ImageNet and Back: Benchmarking Continual Curriculum LearningCode0
Arabic Speech Recognition by End-to-End, Modular Systems and HumanCode0
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological UnderpinningsCode0
Recognizing Object Affordances to Support Scene Reasoning for Manipulation TasksCode0
Detecting critical treatment effect bias in small subgroupsCode0
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringCode0
Show:102550
← PrevPage 51 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified