SOTAVerified

Benchmarking

Papers

Showing 21512200 of 5548 papers

TitleStatusHype
Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data0
MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification0
Benchmarking and Improving Detail Image CaptionCode2
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn InteractionsCode1
Quantitative Certification of Bias in Large Language ModelsCode1
Exploring Thermography Technology: A Comprehensive Facial Dataset for Face Detection, Recognition, and Emotion0
Risk-Neutral Generative Networks0
DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment RegimeCode1
Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking SequencesCode1
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous DrivingCode3
A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis0
Benchmarking General-Purpose In-Context Learning0
GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases0
BOLD: Boolean Logic Deep Learning0
Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum"0
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification0
NuwaTS: a Foundation Model Mending Every Incomplete Time Series0
Benchmarking Hierarchical Image Pyramid Transformer for the classification of colon biopsies and polyps in histopathology images0
Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study0
MCDFN: Supply Chain Demand Forecasting via an Explainable Multi-Channel Data Fusion Network Model0
Full-stack evaluation of Machine Learning inference workloads for RISC-V systems0
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks0
Analog or Digital In-memory Computing? Benchmarking through Quantitative ModelingCode1
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language ModelsCode2
An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models0
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous AgentsCode4
GCondenser: Benchmarking Graph CondensationCode1
A Gap in Time: The Challenge of Processing Heterogeneous IoT Data in Digitalized Buildings0
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models0
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models0
EXACT: Towards a platform for empirically benchmarking Machine Learning model explanation methods0
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep LearningCode2
DispaRisk: Auditing Fairness Through Usable InformationCode0
MTVQA: Benchmarking Multilingual Text-Centric Visual Question AnsweringCode2
EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge0
BraTS-Path Challenge: Assessing Heterogeneous Histopathologic Brain Tumor Sub-regions0
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
A Robust Autoencoder Ensemble-Based Approach for Anomaly Detection in Text0
Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail PromotionsCode0
An Integrated Framework for Multi-Granular Explanation of Video SummarizationCode0
DocuMint: Docstring Generation for Python using Small Language ModelsCode1
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsCode2
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
SpeechVerse: A Large-scale Generalizable Audio Language Model0
UCCIX: Irish-eXcellence Large Language Model0
Divergent Creativity in Humans and Large Language ModelsCode0
Show:102550
← PrevPage 44 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified