SOTAVerified

Benchmarking

Papers

Showing 16761700 of 5548 papers

TitleStatusHype
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns0
Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems0
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking0
PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow0
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional ChineseCode0
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate0
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold30
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates0
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical DataCode0
Benchmarking Abstract and Reasoning Abilities Through A Theoretical PerspectiveCode0
TabularQGAN: A Quantum Generative Model for Tabular Data0
Jailbreak Distillation: Renewable Safety Benchmarking0
StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association AnalysisCode0
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue EvaluatorsCode0
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval0
Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine LearningCode0
Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener FilterCode0
VideoMarkBench: Benchmarking Robustness of Video WatermarkingCode0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
Gauss-Ramanujan Functions: Constructions, Properties, and Applications in Communications and Signal Processing0
MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes0
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMsCode0
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding0
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringCode0
Show:102550
← PrevPage 68 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified