SOTAVerified

Benchmarking

Papers

Showing 201250 of 5548 papers

TitleStatusHype
SORCE: Small Object Retrieval in Complex EnvironmentsCode0
GenSpace: Benchmarking Spatially-Aware Image Generation0
Segmenting France Across Four CenturiesCode0
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM AgentsCode2
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals0
Benchmarking Foundation Models for Zero-Shot Biometric Tasks0
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMsCode0
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization0
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation0
Automated Structured Radiology Report Generation0
ByzFL: Research Framework for Robust Federated LearningCode1
Bench4KE: Benchmarking Automated Competency Question GenerationCode1
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image GenerationCode1
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge0
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns0
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems0
VERINA: Benchmarking Verifiable Code GenerationCode2
LLM Performance for Code Generation on Noisy TasksCode0
Toward Memory-Aided World Models: Benchmarking via Spatial ConsistencyCode1
Benchmarking Abstract and Reasoning Abilities Through A Theoretical PerspectiveCode0
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue EvaluatorsCode0
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and BenchmarkingCode1
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval0
StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association AnalysisCode0
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsCode1
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates0
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and BenchmarkingCode1
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold30
PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow0
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional ChineseCode0
Jailbreak Distillation: Renewable Safety Benchmarking0
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical DataCode0
TabularQGAN: A Quantum Generative Model for Tabular Data0
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate0
SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing ProblemCode1
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringCode0
MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes0
Bencher: Simple and Reproducible Benchmarking for Black-Box OptimizationCode1
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMsCode0
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization AlgorithmsCode2
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding0
FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone NavigationCode1
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener FilterCode0
Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine LearningCode0
Show:102550
← PrevPage 5 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified