SOTAVerified

Benchmarking

Papers

Showing 601650 of 5548 papers

TitleStatusHype
Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design0
LEMUR Neural Network Dataset: Towards Seamless AutoMLCode1
NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding0
TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning0
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMsCode1
SortBench: Benchmarking LLMs based on their ability to sort lists0
TorchFX: A modern approach to Audio DSP with PyTorch and GPU accelerationCode2
Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain TrajectoriesCode0
Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) AlgorithmsCode0
NorEval: A Norwegian Language Understanding and Generation Evaluation BenchmarkCode0
SydneyScapes: Image Segmentation for Australian Environments0
Geological Inference from Textual Data using Word EmbeddingsCode0
Benchmarking Multi-Organ Segmentation Tools for Multi-Parametric T1-weighted Abdominal MRI0
Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs0
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-JudgeCode0
Benchmarking Multimodal CoT Reward Model Stepwise by Visual ProgramCode0
TabKAN: Advancing Tabular Data Analysis using Kolmogorov-Arnold Network0
Evolutionary Generation of Random Surreal Numbers for BenchmarkingCode1
A Roadmap for Improving Data Reliability and Sharing in Crosslinking Mass Spectrometry0
RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration0
Can Carbon-Aware Electric Load Shifting Reduce Emissions? An Equilibrium-Based Analysis0
Benchmarking Convolutional Neural Network and Graph Neural Network based Surrogate Models on a Real-World Car External Aerodynamics Dataset0
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language ModelsCode1
An Empirical Study of GPT-4o Image Generation CapabilitiesCode1
Towards Visual Text Grounding of Multimodal Large Language Model0
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation ModelsCode0
Leveraging State Space Models in Long Range Genomics0
Generative Adversarial Networks with Limited Data: A Survey and Benchmarking0
Riemannian Geometry for the classification of brain states with intracortical brain-computer interfaces0
Cross-functional transferability in universal machine learning interatomic potentials0
A Solid-State Nanopore Signal Generator for Training Machine Learning Models0
Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search0
Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image CompressionCode0
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIsCode0
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
Point Cloud Objective Quality: Benchmarking Features and Quality Evaluation0
Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical SystemsCode0
Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency0
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological UnderpinningsCode0
Evaluating AI Recruitment Sourcing Tools by Human PreferenceCode0
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Generative Evaluation of Complex Reasoning in Large Language ModelsCode1
Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge0
Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms0
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
Show:102550
← PrevPage 13 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified