SOTAVerified

Benchmarking

Papers

Showing 17511800 of 5548 papers

TitleStatusHype
BLADE: Benchmarking Language Model Agents for Data-Driven ScienceCode1
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving0
Benchmarking quantum machine learning kernel training for classification tasksCode0
PADetBench: Towards Benchmarking Physical Attacks against Object DetectionCode1
Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors0
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion RecognitionCode1
SustainDC: Benchmarking for Sustainable Data Center ControlCode2
TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-casesCode1
XCompress: LLM assisted Python-based text compression toolkitCode0
Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K datasetCode1
A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation0
A Meta-Engine Framework for Interleaved Task and Motion Planning using Topological Refinements0
Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration0
UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster ScenariosCode1
Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule EndoscopyCode0
The impact of internal variability on benchmarking deep learning climate emulatorsCode1
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios0
FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data0
Towards Explainable Network Intrusion Detection using Large Language Models0
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and BeyondCode1
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language ModelsCode1
Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy, Survey, Research Challenges and Future Directions0
Soft-Hard Attention U-Net Model and Benchmark Dataset for Multiscale Image Shadow Removal0
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational AgentsCode1
Segment Anything in Medical Images and Videos: Benchmark and DeploymentCode7
Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline0
MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities0
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future0
LMEMs for post-hoc analysis of HPO BenchmarkingCode0
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance0
SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems0
Visual-Inertial SLAM for Unstructured Outdoor Environments: Benchmarking the Benefits and Computational Costs of Loop ClosingCode0
Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test DataCode0
Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations0
IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model0
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
RAGEval: Scenario Specific RAG Evaluation Dataset Generation FrameworkCode3
PINNs for Medical Image Analysis: A Survey0
IN-Sight: Interactive Navigation through Sight0
High-Quality, ROS Compatible Video Encoding and Decoding for High-Definition DatasetsCode0
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified ModelCode0
KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making0
Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication0
TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models0
GNUMAP: A Parameter-Free Approach to Unsupervised Dimensionality Reduction via Graph Neural Networks0
Benchmarking Histopathology Foundation Models for Ovarian Cancer Bevacizumab Treatment Response Prediction from Whole Slide Images0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
Anomalous State Sequence Modeling to Enhance Safety in Reinforcement Learning0
Show:102550
← PrevPage 36 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified