SOTAVerified

Benchmarking

Papers

Showing 28512900 of 5548 papers

TitleStatusHype
No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA0
Data Augmentation for Continual RL via Adversarial Gradient Episodic Memory0
Open Llama2 Model for the Lithuanian Language0
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection0
S3Simulator: A benchmarking Side Scan Sonar Simulator dataset for Underwater Image AnalysisCode0
Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures0
Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification0
WCEbleedGen: A wireless capsule endoscopy dataset and its benchmarking for automatic bleeding classification, detection, and segmentationCode0
MultiMed: Massively Multimodal and Multitask Medical Understanding0
Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis0
WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain0
Advances in Preference-based Reinforcement Learning: A Review0
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital TwinsCode0
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands0
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning0
UKAN: Unbound Kolmogorov-Arnold Network Accompanied with Accelerated Library0
ISLES'24: Improving final infarct prediction in ischemic stroke using multimodal imaging and clinical data0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving0
Benchmarking quantum machine learning kernel training for classification tasksCode0
Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors0
XCompress: LLM assisted Python-based text compression toolkitCode0
A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation0
A Meta-Engine Framework for Interleaved Task and Motion Planning using Topological Refinements0
Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration0
Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule EndoscopyCode0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data0
SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios0
Towards Explainable Network Intrusion Detection using Large Language Models0
Soft-Hard Attention U-Net Model and Benchmark Dataset for Multiscale Image Shadow Removal0
Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy, Survey, Research Challenges and Future Directions0
Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline0
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future0
LMEMs for post-hoc analysis of HPO BenchmarkingCode0
MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities0
SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems0
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance0
Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations0
Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test DataCode0
Visual-Inertial SLAM for Unstructured Outdoor Environments: Benchmarking the Benefits and Computational Costs of Loop ClosingCode0
IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model0
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
PINNs for Medical Image Analysis: A Survey0
IN-Sight: Interactive Navigation through Sight0
High-Quality, ROS Compatible Video Encoding and Decoding for High-Definition DatasetsCode0
KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making0
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified ModelCode0
Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication0
TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models0
Show:102550
← PrevPage 58 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified