SOTAVerified

Benchmarking

Papers

Showing 28012850 of 5548 papers

TitleStatusHype
GPTs and Language Barrier: A Cross-Lingual Legal QA Examination0
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models0
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems0
Variational Laplace for Bayesian neural networks0
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities0
Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking0
Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings0
Beyond Benchmarks: On The False Promise of AI Regulation0
Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms0
Graph-based Deep-Tree Recursive Neural Network (DTRNN) for Text Classification0
Graph-based Prediction and Planning Policy Network (GP3Net) for scalable self-driving in dynamic environments using Deep Reinforcement Learning0
Graph clustering with Boltzmann machines0
A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection0
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets0
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models0
Label Efficient Regularization and Propagation for Graph Node Classification0
Graph Joint Attention Networks0
A Bayesian Committee Machine Potential for Oxygen-containing Organic Compounds0
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra0
Better Practices for Domain Adaptation0
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation0
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures0
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal0
Best Practices in Pool-based Active Learning for Image Classification0
Abasy Atlas v2.2: The most comprehensive and up-to-date inventory of meta-curated, historical, bacterial regulatory networks, their completeness and system-level characterization0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach0
BERT-GT: Cross-sentence n-ary relation extraction with BERT and Graph Transformer0
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices0
Grid Search Hyperparameter Benchmarking of BERT, ALBERT, and LongFormer on DuoRC0
BERT-based Chinese Text Classification for Emergency Domain with a Novel Loss Function0
AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI0
Benefits and Challenges of Dynamic Modelling of Cascading Failures in Power Systems0
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models0
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents0
BenchMARL: Benchmarking Multi-Agent Reinforcement Learning0
gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs0
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation0
Benchmarks as Microscopes: A Call for Model Metrology0
The Curious Case of Integrator Reach Sets, Part I: Basic Theory0
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation0
Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks0
Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge0
Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance0
VoiceWukong: Benchmarking Deepfake Voice Detection0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text0
Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure0
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems0
HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images0
Show:102550
← PrevPage 57 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified