SOTAVerified

Benchmarking

Papers

Showing 53015350 of 5548 papers

TitleStatusHype
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and ValidationCode0
CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical DatasetCode0
Cryo-RALib -- a modular library for accelerating alignment in cryo-EMCode0
What the Weight?! A Unified Framework for Zero-Shot Knowledge CompositionCode0
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive ProgressionsCode0
Cross-Lingual Text Classification of Transliterated Hindi and MalayalamCode0
Benchmarking Flexible Electric Loads Scheduling Algorithms under Market Price UncertaintyCode0
Yum-me: A Personalized Nutrient-based Meal Recommender SystemCode0
Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph GenerationCode0
Cross-lingual sentiment classification in low-resource Bengali languageCode0
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive SegmentationCode0
STREETS: A Novel Camera Network Dataset for Traffic FlowCode0
Benchmarking Feature-based Algorithm Selection Systems for Black-box Numerical OptimizationCode0
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMsCode0
Benchmarking Failures in Tool-Augmented Language ModelsCode0
CRNN: A Joint Neural Network for Redundancy DetectionCode0
Critical review of conformational B-cell epitope prediction methodsCode0
PICO Element Detection in Medical Text via Long Short-Term Memory Neural NetworksCode0
Stronger Than You Think: Benchmarking Weak Supervision on Realistic TasksCode0
CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint MatchingCode0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature DataCode0
An Optical Control Environment for Benchmarking Reinforcement Learning AlgorithmsCode0
STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and BenchmarkingCode0
An open unified deep graph learning framework for discovering drug leadsCode0
PixelBrax: Learning Continuous Control from Pixels End-to-End on the GPUCode0
PixelHop: A Successive Subspace Learning (SSL) Method for Object ClassificationCode0
pke: an open source python-based keyphrase extraction toolkitCode0
Benchmarking Educational Program RepairCode0
A Benchmarking Study of Vision-based Robotic Grasping AlgorithmsCode0
CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and SummarizationCode0
CREPO: An Open Repository to Benchmark Credal Network AlgorithmsCode0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSICode0
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language ModelsCode0
ConvGeN: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasetsCode0
PMLB: A Large Benchmark Suite for Machine Learning Evaluation and ComparisonCode0
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting FrameworkCode0
pmuBAGE: The Benchmarking Assortment of Generated PMU Data for Power System Events -- Part I: Overview and ResultsCode0
pmuBAGE: The Benchmarking Assortment of Generated PMU Data for Power System EventsCode0
Continuous Optimization Benchmarks by SimulationCode0
Continual Learning Strategies for 3D Engineering Regression Problems: A Benchmarking StudyCode0
Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum SystemsCode0
Structured Prediction Problem ArchiveCode0
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment BenchmarkingCode0
Benchmarking down-scaled (not so large) pre-trained language modelsCode0
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition DynamicsCode0
ContextGNN goes to Elliot: Towards Benchmarking Relational Deep Learning for Static Link Prediction (aka Personalized Item Recommendation)Code0
Selected Languages are All You Need for Cross-lingual Truthfulness TransferCode0
Content-Aware Differential Privacy with Conditional Invertible Neural NetworksCode0
Population-wise Labeling of Sulcal Graphs using Multi-graph MatchingCode0
Show:102550
← PrevPage 107 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified