SOTAVerified

Benchmarking

Papers

Showing 851900 of 5548 papers

TitleStatusHype
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and CollaborationCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization RegimeCode1
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language ModelsCode1
Flames: Benchmarking Value Alignment of LLMs in ChineseCode1
MultiIoT: Benchmarking Machine Learning for the Internet of ThingsCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMsCode1
The PetShop Dataset -- Finding Causes of Performance Issues across MicroservicesCode1
The voraus-AD Dataset for Anomaly Detection in Robot ApplicationsCode1
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture TranscriptsCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State DecodingCode1
JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in CrowdsCode1
Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical CyclonesCode1
NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning ApplicationsCode1
FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven InterpretationCode1
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLOCode1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergenceCode1
In Search of Lost Online Test-time Adaptation: A SurveyCode1
Re-evaluating Retrosynthesis Algorithms with SyntheseusCode1
MLFMF: Data Sets for Machine Learning for Mathematical FormalizationCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection BenchmarkCode1
Fast hyperboloid decision tree algorithmsCode1
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution ShiftCode1
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For NowCode1
FactCHD: Benchmarking Fact-Conflicting Hallucination DetectionCode1
Object-aware Inversion and Reassembly for Image EditingCode1
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in ConversationsCode1
EvalCrafter: Benchmarking and Evaluating Large Video Generation ModelsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
Welfare Diplomacy: Benchmarking Language Model CooperationCode1
pose-format: Library for Viewing, Augmenting, and Handling .pose FilesCode1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
Towards Evaluating Generalist Agents: An Automated Benchmark in Open WorldCode1
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution ShiftsCode1
MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement LearningCode1
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language ModelsCode1
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric ApproachCode1
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language ModelingCode1
Can Language Models Employ the Socratic Method? Experiments with Code DebuggingCode1
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth BenchmarkingCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
PGDQN: Preference-Guided Deep Q-NetworkCode1
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
Show:102550
← PrevPage 18 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified