SOTAVerified

Benchmarking

Papers

Showing 851875 of 5548 papers

TitleStatusHype
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and CollaborationCode1
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language ModelsCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization RegimeCode1
Flames: Benchmarking Value Alignment of LLMs in ChineseCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
MultiIoT: Benchmarking Machine Learning for the Internet of ThingsCode1
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMsCode1
The voraus-AD Dataset for Anomaly Detection in Robot ApplicationsCode1
The PetShop Dataset -- Finding Causes of Performance Issues across MicroservicesCode1
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture TranscriptsCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State DecodingCode1
Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical CyclonesCode1
JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in CrowdsCode1
FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven InterpretationCode1
NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning ApplicationsCode1
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLOCode1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergenceCode1
In Search of Lost Online Test-time Adaptation: A SurveyCode1
Re-evaluating Retrosynthesis Algorithms with SyntheseusCode1
MLFMF: Data Sets for Machine Learning for Mathematical FormalizationCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection BenchmarkCode1
Show:102550
← PrevPage 35 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified