SOTAVerified

Benchmarking

Papers

Showing 476500 of 5548 papers

TitleStatusHype
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Benchmarking Distribution Shift in Tabular Data with TableShiftCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
Benchmarking Differential Privacy and Federated Learning for BERT ModelsCode1
Benchmarking Encoder-Decoder Architectures for Biplanar X-ray to 3D Shape ReconstructionCode1
AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoringCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
A Platform for the Biomedical Application of Large Language ModelsCode1
Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working MemoryCode1
Benchmarking Detection Transfer Learning with Vision TransformersCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Show:102550
← PrevPage 20 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified