SOTAVerified

Benchmarking

Papers

Showing 451500 of 5548 papers

TitleStatusHype
An Image Dataset for Benchmarking Recommender Systems with Raw PixelsCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning PotentialsCode1
AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment GraphCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
AnomalyHop: An SSL-based Image Anomaly Localization MethodCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
An Open-source Benchmark of Deep Learning Models for Audio-visual Apparent and Self-reported Personality RecognitionCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
An Exploration of Embodied Visual ExplorationCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Benchmarking Graph Neural Networks for FMRI analysisCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
Benchmarking Econometric and Machine Learning Methodologies in NowcastingCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Benchmarking Distribution Shift in Tabular Data with TableShiftCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
Benchmarking Differential Privacy and Federated Learning for BERT ModelsCode1
Benchmarking Encoder-Decoder Architectures for Biplanar X-ray to 3D Shape ReconstructionCode1
AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoringCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
A Platform for the Biomedical Application of Large Language ModelsCode1
Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working MemoryCode1
Benchmarking Detection Transfer Learning with Vision TransformersCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Show:102550
← PrevPage 10 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified