SOTAVerified

Benchmarking

Papers

Showing 301350 of 5548 papers

TitleStatusHype
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex ScenariosCode2
R-Judge: Benchmarking Safety Risk Awareness for LLM AgentsCode2
WAVES: Benchmarking the Robustness of Image WatermarksCode2
Authorship Obfuscation in Multilingual Machine-Generated Text DetectionCode2
InfiAgent-DABench: Evaluating Agents on Data Analysis TasksCode2
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified BenchmarkCode2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language ModelsCode2
AlignBench: Benchmarking Chinese Alignment of Large Language ModelsCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
SEED-Bench-2: Benchmarking Multimodal Large Language ModelsCode2
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
Exponentially Faster Language ModellingCode2
What's In My Big Data?Code2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
Formalizing and Benchmarking Prompt Injection Attacks and DefensesCode2
Octopus: Embodied Vision-Language Programmer from Environmental FeedbackCode2
ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction HorizonsCode2
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationCode2
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language ModelsCode2
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
A Content-Driven Micro-Video Recommendation Dataset at ScaleCode2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement LearningCode2
VerilogEval: Evaluating Large Language Models for Verilog Code GenerationCode2
PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your FingertipsCode2
Benchmarking Large Language Models in Retrieval-Augmented GenerationCode2
Orientation-Independent Chinese Text Recognition in Scene ImagesCode2
Topical-Chat: Towards Knowledge-Grounded Open-Domain ConversationsCode2
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous AgentsCode2
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
IML-ViT: Benchmarking Image Manipulation Localization by Vision TransformerCode2
Foundational Models Defining a New Era in Vision: A Survey and OutlookCode2
Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPGCode2
Benchmarking Potential Based Rewards for Learning Humanoid LocomotionCode2
EasyTPP: Towards Open Benchmarking Temporal Point ProcessesCode2
A Dynamic Points Removal Benchmark in Point Cloud MapsCode2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
InterCode: Standardizing and Benchmarking Interactive Coding with Execution FeedbackCode2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender SystemsCode2
PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEsCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine PerceptionCode2
LibAUC: A Deep Learning Library for X-Risk OptimizationCode2
The Brain Tumor Segmentation (BraTS) Challenge 2023: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)Code2
Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language ModelsCode2
RoboPianist: Dexterous Piano Playing with Deep Reinforcement LearningCode2
OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy PerceptionCode2
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid ManipulationCode2
Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern AnalysisCode2
Show:102550
← PrevPage 7 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified