SOTAVerified

Benchmarking

Papers

Showing 451475 of 5548 papers

TitleStatusHype
An Exploration of Embodied Visual ExplorationCode1
AD-LLM: Benchmarking Large Language Models for Anomaly DetectionCode1
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning PotentialsCode1
AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment GraphCode1
Benchmarking Econometric and Machine Learning Methodologies in NowcastingCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Benchmarking Detection Transfer Learning with Vision TransformersCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
AnomalyHop: An SSL-based Image Anomaly Localization MethodCode1
Benchmarking Deep Models for Salient Object DetectionCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Benchmarking Deep Learning Interpretability in Time Series PredictionsCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
CIDEr: Consensus-based Image Description EvaluationCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
Show:102550
← PrevPage 19 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified