Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3101–3150 of 5548 papers

Title	Date	Tasks	Status
Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking	Jun 6, 2024	6D Pose Estimation using RGBBenchmarking	—Unverified
Time Sensitive Knowledge Editing through Efficient Finetuning	Jun 6, 2024	Benchmarkingknowledge editing	—Unverified
Statistical Multicriteria Benchmarking via the GSD-Front	Jun 6, 2024	Benchmarking	—Unverified
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection	Jun 5, 2024	Anomaly DetectionBenchmarking	—Unverified
Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation	Jun 5, 2024	BenchmarkingImage Segmentation	—Unverified
Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs	Jun 4, 2024	BenchmarkingFairness	—Unverified
Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check	Jun 4, 2024	BenchmarkingRepresentation Learning	—Unverified
Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN Performance	Jun 4, 2024	BenchmarkingDrug Discovery	CodeCode Available
Analyzing the Feature Extractor Networks for Face Image Synthesis	Jun 4, 2024	BenchmarkingImage Generation	CodeCode Available
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset	Jun 4, 2024	Benchmarking	CodeCode Available
ACCORD: Closing the Commonsense Measurability Gap	Jun 4, 2024	BenchmarkingCommon Sense Reasoning	CodeCode Available
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability	Jun 4, 2024	BenchmarkingLanguage Modeling	CodeCode Available
LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions	Jun 3, 2024	Autonomous DrivingBenchmarking	—Unverified
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection	Jun 3, 2024	Action RecognitionBenchmarking	—Unverified
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models	Jun 3, 2024	BenchmarkingCode Completion	—Unverified
Scaffold Splits Overestimate Virtual Screening Performance	Jun 2, 2024	BenchmarkingClustering	—Unverified
WebSuite: Systematically Evaluating Why Web Agents Fail	Jun 1, 2024	BenchmarkingDiagnostic	CodeCode Available
On the project risk baseline: integrating aleatory uncertainty into project scheduling	May 31, 2024	BenchmarkingScheduling	—Unverified
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images	May 30, 2024	AllBenchmarking	—Unverified
CoSy: Evaluating Textual Explanations of Neurons	May 30, 2024	Benchmarking	—Unverified
MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification	May 29, 2024	Benchmarking	—Unverified
Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data	May 29, 2024	BenchmarkingSpecificity	—Unverified
Exploring Thermography Technology: A Comprehensive Facial Dataset for Face Detection, Recognition, and Emotion	May 28, 2024	BenchmarkingEmotion Recognition	—Unverified
Risk-Neutral Generative Networks	May 28, 2024	Benchmarking	—Unverified
A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis	May 27, 2024	Benchmarking	—Unverified
Benchmarking General-Purpose In-Context Learning	May 27, 2024	BenchmarkingDecision Making	—Unverified
GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases	May 25, 2024	BenchmarkingHallucination	—Unverified
BOLD: Boolean Logic Deep Learning	May 25, 2024	BenchmarkingDeep Learning	—Unverified
NuwaTS: a Foundation Model Mending Every Incomplete Time Series	May 24, 2024	BenchmarkingContrastive Learning	—Unverified
MCDFN: Supply Chain Demand Forecasting via an Explainable Multi-Channel Data Fusion Network Model	May 24, 2024	BenchmarkingDemand Forecasting	—Unverified
Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum"	May 24, 2024	BenchmarkingClassification	—Unverified
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks	May 24, 2024	BenchmarkingDecoder	—Unverified
Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study	May 24, 2024	BenchmarkingVulnerability Detection	—Unverified
Full-stack evaluation of Machine Learning inference workloads for RISC-V systems	May 24, 2024	BenchmarkingDeep Learning	—Unverified
Benchmarking Hierarchical Image Pyramid Transformer for the classification of colon biopsies and polyps in histopathology images	May 24, 2024	BenchmarkingClassification	—Unverified
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification	May 24, 2024	BenchmarkingData Augmentation	—Unverified
A Gap in Time: The Challenge of Processing Heterogeneous IoT Data in Digitalized Buildings	May 23, 2024	BenchmarkingData Integration	—Unverified
An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models	May 23, 2024	Autonomous DrivingBenchmarking	—Unverified
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models	May 22, 2024	BenchmarkingHallucination	—Unverified
EXACT: Towards a platform for empirically benchmarking Machine Learning model explanation methods	May 20, 2024	BenchmarkingExplainable artificial intelligence	—Unverified
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models	May 20, 2024	BenchmarkingDiversity	—Unverified
DispaRisk: Auditing Fairness Through Usable Information	May 20, 2024	BenchmarkingBias Detection	CodeCode Available
EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models	May 18, 2024	BenchmarkingSpecificity	—Unverified
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT	May 17, 2024	BenchmarkingMultiple-choice	—Unverified
SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge	May 17, 2024	BenchmarkingSocial Media Popularity Prediction	—Unverified
BraTS-Path Challenge: Assessing Heterogeneous Histopathologic Brain Tumor Sub-regions	May 17, 2024	BenchmarkingPrognosis	—Unverified
An Integrated Framework for Multi-Granular Explanation of Video Summarization	May 16, 2024	BenchmarkingPanoptic Segmentation	CodeCode Available
Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions	May 16, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available
A Robust Autoencoder Ensemble-Based Approach for Anomaly Detection in Text	May 16, 2024	Anomaly DetectionBenchmarking	—Unverified
SpeechVerse: A Large-scale Generalizable Audio Language Model	May 14, 2024	Automatic Speech RecognitionBenchmarking	—Unverified

Show:10 25 50

← PrevPage 63 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified