| QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation | Jun 9, 2024 | BenchmarkingQuestion Generation | CodeCode Available | 1 |
| Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications | Jun 8, 2024 | BenchmarkingMamba | —Unverified | 0 |
| 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation | Jun 8, 2024 | BenchmarkingInstance Segmentation | —Unverified | 0 |
| VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metric | Jun 7, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Behavior Structformer: Learning Players Representations with Structured Tokenization | Jun 7, 2024 | Benchmarking | —Unverified | 0 |
| GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models | Jun 7, 2024 | BenchmarkingDenoising | —Unverified | 0 |
| WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild | Jun 7, 2024 | BenchmarkingChatbot | CodeCode Available | 3 |
| Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain Activity | Jun 7, 2024 | BenchmarkingEEG | CodeCode Available | 0 |
| CLoG: Benchmarking Continual Learning of Image Generation Models | Jun 7, 2024 | BenchmarkingContinual Learning | CodeCode Available | 1 |
| Scenarios and Approaches for Situated Natural Language Explanations | Jun 7, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation | Jun 7, 2024 | Benchmarking | —Unverified | 0 |
| Multi-Head RAG: Solving Multi-Aspect Problems with LLMs | Jun 7, 2024 | BenchmarkingDecoder | CodeCode Available | 3 |
| Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking | Jun 6, 2024 | 6D Pose Estimation using RGBBenchmarking | —Unverified | 0 |
| Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation | Jun 6, 2024 | BenchmarkingDrug Discovery | —Unverified | 0 |
| Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As | Jun 6, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| Statistical Multicriteria Benchmarking via the GSD-Front | Jun 6, 2024 | Benchmarking | —Unverified | 0 |
| Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving | Jun 6, 2024 | Autonomous DrivingBench2Drive | CodeCode Available | 4 |
| Better Late Than Never: Formulating and Benchmarking Recommendation Editing | Jun 6, 2024 | BenchmarkingRecommendation Systems | CodeCode Available | 0 |
| Time Sensitive Knowledge Editing through Efficient Finetuning | Jun 6, 2024 | Benchmarkingknowledge editing | —Unverified | 0 |
| NATURAL PLAN: Benchmarking LLMs on Natural Language Planning | Jun 6, 2024 | BenchmarkingScheduling | —Unverified | 0 |
| MLVU: Benchmarking Multi-task Long Video Understanding | Jun 6, 2024 | BenchmarkingVideo Understanding | CodeCode Available | 3 |
| Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices | Jun 6, 2024 | BenchmarkingRAG | —Unverified | 0 |
| BEADs: Bias Evaluation Across Domains | Jun 6, 2024 | BenchmarkingBias Detection | —Unverified | 0 |
| TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising | Jun 5, 2024 | BenchmarkingDenoising | CodeCode Available | 1 |
| Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation | Jun 5, 2024 | BenchmarkingImage Segmentation | —Unverified | 0 |
| CommonPower: A Framework for Safe Data-Driven Smart Grid Control | Jun 5, 2024 | Benchmarkingenergy management | CodeCode Available | 1 |
| A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection | Jun 5, 2024 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark | Jun 5, 2024 | Benchmarking | CodeCode Available | 1 |
| Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN Performance | Jun 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 0 |
| ACCORD: Closing the Commonsense Measurability Gap | Jun 4, 2024 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check | Jun 4, 2024 | BenchmarkingRepresentation Learning | —Unverified | 0 |
| MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset | Jun 4, 2024 | Benchmarking | CodeCode Available | 0 |
| Analyzing the Feature Extractor Networks for Face Image Synthesis | Jun 4, 2024 | BenchmarkingImage Generation | CodeCode Available | 0 |
| TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability | Jun 4, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders | Jun 4, 2024 | BenchmarkingClustering | CodeCode Available | 1 |
| Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs | Jun 4, 2024 | BenchmarkingFairness | —Unverified | 0 |
| R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models | Jun 3, 2024 | BenchmarkingCode Completion | —Unverified | 0 |
| ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection | Jun 3, 2024 | Action RecognitionBenchmarking | —Unverified | 0 |
| LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions | Jun 3, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics | Jun 3, 2024 | Audio ClassificationBenchmarking | CodeCode Available | 1 |
| TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine | Jun 3, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Scaffold Splits Overestimate Virtual Screening Performance | Jun 2, 2024 | BenchmarkingClustering | —Unverified | 0 |
| WebSuite: Systematically Evaluating Why Web Agents Fail | Jun 1, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models | Jun 1, 2024 | Benchmarking | CodeCode Available | 1 |
| On the project risk baseline: integrating aleatory uncertainty into project scheduling | May 31, 2024 | BenchmarkingScheduling | —Unverified | 0 |
| LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild | May 30, 2024 | Benchmarking | CodeCode Available | 1 |
| SECURE: Benchmarking Large Language Models for Cybersecurity | May 30, 2024 | Benchmarking | CodeCode Available | 1 |
| Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images | May 30, 2024 | AllBenchmarking | —Unverified | 0 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| CoSy: Evaluating Textual Explanations of Neurons | May 30, 2024 | Benchmarking | —Unverified | 0 |