| BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts | Dec 3, 2024 | Age And Gender ClassificationAge and Gender Estimation | CodeCode Available | 0 |
| VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning | Dec 3, 2024 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| Benchmarking symbolic regression constant optimization schemes | Dec 3, 2024 | Benchmarkingregression | —Unverified | 0 |
| Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data | Dec 3, 2024 | Benchmarking | —Unverified | 0 |
| OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations | Dec 3, 2024 | BenchmarkingFace Recognition | —Unverified | 0 |
| Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods | Dec 3, 2024 | Benchmarking | CodeCode Available | 0 |
| Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking | Dec 2, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| AI Benchmarks and Datasets for LLM Evaluation | Dec 2, 2024 | BenchmarkingDistributed Computing | —Unverified | 0 |
| Agentic-HLS: An agentic reasoning based high-level synthesis system using large language models (AI for EDA workshop 2024) | Dec 2, 2024 | BenchmarkingHigh-Level Synthesis | CodeCode Available | 0 |
| Understanding the World's Museums through Vision-Language Reasoning | Dec 2, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences | Nov 30, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering | Nov 29, 2024 | BenchmarkingObject | —Unverified | 0 |
| HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos | Nov 28, 2024 | BenchmarkingObject Tracking | —Unverified | 0 |
| Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks | Nov 28, 2024 | BenchmarkingNatural Language Inference | —Unverified | 0 |
| λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics | Nov 28, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems | Nov 27, 2024 | AutoMLBenchmarking | —Unverified | 0 |
| Benchmarking Agility and Reconfigurability in Satellite Systems for Tropical Cyclone Monitoring | Nov 27, 2024 | BenchmarkingEarth Observation | —Unverified | 0 |
| Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches | Nov 26, 2024 | Benchmarking | —Unverified | 0 |
| Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals | Nov 26, 2024 | BenchmarkingRetrieval | —Unverified | 0 |
| Abnormality-Driven Representation Learning for Radiology Imaging | Nov 25, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Performance Benchmarking of Psychomotor Skills Using Wearable Devices: An Application in Sport | Nov 25, 2024 | Benchmarking | —Unverified | 0 |
| A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation | Nov 25, 2024 | Active LearningBayesian Inference | —Unverified | 0 |
| Benchmarking Active Learning for NILM | Nov 24, 2024 | Active LearningBenchmarking | —Unverified | 0 |
| ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain | Nov 23, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| Reassessing Layer Pruning in LLMs: New Insights and Methods | Nov 23, 2024 | BenchmarkingGPU | CodeCode Available | 0 |
| AdamZ: An Enhanced Optimisation Method for Neural Network Training | Nov 22, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking the Robustness of Optical Flow Estimation to Corruptions | Nov 22, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 0 |
| Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains | Nov 22, 2024 | BenchmarkingCaption Generation | —Unverified | 0 |
| Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels | Nov 21, 2024 | BenchmarkingMachine Translation | CodeCode Available | 0 |
| PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series | Nov 21, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling | Nov 21, 2024 | ArticlesBenchmarking | CodeCode Available | 0 |
| Benchmarking a wide range of optimisers for solving the Fermi-Hubbard model using the variational quantum eigensolver | Nov 20, 2024 | Benchmarking | —Unverified | 0 |
| BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation | Nov 20, 2024 | BenchmarkingPoint Cloud Segmentation | —Unverified | 0 |
| Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking | Nov 20, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games | Nov 20, 2024 | BenchmarkingNetHack | —Unverified | 0 |
| Delta-Influence: Unlearning Poisons via Influence Functions | Nov 20, 2024 | AttributeBenchmarking | CodeCode Available | 0 |
| Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value Analysis | Nov 19, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Positional Encodings for GNNs and Graph Transformers | Nov 19, 2024 | Benchmarking | CodeCode Available | 0 |
| The Moral Mind(s) of Large Language Models | Nov 19, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts | Nov 18, 2024 | BenchmarkingMultimodal Large Language Model | CodeCode Available | 0 |
| Benchmarking pre-trained text embedding models in aligning built asset information | Nov 18, 2024 | Asset ManagementBenchmarking | CodeCode Available | 0 |
| Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies | Nov 17, 2024 | Benchmarking | —Unverified | 0 |
| FastDraft: How to Train Your Draft | Nov 17, 2024 | BenchmarkingCode Completion | —Unverified | 0 |
| Reinforcing Competitive Multi-Agents for Playing So Long Sucker | Nov 17, 2024 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML | Nov 17, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer Sections | Nov 16, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level | Nov 15, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering | Nov 15, 2024 | BenchmarkingClustering | —Unverified | 0 |
| Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT | Nov 15, 2024 | Benchmarking | —Unverified | 0 |