| BLADE: Benchmarking Language Model Agents for Data-Driven Science | Aug 19, 2024 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving | Aug 19, 2024 | BenchmarkingMachine Translation | —Unverified | 0 |
| Benchmarking quantum machine learning kernel training for classification tasks | Aug 17, 2024 | BenchmarkingQuantum Machine Learning | CodeCode Available | 0 |
| PADetBench: Towards Benchmarking Physical Attacks against Object Detection | Aug 17, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors | Aug 15, 2024 | BenchmarkingManagement | —Unverified | 0 |
| SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition | Aug 14, 2024 | Automatic Speech RecognitionBenchmarking | CodeCode Available | 1 |
| SustainDC: Benchmarking for Sustainable Data Center Control | Aug 14, 2024 | BenchmarkingManagement | CodeCode Available | 2 |
| TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases | Aug 14, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| XCompress: LLM assisted Python-based text compression toolkit | Aug 12, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K dataset | Aug 12, 2024 | Benchmarking | CodeCode Available | 1 |
| A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation | Aug 11, 2024 | Benchmarkingimage-classification | —Unverified | 0 |
| A Meta-Engine Framework for Interleaved Task and Motion Planning using Topological Refinements | Aug 11, 2024 | BenchmarkingMotion Planning | —Unverified | 0 |
| Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration | Aug 9, 2024 | BenchmarkingVideo Compression | —Unverified | 0 |
| UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios | Aug 9, 2024 | BenchmarkingHuman Detection | CodeCode Available | 1 |
| Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule Endoscopy | Aug 9, 2024 | BenchmarkingMedical Image Analysis | CodeCode Available | 0 |
| The impact of internal variability on benchmarking deep learning climate emulators | Aug 9, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| h4rm3l: A language for Composable Jailbreak Attack Synthesis | Aug 9, 2024 | BenchmarkingProgram Synthesis | —Unverified | 0 |
| SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios | Aug 8, 2024 | Active LearningBenchmarking | —Unverified | 0 |
| FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data | Aug 8, 2024 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Towards Explainable Network Intrusion Detection using Large Language Models | Aug 8, 2024 | BenchmarkingIntrusion Detection | —Unverified | 0 |
| Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond | Aug 7, 2024 | BenchmarkingLanguage Identification | CodeCode Available | 1 |
| WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models | Aug 7, 2024 | AI and SafetyBenchmarking | CodeCode Available | 1 |
| Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy, Survey, Research Challenges and Future Directions | Aug 7, 2024 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Soft-Hard Attention U-Net Model and Benchmark Dataset for Multiscale Image Shadow Removal | Aug 7, 2024 | BenchmarkingHard Attention | —Unverified | 0 |
| OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents | Aug 6, 2024 | BenchmarkingRetrieval-augmented Generation | CodeCode Available | 1 |
| Segment Anything in Medical Images and Videos: Benchmark and Deployment | Aug 6, 2024 | BenchmarkingSegmentation | CodeCode Available | 7 |
| Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline | Aug 6, 2024 | Benchmarking | —Unverified | 0 |
| MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities | Aug 5, 2024 | BenchmarkingGraph Generation | —Unverified | 0 |
| From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future | Aug 5, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| LMEMs for post-hoc analysis of HPO Benchmarking | Aug 5, 2024 | BenchmarkingHyperparameter Optimization | CodeCode Available | 0 |
| User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance | Aug 4, 2024 | Action AnticipationBenchmarking | —Unverified | 0 |
| SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems | Aug 4, 2024 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Visual-Inertial SLAM for Unstructured Outdoor Environments: Benchmarking the Benefits and Computational Costs of Loop Closing | Aug 3, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 0 |
| Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data | Aug 3, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 0 |
| Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations | Aug 3, 2024 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model | Aug 2, 2024 | BenchmarkingFeature Engineering | —Unverified | 0 |
| Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics | Aug 2, 2024 | Adversarial AttackAdversarial Purification | CodeCode Available | 1 |
| Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions | Aug 2, 2024 | Benchmarkingmultimodal interaction | CodeCode Available | 0 |
| RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework | Aug 2, 2024 | BenchmarkingDataset Generation | CodeCode Available | 3 |
| PINNs for Medical Image Analysis: A Survey | Aug 2, 2024 | AnatomyBenchmarking | —Unverified | 0 |
| IN-Sight: Interactive Navigation through Sight | Aug 1, 2024 | BenchmarkingNavigate | —Unverified | 0 |
| High-Quality, ROS Compatible Video Encoding and Decoding for High-Definition Datasets | Aug 1, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 0 |
| Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model | Jul 31, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 0 |
| KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making | Jul 31, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Efficient Channel Estimation for Millimeter Wave and Terahertz Systems Enabled by Integrated Super-resolution Sensing and Communication | Jul 30, 2024 | BenchmarkingSuper-Resolution | —Unverified | 0 |
| TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models | Jul 30, 2024 | BenchmarkingCode Completion | —Unverified | 0 |
| GNUMAP: A Parameter-Free Approach to Unsupervised Dimensionality Reduction via Graph Neural Networks | Jul 30, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Benchmarking Histopathology Foundation Models for Ovarian Cancer Bevacizumab Treatment Response Prediction from Whole Slide Images | Jul 30, 2024 | BenchmarkingMultiple Instance Learning | —Unverified | 0 |
| Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks | Jul 29, 2024 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| Anomalous State Sequence Modeling to Enhance Safety in Reinforcement Learning | Jul 29, 2024 | Anomaly DetectionBenchmarking | —Unverified | 0 |