| Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning | Oct 4, 2024 | BenchmarkingUncertainty Quantification | —Unverified | 0 |
| Towards a Benchmark for Large Language Models for Business Process Management Tasks | Oct 4, 2024 | BenchmarkingManagement | CodeCode Available | 0 |
| EBES: Easy Benchmarking for Event Sequences | Oct 4, 2024 | Benchmarking | CodeCode Available | 1 |
| AutoPenBench: Benchmarking Generative Agents for Penetration Testing | Oct 4, 2024 | Benchmarking | CodeCode Available | 2 |
| Repurposing Foundation Model for Generalizable Medical Time Series Classification | Oct 3, 2024 | BenchmarkingDiagnostic | —Unverified | 0 |
| LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services | Oct 3, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects | Oct 3, 2024 | BenchmarkingImitation Learning | CodeCode Available | 1 |
| Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning | Oct 3, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| MANTRA: The Manifold Triangulations Assemblage | Oct 3, 2024 | Benchmarking | CodeCode Available | 0 |
| Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents | Oct 3, 2024 | Autonomous DrivingBackdoor Attack | CodeCode Available | 3 |
| IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models | Oct 3, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning | Oct 2, 2024 | BenchmarkingDenoising | —Unverified | 0 |
| CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations | Oct 2, 2024 | BenchmarkingLong Form Question Answering | —Unverified | 0 |
| Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description | Oct 2, 2024 | BenchmarkingFacial expression generation | —Unverified | 0 |
| MONICA: Benchmarking on Long-tailed Medical Image Classification | Oct 2, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models | Oct 2, 2024 | Benchmarking | CodeCode Available | 3 |
| StringLLM: Understanding the String Processing Capability of Large Language Models | Oct 2, 2024 | Benchmarking | CodeCode Available | 1 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs | Oct 2, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| Deep Unlearn: Benchmarking Machine Unlearning | Oct 2, 2024 | BenchmarkingMachine Unlearning | —Unverified | 0 |
| ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving | Oct 2, 2024 | BenchmarkingDocument Summarization | —Unverified | 0 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 |
| shapiq: Shapley Interactions for Machine Learning | Oct 2, 2024 | BenchmarkingData Valuation | CodeCode Available | 4 |
| Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents | Oct 1, 2024 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | Oct 1, 2024 | BenchmarkingFairness | —Unverified | 0 |
| CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset | Oct 1, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis | Sep 30, 2024 | BenchmarkingIntrusion Detection | CodeCode Available | 1 |
| ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning | Sep 30, 2024 | BenchmarkingDisparity Estimation | CodeCode Available | 0 |
| Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration | Sep 30, 2024 | BenchmarkingIntent Detection | —Unverified | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Match Stereo Videos via Bidirectional Alignment | Sep 30, 2024 | BenchmarkingStereo Matching | —Unverified | 0 |
| Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models | Sep 30, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 |
| GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks | Sep 29, 2024 | Benchmarking | —Unverified | 0 |
| Tracking Everything in Robotic-Assisted Surgery | Sep 29, 2024 | Benchmarking | —Unverified | 0 |
| A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends | Sep 29, 2024 | Benchmarkinggraph construction | CodeCode Available | 2 |
| AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy | Sep 29, 2024 | AstronomyBenchmarking | —Unverified | 0 |
| Constrained Reinforcement Learning for Safe Heat Pump Control | Sep 29, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 0 |
| SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement | Sep 28, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| EarthquakeNPP: Benchmark Datasets for Earthquake Forecasting with Neural Point Processes | Sep 27, 2024 | BenchmarkingDataset Generation | —Unverified | 0 |
| bnRep: A repository of Bayesian networks from the academic literature | Sep 27, 2024 | Benchmarking | —Unverified | 0 |
| CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting | Sep 27, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| MCUBench: A Benchmark of Tiny Object Detectors on MCUs | Sep 27, 2024 | BenchmarkingModel Selection | —Unverified | 0 |
| Data Analysis in the Era of Generative AI | Sep 27, 2024 | Benchmarking | —Unverified | 0 |
| Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study | Sep 27, 2024 | Benchmarkingtabular-regression | CodeCode Available | 0 |
| ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning | Sep 27, 2024 | AutoMLBenchmarking | CodeCode Available | 1 |
| The Elephant in the Room: Towards A Reliable Time-Series Anomaly Detection Benchmark | Sep 26, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 3 |
| Conformal Prediction: A Theoretical Note and Benchmarking Transductive Node Classification in Graphs | Sep 26, 2024 | BenchmarkingConformal Prediction | CodeCode Available | 0 |
| MALPOLON: A Framework for Deep Species Distribution Modeling | Sep 26, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics | Sep 25, 2024 | Benchmarking | —Unverified | 0 |
| Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning | Sep 25, 2024 | BenchmarkingFormal Logic | —Unverified | 0 |