| When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning | Oct 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| Guidelines for Fine-grained Sentence-level Arabic Readability Annotation | Oct 11, 2024 | BenchmarkingSentence | —Unverified | 0 |
| Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example | Oct 11, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Can we hop in general? A discussion of benchmark selection and design using the Hopper environment | Oct 11, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks | Oct 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation | Oct 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Identifying Money Laundering Subgraphs on the Blockchain | Oct 10, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Agentic Workflow Generation | Oct 10, 2024 | Benchmarking | CodeCode Available | 2 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 |
| Audio Explanation Synthesis with Generative Foundation Models | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Advocating Character Error Rate for Multilingual ASR Evaluation | Oct 9, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning | Oct 9, 2024 | BenchmarkingFairness | CodeCode Available | 0 |
| Towards Generalisable Time Series Understanding Across Domains | Oct 9, 2024 | BenchmarkingTime Series | CodeCode Available | 1 |
| Analysis of different disparity estimation techniques on aerial stereo image datasets | Oct 9, 2024 | BenchmarkingDepth Estimation | —Unverified | 0 |
| InAttention: Linear Context Scaling for Transformers | Oct 9, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB | Oct 9, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| TuringQ: Benchmarking AI Comprehension in Theory of Computation | Oct 9, 2024 | Benchmarking | CodeCode Available | 0 |
| HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding | Oct 9, 2024 | BenchmarkingInstruction Following | —Unverified | 0 |
| Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond | Oct 9, 2024 | Benchmarking | CodeCode Available | 2 |
| M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes | Oct 9, 2024 | BenchmarkingMotion Generation | —Unverified | 0 |
| Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making | Oct 9, 2024 | BenchmarkingDecision Making | CodeCode Available | 3 |
| Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective | Oct 8, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers | Oct 8, 2024 | Benchmarking | CodeCode Available | 0 |
| FedGraph: A Research Library and Benchmark for Federated Graph Learning | Oct 8, 2024 | BenchmarkingFederated Learning | CodeCode Available | 2 |
| Active Evaluation Acquisition for Efficient LLM Benchmarking | Oct 8, 2024 | Benchmarking | —Unverified | 0 |
| Manual Verbalizer Enrichment for Few-Shot Text Classification | Oct 8, 2024 | BenchmarkingClassification | —Unverified | 0 |
| Benchmarking of a new data splitting method on volcanic eruption data | Oct 8, 2024 | Benchmarking | —Unverified | 0 |
| Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems | Oct 7, 2024 | BenchmarkingMachine Translation | —Unverified | 0 |
| Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild | Oct 7, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 |
| Rule-based Data Selection for Large Language Models | Oct 7, 2024 | BenchmarkingMath | —Unverified | 0 |
| Precise Model Benchmarking with Only a Few Observations | Oct 7, 2024 | Benchmarkingmodel | —Unverified | 0 |
| MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense | Oct 7, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 |
| Named Clinical Entity Recognition Benchmark | Oct 7, 2024 | BenchmarkingDecoder | CodeCode Available | 0 |
| TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models | Oct 7, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| Large Scale MRI Collection and Segmentation of Cirrhotic Liver | Oct 6, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection | Oct 6, 2024 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| dattri: A Library for Efficient Data Attribution | Oct 6, 2024 | Benchmarking | CodeCode Available | 2 |
| Adjusting Pretrained Backbones for Performativity | Oct 6, 2024 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends | Oct 5, 2024 | BenchmarkingChart Understanding | —Unverified | 0 |
| PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms | Oct 5, 2024 | BenchmarkingGPU | —Unverified | 0 |
| Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning | Oct 5, 2024 | BenchmarkingDrug Design | CodeCode Available | 1 |
| Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels | Oct 5, 2024 | Benchmarking | —Unverified | 0 |
| TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions | Oct 5, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension | Oct 4, 2024 | BenchmarkingComputational chemistry | —Unverified | 0 |
| ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities | Oct 4, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices | Oct 4, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Benchmarking the Fidelity and Utility of Synthetic Relational Data | Oct 4, 2024 | BenchmarkingFeature Importance | —Unverified | 0 |
| PersoBench: Benchmarking Personalized Response Generation in Large Language Models | Oct 4, 2024 | BenchmarkingDialogue Generation | CodeCode Available | 0 |
| Ward: Provable RAG Dataset Inference via LLM Watermarks | Oct 4, 2024 | BenchmarkingRAG | —Unverified | 0 |