| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| Overcoming Common Flaws in the Evaluation of Selective Classification Systems | Jul 1, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| GraphArena: Benchmarking Large Language Models on Graph Computational Problems | Jun 29, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities | Jun 27, 2024 | Benchmarking | CodeCode Available | 1 |
| Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection | Jun 25, 2024 | BenchmarkingPrompt Learning | CodeCode Available | 1 |
| SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It) | Jun 25, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| MatText: Do Language Models Need More than Text & Scale for Materials Modeling? | Jun 25, 2024 | Benchmarking | CodeCode Available | 1 |
| AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models | Jun 24, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design | Jun 24, 2024 | BenchmarkingDrug Design | CodeCode Available | 1 |
| Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track | Jun 24, 2024 | BenchmarkingRAG | CodeCode Available | 1 |
| A Closer Look at Mortality Risk Prediction from Electrocardiograms | Jun 24, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models | Jun 21, 2024 | Benchmarking | CodeCode Available | 1 |
| A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data | Jun 20, 2024 | BenchmarkingKolmogorov-Arnold Networks | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| BeHonest: Benchmarking Honesty in Large Language Models | Jun 19, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking | Jun 17, 2024 | BenchmarkingDemand Forecasting | CodeCode Available | 1 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 |
| A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking | Jun 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs | Jun 14, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency | Jun 14, 2024 | Benchmarking | CodeCode Available | 1 |
| LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data | Jun 14, 2024 | BenchmarkingDecision Making | CodeCode Available | 1 |
| SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution | Jun 13, 2024 | BenchmarkingImage Super-Resolution | CodeCode Available | 1 |
| SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models | Jun 13, 2024 | Benchmarking | CodeCode Available | 1 |