| Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models | Jun 17, 2024 | Benchmarking | CodeCode Available | 2 |
| The Liouville Generator for Producing Integrable Expressions | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations | Jun 17, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content | Jun 17, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| Standardizing Structural Causal Models | Jun 17, 2024 | BenchmarkingCausal Inference | CodeCode Available | 0 |
| Benchmarking of LLM Detection: Comparing Two Competing Approaches | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 |
| Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex | Jun 16, 2024 | BenchmarkingObject Recognition | —Unverified | 0 |
| RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models | Jun 16, 2024 | Benchmarking | CodeCode Available | 0 |
| NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics | Jun 16, 2024 | Benchmarkingde novo peptide sequencing | —Unverified | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| Evaluating the Performance of Large Language Models via Debates | Jun 16, 2024 | Benchmarking | —Unverified | 0 |
| Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning | Jun 16, 2024 | BenchmarkingMath | —Unverified | 0 |
| RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models | Jun 16, 2024 | Adversarial AttackBenchmarking | CodeCode Available | 2 |
| WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences | Jun 16, 2024 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| GANmut: Generating and Modifying Facial Expressions | Jun 16, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters | Jun 16, 2024 | BenchmarkingInstance Segmentation | CodeCode Available | 0 |
| Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models | Jun 15, 2024 | BenchmarkingData Augmentation | CodeCode Available | 0 |
| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 |
| A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking | Jun 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework | Jun 14, 2024 | Benchmarking | —Unverified | 0 |
| SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading | Jun 14, 2024 | BenchmarkingMathematical Proofs | CodeCode Available | 0 |
| ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures | Jun 14, 2024 | Answer GenerationBenchmarking | CodeCode Available | 0 |
| VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs | Jun 14, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming | Jun 14, 2024 | BenchmarkingGeneral Knowledge | —Unverified | 0 |