| Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications | Jun 19, 2024 | BenchmarkingMachine Reading Comprehension | —Unverified | 0 |
| M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere | Jun 19, 2024 | BenchmarkingSpatio-Temporal Forecasting | CodeCode Available | 0 |
| Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance | Jun 18, 2024 | Benchmarking | —Unverified | 0 |
| Exploring and Benchmarking the Planning Capabilities of Large Language Models | Jun 18, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts | Jun 18, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | Jun 18, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Automatic benchmarking of large multimodal models via iterative experiment programming | Jun 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| The Liouville Generator for Producing Integrable Expressions | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models | Jun 17, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States | Jun 17, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations | Jun 17, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading | Jun 17, 2024 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Benchmarking of LLM Detection: Comparing Two Competing Approaches | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| Standardizing Structural Causal Models | Jun 17, 2024 | BenchmarkingCausal Inference | CodeCode Available | 0 |
| Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams | Jun 17, 2024 | AllBenchmarking | CodeCode Available | 0 |
| A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models | Jun 17, 2024 | BenchmarkingSurvey | —Unverified | 0 |
| RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content | Jun 17, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning | Jun 16, 2024 | BenchmarkingMath | —Unverified | 0 |
| Evaluating the Performance of Large Language Models via Debates | Jun 16, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex | Jun 16, 2024 | BenchmarkingObject Recognition | —Unverified | 0 |
| Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters | Jun 16, 2024 | BenchmarkingInstance Segmentation | CodeCode Available | 0 |
| WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences | Jun 16, 2024 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models | Jun 16, 2024 | Benchmarking | CodeCode Available | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |