| Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | Jun 18, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Automatic benchmarking of large multimodal models via iterative experiment programming | Jun 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| The Liouville Generator for Producing Integrable Expressions | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models | Jun 17, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States | Jun 17, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations | Jun 17, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading | Jun 17, 2024 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Benchmarking of LLM Detection: Comparing Two Competing Approaches | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| Standardizing Structural Causal Models | Jun 17, 2024 | BenchmarkingCausal Inference | CodeCode Available | 0 |