| BLESS: Benchmarking Large Language Models on Sentence Simplification | Oct 24, 2023 | BenchmarkingDiversity | CodeCode Available | 0 |
| CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks | Oct 23, 2023 | Benchmarking | CodeCode Available | 1 |
| Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic | Oct 23, 2023 | BenchmarkingInstruction Following | —Unverified | 0 |
| DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design | Oct 23, 2023 | BenchmarkingImage Generation | CodeCode Available | 0 |
| XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series Classification | Oct 23, 2023 | BenchmarkingTime Series | CodeCode Available | 0 |
| A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video | Oct 22, 2023 | 3D ReconstructionAnatomy | —Unverified | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| Fast hyperboloid decision tree algorithms | Oct 20, 2023 | BenchmarkingRiemannian optimization | CodeCode Available | 1 |
| Benchmarking and Improving Text-to-SQL Generation under Ambiguity | Oct 20, 2023 | BenchmarkingDiversity | CodeCode Available | 0 |
| Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models | Oct 20, 2023 | Activity PredictionBenchmarking | CodeCode Available | 0 |