| Multicalibration for Confidence Scoring in LLMs | Apr 6, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics | Apr 6, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| SDFR: Synthetic Data for Face Recognition Competition | Apr 6, 2024 | BenchmarkingFace Recognition | —Unverified | 0 |
| Enhancing Video Summarization with Context Awareness | Apr 6, 2024 | BenchmarkingInformativeness | CodeCode Available | 0 |
| GNNBENCH: Fair and Productive Benchmarking for Single-GPU GNN System | Apr 5, 2024 | BenchmarkingGPU | —Unverified | 0 |
| Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) | Apr 5, 2024 | Benchmarking | CodeCode Available | 0 |
| Dynamic Risk Assessment Methodology with an LDM-based System for Parking Scenarios | Apr 5, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text Generation | Apr 5, 2024 | AttributeBenchmarking | CodeCode Available | 0 |
| Benchmarking ChatGPT on Algorithmic Reasoning | Apr 4, 2024 | Benchmarking | CodeCode Available | 0 |
| Schroedinger's Threshold: When the AUC doesn't predict Accuracy | Apr 4, 2024 | Benchmarking | CodeCode Available | 0 |