| FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs | Oct 25, 2024 | BenchmarkingFairness | —Unverified | 0 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach | Oct 24, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| Conditional diffusions for amortized neural posterior estimation | Oct 24, 2024 | Bayesian InferenceBenchmarking | CodeCode Available | 0 |
| Benchmarking Graph Learning for Drug-Drug Interaction Prediction | Oct 24, 2024 | BenchmarkingGraph Learning | —Unverified | 0 |
| From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems | Oct 24, 2024 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances | Oct 24, 2024 | BenchmarkingImage to Video Generation | CodeCode Available | 3 |
| Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework | Oct 24, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation | Oct 23, 2024 | ArticlesBenchmarking | CodeCode Available | 0 |
| Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling | Oct 23, 2024 | Benchmarking | —Unverified | 0 |