| Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs | Nov 29, 2023 | Benchmarking | CodeCode Available | 1 |
| UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation | Nov 26, 2023 | BenchmarkingHallucination | CodeCode Available | 1 |
| Benchmarking Robustness of Text-Image Composed Retrieval | Nov 24, 2023 | AttributeBenchmarking | CodeCode Available | 1 |
| IMGTB: A Framework for Machine-Generated Text Detection Benchmarking | Nov 21, 2023 | BenchmarkingText Detection | CodeCode Available | 1 |
| BEND: Benchmarking DNA Language Models on biologically meaningful tasks | Nov 21, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Towards a more inductive world for drug repurposing approaches | Nov 21, 2023 | BenchmarkingPrediction | CodeCode Available | 1 |
| LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly Detector | Nov 20, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Pathology Feature Extractors for Whole Slide Image Classification | Nov 20, 2023 | Benchmarkingimage-classification | CodeCode Available | 1 |
| TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction | Nov 16, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 |
| Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization | Nov 15, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 |