| RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code | Sep 23, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models | Sep 20, 2024 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| MetaFormer and CNN Hybrid Model for Polyp Image Segmentation | Sep 16, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| ODAQ: Open Dataset of Audio Quality - Benchmark on GitHub | Sep 13, 2024 | Audio Quality AssessmentBenchmarking | CodeCode Available | 1 |
| Insights from Benchmarking Frontier Language Models on Web App Code Generation | Sep 8, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| RTLRewriter: Methodologies for Large Models aided RTL Code Optimization | Sep 4, 2024 | Benchmarking | CodeCode Available | 1 |
| LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs | Sep 3, 2024 | 16kBenchmarking | CodeCode Available | 1 |
| Towards Student Actions in Classroom Scenes: New Dataset and Baseline | Sep 2, 2024 | Action DetectionBenchmarking | CodeCode Available | 1 |
| How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models | Aug 29, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 |
| STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models | Aug 29, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |