| Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark | Apr 21, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators | Apr 21, 2025 | Code GenerationInstruction Following | CodeCode Available | 0 |
| Kuwain 1.5B: An Arabic SLM via Language Injection | Apr 21, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models | Apr 21, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Automated Duplicate Bug Report Detection in Large Open Bug Repositories | Apr 21, 2025 | Large Language Model | —Unverified | 0 |
| Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval | Apr 20, 2025 | Large Language ModelRetrieval | —Unverified | 0 |
| ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Causal Disentanglement for Robust Long-tail Medical Image Generation | Apr 20, 2025 | counterfactualDisentanglement | —Unverified | 0 |
| PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Bottom-Up Synthesis of Knowledge-Grounded Task-Oriented Dialogues with Iteratively Self-Refined Prompts | Apr 19, 2025 | Conversational Question AnsweringLanguage Modeling | —Unverified | 0 |