| RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs | Jun 27, 2024 | DiversityNegation | CodeCode Available | 1 |
| Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA | May 30, 2024 | DiagnosticMedical Diagnosis | CodeCode Available | 1 |
| Towards Safer Large Language Models through Machine Unlearning | Feb 15, 2024 | Machine UnlearningNegation | CodeCode Available | 1 |
| Approximate Attributions for Off-the-Shelf Siamese Transformers | Feb 5, 2024 | NegationSentence | CodeCode Available | 1 |
| LongHealth: A Question Answering Benchmark with Long Clinical Documents | Jan 25, 2024 | Information RetrievalMultiple-choice | CodeCode Available | 1 |
| Expressive Sign Equivariant Networks for Spectral Geometric Learning | Dec 4, 2023 | Link PredictionNegation | CodeCode Available | 1 |
| Regularization by Texts for Latent Diffusion Inverse Solvers | Nov 27, 2023 | Negation | CodeCode Available | 1 |
| Instant3D: Instant Text-to-3D Generation | Nov 14, 2023 | 3D GenerationNegation | CodeCode Available | 1 |
| This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models | Oct 24, 2023 | DescriptiveNegation | CodeCode Available | 1 |
| Ask Again, Then Fail: Large Language Models' Vacillations in Judgment | Oct 3, 2023 | Negation | CodeCode Available | 1 |