| FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | May 27, 2022 | 16k4k | CodeCode Available | 6 |
| BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining | Oct 19, 2022 | Document ClassificationLanguage Modelling | CodeCode Available | 4 |
| DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models | Jun 17, 2024 | Document ClassificationVisual Grounding | CodeCode Available | 3 |
| Pre-Training with Whole Word Masking for Chinese BERT | Jun 19, 2019 | Document ClassificationGeneral Classification | CodeCode Available | 3 |
| Visually Guided Generative Text-Layout Pre-training for Document Intelligence | Mar 25, 2024 | Document Classificationdocument understanding | CodeCode Available | 2 |
| LinkBERT: Pretraining Language Models with Document Links | Mar 29, 2022 | Document ClassificationLanguage Modeling | CodeCode Available | 2 |
| One Configuration to Rule Them All? Towards Hyperparameter Transfer in Topic Models using Multi-Objective Bayesian Optimization | Feb 15, 2022 | AllBayesian Optimization | CodeCode Available | 2 |
| HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning | Dec 5, 2024 | Contrastive LearningDocument Classification | CodeCode Available | 1 |
| Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes | Oct 8, 2024 | ArticlesClassification | CodeCode Available | 1 |
| SuperGLEBer: German Language Understanding Evaluation Benchmark | Jun 20, 2024 | Document ClassificationNatural Language Understanding | CodeCode Available | 1 |
| NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents | Feb 27, 2024 | Document ClassificationLanguage Modeling | CodeCode Available | 1 |
| Prompted Contextual Vectors for Spear-Phishing Detection | Feb 13, 2024 | Document Classification | CodeCode Available | 1 |
| ANLS* -- A Universal Document Processing Metric for Generative Large Language Models | Feb 6, 2024 | Document Classification | CodeCode Available | 1 |
| L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages | Jan 4, 2024 | ArticlesClassification | CodeCode Available | 1 |
| GeoGalactica: A Scientific Large Language Model in Geoscience | Dec 31, 2023 | Document ClassificationGeneral Knowledge | CodeCode Available | 1 |
| ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models | Nov 15, 2023 | Document ClassificationQuestion Answering | CodeCode Available | 1 |
| Taken by Surprise: Contrast effect for Similarity Scores | Aug 18, 2023 | ClassificationDocument Classification | CodeCode Available | 1 |
| Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training | Jun 12, 2023 | Document ClassificationRetrieval | CodeCode Available | 1 |
| Benchmarking large language models for biomedical natural language processing applications and recommendations | May 10, 2023 | BenchmarkingDocument Classification | CodeCode Available | 1 |
| HiPool: Modeling Long Documents Using Graph Neural Networks | May 5, 2023 | Document ClassificationSentence | CodeCode Available | 1 |
| Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding | Apr 9, 2023 | Document Classificationnamed-entity-recognition | CodeCode Available | 1 |
| Bioformer: an efficient transformer language model for biomedical text mining | Feb 3, 2023 | ArticlesDocument Classification | CodeCode Available | 1 |
| A Comparative Study of Pretrained Language Models for Long Clinical Text | Jan 27, 2023 | Clinical KnowledgeDocument Classification | CodeCode Available | 1 |
| Multimodal Side-Tuning for Document Classification | Jan 16, 2023 | ClassificationDocument Classification | CodeCode Available | 1 |
| Tsetlin Machine Embedding: Representing Words Using Logical Expressions | Jan 2, 2023 | Document ClassificationMachine Translation | CodeCode Available | 1 |
| Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics | Oct 12, 2022 | Document ClassificationRetrieval | CodeCode Available | 1 |
| ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths | Jun 12, 2022 | ChunkingDocument Classification | CodeCode Available | 1 |
| LDRNet: Enabling Real-time Document Localization on Mobile Devices | Jun 5, 2022 | Document Classification | CodeCode Available | 1 |
| Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem | May 4, 2022 | Document ClassificationTraveling Salesman Problem | CodeCode Available | 1 |
| Revisiting Transformer-based Models for Long Document Classification | Apr 14, 2022 | ClassificationDocument Classification | CodeCode Available | 1 |
| Specialized Document Embeddings for Aspect-based Similarity of Research Papers | Mar 28, 2022 | Document ClassificationRecommendation Systems | CodeCode Available | 1 |
| DocXClassifier: High Performance Explainable Deep Network for Document Image Classification | Mar 17, 2022 | ClassificationData Augmentation | CodeCode Available | 1 |
| Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings | Feb 14, 2022 | Citation PredictionContrastive Learning | CodeCode Available | 1 |
| Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences | Jan 27, 2022 | Clinical KnowledgeDocument Classification | CodeCode Available | 1 |
| Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification | Dec 13, 2021 | ClassificationDocument Classification | CodeCode Available | 1 |
| MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer | Nov 1, 2021 | Cross-Lingual TransferDocument Classification | CodeCode Available | 1 |
| Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution | Sep 10, 2021 | Document ClassificationMulti-Label Text Classification | CodeCode Available | 1 |
| MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer | Sep 2, 2021 | Cross-Lingual TransferDocument Classification | CodeCode Available | 1 |
| PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors | Aug 2, 2021 | Document ClassificationSpecificity | CodeCode Available | 1 |
| Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT | Jul 9, 2021 | BenchmarkingDocument Classification | CodeCode Available | 1 |
| TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration | Jun 24, 2021 | Active LearningDocument Classification | CodeCode Available | 1 |
| A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data | Jun 12, 2021 | ClassificationDocument Classification | CodeCode Available | 1 |
| SciFive: a text-to-text transformer model for biomedical literature | May 28, 2021 | Document ClassificationDrug–drug Interaction Extraction | CodeCode Available | 1 |
| Three-level Hierarchical Transformer Networks for Long-sequence and Multiple Clinical Documents Classification | Apr 17, 2021 | Document ClassificationGeneral Classification | CodeCode Available | 1 |
| Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media | Apr 16, 2021 | Document ClassificationDomain Adaptation | CodeCode Available | 1 |
| Multilingual and cross-lingual document classification: A meta-learning approach | Jan 27, 2021 | Cross-Lingual Document ClassificationDocument Classification | CodeCode Available | 1 |
| Can a Fruit Fly Learn Word Embeddings? | Jan 18, 2021 | Document ClassificationWord Embeddings | CodeCode Available | 1 |
| BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla | Jan 1, 2021 | Document ClassificationLanguage Modeling | CodeCode Available | 1 |
| Hierarchical Metadata-Aware Document Categorization under Weak Supervision | Oct 26, 2020 | Data AugmentationDocument Classification | CodeCode Available | 1 |
| German's Next Language Model | Oct 21, 2020 | BenchmarkingDocument Classification | CodeCode Available | 1 |