Effectiveness of Cross-linguistic Extraction of Genetic Information using Generative Large Language Models
Milindi Kodikara, Karin Verspoor
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/Milindi-Kodikara/RMIT-READ-BioMedIn papernone★ 1
Abstract
This paper presents the RMIT University system (RMIT-READ-BioMed) developed for the GenoVarDis shared task at IberLEF 2024, focusing on the task of Named Entity Recognition (NER) of genes, genetic variants, and associated diseases from Spanish-language scientific literature texts. The approach involves exploration of a general generative Large Language Model (LLM), GPT-3.5, for NER. We explore the impact of providing English-language instructions with the Spanish-language target text (crosslinguistic setting) as compared to a within-language setting where the instruction language matches the language of the text. We further experiment with a range of instruction strategies, including zero-shot and few-shot prompting under these two settings. Results indicate that the optimal results could be obtained with Englishlanguage instructions under the few-shot learning paradigm, resulting in an F1-score of 0.5. While this approach does not match the top results achieved for the shared task, our experiments provide insight into limitations associated with simple prompting of LLMs in languages other than English.