Reliability and Robustness of Transformers for Automated Short-Answer Grading
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Short-Answer Grading (SAG) is an application for NLP in education where student answers to open questions are graded. This task places high demands both on the reliability (accuracy and fairness) of label predictions and model robustness against strategic, "adversarial" input. Neural approaches are powerful tools for many problems in NLP, and transfer learning for Transformer-based models specificially promises to support data-poor tasks as this. We analyse the performance of a Transfomer-based SOTA model, zooming in on class- and item type specific behavior in order to gauge reliability; we use adversarial testing to analyze the the model's robustness towards strategic answers. We find a strong dependence on the specifics of training and test data, and recommend that model performance be verified for each individual use case.