ABSTRACT
Automating Information Extraction (IE) from handwritten documents makes it more convenient, less time-consuming and laborintensive. In this work, we propose an end-to-end encoder-decoder model, that incorporates the advantages of transformers and Graph Convolutional Networks (GCN), to jointly perform Handwritten Text Recognition (HTR) and Named Entity Recognition (NER). The proposed architecture is mainly composed of two parts: a Sparse Graph Transformer Encoder (SGTE), to capture efficient representations of input text images while controlling the information propagation over the model. The SGTE is followed by a Cross GCN-based decoder that combines the outputs of the last SGTE layer and the Multi-Head Attention (MHA) block to reinforce the alignment of visual features to characters and Named Entity (NE) tags. The proposed model shows promising results and achieves state-of-the-art performance in the ICDAR 2017 Information Extraction competition using the Esposalles database and on the IAM dataset.