Fine-grained Named Entity Recognition in Legal Documents

This paper describes an approach at Named Entity Recognition (NER) in German language documents from the legal domain. For this purpose, a dataset consisting of German court decisions was developed. The source texts were manually annotated with the following 19 semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European legal norm, regulation, contract, court decision, and legal literature. Overall, the dataset consists of approximately 67,000 sentences and contains around 54,000 annotated entities. The 19 fine-grained classes were automatically generalised to seven more coarse-grained semantic classes: people, location, organization, legal norm, case-by-case regulation, court decision, and legal literature. Thus, the dataset includes two variants of the annotation, course- and fine-grained. For the task of NER, Conditional Random Fields (CRFs) and bidirectional Long-Short Term Memory Networks (BiLSTMs) were applied to the data set as state of the art models. Three different models were developed for each of these two model families and tested with the course- and fine-grained annotations. The BLSTM models achieve the best performance with an F1 value of 95.46% for the fine-grained classes and 95.95% for the coarse-grained classes. By contrast, the CRF models reach a maximum of 93.23% for the fine-grained classes and 93.22% for the coarse-grained classes. The work presented in this paper was carried out under the umbrella of the European project LYNX that develops a semantic platform that will enable the development of various document processing and analysis applications for the legal domain.