From Legacy to Linked Data: Rebuilding a STEM information service

Industry

Containing over 16 million  records since 1898, IET’s Inspec is the World’s leading abstract and indexing database for research in the fields of engineering, physics and computer science. The Inspec production system has historically been based on a traditional relational database  system. With the advent of the semantic web idea, Inspec has made the decision to switch to a graph semantic-based database in order to better deal with relationships between increasingly large and complex data. This graph currently includes Inspec content and subjects aiming to expand its scope to organisations, authors and locations. Converting the traditional data into linked data has provided many new exciting ways to access and discover academic resources  and display trends in physics, engineering and computing research. In this talk we show how using domain models, ontologies and semantic enrichment improves data discoverability, navigation and visualisation.

Every year, almost 1 million research papers , conference papers, videos, books and other publication types are added to the Inspec database. Each of these records contains not only information provided by its publishers such as the authors of papers and their affiliations but also data enriched by our indexing service. This data includes controlled terms from a thesaurus, classification codes free text, numerical, chemical and astronomical annotations. This makes scientific articles more discoverable by researchers, who are able to find high quality, relevant content.  
For years all this data had been stored in a legacy relational database where it was accessed through various simple text querying Web 2.0 domains. A process which while practical, was not intuitive and did not take full advantage of the expansive annotations applied to each paper or the links between these annotations. One of the benefits of having such a richly detailed, highly structured corpus of data is that large scale and micro-trends can be determined, such as how new scientific concepts have increased over time, in specific locations, and have grown and branched off into sub-concepts. This enables a far deeper understanding of the trends in scientific research along with the main key opinion leaders and institutions focusing on those areas.  
At Inspec, for the last few years, a gradual move has been made towards semantics,  integrating our traditional relational database into an RDF paradigm. We show the evolution from taxonomies to domain models embedding all the different entities and relationships between them, the development of ontologies based on these models, and the population with reference data effectively creating a large-scale semantic network of agents, content and concepts. This is an iterative agile process where the domain model is designed according to the questions we want answered by the data.
We also discuss the process of migrating Inspec content to a semantic database, facing problems such as data deduplication and disambiguation . This allows us to enhance our content to integrate information on author affiliations, events location and research project funding, whilst guaranteeing data quality.
We show how this allows us not only to improve our own production processes, making it possible to implement semantic enrichment but also services provided, widely expanding the functionalities for users searching our database, with smart graph navigation and data visualisation tools.

Speakers: