Named entity recognition and disambiguation using an iterative graph processing system

Industry

ReportLinker.com is an award-winning market research solution that finds, filters and organizes the latest industry data from both reputable private publishers and trusted public organizations. Launched in 2007, ReportLinker provides access to more than 3 millions public reports with data on 350 industries and 3,000 sub-industries around the world.

Our natural language processing platform makes it possible to automatically analyse millions of unstructured documents per day (pdf, doc, ppt, …), resulting in new structured information based on:
- relevance of the document
- automatic summary
- relevant themes in the report (chapters, tables, figures)
- relevant industries, countries, topics

We use a lexical data platform (including 3,5 millions hierarchical vocabularies) to detect important concepts and categories. We index each of these documents/themes with its metadata to make them easily searchable using faceted navigation.
This relation between LDP and NLP is very powerful in order to classify an unstructured document into a controlled hierarchy of categories. However, it quickly reaches its limit when we need to find and analyse non-controlled vocabulary which is the the case for the several concepts like company name, product name, influential people.

All Entities playing a role in the world economy are important information points for an analyst. The most important one is the “Company” dimension: how can we find, normalize and classify the company names mentioned in our reports ?

We started with a very simple approach : we bought a database with several millions of company names and we used it as controlled vocabulary to find companies. This approach did not work at all. We forgot an important rule about named entities: it’s very common to have the same vocabulary designing several concepts therefore it’s impossible to disambiguate these concepts without additional data to contextualize the company.

In our new approach, we decided to create our own database of company names with additional context for disambiguation. The key idea was to use our content (110 millions documents) to discover and identify company names, people, products. With text mining technology and an inference engine we use the proximity between concepts in the documents to build a relational graph, where each node is an identified concept, each edge a relation between these nodes.

This workflow can be divided into 3 steps:

Step 1 - For each document analysed, we extract several “hypotheses” (unverified facts) using text mining rules. We mainly have 3 types of hypotheses:
- Identification of a concept (the probability that its a company, person, product, …)
- Relation between 2 concepts (context proximity between 2 concepts in the document)
- Proximity between a concept and an industry/country (context proximity with an other dimension in the document)
All the hypotheses are stocked in a NoSQL database (MongoDB). We stocked 400 millions hypotheses for our first prototype.

Step 2 - The inference engine loads all the hypotheses and uses an iterative graph processing system (Giraph/Hadoop) to verify which hypotheses can be consolidated as a verified fact. Hypotheses with a low similarity are discarded (noise). Verified facts are consolidated as a node or an edge in the final graph. This graph can contain several nodes with the same type and the same name (like 2 companies with the same name), but their relations in the graph will be very different : they don’t have the same context, it’s not the same company. Using 400 millions hypotheses, the system consolidated 430,000 verified nodes (companies, people’s names, products) and 55 millions relations.

Final Step - We now have a new base of verified companies, their alias and their context. We use this base as an extended LDP for our NLP to extract the companies as normalized metadata. All the company names are loaded in-memory by the NLP but all the context data (people/product/alias linked to the company) that needs to be checked is too heavy to be loaded in the memory of a single server. We therefore use a distributed in-memory database called Redis to be able to load and verify very quickly thousands of concepts per second. If a company name with a part of his context is found in the document, then the normalized name of the company will appear as metadata.


Using Big Data analytics we found a very good approach to discover, disambiguate and normalise complex concepts. This solution works because we succeed in resolving 3 main issues:
- Data volume
- Pattern detection to discover hypotheses
- Optimized algorithms for the inference engine

Speakers: