"Patent documents are considered as a valuable source of technological and legal information, which has a high impact for innovation-driven business and research as well. With the availability of customized patent databases, and the emerge of millions of patent applications each year, the potentials of data mining and machine learning based techniques for providing added value out of such huge amount of data became very high. However, patent documents have special characteristics such as significantly different lexicons and different types of structured and unstructured data. On the other hand, the recall-based patent searches require the analysis processes to be conducted on large number of patents, while providing very specific and precise results at the same time.
These facts pose special challenges for our data mining solutions, and imply a high demand for powerful computing resources to handle data-intensive workloads in a time effective manner.
In this talk, the concept and the realization of our “Generic Scalable Service Runner” will be presented. Based on a generic approach, it provides a simple interface which makes it possible for text annotators and analytics services to run in a massively parallel manner by processing documents saved in a distributed database on a Hadoop cluster. Different data mining use cases have been deployed and tested on our Hadoop cluster, so that each tool could run hundreds of times faster than a single process. Following this procesdure is also much more productive and straightforward than providing a scalable solution for each individual application.
The generic service abstracts away the implementation details of Hadoop jobs and eliminates the need to deal with the cluster configurations. It this sense it provides a solution which could be compatible with any Hadoop environment, including on-premises, cloud-based or even hybrid environments.
From another perspective, this work is seen just as a first step in building a generic framework which enables the seamless integration of several consequent Natural language processing and machine learning steps in one pipeline. By exploiting appropriate semantic content representations, such a framework could be designed with unified and standardized formats at the data level along with optimizations and abstractions at the level of execution engine.
"