The data may not contain the answer

August 01, 2018 by Stefan Summesberger

Keynote Speaker Allan Hanbury is Professor for Data Intelligence at the TU Wien, Austria, and Faculty Member of the Complexity Science Hub. He is initiator of the Austrian ICT Lighthouse Project, Data Market Austria, which is creating a Data-Services Ecosystem in Austria. In this interview we talk about data knowledge and its implementation in joint projects between academia and industry.

At the newly founded professorship at TU Wien you are now in charge of “Data Intelligence” since the beginning of the year. The word “intelligence” creates an impression that this is about the exploration and discovery of hidden knowledge within a horizon-less data lake. Please elaborate on the quality of findings data scientists are able to distill from the unstructured?

The quality of the results depends on the application. For some applications it is possible to get very detailed results, whereas for others only general trends are possible. As an example of the former, I can present some work that we did in the medical domain. We took all available publications about results of clinical trials and extracted the disease (e.g. “influenza”) and intervention (e.g. “aspirin”) from the publication titles, as well as the positive or negative sentiment of the conclusion section as an indication of whether the intervention works for the disease or not. Our commercial partner put a graphical interface on top of this, giving comprehensive visualisations of how well interventions for a disease work. The tool is useful at the individual disease-intervention granularity as it rapidly summarises information for a disease that it would take a person a few hours or days to put together. Even though it makes errors in the extraction, it is completely transparent, as users can drill down to see the publications that are grouped for the result.

In work that we have done in the financial domain, predicting how financial indicators evolve based on sentiment analysis of sections of bank or company annual reports, for example, we found that the predictions made for individual companies or banks were rather noisy. In this case, reliable results were only obtained by averaging over 10s to 100s of organisations to give an overview of a whole sector – still useful information for some applications, but at a coarser granularity.

Where are the potentials of “Data Intelligence” and where are the limitations?

The limitations of data intelligence were already well summarised by John Tukey in 1986: “The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

This continues to be valid, as Data Intelligence faces various challenges. The first is in the data itself: Are there inherent biases in the data due to the way the data has been collected? Is the data complete, or is a crucial part of it missing due to a conscious or unconscious decision to exclude data, the inability to access the data (e.g. due to privacy regulations), or unawareness of the existence of data? Then there are the techniques that are applied to the data. Finding all correlations in the data can be done, but not all correlations are useful – some great examples of spurious correlations are on http://tylervigen.com. Data generally come from specific areas of application or domains where large bodies of knowledge already exist. The challenge is to extract information that is useful for domain experts in the areas of application – it shouldn’t simply confirm background knowledge in the area, but new insights should be well justified. The potential of Data Intelligence is to overcome these limitations for a large number of application areas, bringing together domain experts and data scientists to produce new and useful insights. However, simply solving problems in individual application areas is not sufficient – it is also necessary to generalise by creating a well-grounded methodology on applying Data Science to a wide range of application areas. This will place the current, more heuristic approaches on a theoretical foundation, improving the efficiency and justifiability of Data Science approaches.

The ideal data scientist combines the skills of a data engineer with understanding contexts in the specific domain he/she works in. Where can we find these multidisciplinary skilled experts?

Simply seen, there are two sources – either take domain experts and train them in data engineering and data science skills, or train data engineers and data scientists with the skills to adapt to new domains. We are following both paths at the TU Wien. We have a new Data Science masters course starting in October 2018, in which we educate computer science and mathematics graduates to be adaptable to new domains, and will have the students do their masters thesis work in a wide range of domains. The TU Wien is also running a Data Science Innovation course, in which domain experts from industry are introduced to data science and machine learning techniques.

You are doing basic research on the one hand and close practical cooperation with corporates like Deutsche Telekom AG or T-Mobile Austria, please give us an idea of your common research agenda.

We are working on increasing the synergy between corporates and academia. Academics working in Data Science are often at a disadvantage due to their lack of access to the huge amounts of data that their colleagues in industry have. On the other hand, corporates may be focussed on specific problems and not see the “big picture.” By collaborating with a number of corporates, we are able to, on the one hand, provide independent input for solving their challenges, which often give rise to interesting research questions themselves. On the other hand, through working on a wide variety of challenges, we collect the information and knowledge necessary to carry out the generalisation and theoretical framework development mentioned before.

Your talk at SEMANTiCS 2018 is about search tasks in the domains “Intellectual Property” and “Medicine”. While the pharmaceutical sector has a highly structured knowledge base, the law sector is full of “blurry” data kept in unstructured documents. What’s the common baseline here, is semantics the missing link?

Search tasks are a good example of what I have discussed before. Even though the basic indexing and search algorithms are always the same, how they are adapted and applied depends on the application area. Even when the same collection of documents is being searched, the requirements can be different. In the Intellectual Property domain, legal professionals need search tools that give repeatable results for cases in which they have to demonstrate due diligence in their search approach, while start-ups need search tools that will rapidly give them feedback on whether their idea is viable without having to learn a formal query language. In the medical domain, doctors and patients search for information and have specific requirements. However, the assumption that patients need information in a language that is easier to comprehend is not always correct – patients with chronic conditions and their families often have the ability to read technically detailed documents about the condition.

Search is too complex to be able to say that a single technique is the missing link. We have used semantics in some applications, but not in others. What would be useful would be a theoretically well-justified methodology to understand under which conditions semantics would be useful and not.

Discuss about data intelligence as well as lexical and statistical semantics in professional search with Allan at SEMANTiCS 2018. Register now!

About SEMANTiCS

The annual SEMANTiCS conference is the meeting place for professionals who make semantic computing work, and understand its benefits and know its limitations. Every year, SEMANTiCS attracts information managers, IT-architects, software engineers, and researchers, from organisations ranging from NPOs, universities, public administrations to the largest companies in the world. http://www.semantics.cc