Google releases dataset to train more sophisticated question-answering systems

You’ve likely used an AI question-answering system in the not-too-distant past, and probably without realizing it. Think chatbots: You ask a question about something and get a response in return.

Some are more sophisticated than others, of course, and unfortunately for researchers, figuring out just how capable one system is relative to another is easier said than done. That’s because there’s a dearth of high-quality datasets for question and answering — that is, a corpus of queries asked by people seeking information and the corresponding answers. But that changes today.

In a paper (“Natural Questions: a Benchmark for Question Answering Research“) and accompanying blog post, Google revealed Natural Questions (NQ), a new, large-scale dataset for training and evaluating open-domain question-answering systems. Tom Kwiatkowski and Michael Collins, research scientists at Google AI Language, claim it’s the first to “replicate the end-to-end process” in which people find answers to questions.

“Given a question expressed in natural language (‘Why is the sky blue?’), a QA system should be able to read the web (such as this Wikipedia page) and return the correct answer, even if the answer is somewhat complicated and long,” they wrote. “Assembling a high-quality dataset for question answering requires a large source of real questions and significant human effort in finding correct answers.”

Natural Questions, which consists of over 300,000 naturally occurring queries paired with human-annotated answers from Wikipedia pages, is designed both to train question-answering systems and to evaluate them. It was created with anonymized Google Search queries, the answers to which human annotators found by reading through entire Wikipedia pages and searching for two types of responses: long answers that “cover[ed] all of the information required to infer the answer[s],” and short answers that “answer[ed] the question[s] succinctly.”

The quality of the annotations has been measured at 90 percent accuracy, according to Kwiatkowski and Collins.

To coincide with the release of the dataset, Google launched a challenge that seeks to spur development of a question-answering system capable of “comprehend[ing] an entire Wikipedia article that may or may not contain the answer to the question.” Such a system, Kwiatkowski and Collins contend, would have to be able to decide whether any part of the Wikipedia page contained the information needed to infer the answer, requiring a “deeper level” of language understanding than most systems demonstrate.

“It is our hope that the release of NQ, and the associated challenge, will help spur the development of more effective and robust QA systems,” they wrote. “We encourage the NLU community to participate and to help close the large gap between the performance of current state-of-the-art approaches and a human upper bound.”

Today’s unveiling comes after the release of Google’s ActiveQA, a research project that aims to investigate the use of reinforcement learning to train AI agents for question answering. And it follows the open-sourcing of Bidirectional Encoder Representations from Transformers, or BERT, a framework that Google claims enables developers to train a “state-of-the-art” NLP model in 30 minutes on a single Cloud TPU (tensor processing unit, Google’s cloud-hosted accelerator hardware) or a few hours on a single graphics processing unit.