How Surfing the Web Improves Machine Learning

The new approach was tested with extraction tasks for mass shooting data, illustrated above, as well as food contamination data. (Image courtesy of Narasimhan et al.).
A new approach to machine learning information extraction has been developed by MIT researchers. Information extraction uses machine learning to automatically classify data items taken from plain text, such as online articles, to aid in statistical analysis.

The new technique makes machine learning a little more like human learning; a more natural fit for natural language processing. In two separate experiments, the new method outperformed conventional machine learning techniques by about 10 percent.


More Human Machine Learning

Conventional approaches to machine learning information extraction use vast amounts of training data, which increases the capacity of the system to handle difficult problems. The new approach uses much less data, which more realistically represents the amount of info typically available. The system then deals with the limited information in the same way a human would.

“In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,” said researcher Regina Barzilay. “That’s very different from what you or I would do. When you’re reading an article that you can’t understand, you’re going to go on the web and find one that you can understand.”

Similarly, the new system will look for more information when necessary. The system assigns a confidence score for each of its classifications, which represents the likelihood that they’re correct based on the given data. If the confidence score is too low, the system automatically searches the web for more info. It then analyzes the new data and reconciles it with the initial extraction, repeating the process until it’s confident in the results.

The researchers tested their approach in two extraction tasks: collecting U.S. mass shooting data, and collecting food contamination data. The system was tasked with extracting info including, for the first case: the name of the shooter, shooting location, number of people wounded, number of people killed; for the second case: food type, type of contaminant, and location.

The system learned clusters of search terms associated with its target classifications. For example, the names of mass shooters correlate to search terms like “police”, “identified”, “arrested”, and “charged”. On average, the system downloaded another nine or ten articles for each article it was asked to analyze.


Improving Information Extraction

“One of the difficulties of natural language is that you can express the same information in many, many different ways, and capturing all that variation is one of the challenges of building a comprehensive model,” said Chris Callison-Burch, a computer scientist not involved in the research. “[Barzilay and her colleagues] have this super-clever part of the model that goes out and queries for more information that might result in something that’s simpler for it to process. It’s clever and well-executed.”

Since the new method improves upon conventional information extraction by 10 percent, it’s an appealing option for researchers analyzing certain kinds of information. Callison-Burch’s research group is currently attempting to build a database of gun violence information, and he’s excited about the potential of the new approach.

“We’ve crawled millions and millions of news articles, and then we pick out ones that the text classifier thinks are related to gun violence, and then we have humans start doing information extraction manually,” he said. “Having a model like Regina’s that would allow us to predict whether or not this article corresponded to one that we’ve already annotated would be a huge time savings. It’s something that I’d be very excited to do in the future.”

You can read the team’s paper here. To learn more about machine learning, read Crowdsourcing an AI Game Show Host.