Scientists Tested Natural Language Models To Predict Human Language Judgments
Natural language models may test computational assumptions about how people understand language. A group of scientists from Columbia University in New York, coordinated by Tal Golan and Matthew Siegelman, assessed the model human consistency of several language models using a unique experimental approach: problematic sentence pairs. Two language models differ regarding which sentence is more likely to appear in the real test for each contentious sentence pair. Taking into account nine language models (including n-gram, recurrent neural networks, and transformer models),
The researchers generated hundreds of such contentious sentence pairings by picking phrases from a corpus or synthetically optimizing sentence pairs to be highly controversial and controversial. Human volunteers subsequently made evaluations indicating which of the two terms was more plausible for each couple. Controversial phrase pairings successfully highlighted model flaws and found models most closely matched with human assessments. GPT-2 was the most human-consistent model studied. However, testing showed severe deficiencies in its alignment with human perception.
These researchers put nine models from three different classes to the test: n-gram models, recurrent neural networks, and transformers. The Natural Language Toolkit's open source code was used to train the n-gram models. The recurrent neural networks were trained using PyTorch designs and optimization processes. HuggingFace, an open-source repository, was used to build the transformers. They gathered opinions from 100 native English speakers who took an online exam. Participants in each experimental session were asked to determine which statements they would be "more likely to encounter in the world, as either speech or written text" and rate their confidence in their response on a 3 point scale.
Despite the consistency in model ranking between our findings and earlier work, GPT-2's severe failure in predicting human reactions to natural vs. synthetic contentious pairings reveals that GPT-2 does not adequately imitate the computations used in human processing of even short words. This result is somewhat predictable because GPT-2 is an off-the-shelf machine learning model that was not created with human psycholinguistic and physiological features in mind. Even though we found a lot of human inconsistency, a recent GPT-2 study found that almost all of the variations in how people responded to actual words could be explained.
The researchers arranged 90 sentence pairings into ten sets of nine sentences each and gave each set to a different group of ten individuals. They calculated the percentage of trials in which the model and the person agreed on which phrase was more likely to assess model-human alignment. All nine language models outperformed chance by predicting human choices for randomly generated natural phrase pairings (50% accuracy). They gave each group of ten individuals a different set of phrase pairs. We statistically analyzed between-model differences while accounting for both people and sentence pairs as random variables using a simple Wilcoxon signed-rank test across the ten participant groups.
A process for synthesizing contentious sentence pairs was created, in which naturally existing sentences serve as initializations for synthetic phrases and reference points that drive sentence synthesis. They started with a naturally occurring statement. They then keep replacing words in the sentence with comments from a predefined vocabulary to make the synthetic sentence less likely to be correct by one language model while ensuring that the synthetic sentence is at least as possible to be accurate by another model.
Human participants rated ten contentious synthetic-sentence pairings for each model pair. They assessed how well each model predicted human sentence choices in all of the controversial synthetic-sentence combinations in which it was one of two models tested.
The tests proved that:
- There are many ways natural language processing models can generate controversial sentence pairs. They can pick pairs of sentences from a corpus or change natural sentences to make controversial predictions.
- The contentious sentence pairs make it easy to quickly compare models that seem the same in terms of human consistency.
- All of the existing natural language processing model classes mistakenly give a high probability to the following non-natural sentences: A simple statement may be modified such that its likelihood according to a specific model does not diminish. Still, as per human judgments, the phrase becomes much less likely.
- This method of comparing and testing models may give new ideas about which types of models work best with human language perception, and which types of models need to be made in the future.