Topics in Statistical Modeling for Unstructured Text Data with Application to Commonsense Inference

Yang, Yiben

doi:https://doi.org/10.21985/n2-fha1-n087

Work

Topics in Statistical Modeling for Unstructured Text Data with Application to Commonsense Inference

Public

Download PDF

Download All Files (.zip)

Commonsense inference is a critical capability of modern artificial intelligence (AI) systems. The machines need commonsense knowledge to perform tasks exactly like human being does. Learning commonsense inference from text has been a long standing challenge in the field of natural language processing due to reporting bias -- people do not explicitly state commonsense while communicating, although commonsense knowledge can be implicitly embedded in text, for example, "x is in y" implies "y is larger than x". Text data is unstructured, high-dimensional, sparse and large-scale, which pose challenges for conventional statistical models to learn implicitly embedded commonsense knowledge. Recent advance in statistical language models based on deep neural networks tackles the challenges by learning vector representations of text. With powerful vector representations of language obtained by training neural language models on large scale unlabeled corpora, researchers have achieved strong performance on many commonsense reasoning benchmarks. However, all existing methods rely on large-scale human-authored training data to achieve peak performance. Manual curation of training data in scale for each new commonsense domain can be prohibitively expensive. In addition, while large corpora are easy to get, training neural language models on large corpora demands high computational cost even with advanced hardwares (thanks to the recent advance in information technology). In this dissertation, we focus on developing novel statistical methodologies to reduce both manual effort and computational time in building AI systems to perform commonsense inference. First, we propose a n-gram regularization method that uses large-corpus statistics to improve computational efficiency of training recurrent neural network based language models. Second, we develop a multivariate extension of Bradley-Terry model on word embeddings for object commonsense property comparison. We demonstrate that the new model can outperform existing methods while greatly reducing human annotation efforts; and further reduce the labeling cost with synthetic active learning strategy. Finally, we introduce a novel generative data augmentation framework called G-DAUG for low-resource commonsense reasoning tasks. G-DAUG consistently outperforms existing data augmentation methods, and establishes new state-of-the-art benchmarks on multiple commonsense reasoning tasks. In addition to improvements in in-distribution accuracy, G-DAUG augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples.

Creator