Contact Us


Frequently Asked Questions

ETD Help

Policies and Procedures

Copyright and Patents

Access Restrictions

Search ETDs:
Advanced Search
Browse by:
Browse ProQuest
Search ProQuest

Laney Graduate School

Rollins School of Public Health

Candler School of Theology

Emory College

Emory Libraries

New ETD website is now LIVE and located here:

Categorical Evaluation for Advanced Distributional Semantic Models

Kilgore, Andrew Reid (2016)
Honors Thesis (71 pages)
Committee Chair / Thesis Adviser: Choi, Jinho
Committee Members: Summet, Valerie H ; Wolff, Phillip
Research Fields: Computer science
Keywords: NLP; Natural Language Processing; Distributional Semantics; Analogy Testing
Program: College Honors Program, Computer Science
Permanent url:


Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form "If a is to a' then b is to what?" are answered by composing multiple word vectors and searching the vector space.

During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract.

In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.

Table of Contents

1 Introduction

1.1 Thesis Statement

2 Background

2.1 Word Representation

2.1.1 Word Embeddings

2.2 Analogy Tests

2.2.1 Vector Offset

2.2.2 Analogies

2.2.3 Syntactic Test Set

2.2.4 Test Set Customization

2.2.5 Scoring - Precision and Recall

2.3 Linguistic Structure

2.3.1 Dependency Structure

2.3.2 Predicate Argument Structure

2.3.3 Morphemes

2.4 Neural Network Models

2.4.1 NNLM

2.4.2 RNNLM

2.4.3 Word2Vec

2.4.4 Contexts

3 Approach

3.1 Corpus

3.2 Syntactic Contexts

3.2.1 First Order Dependency (dep1)

3.2.2 Semantic Role Label Head (srl1)

3.2.3 Closest Dependency Siblings (sib1)

3.2.4 First and Second Closest Dependency Siblings

3.3 Composite Models

3.3.1 All Siblings (allsib)

3.3.2 Second Order Dependency(dep2)

3.3.3 Second Order Dependency with Head (dep2h)

3.3.4 Siblings with Dependents

3.4 Ensemble Models

3.4.1 Model Inclusion

3.4.2 Categorical Model Selection

3.5 Analogy Testing

3.5.1 Scoring

3.6 Implementation

3.6.1 Arbitrary and Dependency Contexts

3.6.2 Analogy Testing Framework

3.6.3 Ensemble Models

4 Experiments

4.1 EmoryNLP Word2Vec

4.2 Lexical Evaluation

4.3 Grammatical Evaluation

4.3.1 Rank Scoring

4.4 Context Analysis

4.5 Ensemble Models

4.5.1 Diminishing Information

5 Conclusion

5.1 Future Work

5.1.1 Additional Models

5.1.2 Ensemble Models

5.1.3 Vector Space Analysis

Appendix A - Glossary

Appendix B -Syntactic Context Comparisions


application/pdf Honors Thesis 71 pages (1.8 MB) [Access copy of Honors Thesis]
Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.