Advanced Natural Language Engineering (G5114): Assessed coursework

Practical assignment (3000 words)

The Microsoft Research Sentence Completion Challenge (Zweig and Burges, 2011) requires a system to be able to predict which is the most likely word (from a set of 5 possibilities) to complete a sentence. In the labs you have evaluated using unigram and bigram models. In this assignment you are expected to investigate at least 2 extensions or alternative approaches to making predictions. Your solution does not need to be novel. You might choose to investigate 2 of the following approaches or 1 of the following approaches and 1 of your own devising.

Tri-gram (or even quadrigram) models
Word similarity methods e.g., using Googlenews vectors or WordNet?
Combining n-gram methods with word similarity methods e.g., distributional smoothing? Using a neural language model?

It does not matter how well your method(s) perform. However, your methods should be clearly described, any hyper-parameters (either xed, varied or optimised) should be discussed and there should be a clear comparison of the approaches with each other and the unigram and bigram baselines - both from a practical and empirical perspective. You have been provided with the training and test data for this task in the labs. You may (and are expected to) use any of the code that you have developed throughout the labs. This includes code provided to you in the exercises or solutions. You may use any other resources to which you have access. You are encouraged to make use of one or more of WordNet, the Lin dependency thesaurus provided in NLTK and/or Word2Vec word embeddings. You may also download other resources from the Internet and make use of any Python libraries that you are familiar with. with your conclusions and areas for further work. You should also submit your code as an appendix. Your report (including gures and bibliography but not including code appendix) should be no longer than 8 sides (3000 words of text plus gures and bibliography). Your code in the appendix should be clearly commented. Marks will not be awarded simply for how well your system does or for programming wizardry. Marks will be awarded for clearly evaluating possible solutions to the sentence completion challenge.

Wechat

QQ

Telegram

Advanced Natural Language Engineering (G5114): Assessed coursework

Practical assignment (3000 words)