IFT3335 - TP2

Using classification for word sense disambiguation This practical work is to be carried out in a group of 2-3 people. It corresponds to 15% of the global mark. Date of delivery: December 11 before 23:59. (Due to the exam period, it is tolerated that the submission is made later, but before December 20, 2022, 23:59, and this without penalty. After this deadline, the penalty of 10% per day will be applied)

The purpose of this tutorial is to practice classification algorithms (supervised machine learning). We will deal with the word sense disambiguation problem which aims at classifying a word used in a context into the appropriate sense class. The classification algorithms are already implemented in the Scikit-Learn library (scikit-learn: https://scikit-learn.org/stable/). Your job in this lab is to use Scikit-Learn on data collections and examine the impact of different algorithms and features.

1. The task of word sense disambiguation

Word sense disambiguation consists in determining the meaning of a word with several possible meanings. For example, the word "mouse" in French can refer to an animal or to a computer device. In the context of "The cat catches the mouse", the meaning referred to is that of animal. This meaning can be determined by using the context of the word in the sentence. This is a basic task for text comprehension. This task is typically performed using a classification approach. We assume that we have a set of meanings already defined, and we have a set of training examples - texts (sentences) containing occurrences of the ambiguous word, in which the meaning of each occurrence of the target word is manually annotated. Using these texts as examples, we train a classifier. This classifier will be used to disambiguate the word in new texts, i.e. to classify the word in the appropriate sense. As with many machine learning tasks, it is difficult to know which method is the most appropriate for disambiguating the meaning of words. So you have to try several methods, with different characteristics, to choose one that works best for the task. This tutorial places you in this global context, and you are asked to test different methods.

2. Preparations

To get familiar with Scikit-Learn, check out the Scikit-Learn tutorial in the Machine Learning MOOC course. You are also advised to read the Scikit-Learn documentation online: https://scikit-learn.org/stable/ The following functions are particularly important for this TP:

Pretreatment This allows you to load the data and do some preprocessing on the data, e.g. selection of data to process, transformation of data and attributes, etc. To start, import an existing dataset into the package (e.g. iris dataset, digits for handwriting recognition...), and try it out.
Classification Once the data is loaded, you can now choose an algorithm and apply it to the data. This is separated into a training phase and a test phase. Once the algorithm is chosen, import it and then initialize it. Test on test data to see the performance.
Feature selection and weighting You can now try to select attributes to use for classification. Depending on the data you are processing, you may want to ignore certain dimensions or features. There are different methods for selecting a subset of features to use in classification. This is useful when your data is noisy, with many attributes/features that do not help in classification. A cleaning (selection) is very beneficial in this case. This selection also helps to speed up the processing because you will have fewer features to consider. Play freely with the text collections included in Scikit-Learn. In particular, you need to transform text into a set of attributes/features. By default, each different word is considered an attribute or feature. Since a word may appear several times in a text, and its importance may be different depending on its frequency of occurrence, it is appropriate to use frequency to weight the importance of a word in a text. This transformation into a feature vector and the weighting of features is done in Scikit-Learn through the CountVectorizer, and TfidfVectorizer classes. The first class allows to create a vector of features weighted by their frequency, and the second will weight the features according to the TF-IDF scheme: TF - Term Frequenct, IDF - Inverse Document Frequency. See https://fr.wikipedia.org/wiki/TF-IDF for a brief description. Read the options offered in Scikit-Learn. When you choose, you can specify if the result of this transformation produces a set of attributes (words) binary (present or absent) or with a numerical weight (frequency, tf transformed and with idf). A common practice in the field of text classification and research is to truncate the words to keep only the roots. For example, the word The word "computer" will be truncated into "comput". This is to create a single representation for a family of similar words (computer, computers, computing, compute, computes, computation). It is assumed that morphological differences do not change the meaning of these words. This process is called stemming. There are standard stemming methods available in python, including the one offered by the NLTK library (see here for a tutorial on truncation with NLTK). You will use the Porter stemmer - the most common one for this task. To integrate the stemmer from NLTK, see the page https://scikit-learn.org/stable/modules/feature_extraction.html.

2.1. Corpus to be treated for the TP

For this tutorial, we will use a set of annotated English sentences containing the ambiguous word interest, which can correspond to 6 different meanings, according to the Longman dictionary:

Sense 1 = 361 occurrences (15%) - readiness to give attention
Sense 2 = 11 occurrences (01%) - quality of causing attention to be given to Sense 3 = 66 occurrences (03%) - activity, etc. that one gives attention to Sense 4 = 178 occurrences (08%) - advantage, advancement or favor
Sense 5 = 500 occurrences (21%) - a share in a company or business Sense 6 = 1252 occurrences (53%) - money paid for the use of money

The annotated text contains the result of a part-of-speech analysis + annotation of the meaning of the word interest. Here is an example:

[ yields/NNS ] on/IN [ money-market/JJ mutual/JJ funds/NNS ] continued/VBD to/TO slide/VB ,/, amid/IN [ signs/NNS ] that/IN [ portfolio/NN managers/NNS ] expect/VBP [ further/JJ declines/NNS ] in/IN [ interest_6/NN rates/NNS ] ./.
$$
[ longer/JJR maturities/NNS ] are/VBP thought/VBN to/TO indicate/VB [ declining/VBG interest_6/NN rates/NNS ] because/IN [ they/PP ] permit/VBP [ portfolio/NN managers/NNS ] to/TO retain/VB relatively/RB [ higher/JJR rates/NNS ] for/IN [ a/DT longer/JJR period/NN ] ./.

In this example, the brackets [ ] enclose a noun phrase. Each word is followed by its grammatical category (e.g. /NNS), and the ambiguous word, interest, is annotated with its meaning (_6, that is, the 6th meaning). Punctuation marks are themselves their own category (as in ./. at the end of a sentence). Sentences are separated by a $$ line. This corpus contains 2369 instances of word interest. A description of this corpus can be found here: http://www.d.umn.edu/~tpederse/Data/README.int.txt. The corpus we use in this tutorial is taken from http://www.d.umn.edu/~tpederse/data.html.

2.2. The disambiguation process

To determine the meaning of the word, we use the information in its context. The contextual information must be extracted beforehand (this is not done by Scikit-learn). There are several types of contextual information. Here are two types of contextual information:

The set of words before and the words after (in a bag of words, without order). In the first example proposed above, if we retain the 2 words before and 2 words after (the word interest) are {declines, in, rate, . }.
The categories of the words around. For the same 4 words of the first example, we will have: " NNS ", " IN ", " NNS ", " . These categories are generally taken into account in order (C-2=NNS, C-1=IN, C1=NNS, and C2=. ) in order to take into account the syntactic structure.

These two groups of features are the ones you should use as a minimum. But there are some possible variations (which you can test):

When selecting words around the text, it is possible to ignore very frequent and uninformative stopwords, such as in or punctuation. (in and further are stopwords - see the list of stopwords). This option is included in the Scikit-Learn algorithms.
Truncation of words, using a stemming algorithm, as explained earlier. In the literature, other types of features have been proposed and used. We suggest you to consult the following page for a summary presentation: https://en.wikipedia.org/wiki/Word-sense_disambiguation You are encouraged to explore additional features. The use of additional features will be taken into account. If these additional features are non-trivial, bonus points may be awarded in the correction.

2.3. The tasks to be carried out

Preparation: You need to make a program capable of extracting features from annotated texts. In particular, think about and test the choice of elements to use as features for classification.
Basic task
- You have to test the performance of different classification algorithms. For this lab, you are asked to test the following algorithms available in Scikit- Learn, which are presented in the MOOC course:
- a. NaiveBayes : Choose Multinomial Naive Bayes https://scikit- learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
- b. Tree decision tree :https://scikit- learn.org/stable/modules/tree.html#classification
- c. Random forest: https://scikit- learn.org/stable/modules/set.html#forests-of-randomized-trees
- d. SVM: https://scikit-learn.org/stable/modules/svm.html
- e. and MultiLayerPerceptron (trying different numbers of hidden neurons): https://scikit-learn.org/stable/modules/neural_networks_supervised.html#multi-layer-perceptron For these algorithms, first use the 2 feature types described above. Test with different context window sizes (1, 2, 3, ..., or even the whole sentence), and observe the variation in disambiguation performance as a function of this size. However, do not do this for all algorithms (there are too many). Test only on one algorithm, and assume that these features produce the same effects on the other algorithms.
Additional tasks (optional) Think of features that might be interesting to explore, and justify why. Then define and implement an interesting method to test them.
Additional method (optional) In addition to classical machine learning methods, deep learning is increasingly used for natural language processing tasks. In particular, recent studies typically exploit a pre-trained model like BERT. In this optional task, you are invited to explore the use of BERT for word sense classification. A basic BERT model is used to generate a representation for the sentence. This representation will be passed to a perceptron layer to do the classification. Explore this option if you have done the mandatory tasks, and have time and skills to explore BERT (TensorFlow, Karas, ...).
Analysis in a report Analyze the classification results with different algorithms, and compare their performance. As an overall reflection, you are invited to think about the suitability of each algorithm to handle this word sense classification problem, its strengths and weaknesses, and possible improvements.

3. To be returned

You must report all programs used for feature extraction. Allocate a small part of your report to give a description of these programs and their use. In addition to the programs, you must also submit a report, approximately 10 pages in length. In your report, you should describe your pre-processing, the experiments you performed, the results obtained, and finally comparisons and analyses of the results. Your analyses should cover, at a minimum, the performance of the different algorithms, the impact of different options - stemming, the number of hidden neurons, the window size for disambiguation. You are encouraged to do analysis on other aspects. Describe freely what you find interesting in these experiments.

4. Evaluation scales

This practical training corresponds to 15% of the overall grade. Here are the criteria used to grade this TP:

Feature extraction program: 2 points
Tests with Naïve Bayes : 2 points
Tests with decision tree: 2 points
Tests with random forest: 2 bridges
Tests with SVM : 2 points
Tests with MultiLayerPerceptron : 2 points
Report: 3 points (the description, analyses and comparisons between different algorithms and options, analyses of results and conclusions are taken into account. Good structure and clarity of description are expected).
Bonus: Up to 2 bonus points if non-trivial characteristics are developed and tested, and/or if you have included the BERT-based method in your tests and analyses.

Wechat

QQ

Telegram