HW 1 (Programming Portion)
Naive Bayes and Decision Trees [15 points]
General Instructions
Upload only those Python files t hat you have modified fo r th is as signment. These fil es should include naive bayes.py, decision tree.py, and crossval.py. You are welcome to create additional functions, files, or s cripts, and you are also welcome to modify the included interfaces for existing functions in the given files if you prefer a different organization. For this homework, you will build two text categorization classifiers: one using Naive Bayes and the other using decision t rees. You will write general code for crossvalidation that will apply to either of your classifiers.
Data and starter code
In the HW1 folder, you should find t he 2 0newsgroups d ata s et ( also a vailable f rom t he o riginal source http://qwone.com/~jason/20Newsgroups/). This data set consists of newsgroup posts from an earlier era of the Internet. The posts are in different categories, and this data set has become a standard benchmark for text classification methods. The data is represented in a bag-of-words format, where each post is represented by what words are present in it, without any consideration of the order of the words nor their counts. We have also provided a unit test class in tests.py, which contains unit tests for each type of learning model (see https://docs.python.org/3/library/unittest.html for more information about unittest). These unit tests may be easier to use for debugging in an IDE like PyCharm than the iPython notebook. A successful implementation should pass all unit tests and run through the entire iPython notebook without issues. You can run the unit tests from a *nix (Unix-like OS) command line with the command.
python -m unittest -v tests
or you can use an IDE’s unit test interface. These tests are not foolproof, so it’s possible for code that does not meet the requirements for full credit to pass the tests (and, though it would be surprising, it may be possible for full credit code to fail the tests). Please make sure to carefully read the test code and the produced messages for guidance about the target accuracy you should aim for.
Programming Tasks
Before starting all the tasks, examine the entire codebase. Follow the code from the iPython notebook to see which methods it calls. Make sure you understand what all of the code does (you can run the iPython notebooks the same way you learned in numpy tutorial). Your required tasks are:
- [0 points] Examine the iPython notebook test predictors.ipynb. This notebook uses the learning algorithms and predictors you will implement in the first part of the assignment. Read through the data-loading code (load all data.py) and the experiment code to make sure you understand how each piece works.
- [0 points] Examine the function calculate information gain in decision tree.py. The function takes in training data and training labels and computes the information gain for each feature. That is, for each feature, xj , it computes the gain in splitting based on xj , G(xj). Your function should return the vector [G(x1), ...,G(xn)]⊤.