DS3000A - DS9000A Final Exam
Student ID #: XXXXXXXXX
Grade: __ / 100 + 10 Bonus
General Comments
- This exam integrates knowledge and skills acquired in the whole term. You are allowed to use any document and source on your computer and the internet, but you are NOT allowed to share documents, post questions to online forums, or communicate in any way with people inside or outside the class.
- Having any document sharing or communication tool (e.g. Discord, Teams, Outlook, Google Drive etc.), either web-based or app-based, open on your laptop (or running in the background) is considered act of cheating and you will receive 0 pts for the exam.
- To finish the midterm in the alloted time, you will have to work efficiently. Read the entirety of each question carefully.
- You need to submit your final notebook by 1:00PM on OWL to the Test and Quizzes section, this is the same place where you downloaded the empty notebook and data. Late submission will be scored with 0 pts. To avoid technical difficulties, start your submission, at the latest, five to ten minutes before the deadline.
- Some questions demand a written answer - answer these in full English sentences in markdown cells.
- For your figures, ensure that all axes are labeled in an informative way. There might be a situation where you should limit the x-axis and/or the y-axis to zoom-in for interpretation.
- Ensure that your code runs correctly by choosing "Kernel -> Restart and Run All" before submitting to OWL.
Additional Guidance
- If at any point you are asking yourself "are we supposed to...", write your assumptions clearly and proceed according to those assumptions.
- If you have no clue how to approach a question, skip it and move on. Revisit the skipped one(s) after you are done with the rest.
Preliminaries
Feel free to add stuff.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, mean_squared_error, silhouette_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LinearRegression, RidgeCV, SGDClassifier
from sklearn.ensemble import RandomForestRegressor
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from tensorflow.keras.layers import Dense
seed=1220
np.random.seed(seed)
import warnings
warnings.filterwarnings('ignore')
Question 1 - Model Selection
You are going to work on a dataset listing the soccer players participated in the 2022 FIFA World Cup. Our ultimate goal is to find the best ML model (amongst four candidates) that can best predict a player's monetary value. The dataset has the following attributes:
Age
: Player age in yearsNationality
: Players nationalityOverall
: Player overall performance score (higher better)Potential
: Player potential score (higher better)Club
: Player home soccer clubValue
: Player value i.e, the amount of money in euros a club should pay in order to purchase the player (higher better)Wage
: Player stipend in euros (higher better)Preferred Foot
: Player preferred foot to playInternational Reputation
: Player international fame (higher better)Week Foot
: Performance score of player weak foot (higher better)Skill Moves
: Player move skill score (higher better)Body Type
: Player body typePosition
: Position player holds on the pitchHeight
: Player height in centimetersWeight
: Player weight in kilograms Q 1.1 - [0.5] - Loaddataset_1.csv
as a pandas dataframe, name itdata
, and display its first 5 rows.
data =
Q 1.2 - [1] - Code to answer the following questions:
- Does the data contain any missing value(s)? How do you take care of them? [0.5]
- Do you see any suspicious value(s) in the statistical summary of the data? If so, explain why suspicious and take care of them properly? [0.5]
Q 1.3 - [2] - The BMI is defined as the body mass divided by the square of the body height, and is expressed in units of . With this knowledge, see if you can do some meaningful feature extraction?
Q 1.4 - [4] - Use sns.jointplot
to investigate the following relationships and apply proper transformations where needed:
- Value vs. Wage
- Value vs. Overall
- Wage vs. Overall
- Value vs. Potential
- Wage vs. Potential
Note: Where transformation is needed, use sns.jointplot
twice (i.e., before and after transformation).
Q 1.5 - [2] - Output a table reporting in descending format the correlations between the numerical features and target.
Q 1.6 - [6] - Code the following:
- use pandas
get_dummies
to take care of the categorical variables, if any, [2] - at this point, before proceeding to the next step, store the dataframe with a unique name because you will need it again in Question 1.14 and 1.15. [1]
- use
train_test_split
withrandom_state=seed
to put aside 20% of the data for testing purpose, [1] - define an RMSE scorer function. [2]
Q 1.7 - [4] - Do the following:
- instantiate an sklearn's linear regression with the default arguments and name it
model1
, [0.5] - run shuffled 5-split Kfold cross-validation on
model1
and report the cross-validated RMSE of each split as well as their mean and standard deviation [1] - fit the model, [0.5]
- report prediction RMSE score, [0.5]
- report generalization RMSE score, [0.5]
- get the fitted coefficients from
model1
and usesns.barplot
to see in descending order the 5 features that the model deems as the most important ones. (Take the absolute values of the coefficients because we just want to see the most correlated ones and do not care whether they are positively correlated or negatively). [1]
Q 1.8 - [5] - Do the following:
- Bundle the
StandardScaler
with the sklearn's cross-validated ridge linear regression into aPipeline
and name itmodel2
(for the regressor use the default arguments exceptalpha = [1e-10, 1e-5, 1]
andstore_cv_values=True
), [1] - run shuffled 5-split Kfold cross-validation on
model2
and report the cross-validated RMSE of each split as well as their mean and standard deviation, [1] - fit the model, [0.5]
- report prediction RMSE score, [0.5]
- report generalization RMSE score, [0.5]
- which entry in the
alpha
list did the model select for training? [0.5] - get the fitted coefficients from
model2
and usesns.barplot
to see in descending order the 5 features that the model deems as the most important ones. (Take the absolute values of the coefficients because we just want to see the most correlated ones and do not care whether they are positively correlated or negatively). [1]
Q 1.9 - [4] - Do the following:
- instantiate an sklearn's random forest regressor with the default arguments except
n_jobs=-1,
andrandom_state=seed
and name itmodel3
, [0.5] - run shuffled 5-split Kfold cross-validation on
model3
and report the cross-validated RMSE of each split as well as their mean and standard deviation, [1] - fit the model, [0.5]
- report prediction RMSE score, [0.5]
- report generalization RMSE score, [0.5]
- how many trees this forest has? [0.5]
- use
barplot
to generate a variable (or feature) importance diagram from this model (limit the plot to the top 5 features). [0.5]
Q 1.10 - [2] - Use the cross-validated grid search function to find the best possible values for n_estimators
and max_features
for the random forest. Here are the degrees of freedom to use: For n_estimators
try [50, 100, 150]
, and for max_features
try every possible values.
Note: Only use 50% of total data (randomly sampled using the provided random seed) to fit the grid search function.
Q 1.11 - [4] - Do the following:
- Take the random forest again but this time use the best values found in the previous step (again with
n_jobs=-1,
andrandom_state=seed
), and name itmodel4
, [1] - run shuffled 5-split Kfold cross-validation on
model4
and report the cross-validated RMSE of each split as well as their mean and standard deviation, [1] - fit the model, [0.5]
- report prediction RMSE score, [0.5]
- report generalization RMSE score, [0.5]
- use
barplot
to generate a variable (or feature) importance diagram from this model (limit the plot to the top 5 features). [0.5]
Q 1.12 - [1] - Based on your results, what features do you think are the most important ones? Which model do you trust for this purpose and why?
Q 1.13 - [1.5] - If you are asked to choose one final model for production, which one would you select? Explain why? Note: To answer this, take computational complexity into account alongside other criteria.
Q 1.14 - [10] - Take the dataframe that you set aside in Question 1.6 for this question. With International Reputation
as label attempt to do nonlinear dimension reduction using 3-component t-SNE with learning_rate='auto'
, init='random
, perplexity=50
, random_state=seed
, and n_jobs=-1
. You will probably witness better separations with higher values of n_iter
, however, for the sake of computation time do not go beyond 1500 . There is no deterministic outcome to expect from this question. As long as your implementation is correct, you should get the full mark. Treat this as an unsupervised task. Do the following:
- instantiate a t-SNE model with proper arguments, [2]
- fit the model properly, [2]
- 3D scatter plot the components that you get after dimension reduction and name the axes properly, [3]
- use the label to color code the data points in your 3D plot. [2]
- why t-SNE uses t-distribution and not Gaussian? [1]
Note: If you do not know how to plot in 3D, do 2D for partial mark.
Q 1.15 - [5] - Take the dataframe that you set aside in Question 1.6 for this question. We want to do a classification with 'International Reputation'
as target class. This is going to be an imbalanced classification but we don't care. We are interested to see if can get a better accuracy score if we do some clustering as a preprocessing step. Do the following:
- what would be the classification baseline accuracy for this dataframe? [1]
- use
train_test_split
withrandom_state=seed
to set aside 20% of the data as test set, [0.5] - instantiate a sklearn's stochastic gradient descent classifier with the proper solver for logistic regression and name it
clf
. Use elasticnet regularization withl1_ratio
of 0.7. Setmax_iter=2000
,tol=1e-3
,n_jobs=-1
,random_state=seed
, [2] - run 5-split
StratifiedKFold
cross-validation onclf
and report the cross-validated accuracy of each fold as well as their mean and standard deviation, [1.5]
Q 1.16 - [8] - Do the following:
- bundle a 50-cluster
K-Means
(as a preprocessing step) and theclf
into a pipeline. Setrandom_state=seed
forK-Means
, [3] - run 5-split
StratifiedKFold
cross-validation on the pipeline and report the cross-validated accuracy of each fold as well as their mean and standard deviation, [2] - do you find the added preprocessing step effective? why? [1]
- what transformations did the data undergo through this pipeline? [2]
Question 2 - [40] - Clustering
For this question we use a modified dataset from UCI Machine Learning Datasets. The data contains selling features on a social media platform. Each record has information about the time the information was posted and engagements such as emotion.
Q 2.1 - [1] - Load dataset_2.csv
as a pandas dataframe, name it df2
, and display its first 5 rows.
For this question we use a modified dataset from UCI Machine Learning Datasets. The data contains selling features on a social media platform. Each record has information about the time the information was posted and engagements such as emotion.
Q 2.1 - [1] - Load dataset_2.csv
as a pandas dataframe, name it df2
, and display its first 5 rows.
df2 =
Q 2.2 - [8] - Do the following:
- How many observations and attributes do you see in the dataset? [1]
- Check for missing values and drop the columns that contain missing values. [1]
- Create a label encoder using
LabelEncoder
from sklearn and convert the categorical variable into numerics. [2] - Keep a copy of the encoded version of
df2['data_type']
under a different name (e.g.,y
) - you will need it in Question 2.6 as true label. [1] - Explain why it is a good idea to normalize the data for K-Means clustering. [1]
- Train a
MinMaxScaler
over the full dataset but noty
. [2]
Q 2.3 - [4] - Now that the data is ready let's use KMeans
with random_state=seed
to plot k versus inertia for the model. Take k in [2, 3, 4, 5, 6, 8]
.
Q 2.4 - [4] - Plot k versus silhouette score for the model fit in the previous question.
Q 2.5 - [5] - According to the plots of Question 2.3 and Question 2.4 select 4 values for k and generate Silhouette Diagrams for them.
Q 2.6 - [5] - Train the model (using the same seed) for the k's that you selected in the previous question and report the model accuracy per k. Hint: In order to calculate the number of correct cluster labels you can use the data that you set aside in Question 2.2 as true label for this question.
Q 2.7 - [3] - Based on the insights generated in Question 2.3 - 2.6, pick two values for k. Explain why and support your choices by the results. Q 2.8 - [6] - Do the following:
- In Question 2.2, you used
MinMaxScaler
. This time, instead ofMinMaxScaler
, use theStandardScaler()
to prepare the data once again. TrainKMeans
on this data withrandom_state=seed
and number of clusters being equal to your first choice for k. [3] - Apply a PCA transform to the data using 3 components and create a 3D scatter plot, differentiating data points by color. [3]
Note: If you do not know how to plot in 3D, do 2D for partial mark.
Q 2.9 - [3] - Retrain the KMeans
with number of clusters being equal to your second choice for k, and again apply a PCA transform to the data using 3 components and create a 3D scatter plot, differentiating data points by color.
Note: If you do not know how to plot in 3D, do 2D for partial mark.
Q 2.10 - [1] - After seeing the figures generated in Question 2.8 and 2.9, what value of k would be your ultimate choice? Explain.
Question 3 - [10 Bonus] - ANN
Let's use the same dataset as Question 1 to train an ANN to predict players values. You can use either PyTorch or TensorFlow.
Q 3.1 - [2] - Load dataset_1.csv
as a pandas dataframe, create the array of features X
and target y
. Use train_test_split
with random_state=seed,test_size=0.3
twice to get not only a training set and a test set but also a validation set. Use StandardScaler()
to transform X's.
Q 3.2 - [3] - Create an ANN with 4 layers:
- An input layer with 500 nodes
- A hidden layer with 100 nodes
- Another hidden layer with 50 nodes
- A single node output layer
It is up to you where and what type of activation function to use.
How many parameters your ANN must optimize?
Q 3.3 - [3] - Choose mean_squared_error
for loss and train the model (with epochs=20
) over training and validation sets.
Q 3.4 - [1] - Report both prediction and generalization loss of the model.
Q 3.5 - [1] - Plot the learning curve i.e., epoch vs training loss and validation loss.
Warning!
Upload your complete notebook to the same place on OWL where you initially downloaded it. After uploading, click the "Submit for Grading" button and confirm. Late submissions are not allowed, so please start the submission process 10 minutes before the deadline.