DS3000A - DS9000A Final Exam

Student ID #: XXXXXXXXX

Grade: __ / 100 + 10 Bonus

General Comments

This exam integrates knowledge and skills acquired in the whole term. You are allowed to use any document and source on your computer and the internet, but you are NOT allowed to share documents, post questions to online forums, or communicate in any way with people inside or outside the class.
Having any document sharing or communication tool (e.g. Discord, Teams, Outlook, Google Drive etc.), either web-based or app-based, open on your laptop (or running in the background) is considered act of cheating and you will receive 0 pts for the exam.
To finish the midterm in the alloted time, you will have to work efficiently. Read the entirety of each question carefully.
You need to submit your final notebook by 1:00PM on OWL to the Test and Quizzes section, this is the same place where you downloaded the empty notebook and data. Late submission will be scored with 0 pts. To avoid technical difficulties, start your submission, at the latest, five to ten minutes before the deadline.
Some questions demand a written answer - answer these in full English sentences in markdown cells.
For your figures, ensure that all axes are labeled in an informative way. There might be a situation where you should limit the x-axis and/or the y-axis to zoom-in for interpretation.
Ensure that your code runs correctly by choosing "Kernel -> Restart and Run All" before submitting to OWL.

Additional Guidance

If at any point you are asking yourself "are we supposed to...", write your assumptions clearly and proceed according to those assumptions.
If you have no clue how to approach a question, skip it and move on. Revisit the skipped one(s) after you are done with the rest.

Preliminaries

Feel free to add stuff.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, mean_squared_error, silhouette_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LinearRegression, RidgeCV, SGDClassifier
from sklearn.ensemble import RandomForestRegressor

from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from tensorflow.keras.layers import Dense

seed=1220
np.random.seed(seed)

import warnings 
warnings.filterwarnings('ignore')

Question 1 - Model Selection

You are going to work on a dataset listing the soccer players participated in the 2022 FIFA World Cup. Our ultimate goal is to find the best ML model (amongst four candidates) that can best predict a player's monetary value. The dataset has the following attributes:

Age: Player age in years
Nationality: Players nationality
Overall: Player overall performance score (higher better)
Potential: Player potential score (higher better)
Club: Player home soccer club
Value: Player value i.e, the amount of money in euros a club should pay in order to purchase the player (higher better)
Wage: Player stipend in euros (higher better)
Preferred Foot: Player preferred foot to play
International Reputation: Player international fame (higher better)
Week Foot: Performance score of player weak foot (higher better)
Skill Moves: Player move skill score (higher better)
Body Type: Player body type
Position: Position player holds on the pitch
Height: Player height in centimeters
Weight: Player weight in kilograms Q 1.1 - [0.5] - Load dataset_1.csv as a pandas dataframe, name it data, and display its first 5 rows.

data =

Q 1.2 - [1] - Code to answer the following questions:

Does the data contain any missing value(s)? How do you take care of them? [0.5]
Do you see any suspicious value(s) in the statistical summary of the data? If so, explain why suspicious and take care of them properly? [0.5]

Q 1.3 - [2] - The BMI is defined as the body mass divided by the square of the body height, and is expressed in units of $kg/m^2$ . With this knowledge, see if you can do some meaningful feature extraction?

Q 1.4 - [4] - Use sns.jointplot to investigate the following relationships and apply proper transformations where needed:

Value vs. Wage
Value vs. Overall
Wage vs. Overall
Value vs. Potential
Wage vs. Potential

Note: Where transformation is needed, use sns.jointplot twice (i.e., before and after transformation).

Q 1.5 - [2] - Output a table reporting in descending format the correlations between the numerical features and target.

Q 1.6 - [6] - Code the following:

use pandas get_dummies to take care of the categorical variables, if any, [2]
at this point, before proceeding to the next step, store the dataframe with a unique name because you will need it again in Question 1.14 and 1.15. [1]
use train_test_split with random_state=seed to put aside 20% of the data for testing purpose, [1]
define an RMSE scorer function. [2]

Q 1.7 - [4] - Do the following:

instantiate an sklearn's linear regression with the default arguments and name it model1, [0.5]
run shuffled 5-split Kfold cross-validation on model1 and report the cross-validated RMSE of each split as well as their mean and standard deviation [1]
fit the model, [0.5]
report prediction RMSE score, [0.5]
report generalization RMSE score, [0.5]
get the fitted coefficients from model1 and use sns.barplot to see in descending order the 5 features that the model deems as the most important ones. (Take the absolute values of the coefficients because we just want to see the most correlated ones and do not care whether they are positively correlated or negatively). [1]

Q 1.8 - [5] - Do the following:

Bundle the StandardScaler with the sklearn's cross-validated ridge linear regression into a Pipeline and name it model2 (for the regressor use the default arguments except alpha = [1e-10, 1e-5, 1] and store_cv_values=True), [1]
run shuffled 5-split Kfold cross-validation on model2 and report the cross-validated RMSE of each split as well as their mean and standard deviation, [1]
fit the model, [0.5]
report prediction RMSE score, [0.5]
report generalization RMSE score, [0.5]
which entry in the alpha list did the model select for training? [0.5]
get the fitted coefficients from model2 and use sns.barplot to see in descending order the 5 features that the model deems as the most important ones. (Take the absolute values of the coefficients because we just want to see the most correlated ones and do not care whether they are positively correlated or negatively). [1]

Q 1.9 - [4] - Do the following:

instantiate an sklearn's random forest regressor with the default arguments except n_jobs=-1, and random_state=seed and name it model3, [0.5]
run shuffled 5-split Kfold cross-validation on model3 and report the cross-validated RMSE of each split as well as their mean and standard deviation, [1]
fit the model, [0.5]
report prediction RMSE score, [0.5]
report generalization RMSE score, [0.5]
how many trees this forest has? [0.5]
use barplot to generate a variable (or feature) importance diagram from this model (limit the plot to the top 5 features). [0.5]

Q 1.10 - [2] - Use the cross-validated grid search function to find the best possible values for n_estimators and max_features for the random forest. Here are the degrees of freedom to use: For n_estimators try [50, 100, 150], and for max_features try every possible values.

Note: Only use 50% of total data (randomly sampled using the provided random seed) to fit the grid search function.

Q 1.11 - [4] - Do the following:

Take the random forest again but this time use the best values found in the previous step (again with n_jobs=-1, and random_state=seed), and name it model4, [1]
run shuffled 5-split Kfold cross-validation on model4 and report the cross-validated RMSE of each split as well as their mean and standard deviation, [1]
fit the model, [0.5]
report prediction RMSE score, [0.5]
report generalization RMSE score, [0.5]
use barplot to generate a variable (or feature) importance diagram from this model (limit the plot to the top 5 features). [0.5]

Q 1.12 - [1] - Based on your results, what features do you think are the most important ones? Which model do you trust for this purpose and why? Q 1.13 - [1.5] - If you are asked to choose one final model for production, which one would you select? Explain why? Note: To answer this, take computational complexity into account alongside other criteria. Q 1.14 - [10] - Take the dataframe that you set aside in Question 1.6 for this question. With International Reputation as label attempt to do nonlinear dimension reduction using 3-component t-SNE with learning_rate='auto', init='random, perplexity=50, random_state=seed, and n_jobs=-1. You will probably witness better separations with higher values of n_iter, however, for the sake of computation time do not go beyond 1500 . There is no deterministic outcome to expect from this question. As long as your implementation is correct, you should get the full mark. Treat this as an unsupervised task. Do the following:

instantiate a t-SNE model with proper arguments, [2]
fit the model properly, [2]
3D scatter plot the components that you get after dimension reduction and name the axes properly, [3]
use the label to color code the data points in your 3D plot. [2]
why t-SNE uses t-distribution and not Gaussian? [1]

Note: If you do not know how to plot in 3D, do 2D for partial mark.

Q 1.15 - [5] - Take the dataframe that you set aside in Question 1.6 for this question. We want to do a classification with 'International Reputation' as target class. This is going to be an imbalanced classification but we don't care. We are interested to see if can get a better accuracy score if we do some clustering as a preprocessing step. Do the following:

what would be the classification baseline accuracy for this dataframe? [1]
use train_test_split with random_state=seed to set aside 20% of the data as test set, [0.5]
instantiate a sklearn's stochastic gradient descent classifier with the proper solver for logistic regression and name it clf. Use elasticnet regularization with l1_ratio of 0.7. Set max_iter=2000, tol=1e-3, n_jobs=-1, random_state=seed, [2]
run 5-split StratifiedKFold cross-validation on clf and report the cross-validated accuracy of each fold as well as their mean and standard deviation, [1.5]

Q 1.16 - [8] - Do the following:

bundle a 50-cluster K-Means (as a preprocessing step) and the clf into a pipeline. Set random_state=seed for K-Means, [3]
run 5-split StratifiedKFold cross-validation on the pipeline and report the cross-validated accuracy of each fold as well as their mean and standard deviation, [2]
do you find the added preprocessing step effective? why? [1]
what transformations did the data undergo through this pipeline? [2]

Question 2 - [40] - Clustering

For this question we use a modified dataset from UCI Machine Learning Datasets. The data contains selling features on a social media platform. Each record has information about the time the information was posted and engagements such as emotion. Q 2.1 - [1] - Load dataset_2.csv as a pandas dataframe, name it df2, and display its first 5 rows. For this question we use a modified dataset from UCI Machine Learning Datasets. The data contains selling features on a social media platform. Each record has information about the time the information was posted and engagements such as emotion. Q 2.1 - [1] - Load dataset_2.csv as a pandas dataframe, name it df2, and display its first 5 rows.

df2 =

Q 2.2 - [8] - Do the following:

How many observations and attributes do you see in the dataset? [1]
Check for missing values and drop the columns that contain missing values. [1]
Create a label encoder using LabelEncoder from sklearn and convert the categorical variable into numerics. [2]
Keep a copy of the encoded version of df2['data_type'] under a different name (e.g., y) - you will need it in Question 2.6 as true label. [1]
Explain why it is a good idea to normalize the data for K-Means clustering. [1]
Train a MinMaxScaler over the full dataset but not y. [2]

Q 2.3 - [4] - Now that the data is ready let's use KMeans with random_state=seed to plot k versus inertia for the model. Take k in [2, 3, 4, 5, 6, 8].

Q 2.4 - [4] - Plot k versus silhouette score for the model fit in the previous question.

Q 2.5 - [5] - According to the plots of Question 2.3 and Question 2.4 select 4 values for k and generate Silhouette Diagrams for them.

Q 2.6 - [5] - Train the model (using the same seed) for the k's that you selected in the previous question and report the model accuracy per k. Hint: In order to calculate the number of correct cluster labels you can use the data that you set aside in Question 2.2 as true label for this question.

Q 2.7 - [3] - Based on the insights generated in Question 2.3 - 2.6, pick two values for k. Explain why and support your choices by the results. Q 2.8 - [6] - Do the following:

In Question 2.2, you used MinMaxScaler. This time, instead of MinMaxScaler, use the StandardScaler() to prepare the data once again. Train KMeans on this data with random_state=seed and number of clusters being equal to your first choice for k. [3]
Apply a PCA transform to the data using 3 components and create a 3D scatter plot, differentiating data points by color. [3]

Note: If you do not know how to plot in 3D, do 2D for partial mark.

Q 2.9 - [3] - Retrain the KMeans with number of clusters being equal to your second choice for k, and again apply a PCA transform to the data using 3 components and create a 3D scatter plot, differentiating data points by color.

Note: If you do not know how to plot in 3D, do 2D for partial mark.

Q 2.10 - [1] - After seeing the figures generated in Question 2.8 and 2.9, what value of k would be your ultimate choice? Explain.

Question 3 - [10 Bonus] - ANN

Let's use the same dataset as Question 1 to train an ANN to predict players values. You can use either PyTorch or TensorFlow. Q 3.1 - [2] - Load dataset_1.csv as a pandas dataframe, create the array of features X and target y. Use train_test_split with random_state=seed,test_size=0.3 twice to get not only a training set and a test set but also a validation set. Use StandardScaler() to transform X's.

Q 3.2 - [3] - Create an ANN with 4 layers:

An input layer with 500 nodes
A hidden layer with 100 nodes
Another hidden layer with 50 nodes
A single node output layer

It is up to you where and what type of activation function to use.

How many parameters your ANN must optimize?

Q 3.3 - [3] - Choose mean_squared_error for loss and train the model (with epochs=20) over training and validation sets.

Q 3.4 - [1] - Report both prediction and generalization loss of the model.

Q 3.5 - [1] - Plot the learning curve i.e., epoch vs training loss and validation loss.

Warning!

Upload your complete notebook to the same place on OWL where you initially downloaded it. After uploading, click the "Submit for Grading" button and confirm. Late submissions are not allowed, so please start the submission process 10 minutes before the deadline.

Wechat

QQ

Telegram

DS3000A - DS9000A Final Exam

Student ID #: XXXXXXXXX

Grade: __ / 100 + 10 Bonus

General Comments

Additional Guidance

Preliminaries

Question 1 - Model Selection

Question 2 - [40] - Clustering

Question 3 - [10 Bonus] - ANN

Warning!