IST 418: Big Data Analytics

Professor: Christopher Dunham <cndunham@syr.edu>
Faculty Assistant: David Garcia <dgarciaf@syr.edu> All general instructions from prior assignments apply here. Make sure to Runtime > Restart and run all prior to File > Download > Download .ipynb and submitting file to Blackboard.

%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark >& /dev/null 

# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=colleges_data_science_programs.csv

if [[ ! -f ./${data_file_1} ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/${data_file_1} >& /dev/null 
fi

Unsupervised learning

The colleges_data_science_programs dataset contains information about dozens of "data science" programs across the US.

Question 1:

This dataset contains many columns that we can use to understand how these data science programs differ from one another.

Question 1a

Read the colleges_data_science_programs.csv data file into a data frame named raw_ds_programs_text_df.

# Your code here

# Grading Cell Do not Modify
print("rows:", raw_ds_programs_text_df.count(), ", cols:", len(raw_ds_programs_text_df.columns))
display(raw_ds_programs_text_df.show(5))

Question 1b

Starting with raw_ds_programs_text_df, create a new dataframe named ds_programs_text_df which simply adds a column named text to the dataframe raw_ds_programs_df.
The text column will have the concatenation of the following columns separated by a space: program, degree and department (find the appropriate function in the fn package).

An example of the ds_programs_text_df should give you:

ds_programs_text_df.orderBy('id').first().text

'Data Science Masters Mathematics and Statistics'

# Your code here

# Grading Cell Do Not Modify
display(ds_programs_text_df.show(5))
display(ds_programs_text_df.select('text').show(5, truncate=False))

Question 2:

Question 2a

Create a pipeline named pipe_features that creates a new dataframe ds_features_df. The pipe_features pipeline adds a column features to ds_programs_text_df that contains the tfidf of the text column.
Make sure to create your pipeline using the natural language processing pipeline methodology as outlined in class and demonstrated in the in-class notebooks.

# Create ds_programs_text_df here

# Grading Cell Do Not Modify
display(ds_features_df.show(5))
display(ds_features_df.select("features").show(5, truncate=False))

Question 2b

Create a pipeline model pipe_pca that computes the first two principal components of the features column as computed by pipe_features and creates a new column named scores.
Use pipe_pca to create a dataframe ds_features_df1 with the columns id, name, url, and scores.

# create ds_features_df1 here

# Grading Cell Do Not Modify
display(ds_features_df1.show(5, truncate=False))
display(ds_features_df1.select("scores").show(5, truncate=False))

Question 3:

In this question you will write code that makes recommendations on programs closest to a program of interest.

Create a function named get_nearest_programs that returns the 3 closest programs to a program of interest.
The get_nearest_programs function shall take 1 argument: program_of_interest.
The get_nearest_programs function shall return the 3 programs (as defined by the name column) closest to the program argument as defined by L2 Euclidian distance. Do not return the program of interest argument as one of the names.
Use the in-class recommender system case study for a reference as how to implement. Use the pipeline and resulting scores column from the previous question as a starting point.

# Your code here

# Grading Cell Do Not Modify
get_nearest_programs('Harvard University')

Question 4

Create two Pandas dataframes pc1_pd and pc2_pd with the columns word and abs_loading that contain the top 5 sorted absolute values of loadings for the purposes of feature selection. Use pipe_pca from question 2b for your analysis. All data for your analysis shall be accessed through pipe_pca.
Provide an interpretation of the loadings based on information provided in lecture taking into account covariance or correlation.

# your code here

# Grading cell do not modify
display(pc1_pd.head())
display(pc2_pd.head())

Your interpretation here:

Question 5:

Create a new pipeline called pipe_pca1 where you fit the maximum possible number of principal components for this dataset.
Create a scree plot and a plot of cumulative variance explained (exactly 2 plots).
Describe 2 things. Fist, how many principal components you were able to create (the maximum number). Second, based on either the scree or cumulative variance explained, describe how many principal components you would use if you were building a supervised machine learning model. Use bullets in the markdown cell to separate your 2 answers.

# your code here

Your 2 explanations here

Question 6:

Create a pipeline named pipe_pca2 that computes PCA scores for the first 2 principal components. Add a kmeans object to pipe_pca2 and compute kmeans with k = 5.
Create a scatter plot which displays PC2 scores (y-axis) vs. PC1 scores (x-axis) where each point is colored by the cluster assignment. Include a plot legend.
Look for interesting patterns in the clusters and label the points to learn something surprising or interesting about the data set. One example of what I am looking for is in the case study notebook which has a plot that shows IST-718 is very close to IST-719 - and IST-718 / IST-719 is far away from other courses. You are free to explore as you see fit. Essentially we are looking for you to find an interesting pattern within the clusters and label the points such that you learn something about the data set.
Describe what surprising or interesting fact you learned. Your plot should be easy to read and labels should not be so dense that they are hard to read / on top of each other.
The recommender notebook uses a normalizer object to produce the IST-718 / IST-719 plot mentioned above. The normalizer has the effect of scaling each data observation into a unit vector. This may or may not be useful to improve your visualization - you will have to try it and see if it helps. ONLY use the normalizer for visualizations in this assignment. The normalizer should not be included in any pipeline except if it is being used for visualization purposes.

# your code here

Your explanation here

Question 7:

Starting with pipe_pca1 from question 5, transform the pipeline and save the resulting dataframe to a variable named pca_fun. -Extract the output from the standard scaler output column from the first row of pca_fun and store in a variable named row1_centered.
Manually compute 5 PCA scores by projecting row1_centered onto the first 5 loading vectors which were computed in your PCA object. Save the 5 projected pca scores in a varialbe called proj_scores.
Extract the first 5 PCA scores from the first row of the pca_fun scores column and save them in a variable named pca_fun_scores.
The grading cell prints proj_scores and pca_fun_scores such that they are right next to each other. Compare proj_scores to pca_fun_scores and explain why they are the same or different.

# your code here

# Grading Cell - do not modify
print(proj_scores)
print(pca_fun_scores)

Your explanation here:

Wechat

QQ

Telegram