IST 418: Big Data Analytics
- Professor: Christopher Dunham <cndunham@syr.edu>
- Faculty Assistant: David Garcia <dgarciaf@syr.edu> All general instructions from prior assignments apply here. Make sure to Runtime > Restart and run all prior to File > Download > Download .ipynb and submitting file to Blackboard.
%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark >& /dev/null
# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=colleges_data_science_programs.csv
if [[ ! -f ./${data_file_1} ]]; then
# download the data file from github and save it in this colab environment instance
wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/${data_file_1} >& /dev/null
fi
Unsupervised learning
The colleges_data_science_programs dataset contains information about dozens of "data science" programs across the US.
Question 1:
This dataset contains many columns that we can use to understand how these data science programs differ from one another.
Question 1a
Read the colleges_data_science_programs.csv data file into a data frame named raw_ds_programs_text_df.
# Your code here
# Grading Cell Do not Modify
print("rows:", raw_ds_programs_text_df.count(), ", cols:", len(raw_ds_programs_text_df.columns))
display(raw_ds_programs_text_df.show(5))
Question 1b
- Starting with
raw_ds_programs_text_df
, create a new dataframe namedds_programs_text_df
which simply adds a column namedtext
to the dataframeraw_ds_programs_df
. - The
text
column will have the concatenation of the following columns separated by a space:program
,degree
anddepartment
(find the appropriate function in thefn
package).
An example of the ds_programs_text_df
should give you:
ds_programs_text_df.orderBy('id').first().text
'Data Science Masters Mathematics and Statistics'
# Your code here
# Grading Cell Do Not Modify
display(ds_programs_text_df.show(5))
display(ds_programs_text_df.select('text').show(5, truncate=False))
Question 2:
Question 2a
- Create a pipeline named
pipe_features
that creates a new dataframeds_features_df
. The pipe_features pipeline adds a columnfeatures
tods_programs_text_df
that contains thetfidf
of thetext
column. - Make sure to create your pipeline using the natural language processing pipeline methodology as outlined in class and demonstrated in the in-class notebooks.
# Create ds_programs_text_df here
# Grading Cell Do Not Modify
display(ds_features_df.show(5))
display(ds_features_df.select("features").show(5, truncate=False))
Question 2b
- Create a pipeline model
pipe_pca
that computes the first two principal components of thefeatures
column as computed bypipe_features
and creates a new column namedscores
. - Use
pipe_pca
to create a dataframeds_features_df1
with the columnsid
,name
,url
, andscores
.
# create ds_features_df1 here
# Grading Cell Do Not Modify
display(ds_features_df1.show(5, truncate=False))
display(ds_features_df1.select("scores").show(5, truncate=False))
Question 3:
In this question you will write code that makes recommendations on programs closest to a program of interest.
- Create a function named
get_nearest_programs
that returns the 3 closest programs to a program of interest. - The
get_nearest_programs
function shall take 1 argument:program_of_interest
. - The
get_nearest_programs
function shall return the 3 programs (as defined by thename
column) closest to the program argument as defined by L2 Euclidian distance. Do not return the program of interest argument as one of the names. - Use the in-class recommender system case study for a reference as how to implement. Use the pipeline and resulting scores column from the previous question as a starting point.
# Your code here
# Grading Cell Do Not Modify
get_nearest_programs('Harvard University')
Question 4
- Create two Pandas dataframes
pc1_pd
andpc2_pd
with the columnsword
andabs_loading
that contain the top 5 sorted absolute values of loadings for the purposes of feature selection. Usepipe_pca
from question 2b for your analysis. All data for your analysis shall be accessed throughpipe_pca
. - Provide an interpretation of the loadings based on information provided in lecture taking into account covariance or correlation.
# your code here
# Grading cell do not modify
display(pc1_pd.head())
display(pc2_pd.head())
Your interpretation here:
Question 5:
- Create a new pipeline called pipe_pca1 where you fit the maximum possible number of principal components for this dataset.
- Create a scree plot and a plot of cumulative variance explained (exactly 2 plots).
- Describe 2 things. Fist, how many principal components you were able to create (the maximum number). Second, based on either the scree or cumulative variance explained, describe how many principal components you would use if you were building a supervised machine learning model. Use bullets in the markdown cell to separate your 2 answers.
# your code here
Your 2 explanations here
Question 6:
- Create a pipeline named pipe_pca2 that computes PCA scores for the first 2 principal components. Add a kmeans object to pipe_pca2 and compute kmeans with k = 5.
- Create a scatter plot which displays PC2 scores (y-axis) vs. PC1 scores (x-axis) where each point is colored by the cluster assignment. Include a plot legend.
- Look for interesting patterns in the clusters and label the points to learn something surprising or interesting about the data set. One example of what I am looking for is in the case study notebook which has a plot that shows IST-718 is very close to IST-719 - and IST-718 / IST-719 is far away from other courses. You are free to explore as you see fit. Essentially we are looking for you to find an interesting pattern within the clusters and label the points such that you learn something about the data set.
- Describe what surprising or interesting fact you learned. Your plot should be easy to read and labels should not be so dense that they are hard to read / on top of each other.
- The recommender notebook uses a normalizer object to produce the IST-718 / IST-719 plot mentioned above. The normalizer has the effect of scaling each data observation into a unit vector. This may or may not be useful to improve your visualization - you will have to try it and see if it helps. ONLY use the normalizer for visualizations in this assignment. The normalizer should not be included in any pipeline except if it is being used for visualization purposes.
# your code here
Your explanation here
Question 7:
- Starting with pipe_pca1 from question 5, transform the pipeline and save the resulting dataframe to a variable named
pca_fun
. -Extract the output from the standard scaler output column from the first row of pca_fun and store in a variable namedrow1_centered
. - Manually compute 5 PCA scores by projecting
row1_centered
onto the first 5 loading vectors which were computed in your PCA object. Save the 5 projected pca scores in a varialbe calledproj_scores
. - Extract the first 5 PCA scores from the first row of the pca_fun scores column and save them in a variable named
pca_fun_scores
. - The grading cell prints
proj_scores
andpca_fun_scores
such that they are right next to each other. Compareproj_scores
topca_fun_scores
and explain why they are the same or different.
# your code here
# Grading Cell - do not modify
print(proj_scores)
print(pca_fun_scores)
Your explanation here: