IST 418: Big Data Analytics
- Professor: Christopher Dunham <cndunham@syr.edu>
- Faculty Assistant: David Garcia <dgarciaf@syr.edu>
General instructions:
- You are welcome to discuss the problems with your classmates but you are not allowed to copy any part of your answers from your classmates. Short code snippets are allowed from the internet. Code from the class text books or class provided code can be copied in its entirety.
- Google Colab is the official class runtime environment so you should test your code on Colab before submission.
- Do not modify cells marked as grading cells or marked as do not modify.
- Before submitting your work, remember to check for run time errors with the following procedure:
Runtime
Restart and run all
. All runtime errors will result in a minimum penalty of half off. - All plots shall include descriptive title and axis labels. Plot legends shall be included where possible. Unless stated otherwise, plots can be made using any Python plotting package. It is understood that spark data structures must be converted to something like numpy or pandas prior to making plots. All required mathematical operations, filtering, selection, etc., required by a homework question shall be performed in spark prior to converting to numpy or pandas.
- You are free to add additional code cells around the cells marked
your code here
. - We reserve the right to take points off for operations that are extremely inefficient or "heavy weight". This is a big data class and extremely inefficient operations make a big difference when scaling up to large data sets. For example, the spark dataframe collect() method is a very heavy weight operation and should not be used unless it there is a real need for it. An example where collect() might be needed is to get ready to make a plot after filtering a spark dataframe.
- import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports. For example, the statement
from sympy import *
is not allowed. You must import the specific packages that you need. - The graders reserve the right to deduct points for subjective things we see with your code. For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that. This is only one of many different things we could find in reviewing your code. In general, write your code like you are submitting it for a code peer review in industry.
- Level of effort is part of our subjective grading. For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements. In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort. We feel that the students who did a better job deserve a better grade. We reserve the right to invoke level of effort grading at any time.
- Only use spark, spark machine learning, spark data frames, RDD's, and map reduce to solve all problems unless instructed otherwise.
- Your notebook must run from start to finish without requiring manual input by the graders. For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps. In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.
Medical Insurance Analysis
This assignment uses a medical insurance dataset with the following columns:
- age: age of primary beneficiary
- sex: female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast,southwest, northwest.
- charges: Individual medical costs billed by health insurance
Note that you are required to split data into train / test / validation sets as needed to use in the pipelines created in this and future assignments.
%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark &> /dev/null
# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=insurance.csv
if [[ ! -f ./${data_file_1} ]]; then
# download the data file from github and save it in this colab environment instance
wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/${data_file_1} &> /dev/null
fi
Question 1 (10 pts):
Read the data into a Spark DataFrame named medical_df
. Column names should be named age
, sex
, bmi
, children
, smoker
, region
, and charges
. Print the resulting DataFrame schema and shape such that it is easy for the graders to find and interpret. Verify your schema makes sense. If the schema does not makes sense, fix it.
# Your code here
Grading Feedback Cell
Question 2 (10 pts):
Explore the data. Make a pair plot. Use a Spark built-in method to provide a statistical summary of medical_df
. The resulting statistical summary shall be contained in a Spark DataFrame. Use exactly one Spark call to create the summary dataframe.
Explain the following 2 items:
- What variables are positively and negatively correlated with charges.
- Provide a brief summary that highlights what is interesting about the summary statistics.
# your code here
Your explanation here
Grading Feedback Cell
Question 3 (10 pts):
Do some data exploration. Create 2 plots which highlight something interesting / surprising about the data. Provide descriptions of the 2 plots that you made. Why did you make these plots and what is interesting about them. You will be graded as compared to the rest of the class on this question.
# Your code here
Your explanation here:
Grading Feedback Cell
Question 4 (10 pts):
In this question you will perform feature engineering. The DataFrame medical_df
is not ready for use with linear regression because some columns are categorical. Create a new DataFrame named fe_medical_df
(feature engineered medical dataframe) which adds new feature-engineered columns. Your feature engineering should take into account the best practices outlined in lecture. Encode the categorical data as columns of binary predictors. The feature engineering you perform in this question will have a direct result on how well your model performs below. Encapsulate your feature engineering in a SparkML pipeline named fe_pipe
(feature engineering pipe). fe_pipe
shall save all resulting output data in a column named features
. Future questions shall use the features
column as regression training data.
Perform ONLY the requested transformations - no more / no less. Do not use a vector assembler in this problem.
Provide an explanation on exactly what feature engineering transformations you made for every column you transform. We expect to see a separate explanation for each and every transformation performed.
# your code here
# Grading cell do not modify
display(fe_medical_df.show(10))
fe_medical_df.printSchema()
Your explanation here:
Grading Feedback Cell
Question 5 (10 pts):
Part 1: Create a new pipeline named lr_pipe
which encapsulates fe_pipe
, a vector assembler, and a linear regression object. Linear regression support objects are anything you need over and above what is in fe_pipe
in order to successfully run linear regression.
Part 2: Write some code that prints out the stage names of lr_pipe
and fe_pipe
such that it's easy for the graders to find and interpret. We don't expect to have to read code to interpret your results.
Part 3: Train and test lr_pipe
using medical_df
. To evaluate lr_pipe
, first write a Spark expression to compute MSE on the resulting fitted model. Second, use a built in Spark evaluator object to compute MSE. Print out the results from both your expression and the built-in evaluator object such that it's easy for the graders to find, interpret, and distinguish between the 2 test cases. We don't expect to have to reverse engineer your code to interpret the results.
# your part 1 code here
# your part 2 code here
# your part 3 code here
Grading Feedback Cell
Question 6 (10 pts):
The goal of this question is to build a pipeline which can be used to perform inference. Create a new pipeline named inf_pipe
which encapsulates fe_pipe
and adds new SparkML statistical components, linear regression support components, and a linear regression object. The goal is to compare linear regression coefficients between each other in order to learn something about the data set. Exclude any features which are not useful to the analysis. inf_pipe
shall use the charges
column as the target. Score inf_pipe
using a spark built-in evaluator with the MSE scoring metric. The output transformed dataframe shall be named inf_medical_df
. The resulting inf_pipe
shall include exactly one vector assembler.
Explanation: First, explain what SparkML statistical component(s) you added to inf_pipe
which were needed in order to be able to compare linear regression components between each other. Second, explain what features you excluded from the analysis (if any) and why.
# your code here
# Grading cell do not modify
display(inf_medical_df.show(10))
Grading Feedback Cell
Your explanation here:
Grading Feedback Cell
Question 7 (10 pts):
Extract the linear regression coefficients from inf_pipe
and collect them in a Pandas DatFrame named inf_pd
. The inf_pd
DatFrame shall have 2 columns: predictor
and value
. Load the predictor
column with the name of the coefficient and the value
column with the linear coefficient values from the linear regression model. Sort inf_pd
by the value
column in ascending order. Describe the most important positive and negative predictors found.
# your code here
# Grading cell do not modify
display(inf_pd)
Grading Feedback Cell
Your explanation here:
Grading Feedback Cell
Question 8 (10 pts):
Create a new DataFrame named strat_med_df
(stratified medical dataframe) by adding a new column to fe_medical_df
named rate_pool
. Create the rate_pool
column by stratifying the charges
column into charges greater than and less than the median of the charges column. Calculate the median using Spark and save the median in a Python variable named rate_median
. Assign an integer 0 to charges that are less than or equal to the median, and a 1 to charges greater than the median.
# your code here
# grading cell do not modify
display(strat_med_df.show(10))
print(rate_median)
Grading Feedback Cell
Question 9 (10 pts):
Create a new pipeline named strat_pipe
which predicts the rate_pool
column in strat_med_df
. Train and test strat_pipe
using strat_med_df
. Score strat_pipe
using a built-in Spark evaluator, 3-fold cross validation, and an AUC (area under the ROC curve) scoring metric. Use an empty ParamGridBuilder (empty grid) in the cross validator.
# your code here
Grading Feedback Cell
Question 10 (10 pts):
Create an ROC plot from the results of question 9 above. Explain the process of how a ROC curve is created (don't tell me how your code works, tell me how a ROC curve is created). Describe the main points of how a ROC curve is created and convince me that you understand the high level process of how to create a ROC curve.
# your code here