IST 418: Big Data Analytics

Professor: Christopher Dunham <cndunham@syr.edu>
Faculty Assistant: David Garcia <dgarciaf@syr.edu>

General instructions:

You are welcome to discuss the problems with your classmates but you are not allowed to copy any part of your answers from your classmates. Short code snippets are allowed from the internet. Code from the class text books or class provided code can be copied in its entirety.
Google Colab is the official class runtime environment so you should test your code on Colab before submission.
Do not modify cells marked as grading cells or marked as do not modify.
Before submitting your work, remember to check for run time errors with the following procedure: Runtime $\rightarrow$ Restart and run all. All runtime errors will result in a minimum penalty of half off.
All plots shall include descriptive title and axis labels. Plot legends shall be included where possible. Unless stated otherwise, plots can be made using any Python plotting package. It is understood that spark data structures must be converted to something like numpy or pandas prior to making plots. All required mathematical operations, filtering, selection, etc., required by a homework question shall be performed in spark prior to converting to numpy or pandas.
You are free to add additional code cells around the cells marked your code here.
We reserve the right to take points off for operations that are extremely inefficient or "heavy weight". This is a big data class and extremely inefficient operations make a big difference when scaling up to large data sets. For example, the spark dataframe collect() method is a very heavy weight operation and should not be used unless it there is a real need for it. An example where collect() might be needed is to get ready to make a plot after filtering a spark dataframe.
import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports. For example, the statement from sympy import * is not allowed. You must import the specific packages that you need.
The graders reserve the right to deduct points for subjective things we see with your code. For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that. This is only one of many different things we could find in reviewing your code. In general, write your code like you are submitting it for a code peer review in industry.
Level of effort is part of our subjective grading. For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements. In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort. We feel that the students who did a better job deserve a better grade. We reserve the right to invoke level of effort grading at any time.
Only use spark, spark machine learning, spark data frames, RDD's, and map reduce to solve all problems unless instructed otherwise.
Your notebook must run from start to finish without requiring manual input by the graders. For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps. In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.

Medical Insurance Analysis

This assignment uses a medical insurance dataset with the following columns:

age: age of primary beneficiary
sex: female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast,southwest, northwest.
charges: Individual medical costs billed by health insurance

Note that you are required to split data into train / test / validation sets as needed to use in the pipelines created in this and future assignments.

%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark &> /dev/null

# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=insurance.csv

if [[ ! -f ./${data_file_1} ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/${data_file_1} &> /dev/null
fi

Question 1 (10 pts):

Read the data into a Spark DataFrame named medical_df. Column names should be named age, sex, bmi, children, smoker, region, and charges. Print the resulting DataFrame schema and shape such that it is easy for the graders to find and interpret. Verify your schema makes sense. If the schema does not makes sense, fix it.

# Your code here

Grading Feedback Cell

Question 2 (10 pts):

Explore the data. Make a pair plot. Use a Spark built-in method to provide a statistical summary of medical_df. The resulting statistical summary shall be contained in a Spark DataFrame. Use exactly one Spark call to create the summary dataframe.

Explain the following 2 items:

What variables are positively and negatively correlated with charges.
Provide a brief summary that highlights what is interesting about the summary statistics.

# your code here

Your explanation here

Grading Feedback Cell

Question 3 (10 pts):

Do some data exploration. Create 2 plots which highlight something interesting / surprising about the data. Provide descriptions of the 2 plots that you made. Why did you make these plots and what is interesting about them. You will be graded as compared to the rest of the class on this question.

# Your code here

Your explanation here:

Grading Feedback Cell

Question 4 (10 pts):

In this question you will perform feature engineering. The DataFrame medical_df is not ready for use with linear regression because some columns are categorical. Create a new DataFrame named fe_medical_df (feature engineered medical dataframe) which adds new feature-engineered columns. Your feature engineering should take into account the best practices outlined in lecture. Encode the categorical data as columns of binary predictors. The feature engineering you perform in this question will have a direct result on how well your model performs below. Encapsulate your feature engineering in a SparkML pipeline named fe_pipe (feature engineering pipe). fe_pipe shall save all resulting output data in a column named features. Future questions shall use the features column as regression training data.

Perform ONLY the requested transformations - no more / no less. Do not use a vector assembler in this problem.

Provide an explanation on exactly what feature engineering transformations you made for every column you transform. We expect to see a separate explanation for each and every transformation performed.

# your code here

# Grading cell do not modify
display(fe_medical_df.show(10))
fe_medical_df.printSchema()

Your explanation here:

Grading Feedback Cell

Question 5 (10 pts):

Part 1: Create a new pipeline named lr_pipe which encapsulates fe_pipe, a vector assembler, and a linear regression object. Linear regression support objects are anything you need over and above what is in fe_pipe in order to successfully run linear regression.

Part 2: Write some code that prints out the stage names of lr_pipe and fe_pipe such that it's easy for the graders to find and interpret. We don't expect to have to read code to interpret your results.

Part 3: Train and test lr_pipe using medical_df. To evaluate lr_pipe, first write a Spark expression to compute MSE on the resulting fitted model. Second, use a built in Spark evaluator object to compute MSE. Print out the results from both your expression and the built-in evaluator object such that it's easy for the graders to find, interpret, and distinguish between the 2 test cases. We don't expect to have to reverse engineer your code to interpret the results.

# your part 1 code here

# your part 2 code here

# your part 3 code here

Grading Feedback Cell

Question 6 (10 pts):

The goal of this question is to build a pipeline which can be used to perform inference. Create a new pipeline named inf_pipe which encapsulates fe_pipe and adds new SparkML statistical components, linear regression support components, and a linear regression object. The goal is to compare linear regression coefficients between each other in order to learn something about the data set. Exclude any features which are not useful to the analysis. inf_pipe shall use the charges column as the target. Score inf_pipe using a spark built-in evaluator with the MSE scoring metric. The output transformed dataframe shall be named inf_medical_df. The resulting inf_pipe shall include exactly one vector assembler.

Explanation: First, explain what SparkML statistical component(s) you added to inf_pipe which were needed in order to be able to compare linear regression components between each other. Second, explain what features you excluded from the analysis (if any) and why.

# your code here

# Grading cell do not modify
display(inf_medical_df.show(10))

Grading Feedback Cell

Your explanation here:

Grading Feedback Cell

Question 7 (10 pts):

Extract the linear regression coefficients from inf_pipe and collect them in a Pandas DatFrame named inf_pd. The inf_pd DatFrame shall have 2 columns: predictor and value. Load the predictor column with the name of the coefficient and the value column with the linear coefficient values from the linear regression model. Sort inf_pd by the value column in ascending order. Describe the most important positive and negative predictors found.

# your code here

# Grading cell do not modify
display(inf_pd)

Grading Feedback Cell

Your explanation here:

Grading Feedback Cell

Question 8 (10 pts):

Create a new DataFrame named strat_med_df (stratified medical dataframe) by adding a new column to fe_medical_df named rate_pool. Create the rate_pool column by stratifying the charges column into charges greater than and less than the median of the charges column. Calculate the median using Spark and save the median in a Python variable named rate_median. Assign an integer 0 to charges that are less than or equal to the median, and a 1 to charges greater than the median.

# your code here

# grading cell do not modify
display(strat_med_df.show(10))
print(rate_median)

Grading Feedback Cell

Question 9 (10 pts):

Create a new pipeline named strat_pipe which predicts the rate_pool column in strat_med_df. Train and test strat_pipe using strat_med_df. Score strat_pipe using a built-in Spark evaluator, 3-fold cross validation, and an AUC (area under the ROC curve) scoring metric. Use an empty ParamGridBuilder (empty grid) in the cross validator.

# your code here

Grading Feedback Cell

Question 10 (10 pts):

Create an ROC plot from the results of question 9 above. Explain the process of how a ROC curve is created (don't tell me how your code works, tell me how a ROC curve is created). Describe the main points of how a ROC curve is created and convince me that you understand the high level process of how to create a ROC curve.

# your code here

Wechat

QQ

Telegram

IST 418: Big Data Analytics

General instructions:

Medical Insurance Analysis

Question 1 (10 pts):

Grading Feedback Cell

Question 2 (10 pts):

Grading Feedback Cell

Question 3 (10 pts):

Grading Feedback Cell

Question 4 (10 pts):

Grading Feedback Cell

Question 5 (10 pts):

Grading Feedback Cell

Question 6 (10 pts):

Grading Feedback Cell

Grading Feedback Cell

Question 7 (10 pts):

Grading Feedback Cell

Grading Feedback Cell

Question 8 (10 pts):

Grading Feedback Cell

Question 9 (10 pts):

Grading Feedback Cell

Question 10 (10 pts):

Your explanation here: