IST 418: Big Data Analytics

Professor: Christopher Dunham <cndunham@syr.edu>
Faculty Assistant: David Garcia <dgarciaf@syr.edu> All general instructions from prior assignments apply here. Make sure to Runtime > Restart and run all prior to File > Download > Download .ipynb and submitting file to Blackboard. I was very disappointed with the linear regression model accuracy releted to the insurance data set in homework 3. I'm sure you were disappointed too. In this homework, we will revisit the insurance data set and try to improve prediction scores. Specifically, we will use random forest and gradient boosting trees to see if we can improve upon the scores achieved in homework 3.

def shape(df):
    return df.count(), len(df.columns)

# Grading Cell
enable_grid_search = False

The following cell is used to read the insurance data set into the colab environment. Do not change or modify the following cell.

%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark &> /dev/null

# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=insurance.csv

if [[ ! -f ./${data_file_1} ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/${data_file_1} &> /dev/null
fi

import pandas as pd
import numpy as np

#creating spark session and spark context
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline, feature, regression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder

# init spark env
spark = SparkSession.builder.appName('spark-hw6').getOrCreate()
sc = spark.sparkContext

Assignment Specific Instructions

Your grade for grid search problems in this assignment will be determined in part on level of effort and your model performance results as compared to other students in the class.

In this assignment, we will be comparing scores between random forest, gradient boosting trees, and deep learning. You are required to correctly use train / test / validation data sets for model comparison as outlined in lecture. Use train and test sets to train and score individual models during grid search. Only use validation data to compare scores between models. You must name your data sets exactly train, test, and validation so that the graders know what data set is being used in each question.

Question 1

Read the insurance data file into a spark data frame named medical_df. Drop any rows that contain NAN / Null values. Check the schema and fix if needed.
Perform needed feature engineering using only a string indexer to get ready for training decision trees. One hot encoding is not needed for random forest - do not use one hot encoding or any other transformations other than string indexing. Save the results in medical_df_fe.
Do not use a vector assembler in this question.

# Read the insurance data file into a spark data frame named `medical_df`.  Drop any rows that contain NAN / Null values.  Check the schema and fix if needed.  
medical_df = spark.read.csv('insurance.csv', header=True, inferSchema=True)
medical_df = medical_df.dropna() # drop NAN/null values
medical_df.printSchema()

# Perform needed feature engineering using **only** a string indexer to get ready for training decision trees.  One hot encoding is not needed for random forest - do not use one hot encoding or any other transformations other than string indexing. Save the results in `medical_df_fe`.
fe = feature.StringIndexer(
    inputCols=['sex', 'smoker', 'region'], 
    outputCols=['sex_indexed', 'smoker_indexed', 'region_indexed']).fit(medical_df)
medical_df_fe = fe.transform(medical_df)

# split dataset
train, test, validation = medical_df.randomSplit(weights=[0.9, 0.05, 0.05], seed=2023)
print('train: {}, test: {}, validation: {}'.format(train.count(), test.count(), validation.count()))

# Grading Cell do not modify
# Print the schema
medical_df_fe.printSchema()
#Print the shape
print('The shape of the dataframe is:', shape(medical_df))
# print the head
medical_df_fe.show()

The following questions will create a random forest regressor model. The goal is to see if we can improve upon the linear regression score from homework 3. You can find the spark documentation for the random forest regressor here.

Question 2

Create and train a random forest regressor model using a grid search in the cell below. Score your model using MSE. Your grid search must be entirely encapsulated in the if enable_grid_search if statement. The enable_grid_search Boolean is defined in a grading cell above. You will disable the grid search before you submit by setting enable_grid_search to false. Setting enable_grid_search to false should not result in a runtime error. You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the enable_grid_search variable to false.

# your code here
if enable_grid_search:
    rf_pipeline = Pipeline(
        stages=[
            fe, 
            feature.VectorAssembler(
                inputCols=['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'], 
                outputCol='features'
            ),
            regression.RandomForestRegressor(labelCol='charges', featuresCol='features', seed=2023)])
    rf = rf_pipeline.getStages()[-1]

    # Create a parameter grid for the random forest classifier:
    param_grid = ParamGridBuilder()\
                     .addGrid(rf.maxDepth, [2, 3, 4, 5,])\
                     .addGrid(rf.numTrees, [10, 30, 50])\
                     .addGrid(rf.maxBins, [10, 20, 30])\
                     .build()
    
    rf_models = []
    for i, grid in enumerate(param_grid):
        # fit on train
        model = rf_pipeline.fit(train, grid)
        # evaluate on validation
        evaluator = RegressionEvaluator(
            labelCol='charges',
            predictionCol='prediction',
            metricName='mse')
        rf_models.append((
            evaluator.evaluate(model.transform(validation)),
            i,
            model
        ))
        print('Grid={:2d} mse on validation={:.3f}'.format(i, rf_models[-1][0]))
    # sort by score
    rf_models.sort()
    index, model = rf_models[0][1:]
    params = param_grid[index]
    print('Best model params={}'.format(params))

Question 3

Create a pipeline named best_rf_pipe that hard codes the tuning parameters from the best model found by the grid search in question 2 above. Train and test best_rf_pipe. Score your model using validation data and the MSE scoring metric. Save train and validation MSE scores in variables named rf_train_mse and rf_validation_mse.

best_rf_pipe = Pipeline(
    stages=[
        fe, 
        feature.VectorAssembler(
            inputCols=['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'], 
            outputCol='features'
        ),
        regression.RandomForestRegressor(
            labelCol='charges', featuresCol='features', seed=2023,
            maxDepth=5,
            numTrees=50,
            maxBins=20)
    ]
)

best_rf_pipe_model = best_rf_pipe.fit(train)

evaluator = RegressionEvaluator(
            labelCol='charges',
            predictionCol='prediction',
            metricName='mse')

rf_train_mse = evaluator.evaluate(best_rf_pipe_model.transform(train))
rf_validation_mse = evaluator.evaluate(best_rf_pipe_model.transform(validation))

# Grading cell do not modify
print("rf_train_mse =", rf_train_mse)
print("rf_validation_mse =", rf_validation_mse)

Grading Feedback Cell

Question 4 (10 pts)

Use best_rf_pipe in question 3 for inference. Create a pandas data frame named rf_feature_importance which contains 2 columns: feature, and importance. Load the feature column with the feature name and the importance column with the feature importance score as determined by the random forest model. Sort the feature importances from high to low such that the most important feature is in the first row of the data frame.

rf_feature_importance = pd.DataFrame({
    'feature': ['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'],
    'importance': best_rf_pipe_model.stages[-1].featureImportances.toArray()
})
rf_feature_importance.sort_values('importance', ascending=False, inplace=True)
rf_feature_importance

# grading cell - do not modify
display(rf_feature_importance)

Question 5

Repeat question 2 but this time use a GBT regressor. Create and train a GBT regressor model using a grid search in the cell below. Score your model using MSE. Your grid search must be entirely encapsulated in the if enable_grid_search if statement. The enable_grid_search Boolean is defined in a grading cell above. You will disable the grid search before you submit by setting enable_grid_search to false. Setting enable_grid_search to false should not result in a runtime error. You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the enable_grid_search variable to false.

# your code here
if enable_grid_search:
    gbt_pipeline = Pipeline(
        stages=[
            fe, 
            feature.VectorAssembler(
                inputCols=['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'], 
                outputCol='features'
            ),
            regression.GBTRegressor(labelCol='charges', featuresCol='features', seed=2023)])
    gbt = gbt_pipeline.getStages()[-1]

    # Create a parameter grid for the random forest classifier:
    param_grid = ParamGridBuilder()\
                     .addGrid(gbt.maxDepth, [2, 3, 4, 5,])\
                     .addGrid(gbt.maxBins, [10, 20, 30])\
                     .build()
    
    gbt_models = []
    for i, grid in enumerate(param_grid):
        # fit on train
        model = gbt_pipeline.fit(train, grid)
        # evaluate on validation
        evaluator = RegressionEvaluator(
            labelCol='charges',
            predictionCol='prediction',
            metricName='mse')
        gbt_models.append((
            evaluator.evaluate(model.transform(validation)),
            i,
            model
        ))
        print('Grid={:2d} mse on validation={:.3f}'.format(i, gbt_models[-1][0]))
    # sort by score
    gbt_models.sort()
    index, model = gbt_models[0][1:]
    params = param_grid[index]
    print('Best model params={}'.format(params))

Question 6

This is a repeat of question 3 but for GBT. Create a pipeline named best_gbt_pipe that hard codes the tuning parameters from the best model found by the grid search in question 5 above. Train and test best_gbt_pipe using MSE as the scoring metric. Save train and validation MSE scores in variables named gbt_train_mse and gbt_validation_mse.

best_gbt_pipe = Pipeline(
    stages=[
        fe, 
        feature.VectorAssembler(
            inputCols=['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'], 
            outputCol='features'
        ),
        regression.GBTRegressor(
            labelCol='charges', featuresCol='features', seed=2023,
            maxDepth=4,
            maxBins=30)
    ]
)

best_gbt_pipe_model = best_gbt_pipe.fit(train)

evaluator = RegressionEvaluator(
            labelCol='charges',
            predictionCol='prediction',
            metricName='mse')

gbt_train_mse = evaluator.evaluate(best_gbt_pipe_model.transform(train))
gbt_validation_mse = evaluator.evaluate(best_gbt_pipe_model.transform(validation))

# Grading cell do not modify
print("gbt_train_mse =", gbt_train_mse)
print("gbt_validation_mse =", gbt_validation_mse)

Question 7

Create a pandas dataframe named rf_gbt_mse_compare which contains 3 columns: Model, Train MSE, and Validation MSE. Load the Model column with "RF" or "GBT", the Train MSE column with the corresponding train MSE, and the Validation MSE column with the corresponding validation MSE scores from the random forest / gradient boosted tree scores. Use rf_train_mse, rf_validation_mse, gbt_train_mse, and gbt_validation_mse variables to load the dataframe.

GBT models usually produce better scores than random forest. I am not sure if that will be the case for this dataset but you will be graded in comparison to other students' results in the class.

rf_gbt_mse_compare = pd.DataFrame({
    'Model': ['RF', 'GBT'],
    'Train MSE': [rf_train_mse, gbt_train_mse],
    'Validation MSE': [rf_validation_mse, gbt_validation_mse]
})

# Grading Cell Do Not Modify
display(rf_gbt_mse_compare)

Question 8

Set the enable_grid_search Boolean variable to False in the grading cell at the top of this notebook. Perform a Runtime -> Disconnect and Delte Runtime, Runtime -> Run all test to verify there are no runtime errors. Leave the enable_grid_search variable set to False and turn in your assignment.

Wechat

QQ

Telegram