需要关于此作业的帮助?欢迎联系我

IST 418: Big Data Analytics

All general instructions from prior assignments apply here. Make sure to Runtime > Restart and run all prior to File > Download > Download .ipynb and submitting file to Blackboard.

%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark >& /dev/null 

# Download the data files from github
wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/US_Airline_Tweets.csv >& /dev/null
wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/stop_words.txt >& /dev/null

Sentiment Analysis

In this assignment, you will use a twitter US airline dataset to perform sentiment analysis. Specifically, you will use twitter data to predict the sentiment of tweets related to peoples experience with an airline.

# Grading cell
# The purpose of the following boolean is to enable or disable grid search (see question 6a).  
# During grading we want to turn grid search off.  
# You should test your code with grid search set to False before submitting.
# Your notebook should run in its entirety without crashing when enable_grid is
# set to False before submitting.
enable_grid = True

Question 1:

Read US_Airline_Tweets.csv into a Spark dataframe named tweets_df.

  • Drop all columns except airline_sentiment, airline, and text
  • Drop rows in which the airline_sentiment column is labeled with a neutral sentiment
  • Drop rows which contain NA / Null values in any column

Transform the airline_sentiment column such that a negative sentiment is equal to 0 and a positive sentiment is equal to 1. This dataset has a lot more negative than positive tweets.

  • Balance the dataset such that the percentage of negative and positive tweets is roughly 50% each. Your solution must randomly sample the dataset without replacement to perform balancing.
  • Determine and print the resulting percentage of positive and negative tweets in the dataframe such that it's easy for the graders to find and interpret your data.
# your code here
# grading cell do not modify
tweets_pd = tweets_df.toPandas()
display(tweets_pd.head())
print(tweets_pd.shape)

Question 2:

Pre-process the data by creating a pipeline named tweets_pre_proc_pipe. Your pipeline should:

  • tokenize
  • remove stop words (note that stop words are downloaded as stop_words.txt)
  • do a TF-IDF transformation.

Fit and execute your pipeline, and create a new dataframe named tweets_pre_proc_df.

Print the shape of the resulting TF-IDF data such that it's easy for the graders to find and understand as num rows x num words.

Based on the shape of the TF-IDF data, would you expect a logistic regression model to overfit? Provide your explanation below.

# your code here
# grading cell do not modify
tweets_pre_proc_df.show(10)

Your explanation here:

Question 3:

Since IDF considers a word's frequency across all documents in a corpus, you can use IDF as a form of inference.

Examine the documentation for the spark ML object that you used to create TF-IDF scores and learn how to extract the IDF scores for words in the corpus.

The idf object in your pipeline has a values attribute and a tolist() method which can be used to extract IDF values.

Create a pandas dataframe containing the 5 most important IDF scores named most_imp_idf.

Create another pandas dataframe containing the 5 least important IDF scores named least_imp_idf.

Each dataframe shall have 2 columns named word and idf_score.

Explain in words your interpretation of what the IDF scores mean.

# your code here
# grading cell do not modify
display(most_imp_idf)
display(least_imp_idf)

Your explanation here:

Question 4:

Create a new recursive pipeline named lr_pipe which encapsulates tweets_pre_proc_pipe and adds a logistic regression model and any needed logistic regression support objects. Use default logistic regression hyper parameters.

Fit lr_pipe using tweets_df.

Score the model using ROC AUC. Report the resulting AUC such that it is easy for graders to find and interpret.

# your code here

Question 5:

Create 2 pandas dataframes named lr_pipe_df_neg and lr_pipe_df_poswhich contain 2 colunms: word and score.

Load the 2 dataframes with the top 10 words and logistic regression coefficients that contribute the most to negative and positive sentiments respectively.

Analyze the 2 dataframes and describe if the words make sense. Do the words look like they are really negative and positive? Provide a written response below.

# your code here
# grading cell - do not modify
display(lr_pipe_df_neg)
display(lr_pipe_df_pos)

Your explanation here:

Question 6a:

The goal of this question is to try to improve the score from question 4 using an elastic net regularization grid search on a new pipeline named lr_pipe_1. lr_pipe_1 is the same as lr_pipe above but we would like you to create a new pipe for grading purposes only. I'm not sure if it's possible to increase the score or not. You will be graded on level of effort to increase the score in relation to other students in the class. All of your grid search code should be inside the if enable_grid statement in the cell below. The enable_grid boolean is set to true in a grading cell above. If any of the grid search code executes outside of the if statement, you will not get full credit for the question. We want the ability to turn off the grid search during grading.

# your grid search (and only your grid search) code here
if enable_grid:
    # your grid search code here
    pass

Question 6b:

Build a new pipeline named lr_pipe_2 which uses the optimized model parameters from the grid search in question 6a above (the best model).

Create 2 variables named alpha and lambda and assign to them the best alpha and lambda produced by the grid search by hard coding the values.

Fit and transform lr_pipe_2.

Compare AUC scores between lr_pipe_2 with lr_pipe in question 4.

Create a pandas dataframe named comapre_1_df which encapsulates the comparison data.

comapre_1_df Shall have 2 columns: model_name and auc_score.

# your optimized model code here
# example
# alpha = 0.1
# lambda = 0.1

# lr_pipe_2 code here which uses the best alpha and lambda
# grading cell - do not modify
display(comapre_1_df)

Question 7:

Perform inference on lr_pipe_2. Write code to report how many words were eliminated from the best model in question 6b above (if any) as compared to the model in question 4 above. Make sure your output is easy for the graders to find and interpret.

Describe in words how feature selection is performed using elastic net regularization.

# your code here

Your explanation here:

Question 8:

Perform the same inference analysis that you did in question 5 but name the data frames lr_pipe_df_neg_1 and lr_pipe_df_pos_1.

Compare the word importance results with the results in question 5. Do the most positive and most negative words produced by using regularization better reflect positive and negative sentiment than the most positive and negative words produced by the model that did not use regularization? Provide a written response below.

# your code here
# grading cell - do not modify
display(lr_pipe_df_neg_1)
display(lr_pipe_df_pos_1)

Your explanation here:

Question 9 (BONUS):

Precision recall plots are very similar to receiver operating characteristic (ROC) curves. The high level steps for creating a precision recall curve are the same as the steps needed to create a ROC curve as outlined in lecture. Learn about precision recall curves.

Create a precision recall plot for the best model in question 6.

Describe below what axes are the same / different between the precision recall curve and the ROC curve.

# your code here

Your explanation here:

Question 11 (MUST DO)

Make sure to set enable_grid to False in the grading cell above and run the notebook in its entirety before submitting to verify that there are no runtime erros.