IST 418: Big Data Analytics

General instructions:

You are welcome to discuss the problems with your classmates but you are not allowed to copy any part of your answers from your classmates. Short code snippets are allowed from the internet. Code from the class text books or class provided code can be copied in its entirety.
Google Colab is the official class runtime environment so you should test your code on Colab before submission.
Do not modify cells marked as grading cells or marked as do not modify.
Before submitting your work, remember to check for run time errors with the following procedure: Runtime $\rightarrow$ Restart and run all. All runtime errors will result in a minimum penalty of half off.
All plots shall include descriptive title and axis labels. Plot legends shall be included where possible. Unless stated otherwise, plots can be made using any Python plotting package. It is understood that spark data structures must be converted to something like numpy or pandas prior to making plots. All required mathematical operations, filtering, selection, etc., required by a homework question shall be performed in spark prior to converting to numpy or pandas.
You are free to add additional code cells around the cells marked your code here.
We reserve the right to take points off for operations that are extremely inefficient or "heavy weight". This is a big data class and extremely inefficient operations make a big difference when scaling up to large data sets. For example, the spark dataframe collect() method is a very heavy weight operation and should not be used unless it there is a real need for it. An example where collect() might be needed is to get ready to make a plot after filtering a spark dataframe.
import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports. For example, the statement from sympy import * is not allowed. You must import the specific packages that you need.
The graders reserve the right to deduct points for subjective things we see with your code. For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that. This is only one of many different things we could find in reviewing your code. In general, write your code like you are submitting it for a code peer review in industry.
Level of effort is part of our subjective grading. For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements. In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort. We feel that the students who did a better job deserve a better grade. We reserve the right to invoke level of effort grading at any time.
Only use spark, spark machine learning, spark data frames, RDD's, and map reduce to solve all problems unless instructed otherwise.
Your notebook must run from start to finish without requiring manual input by the graders. For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps. In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.

Read the data files

The cell below reads the assignment data files from github

%%bash
# define an array of data file names
data_file_array=("indicator_gapminder_population.csv" "indicator_gapminder_under5mortality.csv" "indicator_life_expectancy_at_birth.csv" "indicator_undata_total_fertility.csv")

# for each data file
for file in ${data_file_array[@]}; do
  # if the data file does not exist on the local computer
  if [[ ! -f ./${file} ]]; then 
    # download the data file from github and save it on the local computer
    wget https://raw.githubusercontent.com/cndunham/IST-418-Spring-2023-Data/master/un_indicator_data/${file} &> /dev/null
  fi  
done

Question 1 (10 pts)

In the game of roullete you can bet on several things including if the ball will land on black or red. In a black or red bet, if you win, you double your earnings. How does the casino make money? If you look at the possibilities you realize that the chance of red or black are both slightly less than 1/2. There are two green spots, so the chance of landing on black (or red) is actually 18/38, or 9/19.
Create a utility function which can be used in a monte carlo simulation named get_outcome. The get_outcome function takes as an argument the number of times you play (or spin) the roulette wheel and returns the player's earnings for the number of spins specified. Assume that the player bets exactly one dollar on black for each spin of the wheel.

# your code here

# Grading cell - do not change or delete
num_plays = 10000
get_outcome(num_plays) / num_plays

Question 2 (10 pts)

Using the get_outcome function defined above, use a monte carlo simulation to study the distribution of total earnings. Run 4 simulations for number of roulette plays = 10, 25, 100, and 1000 where each of the 4 simulations is executed 500 times. Collect the results into a 2 dimensional numpy array named roulette_sim_array. Make histogram plots for each of the 4 simulations. Based on the histogram plots, describe what happens to toal earnings as the number of plays increases.

# your code here

Grading Feedback Cell

Your explanation here:

Question 3 (10 pts)

Using the central limit theorem, create a pandas dataframe named roulette_df containing the sampling distribution of the means from the sample data in the numpy array above. The pandas dataframe shall have 4 columns labeled with the simulation names. Using data in the roulette_df, plot histograms for each of the sampling distributions - you should have 4 histograms in total.

The following question is based on the theory of central limit theorem sampling. Assuming you don't know the underlying distribution of the population from which the samples were drawn, some of the histograms are guaranteed to be Gaussian in shape, some are not guaranteed, and some are in a transition region. For each of the 4 simulations, describe if you think the shape is guaranteed to be gaussian, not guaranteed to be Gaussian, or in a transition area between a guaantee and no guarantee.

# Your histogram code here

# Grading cell - do not modify
display(roulette_df.head())

Grading Feedback Cell

Your explanation here:

Question 4 (20 pts)

Create a new monte carlo simulation that calculates the probability that the casino loses money based on the number of times that a player plays roulette. Create a function p_casino_loss that takes as an argument the number of times that the player plays roulette (n_plays), and returns the probability that the casino loses money. Your code should simulate spinning the roulette wheel. Run the n_plays simulation a fixed large number of times (100 works) and return the average probability result. Using data collected from p_casino_loss, produce a line plot that shows the probability that the casino loses money vs. the number of games played for number of games between 25 and 1000. Describe what the results of the simulation show.

# your code here

Your explanation here:

Question 5 (10 pts)

Compute the following matrix dot product manually by creating 2 dimensinal nympy arrays for each matrix, computing the matrix multiply using python for loops, and loading a new 2 dimensional numpy array with the answer. Print the resulting numpy array.

# your code here

Question 6 (10 pts)

Read each of the 4 assignment csv files into pandas dataframes named population_df, morttality_df, life_exp_df, and fertility_df. Rename the column with the country names as "Country" in each of the dataframes. Hint - the bash datafiles data_file_array at the start of the assignment has the file names you need to load. Another thing you can do is click the colab file icon to the left to view the file names stored on the local colab instance.

# your code here

# grading cell - do not modify
display(population_df.head())
display(morttality_df.head())
display(life_exp_df.head())
display(fertility_df.head())

Question 7 (10 pts)

The data frames from the question above are organized such that rows are countries and columns are years. Reorganize each data frame such that each row contains 3 columns: country, year, and a data value. This is known as the long or tidy format. For example, the population data frame columns start out as country, year, year, year, ..., year. After reorganizing, the population data frame columns will contain only 3 columns: country, year, and population. Save the reorganized data into new data frames named tidy_population_df, tidy_morttality_df, tidy_life_exp_df, and tidy_fertility_df. You are free to use any means necessary to perform this task but the pandas melt function may be useful.

# your code here

# grading cell - do not modify
display(tidy_population_df.head())
print(tidy_population_df.size)
display(tidy_morttality_df.head())
print(tidy_morttality_df.size)
display(tidy_life_exp_df.head())
print(tidy_life_exp_df.size)
display(tidy_fertility_df.head())
print(tidy_fertility_df.size)

Question 8 (10 pts)

Join all 4 dataframes together such that the country, year, population, mortality, life expectancy, and fertility columns are collected together in the same dataframe. The join operation should not throw away any data. Name the new dataframe concat_df. Next, delete all rows where life expectancy and fertility are NAN.

# your code here

# grading cell - do not modify
display(concat_df.head())
print(concat_df.shape)

Question 9 (10 pts)

Using concat_df, report the child mortality rate and life expectancy in 2015 for these 5 countries:

Sri Lanka
Poland
Malaysia
Pakistan
Thailand

# Your code here

Wechat

QQ

Telegram

IST 418: Big Data Analytics

General instructions:

Read the data files

Question 1 (10 pts)

Question 2 (10 pts)

Grading Feedback Cell

Question 3 (10 pts)

Grading Feedback Cell

Question 4 (20 pts)

Question 5 (10 pts)

Question 6 (10 pts)

Question 7 (10 pts)

Question 8 (10 pts)

Question 9 (10 pts)