COMP9417 - Machine Learning Homework 1: Regularized Regression & Statistical Estimators
Introduction
In this homework we will explore gradient based optimization and explain in detail how gradient descent can be implemented. Gradient based algorithms have been crucial to the development of machine learning in the last few decades. The most famous example is the backpropagation algorithm used in deep learning, which is in fact just an application of a simple algorithm known as (stochastic) gradient descent. We will first implement gradient descent from scratch on a deterministic problem (no data), and then extend our implementation to solve a real world regression problem. We then also look at the notions of bias, variance and MSE of statistical estimators.
Question 1. Gradient Based Optimization
The general framework for a gradient method for finding a minimizer of a function f : Rn ! R is defined by x(k+1) = x(k) krf(xk); k = 0; 1; 2; : : : ; (1) where k > 0 is known as the step size, or learning rate. Consider the following simple example of minimizing the function g(x) = 2 p x3 + 1. We first note that g0(x) = 3x2(x3 + 1)1=2. We then need to choose a starting value of x, say x(0) = 1. Let’s also take the step size to be constant, k = = 0:1. Then we have the following iterations: x(1) = x(0) 0:1 3(x(0))2((x(0))3 + 1)1=2 = 0:7878679656440357 x(2) = x(1) 0:1 3(x(1))2((x(1))3 + 1)1=2 = 0:6352617090300827 x(3) = 0:5272505146487477 ... and this continues until we terminate the algorithm (as a quick exercise for your own benefit, code this up and compare it to the true minimum of the function which is x = 1). This idea works for functions that have vector valued inputs, which is often the case in machine learning. For example, when we minimize a loss function we do so with respect to a weight vector, . When we take the stepsize to be constant at each iteration, this algorithm is known as gradient descent. For the entirety of this question, do not use any existing implementations of gradient methods, doing so will result in an automatic mark of zero for the entire question.
(a) Consider the following optimisation problem: min x2Rn f(x); where f(x) = 1 2 kAx bk22
2 kxk22 ; and where A 2 Rm n, b 2 Rm are defined as A = 2 4 1 2 1 1 1 1 0 2 0 1 2 1 3 5; b = 2 4 3 2 2 3 5 ; and is a positive constant. Run gradient descent on f using a step size of = 0:1 and = 0:2 and starting point of x(0) = (1; 1; 1; 1). You will need to terminate the algorithm when the following condition is met: krf(x(k))k2 < 0:001. In your answer, clearly write down the version of the gradient steps (1) for this problem. Also, print out the first 5 and last 5 values of x(k), clearly indicating the value of k, in the form: k = 0; x(k) = [1; 1; 1; 1] k = 1; x(k) = k = 2; x(k) = ... Page 3 What to submit: an equation outlining the explicit gradient update, a print out of the first 5 (k = 5 inclusive) and last 5 rows of your iterations. Use the round function to round your numbers to 4 decimal places. Include a screen shot of any code used for this section and a copy of your python code in solutions.py.
problem. Consider the CarSeats data provided in CarSeats.csv. It contains 400 observations with each observation describing child car seats for sale at one of 400 stores. The features in the data set are outlined below: • Sales: Unit sales (in thousands) at each location • CompPrice: Price charged by competitor at each location • Income: Local income level (in thousands of dollars) • Advertising: advertising budget (in thousands of dollars) • Population: local population size (in thousands) • Price: price charged by store at each site • ShelveLoc: A categorical variable with Bad, Good and Medium describing the quality of the shelf location of the car seat • Age: Average age of the local population • Education: Education level at each location • Urban A categorical variable with levels No and Yes to describe whether the store is in an urban location or in a rural one • US: A categorical variable with levels No and Yes to describe whether the store is in the US or not. The target variable is Sales. The goal is to learn to predict the amount of Sales as a function of a subset of the above features. We will do so by running Ridge Regression (Ridge) which is defined as follows