COMP3411/COMP9814 23T1Artificial Intelligence Assignment 2 - Reward-based learning agents
Due: Week 7, Thursday, 30 March 2023, 10 pm
Activities
In this assignment, you are asked to implement a modified version of temporaldifference Q-learning and SARSA algorithms. The modified algorithms will use a modified version of the ϵ-greedy action selection method. You will use another version of gridworld. The new code can be found here. The modification of the method includes the following two aspects:
- Random numbers will be obtaining sequentially from a file.
- When exploring, the worst action will be taken instead of a random one.
The random numbers are available in the file random numbers.txt. The file contains 500k random numbers between 0 and 1 with seed = 999 created with numpy.random.random as follows:
import numpy as np
np.random.seed(999)
random_numbers=np.random.random(500000)
np.savetxt(’random_numbers.txt’, random_numbers)
1.1 Part 1 (5 marks)
Modifying TODO sections in Qlearning.py, create an action selection method that receives a state as an argument and returns the action. Consider the following:
- The method must use sequentially the random number in the provided file.
- In case of a random number rnd < ϵ the method return an exploratory action which in this case will be the worst action, i.e., the one with the lowest Q-value. Otherwise, the method returns the greedy action, i.e., the one with the greatest Q-value.
- If more than one action shares the lowest or greatest Q-value, the first occurrence should be considered.
- To read random numbers you could either load the numbers into an array (or similar) structure and read each position keeping an index or read the file line by line.
- The defined method will be used in do step method when selecting actions.
1.2 Part 2 (5 marks)
Modifying TODO sections in SARSA.py, create the on-policy SARSA method. Consider the following for the implementation:
- Use the same modified version of ϵ-greedy action selection method as in Q-learning.
- In this case, you should start reading random numbers from the beginning, however, take into account that SARSA will use two random numbers at each step.
- Implement the method do step to perform the update of state-action pairs using SARSA.
...