需要关于此作业的帮助?欢迎联系我

COMP3411/COMP9814 23T1Artificial Intelligence Assignment 2 - Reward-based learning agents

Due: Week 7, Thursday, 30 March 2023, 10 pm

Activities

In this assignment, you are asked to implement a modified version of temporaldifference Q-learning and SARSA algorithms. The modified algorithms will use a modified version of the ϵ-greedy action selection method. You will use another version of gridworld. The new code can be found here. The modification of the method includes the following two aspects:

  • Random numbers will be obtaining sequentially from a file.
  • When exploring, the worst action will be taken instead of a random one.

The random numbers are available in the file random numbers.txt. The file contains 500k random numbers between 0 and 1 with seed = 999 created with numpy.random.random as follows:

import numpy as np
np.random.seed(999)
random_numbers=np.random.random(500000)
np.savetxt(’random_numbers.txt’, random_numbers)

1.1 Part 1 (5 marks)

Modifying TODO sections in Qlearning.py, create an action selection method that receives a state as an argument and returns the action. Consider the following:

  1. The method must use sequentially the random number in the provided file.
  2. In case of a random number rnd < ϵ the method return an exploratory action which in this case will be the worst action, i.e., the one with the lowest Q-value. Otherwise, the method returns the greedy action, i.e., the one with the greatest Q-value.
  3. If more than one action shares the lowest or greatest Q-value, the first occurrence should be considered.
  4. To read random numbers you could either load the numbers into an array (or similar) structure and read each position keeping an index or read the file line by line.
  5. The defined method will be used in do step method when selecting actions.

1.2 Part 2 (5 marks)

Modifying TODO sections in SARSA.py, create the on-policy SARSA method. Consider the following for the implementation:

  • Use the same modified version of ϵ-greedy action selection method as in Q-learning.
  • In this case, you should start reading random numbers from the beginning, however, take into account that SARSA will use two random numbers at each step.
  • Implement the method do step to perform the update of state-action pairs using SARSA.

...