Programming Assignment 3

Assigned: March 20

Due: April 10

In this assignment you will build a program that learns to play a simple game of chance in somewhat the same way that the AlphaZero program learned to play chess, Go, and other games. (We will discuss the differences later in the semester. This assignment includes the easy part of AlphaZero; what it leaves out is the hard part.) The game is a variant of the card game “Blackjack”. Two players alternately roll dice, and keep track of their total across turns. They are each trying to reach a sum that lies in a specified target, between a fixed low value and high value. If a player reaches a score in the target range, they immediately win. If they exceed the high value, they immediately lose. The players can choose the number of dice to roll on each turn, between 1 and a fixed maximum. The game thus has four parameters:

NSides, The number of sides of the die. The die is numbered 1 to NSides and all outcomes are equally likely.
LTarget, the lowest winning value.
UTarget, the highest winning value.
NDice, the maximum number of dice a player may roll.

For instance, with NSides=6, LTarget=15, UTarget=17, NDice=2, the following are two possible games. (The players are not necessarily playing well in the games below, just legally.)

Game 1:
Player A rolls 2 dice, which come up 5 and 6. A total: 11.
Player B rolls 2 dice, which come up 3 and 4. B total: 7.
Player A rolls 2 dice, which come up 5 and 5. A total: 21. A loses.
Game 2:
Player A rolls 2 dice, which come up 3 and 4. A total: 7.
Player B rolls 2 dice, which come up 5 and 6. B total: 11.
Player A rolls 2 dice, which come up 3 and 1. A total: 11.
Player B rolls 1 die, which comes up 4. B total: 15. B wins.
Game 3:
Player A rolls 2 dice, which come up 3 and 4. A total: 7.
Player B rolls 2 dice, which come up 1 and 4. B total: 5.
Player A rolls 2 dice, which come up 2 and 5. A total: 14.
Player B rolls 2 dice, which come up 3 and 5. A total: 13.
Player A rolls 1 die , which comes up 1. A total: 15. A wins.

The learning algorithm to be used is as follows: The machine play a series of games with itself, playing both sides. Initially, both players play randomly; as the series progresses, they hopefully play better and better. The program maintains two three-dimensional integer matrices of size LTarget × LTarget × [NDice + 1]. The matrices are WinCount[X,Y,J] and LoseCount[X,Y,J], where X is the current point count for the player about to play; Y is the point count for the opponent; and J is the number of dice the current player rolls. The value of WinCount[X,Y,J] is the number of times that the current player has eventually won when the state was hX, Yi and the current player rolled J dice. LoseCount[X,Y,J] is the number of times that they lost.1 After each game, these two matrices get updated, reflecting the result of the game. For instance, after completing Game 1 above.

Game 1:
Player A rolls 2 dice, which come up 5 and 6. A total: 11.
Player B rolls 2 dice, which come up 3 and 4. B total: 7.
Player A rolls 2 dice, which come up 5 and 5. A total: 21. A loses.

the result will be that LoseCount(0,0,2), LoseCount(11,7,2), and WinCount(0,11,2) will all be incremented by 1. After completing Game 3:

Game 3:
Player A rolls 2 dice, which come up 3 and 4. A total: 7.
Player B rolls 2 dice, which come up 1 and 4. B total: 5.
Player A rolls 2 dice, which come up 2 and 5. A total: 14.
Player B rolls 2 dice, which come up 3 and 5. A total: 13.
Player A rolls 1 die , which comes up 1. A total: 15. A wins.

the result will be that WinCount[0,0,2], LoseCount[0,7,2], WinCount[7,5,2], LoseCount[5,14,2], and WinCount[14,13,1], will all be incremented by 1.

...

Wechat

QQ

Telegram

Programming Assignment 3