In a previous story, we introduced the concept of Q-learning. Today, we will implement that concept and build a Q-leaner in Python. Let’s get started!

The code in this story is part of our *MAD from scratch* project where *MAD* stands for *machine learning, artificial intelligence, and data science*. The complete code used in this story can be found in this repo: https://github.com/clumsyhandyman/mad-from-scratch

## Table of Contents:

- Updating rule of a Q-learner
- A simple Q-learning example
- Initialization of Q-values
- Pseudo-code for Q-learner
- Implement Q-learner in Python
- How good is our Q-learner?
- ADP learner vs MC learner vs Q-learner

## Updating rule of a Q-learner

Before we jump into coding, let’s quickly review how a Q-learner works. As the name implies, the learning target of a Q-learner is Q-values of state-action pairs. In other words, we are using Q-learner to learn *Q*(*s*, *a*).

For any MDP, we always have

In a previous story, we explained the relation between values of states and values of state-action pairs.

Assuming at a certain step of a game, we are at state *s*, we choose an action *a*, and we arrive at next state *s*′. For this particulate step, we experienced the state-action pair (*s*, *a*) and the results of this experience can be assessed by

Using the concept of updating and step size we introduced in our previous story, we know that this assessed value is the *Target* learned in this step. In other words, from this step, we estimate that *Q*(*s*, *a*) should be a value *somewhere* near this.

Let’s break this into details:

*Old Estimate*: Q-value of that state-action pair before this step*Q*(*s*,*a*)[*i*−1]*New Estimate*: Q-value of that state-action pair after this step*Q*(*s*,*a*)[*i*]*Target*:*r*(*s*) +*γ*max*ₐ Q*(*s*′,*a*)*Step Size*: an arbitrary constant value*α*, also called*learning rate*

Now, we have the updating rule of Q-learning:

## A simple Q-learning example

To visualize how a Q-learner works, let’s use an extremely simple example. Assume we have four states in a game, from *s*₁ to *s*₄, with *s*₄ as the ending state. Only one action *a* is available for each state. For the sole purpose of demonstration, this game has no stochastic. Whenever we play this game, we receive the following results:

Let’s use a discounted factor *γ *of 0.5. In theory, what is the Q-values associated with each state-action pair?

Because only one action* a* is available for each state in this simple example, we have

Since we experienced the series of states [*s*₁, *s*₂, *s*₃, *s*₄] in every game, we have

In theory, the value of *s*₄ does not affect any choice of policy because *s*₄ is an ending state. Here, we simply assume *Q*(*s*₄, *a*) is 0. We can then calculate the Q-values theoretically:

Now let’s see if a Q-learner can achieve these same Q-values. For now, we initialize all the Q-values as zero at beginning. We will discuss how initialization affects Q-learner later in this story.

We play our first game, and get the results of each state-action pair:

Now we can apply our updating rule of Q-learning. For now, we use a step-size or learning rate of *α =* 0.3.

After the first game, our Q-values are updated from all zeros to [−0.6, 1.2, 0.3]. Now let’s play the game again.

Then we play the game for the third time:

Then we repeat this game for another time…

OK, now we get the idea. Let’s use a simple Python code to check the results. To drive the point home, no vectorization is used in this code snippet.

If we run this snippet, we get this result:

As shown in the figure, at the beginning, Q-values for all the three state-action pairs are zero. After the first game, they are −0.6, 1.2, and 0.3, respectively. After the second game, they become −0.84, 2.085, and 0.51, respectively. After about 20 games, the Q-values are converging to their theoretical values [0.25, 4.5, 1] as we calculated previously, indicating that our Q-learner correctly learns the Q-values we want it to learn.

## Initialization of Q-values

During this simple example, we mentioned that we initialize all three Q-values as zeroes. It turns about that they do not have to be zeros. In fact, these Q-values can start from literally any value. For example, let’s try initializing the Q-values as 10, 2, and 8, respectively.

They converge to [0.25, 4.5, 1] again.

Or maybe initialize them as 15, -20, and 10.

Still [0.25, 4.5, 1].

Why is that? Why doesn’t the initialization of Q-values affect the final results? Let’s use *Q*(*s*₃, *a*) as an example. As we only have one action in this example, we write *Q*(*sᵢ*, a) as *Q*(*sᵢ*) for short.

After game 1, we have

Here *Q*(*s*₃)[0] is our initial value or our initial *guess*, while *r*(*s*₃) + *γ Q*(*s*₄)[0] is determined in this game and this is *somewhat correct* values. If *α* is 0.3, this means before first game, the Q-value is 100% guess, while after the first game, the Q-values is 70% guess and 30% *learned.*

After the second game, we have

Now Q-value after second game is 49% guess and 51% learned.

Similarly, after the third game,

Now the Q-value is 34.3% guess and 65.7% learned.

By the same token, we can derive that after *k* games, in our Q-value, only (1−*α*)*ᵏ* is from the initial guess. The rest are from what we *learned*. If we use 0.3 as *α*, after just 10 games the initial guess only contributes to 2.8% of the Q-value, while after 20 games that ratio drops further to 0.08%. That is why no matter what our initial guess is, we always converge to the same results.

Of course this is an over-simplified example. In real-world applications, because of stochastic, the converging process is usually much longer than this. Nevertheless, because of this property of Q-learner we showed here, Q-learner will converge to the desired Q-values as long as we have enough training episodes.

## Pseudo-code for Q-learner

Our Q-learner uses a similar API as previous ADP learner and MC learner with separate percept and actuate functions. Here we implement the updating rules of Q-learner we explained above in the percept function.

The actuate function is similar as that of previous MC learner.

Compared to an MC learner who updates policy game-by-game, a Q-learner updates policy step-by-step as shown in line 4 of the pseudo-code of Q-learner percept function. As a result, only the probability threshold is updated between games to control the transfer from exploration to exploitation. This could also be done in a step-by-step way. If we choose that approach, we add this to our percept function and there is no need for this episode update function. Here for the purpose of comparing with previous learners, we keep this function and update *ϵ* game-by-game.

Similar as ADP learner and MC learner, we use the following pseudo code is used to repeatedly play the game for *N* times.

## Implement Q-learner in Python

We then use the following solver to see how our Q-learner performs in our example problem.

To compare with previous ADP and MC learners, we train our MC learner for the same 1000 episodes.

## How good is our Q-learner?

Again, we use the same example problem from the book *Artificial Intelligence A Modern Approach *to assess the performance of our Q-learner.

If we set *ξ *as 0.99, we have the following results. Initially, our Q-values are 100% guess. As a results, we get −1700, −900, −400… As our Q-learner learns more and more, we get better and better.

If we change *ξ* to 0.999, we explore more and delay the exploitation. As a result, the curve of the total rewards increases slower than the case of *ξ = *0.99.

## ADP learner vs MC learner vs Q-learner

Let’s compare the results from our ADP learner, MC learner, and Q-learner. If we look at the winning percentage with *ξ *as 0.99, after about 400 games, both ADP learner and Q-learner can achieve 100% winning percentage. MC learner performs the worst among these three with a final winning percentage of about 80% after 1000 games.

It seems that ADP learner and Q-learner preform equally well. However, winning the game is not whole part of the story. We also need to look at the training time of these learners. Here we use the clock time for a simple comparison. For the case of *ξ *as 0.99, starting with the same random seed and training for the same 1000 games, it takes ADP learner roughly 72 seconds, MC learner 43 seconds, and Q-learner 15.5 seconds. If we take training time into consideration, Q-learner is the best performer with shortest training time and final 100% winning percentage. For the rest two, MC learner uses shorter training time but suffers from lower winning percentage, while ADP learner has the best winning percentage but suffers from the longest training time.

So that’s our simple example and code implementation of Q-learning. Feel free to leave a commit if you feel this useful or have any suggestions. We will talk about more topics on machine learning and AI. Stay tuned and see you next time. 😄