Let's use "reinforcement learning" to train a self-driving on the most efficient routes for picking up and dropping off passengers.
To start with, we'll create a "driving area grid" via the Python "AI Gym" library.
KEY:
import gym
import random
random.seed(1234)
streets = gym.make("Taxi-v2").env
streets.render()
This 5x5 "streets" grid is defined by:
This means that there are 25 x 4 x 5 = 500 possible grid "states", and each of these 500 states will be given a probability for taking one of the following 6 "actions":
Let's use "Q-Learning" for our reinforcement learning algorithm and assign the following "Quality" points for each state:
...establish the initial settings for our taxi:
...and examine the grid and "reward table" for this initial state.
initial_state = streets.encode(2, 3, 2, 0)
streets.s = initial_state
streets.render()
streets.P[initial_state]
The above array has 6 rows for each of our possible 6 actions (move N/S/E/W, pickup or dropoff), with each row containing:
So, given this starting point, the first row shows that moving North would put the taxi into state number 368, substract 1 "step taken" penalty point, and does not result in a successful dropoff.
Our next step is to train our taxi over 10,000 simulated runs. At each step, there will be a 10% chance of taking a random, exploratory step and a 90% chance of taking an action based on highest Q value.
import numpy as np
q_table = np.zeros([streets.observation_space.n, streets.action_space.n])
learning_rate = 0.1
#learning_rate = 0.5
discount_factor = 0.6
exploration = 0.1
#exploration = 0.5
epochs = 10000
for taxi_run in range(epochs):
state = streets.reset()
done = False
while not done:
random_value = random.uniform(0, 1)
if (random_value < exploration):
action = streets.action_space.sample() # Explore a random action
else:
action = np.argmax(q_table[state]) # Use the action with the highest q-value
next_state, reward, done, info = streets.step(action)
prev_q = q_table[state, action]
next_max_q = np.max(q_table[next_state])
new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
q_table[state, action] = new_q
state = next_state
Now that we have a table of Q values for guiding our "optimal next step", let's look at the values for our initial state.
q_table[initial_state]
The 4th value (Move WEST) is the highest value. This makes sense since moving West is our most direct path towards our destination from our initial state.
Now let's animate the taxi's behavior given our learned Q values.
from IPython.display import clear_output
from time import sleep
#numTrips = 500
numTrips = 10
totalTripSteps = 0
for tripnum in range(1, numTrips + 1):
state = streets.reset()
done = False
trip_length = 0
while not done and trip_length < 25:
action = np.argmax(q_table[state])
next_state, reward, done, info = streets.step(action)
clear_output(wait=True)
print("Trip number " + str(tripnum) + " Step " + str(trip_length))
print(streets.render(mode='ansi'))
sleep(.5)
state = next_state
trip_length += 1
totalTripSteps += trip_length
sleep(2)
avgStepsPerTrip = totalTripSteps / numTrips
print("Average Steps Per Trip: " + str(avgStepsPerTrip))