How To Build A Self-Driving Taxi With Reinforecement Learning

Let's use "reinforcement learning" to train a self-driving on the most efficient routes for picking up and dropping off passengers.

To start with, we'll create a "driving area grid" via the Python "AI Gym" library.

KEY:

  • R/G/B/Y = Pickup / dropoff sites
  • BLUE letter = Pickup site
  • MAGENTA letter = Dropoff site
  • SOLID LINE = Wall taxi cannot cross
  • YELLOW-FILLED RECTANGLE = Empty taxi
  • GREEN-FILLED RECTANGLE = Filled taxi
In [6]:
import gym
import random

random.seed(1234)

streets = gym.make("Taxi-v2").env
streets.render()
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

This 5x5 "streets" grid is defined by:

  • 25 possible taxi locations
  • 4 possible destinations
  • 5 possible passenger locations (either inside taxi or at one of the 4 destinations)

This means that there are 25 x 4 x 5 = 500 possible grid "states", and each of these 500 states will be given a probability for taking one of the following 6 "actions":

  • Move NORTH
  • Move EAST
  • Move SOUTH
  • Move WEST
  • PICKUP Passenger
  • DROPOFF Passenger

Let's use "Q-Learning" for our reinforcement learning algorithm and assign the following "Quality" points for each state:

  • CORRECT PICKUP / DROPOFF = +20 points
  • INCORRECT PICKUP / DROPOFF = -10 points
  • STEP TAKEN EN ROUTE TO DESTINATION = -1 point
  • STEP TAKEN ACROSS WALL = (not allowed)

...establish the initial settings for our taxi:

  • Initial location: (2, 3)
  • Pickup location: 2
  • Destination location: 0

...and examine the grid and "reward table" for this initial state.

In [7]:
initial_state = streets.encode(2, 3, 2, 0)
streets.s = initial_state

streets.render()
streets.P[initial_state]
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

Out[7]:
{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

The above array has 6 rows for each of our possible 6 actions (move N/S/E/W, pickup or dropoff), with each row containing:

  • Q value = The probability of taking this action
  • State number = 1 of our 500 possible states for this action
  • Q points = The reward / penalty for taking this action
  • Whether a successful dropoff takes place

So, given this starting point, the first row shows that moving North would put the taxi into state number 368, substract 1 "step taken" penalty point, and does not result in a successful dropoff.

Our next step is to train our taxi over 10,000 simulated runs. At each step, there will be a 10% chance of taking a random, exploratory step and a 90% chance of taking an action based on highest Q value.

In [8]:
import numpy as np

q_table = np.zeros([streets.observation_space.n, streets.action_space.n])

learning_rate = 0.1
#learning_rate = 0.5
discount_factor = 0.6
exploration = 0.1
#exploration = 0.5
epochs = 10000

for taxi_run in range(epochs):
    state = streets.reset()
    done = False
    
    while not done:
        random_value = random.uniform(0, 1)
        if (random_value < exploration):
            action = streets.action_space.sample() # Explore a random action
        else:
            action = np.argmax(q_table[state]) # Use the action with the highest q-value
            
        next_state, reward, done, info = streets.step(action)
        
        prev_q = q_table[state, action]
        next_max_q = np.max(q_table[next_state])
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
        q_table[state, action] = new_q
        
        state = next_state

Now that we have a table of Q values for guiding our "optimal next step", let's look at the values for our initial state.

In [9]:
q_table[initial_state]
Out[9]:
array([-2.41254836, -2.41406562, -2.40893493, -2.3639511 , -8.08002866,
       -7.27836681])

The 4th value (Move WEST) is the highest value. This makes sense since moving West is our most direct path towards our destination from our initial state.

Now let's animate the taxi's behavior given our learned Q values.

In [10]:
from IPython.display import clear_output
from time import sleep

#numTrips = 500
numTrips = 10
totalTripSteps = 0
for tripnum in range(1, numTrips + 1):
    state = streets.reset()
   
    done = False
    trip_length = 0
    
    while not done and trip_length < 25:
        action = np.argmax(q_table[state])
        next_state, reward, done, info = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum) + " Step " + str(trip_length))
        print(streets.render(mode='ansi'))
        sleep(.5)
        state = next_state
        trip_length += 1
    totalTripSteps += trip_length   
    sleep(2)
avgStepsPerTrip = totalTripSteps / numTrips
print("Average Steps Per Trip: " + str(avgStepsPerTrip))
Trip number 10 Step 9
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Average Steps Per Trip: 15.5