Note: This post documents the initial v1.0 architecture and problem formulation. I have since updated the environment constraints and reformulated the logic, allowing the PPO algorithm to successfully outperform the Hungarian heuristic. The documentation for this v2.0 update is currently in the backlog.

MacroRoute

Our modern lives are enabled by algorithms. Think of the last time you ordered a T-shirt online. Behind the scenes, a complex network of servers handled your order, credit card information, and delivery. Each of these steps involve decisions, often under constraint. Finding ways to improve these algorithms, even if by a fraction of a percent, can help improve user experiences and cut costs for businesses.

Examples include,

Google’s Ad business: every time a user enters a page Google’s algorithms need to do a price bidding to decide which advertiser gets the spot.
Amazon Web Services: compute needs to be allocated in large servers in an efficient way
Ride Sharing (Uber, Lyft): when you need to go to the airport and request an Uber, Uber needs to find the best driver and match you accordingly.

This research article explores ride-sharing as a case study due to its visual nature. The challenge is to match drivers and riders (passengers) together in the most optimal (in this case profitable) way. I will explore different solutions, comparing traditional optimization and deep reinforcement learning techniques, ultimately combining the two to outperform either individually.

The concepts, techniques, and takeaways from this project can be applied to other resource allocation-style problems like the ones described above.

Building the World

<aside> 💡

If you want to create your own custom environment take a look at the official Gymnasium guide on creating a custom environment.

</aside>

Unlike supervised or unsupervised learning, reinforcement learning does not start with a dataset to train on. Instead, we gather experience by interacting with the world, progressively building our dataset. Ideally, the agent would interact in the same world as where it’s deployed. However, that is often not possible, e.g. self driving vehicles, where incorrect decisions can lead to serious accidents. Consequently, we often build a virtual world that acts as a safe playground for the agent to explore and learn. Instead of creating a new virtual world each time, one usually creates a gym environment. A gym environment follows a standardized format defined by the Gym API, originally developed by OpenAI and now maintained by the Farama Foundation. Every gym environment consists of a class with four core functions: make, reset, step, and render. By following this standard, we can easily experiment and swap different algorithms.

Cart Pole Gym Environment
Agent needs to learn to keep the pole from falling by moving left/right

Cart Pole Gym Environment Agent needs to learn to keep the pole from falling by moving left/right

The reset and step functions define the logic of the virtual world and are the subject of this article. The full class definition (and all related code) can be found in my Colab Notebook.

To model a ride share service, we need drivers and riders. The assumptions in this paper are,

Passengers spawn following a probability distribution function (PDF). Each passenger has desired drop-off destination, sampled from a conditional PDF based on their location. Lastly, each passenger is willing to wait for a maximum amount of time before disappearing.
Drivers are initially spawned on the map with their own PDF. Each driver can pickup a single passenger and bring them to their destination. The amount of drivers stays constant in the simulation.

Putting this into code gives us the following class definitions,

Driver

class Driver():
    def __init__(self, start_position: list):
        self.position = start_position
        self.speed = 1 / 3  # 1 / x
        self.has_passenger = False
        self.pickup, self.dropoff = None, None

    @property
    def isavailable(self):
        return self.pickup is None or self.dropoff is None

    def assign(self, pickup: list, dropoff: list):
        self.pickup, self.dropoff = pickup, dropoff

    def step(self):
        """ Pickup and drop-off passenger. """
        if self.isavailable:
            return

        # Successful drop-off
        if self.has_passenger and np.allclose(self.position, self.dropoff):
            self.has_passenger = False
            self.pickup, self.dropoff = None, None
            return

        if np.allclose(self.position, self.pickup):
            self.has_passenger = True

        destination = self.dropoff if self.has_passenger else self.pickup
        angle = np.arctan2(destination[1] - self.position[1], destination[0] - self.position[0])
        self.position += self.speed * np.round(np.array([np.cos(angle), np.sin(angle)]))