🟡 Introduction to Reinforcement Learning: Training an Agent in a Simple Environment

stemaway · October 24, 2024, 8:47am

Introduction to Reinforcement Learning: Training an Agent in a Simple Environment

Objective

The main goal of this project is to introduce you to the fundamentals of Reinforcement Learning (RL) by training an agent to solve a simple control task. You will learn how agents interact with environments, implement basic RL algorithms, and understand key concepts such as exploration vs. exploitation, state representation, and policy evaluation.

Learning Outcomes

By completing this project, you will:

Understand the core concepts of Reinforcement Learning, including agents, environments, states, actions, rewards, policies, and value functions.
Learn how to use OpenAI Gym to create and simulate environments.
Implement a basic RL algorithm, specifically Q-Learning, from scratch.
Gain experience in training an agent to perform a task and evaluating its performance.
Learn how to handle continuous state spaces through discretization.
Develop skills in visualizing and analyzing agent performance.

Prerequisites and Theoretical Foundations

1. Python Programming Fundamentals

Variables, data types, and basic operations.
Control structures (if/else, loops).
Functions and classes.
List comprehensions and basic data structures (lists, tuples, dictionaries).

Click to view Python code examples

# Function definition
def greet(name):
    return f"Hello, {name}!"

# Class definition
class Agent:
    def __init__(self, name):
        self.name = name

    def act(self, state):
        # Decide on an action based on the state
        pass

# Control structures
for i in range(5):
    if i % 2 == 0:
        print(f"{i} is even")
    else:
        print(f"{i} is odd")

# List comprehension
squares = [x**2 for x in range(10)]

2. NumPy Essentials

Array creation and manipulation.
Basic array operations (addition, multiplication).
Array indexing and slicing.
Mathematical operations on arrays.

Click to view NumPy code examples

import numpy as np

# Array creation
zeros_array = np.zeros((3, 3))
random_array = np.random.rand(3, 3)

# Element-wise operations
sum_array = zeros_array + random_array

# Indexing and slicing
first_row = random_array[0, :]
element = random_array[1, 2]

# Mathematical operations
mean_value = np.mean(random_array)
max_value = np.max(random_array)

3. Basic Mathematics

Linear Algebra Fundamentals: Vectors, matrices, and matrix operations.
Probability Concepts: Random variables, probability distributions, expected value.
Basic understanding of calculus is helpful but not mandatory.

Click to view mathematical concepts and examples

Probability Example:

Expected value of a random variable ( X ): [ E = \sum_{x} x \cdot P(X = x) ]

Matrix Operations Example:

import numpy as np

# Define a transition matrix
transition_matrix = np.array([
    [0.7, 0.3],
    [0.4, 0.6]
])

# Current state vector
current_state = np.array([1, 0])  # Starting in state 0

# Compute next state probabilities
next_state_probs = np.dot(current_state, transition_matrix)
print(next_state_probs)  # Output: [0.7, 0.3]

List of Theoretical Concepts

Core RL Concepts

Reinforcement Learning Framework:
- Agent: The learner or decision-maker.
- Environment: The external system the agent interacts with.
- State (s): A representation of the current situation.
- Action (a): Choices the agent can make.
- Reward (r): Feedback signal indicating the desirability of the state/action.
The RL Loop:
1. The agent observes the current state ( s_t ).
2. The agent selects an action ( a_t ) based on its policy.
3. The environment transitions to a new state ( s_{t+1} ) and provides a reward ( r_{t+1} ).
4. The agent updates its policy based on the experience.
Key Terms:
- Policy (( \pi )): The agent’s strategy for selecting actions, mapping states to actions.
- Value Function: Estimates how good a state (or state-action pair) is in terms of expected future rewards.
- Q-Function (( Q(s, a) )): The expected return (cumulative future reward) of taking action ( a ) in state ( s ) and following the policy thereafter.
- Exploration vs. Exploitation: Balancing the act of trying new actions to discover their effects (exploration) and choosing known actions that yield high rewards (exploitation).
- Learning Rate (( \alpha )): Determines how much new information overrides old information during learning.
- Discount Factor (( \gamma )): Determines the importance of future rewards compared to immediate rewards.

Skills Gained

Understanding core Reinforcement Learning concepts and terminology.
Implementing basic RL algorithms like Q-Learning from scratch.
Using OpenAI Gym to simulate and interact with environments.
Handling continuous state spaces through discretization.
Implementing exploration strategies (e.g., epsilon-greedy).
Evaluating and visualizing agent performance.

Tools Required

Python 3.7+
NumPy: For numerical computations.
Matplotlib: For plotting and visualization.
OpenAI Gym: For environment simulation.
Jupyter Notebook or any Python IDE (e.g., VSCode, PyCharm).

Install the required libraries using:

pip install numpy matplotlib gym

Steps and Tasks

1. Understanding the Environment

We will use the CartPole-v1 environment from OpenAI Gym, a classic control problem where the goal is to keep a pole balanced on a moving cart.

Tasks:

Explore the environment: Understand the state and action spaces.
Run a random agent: Observe the performance when actions are selected randomly.

Implementation:

import gym

# Create the environment
env = gym.make('CartPole-v1')

# Print action and state space information
print(f"Action space: {env.action_space}")  # Discrete(2)
print(f"Observation space: {env.observation_space}")  # Box(4,)

print(f"Observation space high values: {env.observation_space.high}")
print(f"Observation space low values: {env.observation_space.low}")

# Run one episode with random actions
state = env.reset()
done = False
total_reward = 0

while not done:
    env.render()
    action = env.action_space.sample()  # Random action (0 or 1)
    next_state, reward, done, info = env.step(action)
    total_reward += reward
    state = next_state

env.close()
print(f"Total reward from random agent: {total_reward}")

Explanation

State Space: The environment provides a 4-dimensional continuous state vector:
1. Cart position.
2. Cart velocity.
3. Pole angle.
4. Pole angular velocity.
Action Space: There are two discrete actions:
- 0: Push cart to the left.
- 1: Push cart to the right.
Observation:
- Running the environment with a random agent typically results in poor performance, highlighting the need for a learning agent.

2. Discretizing the State Space

Since the state space is continuous, we’ll discretize it to apply tabular Q-Learning.

Tasks:

Define bins: Create bins for each state variable to discretize the continuous state space.
Implement a function: Map continuous states to discrete states (indices).

Implementation:

import numpy as np

def create_bins(num_bins=10):
    bins = [
        np.linspace(-4.8, 4.8, num_bins),     # Cart position
        np.linspace(-4, 4, num_bins),         # Cart velocity
        np.linspace(-0.418, 0.418, num_bins), # Pole angle (~24 degrees)
        np.linspace(-4, 4, num_bins)          # Pole angular velocity
    ]
    return bins

def discretize_state(state, bins):
    """Convert continuous state to discrete state indices"""
    discrete_state = []
    for i in range(len(state)):
        # Digitize returns the index of the bin each state element falls into
        index = np.digitize(state[i], bins[i]) - 1  # Subtract 1 for zero-based indexing
        # Ensure index is within bounds
        index = min(max(index, 0), len(bins[i]) - 1)
        discrete_state.append(index)
    return tuple(discrete_state)

# Example usage
bins = create_bins()
state = env.reset()
discrete_state = discretize_state(state, bins)
print(f"Discrete state: {discrete_state}")

Explanation

Bins:
- We use np.linspace to create equally spaced bins for each state variable.
- The number of bins (num_bins) can be adjusted to balance between state representation granularity and computational resources.
Discretization Function:
- np.digitize determines which bin each state variable falls into.
- The function maps continuous values to discrete indices, forming a discrete representation of the state.

3. Implementing the Q-Learning Agent

We’ll create an agent that learns an optimal policy using the Q-Learning algorithm.

Tasks:

Initialize the Q-table: A multi-dimensional array representing Q-values for state-action pairs.
Implement the Q-Learning update rule.
Implement an epsilon-greedy policy for action selection.

Implementation:

class QLearningAgent:
    def __init__(self, state_bins, action_size, learning_rate=0.1, gamma=0.99,
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
        self.state_bins = state_bins
        self.action_size = action_size
        self.q_table = np.zeros(state_bins + (action_size,))
        self.alpha = learning_rate  # Learning rate
        self.gamma = gamma          # Discount factor
        self.epsilon = epsilon      # Exploration rate
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min

    def choose_action(self, state):
        if np.random.rand() < self.epsilon:
            # Explore: select a random action
            return np.random.randint(self.action_size)
        else:
            # Exploit: select the action with the highest Q-value for the current state
            return np.argmax(self.q_table[state])

    def update_q_value(self, state, action, reward, next_state, done):
        # Q-Learning update rule
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + self.gamma * self.q_table[next_state + (best_next_action,)] * (1 - int(done))
        td_error = td_target - self.q_table[state + (action,)]
        self.q_table[state + (action,)] += self.alpha * td_error

    def decay_epsilon(self):
        # Decay the exploration rate
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

Explanation

Q-Table Initialization:
- The Q-table is initialized with zeros and has dimensions corresponding to the discretized state space plus the action dimension.
Action Selection:
- Epsilon-Greedy Policy:
  - With probability ( \epsilon ), the agent explores by selecting a random action.
  - With probability ( 1 - \epsilon ), the agent exploits by choosing the action with the highest Q-value.
Q-Value Update Rule:
- The agent updates its Q-values using the Bellman equation: [ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’) - Q(s, a) \right] ]
- This update aims to minimize the temporal-difference (TD) error between the predicted Q-value and the observed reward plus discounted future rewards.
Epsilon Decay:
- Gradually reducing ( \epsilon ) encourages the agent to explore less over time and exploit learned knowledge more.

4. Training the Agent

We’ll train the agent over multiple episodes, allowing it to learn from interactions with the environment.

Tasks:

Run training episodes: Iterate over a specified number of episodes.
Implement the learning loop: The agent interacts with the environment, updates Q-values, and collects rewards.
Track performance: Record the total reward per episode for analysis.

Implementation:

def train_agent(env, agent, bins, num_episodes=1000, max_steps_per_episode=200):
    rewards = []
    for episode in range(num_episodes):
        state = env.reset()
        discrete_state = discretize_state(state, bins)
        total_reward = 0
        done = False

        for step in range(max_steps_per_episode):
            action = agent.choose_action(discrete_state)
            next_state, reward, done, info = env.step(action)
            next_discrete_state = discretize_state(next_state, bins)
            total_reward += reward

            # Update Q-value
            agent.update_q_value(discrete_state, action, reward, next_discrete_state, done)

            discrete_state = next_discrete_state

            if done:
                break

        # Decay epsilon
        agent.decay_epsilon()

        rewards.append(total_reward)

        # Print progress every 100 episodes
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(rewards[-100:])
            print(f"Episode {episode + 1}/{num_episodes}, Average Reward (last 100 episodes): {avg_reward:.2f}, Epsilon: {agent.epsilon:.4f}")

    return rewards

Explanation

Episode Loop:
- For each episode, reset the environment and initialize the state.
- The agent selects actions, observes rewards and next states, and updates Q-values.
- The episode ends when the environment signals done or the maximum number of steps is reached.
Performance Tracking:
- Collect the total reward per episode to monitor learning progress.
- Periodically print the average reward and current epsilon value to observe trends.
Epsilon Decay:
- After each episode, decay the exploration rate to shift the agent’s focus from exploration to exploitation.

5. Evaluating the Agent

After training, we evaluate the agent’s performance without exploration to assess how well it has learned.

Tasks:

Run evaluation episodes: Using the learned policy without exploration.
Measure performance: Calculate the average total reward over evaluation episodes.

Implementation:

def evaluate_agent(env, agent, bins, num_episodes=10, max_steps_per_episode=200):
    agent.epsilon = 0  # Disable exploration
    total_rewards = []

    for episode in range(num_episodes):
        state = env.reset()
        discrete_state = discretize_state(state, bins)
        total_reward = 0
        done = False

        for step in range(max_steps_per_episode):
            env.render()
            action = agent.choose_action(discrete_state)
            next_state, reward, done, info = env.step(action)
            discrete_state = discretize_state(next_state, bins)
            total_reward += reward

            if done:
                break

        total_rewards.append(total_reward)
        print(f"Evaluation Episode {episode + 1}/{num_episodes}, Total Reward: {total_reward}")

    env.close()
    average_reward = np.mean(total_rewards)
    print(f"Average Total Reward over {num_episodes} Evaluation Episodes: {average_reward:.2f}")

Explanation

Evaluation without Exploration:
- By setting agent.epsilon = 0, the agent always selects the action with the highest Q-value.
Rendering:
- We call env.render() to visualize the agent’s performance during evaluation.
Performance Metrics:
- We compute the total reward for each evaluation episode and calculate the average total reward.

6. Visualizing Training Progress

Plotting the rewards over episodes helps in analyzing the agent’s learning curve and identifying trends.

Implementation:

import matplotlib.pyplot as plt

def plot_rewards(rewards):
    plt.figure(figsize=(12, 6))
    plt.plot(rewards, label='Episode Reward')
    plt.xlabel('Episode')
    plt.ylabel('Total Reward')
    plt.title('Training Progress')
    plt.legend()
    plt.grid(True)
    plt.show()

    # Calculate and plot the moving average
    window_size = 50
    moving_avg = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')
    plt.figure(figsize=(12, 6))
    plt.plot(range(window_size - 1, len(rewards)), moving_avg, label=f'{window_size}-Episode Moving Average', color='orange')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.title('Moving Average of Rewards')
    plt.legend()
    plt.grid(True)
    plt.show()

Explanation

Episode Rewards Plot:
- Displays the total reward per episode.
- Helps identify whether the agent’s performance is improving over time.
Moving Average Plot:
- Smoothing the rewards using a moving average provides a clearer view of the overall learning trend.
- The window size determines the level of smoothing.

7. Running the Complete Project

We will now bring all components together and execute the training and evaluation of the agent.

Implementation:

def main():
    env = gym.make('CartPole-v1')
    bins = create_bins(num_bins=10)
    state_bins = tuple(len(bin) for bin in bins)
    action_size = env.action_space.n

    agent = QLearningAgent(state_bins=state_bins, action_size=action_size)

    num_training_episodes = 2000
    rewards = train_agent(env, agent, bins, num_episodes=num_training_episodes)

    plot_rewards(rewards)

    print("Starting evaluation...")
    evaluate_agent(env, agent, bins, num_episodes=5)

if __name__ == "__main__":
    main()

Explanation

Environment and Agent Initialization:
- Create the CartPole environment.
- Define the discretization bins and determine the dimensions for the Q-table.
Training:
- Train the agent using the train_agent function.
- Record the rewards for visualization.
Visualization:
- Plot the training rewards to analyze learning progress.
Evaluation:
- Evaluate the agent’s performance over a few episodes.

8. Next Steps and Improvements

After successfully implementing and understanding the basic RL agent, consider the following enhancements to deepen your learning:

Parameter Tuning:
- Adjust the Number of Bins:
  - Experiment with different numbers of bins for discretization to find a balance between state representation accuracy and computational efficiency.
- Modify Learning Rate and Discount Factor:
  - Test different values for the learning rate (( \alpha )) and discount factor (( \gamma )) to see how they affect learning speed and stability.
- Epsilon Decay Strategy:
  - Try different epsilon decay rates or strategies (e.g., linear decay, exponential decay) to optimize the exploration-exploitation balance.
Algorithm Enhancements:
- Double Q-Learning:
  - Implement Double Q-Learning to address overestimation bias in Q-Learning.
- Function Approximation:
  - Use function approximators like neural networks (Deep Q-Networks) to handle continuous state spaces without discretization.
- Eligibility Traces:
  - Implement SARSA(λ) or Q(λ) to consider the sequence of states and actions leading to rewards.
Advanced Exploration Strategies:
- Softmax Action Selection:
  - Use a softmax function over Q-values to select actions probabilistically, favoring higher Q-values but allowing exploration.
- Upper Confidence Bound (UCB):
  - Implement UCB to balance exploration and exploitation based on uncertainty estimates.
Applying to Different Environments:
- Test on Other OpenAI Gym Environments:
  - Apply your agent to environments like MountainCar-v0 or Acrobot-v1, adapting your approach as necessary.
- Custom Environments:
  - Create simple custom environments to challenge your agent in new ways.
Performance Monitoring and Logging:
- Enhanced Visualization:
  - Plot additional metrics like maximum reward per episode or Q-value distributions.
- Logging Libraries:
  - Integrate logging tools or frameworks (e.g., TensorBoard) to monitor training in real-time.
Handling Continuous Actions:
- Policy Gradient Methods:
  - Explore algorithms like REINFORCE or Actor-Critic methods that can handle continuous action spaces.

9. Common Issues and Solutions

Issue: Agent’s performance plateaus or does not improve.

Possible Solutions:

Check the Discretization:
- Ensure the bins cover the full range of possible state values.
- Increase the number of bins for more precise state representation, but be cautious of the curse of dimensionality.
Adjust Hyperparameters:
- Learning Rate (( \alpha )):
  - If learning is unstable, try decreasing the learning rate.
- Discount Factor (( \gamma )):
  - A higher gamma places more emphasis on future rewards.
- Epsilon Decay:
  - Adjust the decay rate to ensure sufficient exploration.
Modify the Reward Structure:
- Introduce penalties for undesirable behaviors to guide learning.

Issue: Agent performs well during training but poorly during evaluation.

Possible Solutions:

Ensure Exploration is Disabled During Evaluation:
- Set agent.epsilon = 0 before evaluation to prevent random actions.
Overfitting to Exploration:
- The agent may rely on exploration to achieve higher rewards; consider adjusting the epsilon decay schedule to encourage learning of the optimal policy.

Issue: High Variance in Rewards Across Episodes.

Possible Solutions:

Increase the Number of Training Episodes:
- Allow the agent more time to learn consistent behavior.
Use Moving Averages:
- Analyze the moving average of rewards to assess trends over time.

10. Conclusion

In this project, you have:

Gained an understanding of the foundational concepts of Reinforcement Learning.
Implemented a basic Q-Learning agent to solve the CartPole-v1 environment.
Learned how to discretize continuous state spaces for tabular methods.
Explored the balance between exploration and exploitation using epsilon-greedy strategies.
Evaluated and visualized your agent’s performance over time.

This foundational knowledge prepares you for more advanced RL topics, such as:

Deep Reinforcement Learning: Using neural networks to approximate value functions or policies.
Policy Gradient Methods: Directly optimizing the policy without relying on value functions.
Model-Based RL: Learning a model of the environment to plan ahead.
Multi-Agent RL: Extending RL concepts to environments with multiple agents.