Introduction to Reinforcement Learning: Training an Agent in a Simple Environment
Objective
The main goal of this project is to introduce you to the fundamentals of Reinforcement Learning (RL) by training an agent to solve a simple control task. You will learn how agents interact with environments, implement basic RL algorithms, and understand key concepts such as exploration vs. exploitation, state representation, and policy evaluation.
Learning Outcomes
By completing this project, you will:
 Understand the core concepts of Reinforcement Learning, including agents, environments, states, actions, rewards, policies, and value functions.
 Learn how to use OpenAI Gym to create and simulate environments.
 Implement a basic RL algorithm, specifically QLearning, from scratch.
 Gain experience in training an agent to perform a task and evaluating its performance.
 Learn how to handle continuous state spaces through discretization.
 Develop skills in visualizing and analyzing agent performance.
Prerequisites and Theoretical Foundations
1. Python Programming Fundamentals
 Variables, data types, and basic operations.
 Control structures (
if
/else
, loops).  Functions and classes.
 List comprehensions and basic data structures (lists, tuples, dictionaries).
Click to view Python code examples
# Function definition
def greet(name):
return f"Hello, {name}!"
# Class definition
class Agent:
def __init__(self, name):
self.name = name
def act(self, state):
# Decide on an action based on the state
pass
# Control structures
for i in range(5):
if i % 2 == 0:
print(f"{i} is even")
else:
print(f"{i} is odd")
# List comprehension
squares = [x**2 for x in range(10)]
2. NumPy Essentials
 Array creation and manipulation.
 Basic array operations (addition, multiplication).
 Array indexing and slicing.
 Mathematical operations on arrays.
Click to view NumPy code examples
import numpy as np
# Array creation
zeros_array = np.zeros((3, 3))
random_array = np.random.rand(3, 3)
# Elementwise operations
sum_array = zeros_array + random_array
# Indexing and slicing
first_row = random_array[0, :]
element = random_array[1, 2]
# Mathematical operations
mean_value = np.mean(random_array)
max_value = np.max(random_array)
3. Basic Mathematics
 Linear Algebra Fundamentals: Vectors, matrices, and matrix operations.
 Probability Concepts: Random variables, probability distributions, expected value.
 Basic understanding of calculus is helpful but not mandatory.
Click to view mathematical concepts and examples
Probability Example:
 Expected value of a random variable ( X ): [ E = \sum_{x} x \cdot P(X = x) ]
Matrix Operations Example:
import numpy as np
# Define a transition matrix
transition_matrix = np.array([
[0.7, 0.3],
[0.4, 0.6]
])
# Current state vector
current_state = np.array([1, 0]) # Starting in state 0
# Compute next state probabilities
next_state_probs = np.dot(current_state, transition_matrix)
print(next_state_probs) # Output: [0.7, 0.3]
List of Theoretical Concepts
Core RL Concepts

Reinforcement Learning Framework:
 Agent: The learner or decisionmaker.
 Environment: The external system the agent interacts with.
 State (s): A representation of the current situation.
 Action (a): Choices the agent can make.
 Reward (r): Feedback signal indicating the desirability of the state/action.

The RL Loop:
 The agent observes the current state ( s_t ).
 The agent selects an action ( a_t ) based on its policy.
 The environment transitions to a new state ( s_{t+1} ) and provides a reward ( r_{t+1} ).
 The agent updates its policy based on the experience.

Key Terms:
 Policy (( \pi )): The agentâ€™s strategy for selecting actions, mapping states to actions.
 Value Function: Estimates how good a state (or stateaction pair) is in terms of expected future rewards.
 QFunction (( Q(s, a) )): The expected return (cumulative future reward) of taking action ( a ) in state ( s ) and following the policy thereafter.
 Exploration vs. Exploitation: Balancing the act of trying new actions to discover their effects (exploration) and choosing known actions that yield high rewards (exploitation).
 Learning Rate (( \alpha )): Determines how much new information overrides old information during learning.
 Discount Factor (( \gamma )): Determines the importance of future rewards compared to immediate rewards.
Skills Gained
 Understanding core Reinforcement Learning concepts and terminology.
 Implementing basic RL algorithms like QLearning from scratch.
 Using OpenAI Gym to simulate and interact with environments.
 Handling continuous state spaces through discretization.
 Implementing exploration strategies (e.g., epsilongreedy).
 Evaluating and visualizing agent performance.
Tools Required
 Python 3.7+
 NumPy: For numerical computations.
 Matplotlib: For plotting and visualization.
 OpenAI Gym: For environment simulation.
 Jupyter Notebook or any Python IDE (e.g., VSCode, PyCharm).
Install the required libraries using:
pip install numpy matplotlib gym
Steps and Tasks
1. Understanding the Environment
We will use the CartPolev1 environment from OpenAI Gym, a classic control problem where the goal is to keep a pole balanced on a moving cart.
Tasks:
 Explore the environment: Understand the state and action spaces.
 Run a random agent: Observe the performance when actions are selected randomly.
Implementation:
import gym
# Create the environment
env = gym.make('CartPolev1')
# Print action and state space information
print(f"Action space: {env.action_space}") # Discrete(2)
print(f"Observation space: {env.observation_space}") # Box(4,)
print(f"Observation space high values: {env.observation_space.high}")
print(f"Observation space low values: {env.observation_space.low}")
# Run one episode with random actions
state = env.reset()
done = False
total_reward = 0
while not done:
env.render()
action = env.action_space.sample() # Random action (0 or 1)
next_state, reward, done, info = env.step(action)
total_reward += reward
state = next_state
env.close()
print(f"Total reward from random agent: {total_reward}")
Explanation

State Space: The environment provides a 4dimensional continuous state vector:
 Cart position.
 Cart velocity.
 Pole angle.
 Pole angular velocity.

Action Space: There are two discrete actions:
 0: Push cart to the left.
 1: Push cart to the right.

Observation:
 Running the environment with a random agent typically results in poor performance, highlighting the need for a learning agent.
2. Discretizing the State Space
Since the state space is continuous, weâ€™ll discretize it to apply tabular QLearning.
Tasks:
 Define bins: Create bins for each state variable to discretize the continuous state space.
 Implement a function: Map continuous states to discrete states (indices).
Implementation:
import numpy as np
def create_bins(num_bins=10):
bins = [
np.linspace(4.8, 4.8, num_bins), # Cart position
np.linspace(4, 4, num_bins), # Cart velocity
np.linspace(0.418, 0.418, num_bins), # Pole angle (~24 degrees)
np.linspace(4, 4, num_bins) # Pole angular velocity
]
return bins
def discretize_state(state, bins):
"""Convert continuous state to discrete state indices"""
discrete_state = []
for i in range(len(state)):
# Digitize returns the index of the bin each state element falls into
index = np.digitize(state[i], bins[i])  1 # Subtract 1 for zerobased indexing
# Ensure index is within bounds
index = min(max(index, 0), len(bins[i])  1)
discrete_state.append(index)
return tuple(discrete_state)
# Example usage
bins = create_bins()
state = env.reset()
discrete_state = discretize_state(state, bins)
print(f"Discrete state: {discrete_state}")
Explanation

Bins:
 We use
np.linspace
to create equally spaced bins for each state variable.  The number of bins (
num_bins
) can be adjusted to balance between state representation granularity and computational resources.
 We use

Discretization Function:
np.digitize
determines which bin each state variable falls into. The function maps continuous values to discrete indices, forming a discrete representation of the state.
3. Implementing the QLearning Agent
Weâ€™ll create an agent that learns an optimal policy using the QLearning algorithm.
Tasks:
 Initialize the Qtable: A multidimensional array representing Qvalues for stateaction pairs.
 Implement the QLearning update rule.
 Implement an epsilongreedy policy for action selection.
Implementation:
class QLearningAgent:
def __init__(self, state_bins, action_size, learning_rate=0.1, gamma=0.99,
epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
self.state_bins = state_bins
self.action_size = action_size
self.q_table = np.zeros(state_bins + (action_size,))
self.alpha = learning_rate # Learning rate
self.gamma = gamma # Discount factor
self.epsilon = epsilon # Exploration rate
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
def choose_action(self, state):
if np.random.rand() < self.epsilon:
# Explore: select a random action
return np.random.randint(self.action_size)
else:
# Exploit: select the action with the highest Qvalue for the current state
return np.argmax(self.q_table[state])
def update_q_value(self, state, action, reward, next_state, done):
# QLearning update rule
best_next_action = np.argmax(self.q_table[next_state])
td_target = reward + self.gamma * self.q_table[next_state + (best_next_action,)] * (1  int(done))
td_error = td_target  self.q_table[state + (action,)]
self.q_table[state + (action,)] += self.alpha * td_error
def decay_epsilon(self):
# Decay the exploration rate
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Explanation

QTable Initialization:
 The Qtable is initialized with zeros and has dimensions corresponding to the discretized state space plus the action dimension.

Action Selection:
 EpsilonGreedy Policy:
 With probability ( \epsilon ), the agent explores by selecting a random action.
 With probability ( 1  \epsilon ), the agent exploits by choosing the action with the highest Qvalue.
 EpsilonGreedy Policy:

QValue Update Rule:
 The agent updates its Qvalues using the Bellman equation: [ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{aâ€™} Q(sâ€™, aâ€™)  Q(s, a) \right] ]
 This update aims to minimize the temporaldifference (TD) error between the predicted Qvalue and the observed reward plus discounted future rewards.

Epsilon Decay:
 Gradually reducing ( \epsilon ) encourages the agent to explore less over time and exploit learned knowledge more.
4. Training the Agent
Weâ€™ll train the agent over multiple episodes, allowing it to learn from interactions with the environment.
Tasks:
 Run training episodes: Iterate over a specified number of episodes.
 Implement the learning loop: The agent interacts with the environment, updates Qvalues, and collects rewards.
 Track performance: Record the total reward per episode for analysis.
Implementation:
def train_agent(env, agent, bins, num_episodes=1000, max_steps_per_episode=200):
rewards = []
for episode in range(num_episodes):
state = env.reset()
discrete_state = discretize_state(state, bins)
total_reward = 0
done = False
for step in range(max_steps_per_episode):
action = agent.choose_action(discrete_state)
next_state, reward, done, info = env.step(action)
next_discrete_state = discretize_state(next_state, bins)
total_reward += reward
# Update Qvalue
agent.update_q_value(discrete_state, action, reward, next_discrete_state, done)
discrete_state = next_discrete_state
if done:
break
# Decay epsilon
agent.decay_epsilon()
rewards.append(total_reward)
# Print progress every 100 episodes
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards[100:])
print(f"Episode {episode + 1}/{num_episodes}, Average Reward (last 100 episodes): {avg_reward:.2f}, Epsilon: {agent.epsilon:.4f}")
return rewards
Explanation

Episode Loop:
 For each episode, reset the environment and initialize the state.
 The agent selects actions, observes rewards and next states, and updates Qvalues.
 The episode ends when the environment signals
done
or the maximum number of steps is reached.

Performance Tracking:
 Collect the total reward per episode to monitor learning progress.
 Periodically print the average reward and current epsilon value to observe trends.

Epsilon Decay:
 After each episode, decay the exploration rate to shift the agentâ€™s focus from exploration to exploitation.
5. Evaluating the Agent
After training, we evaluate the agentâ€™s performance without exploration to assess how well it has learned.
Tasks:
 Run evaluation episodes: Using the learned policy without exploration.
 Measure performance: Calculate the average total reward over evaluation episodes.
Implementation:
def evaluate_agent(env, agent, bins, num_episodes=10, max_steps_per_episode=200):
agent.epsilon = 0 # Disable exploration
total_rewards = []
for episode in range(num_episodes):
state = env.reset()
discrete_state = discretize_state(state, bins)
total_reward = 0
done = False
for step in range(max_steps_per_episode):
env.render()
action = agent.choose_action(discrete_state)
next_state, reward, done, info = env.step(action)
discrete_state = discretize_state(next_state, bins)
total_reward += reward
if done:
break
total_rewards.append(total_reward)
print(f"Evaluation Episode {episode + 1}/{num_episodes}, Total Reward: {total_reward}")
env.close()
average_reward = np.mean(total_rewards)
print(f"Average Total Reward over {num_episodes} Evaluation Episodes: {average_reward:.2f}")
Explanation

Evaluation without Exploration:
 By setting
agent.epsilon = 0
, the agent always selects the action with the highest Qvalue.
 By setting

Rendering:
 We call
env.render()
to visualize the agentâ€™s performance during evaluation.
 We call

Performance Metrics:
 We compute the total reward for each evaluation episode and calculate the average total reward.
6. Visualizing Training Progress
Plotting the rewards over episodes helps in analyzing the agentâ€™s learning curve and identifying trends.
Implementation:
import matplotlib.pyplot as plt
def plot_rewards(rewards):
plt.figure(figsize=(12, 6))
plt.plot(rewards, label='Episode Reward')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Training Progress')
plt.legend()
plt.grid(True)
plt.show()
# Calculate and plot the moving average
window_size = 50
moving_avg = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')
plt.figure(figsize=(12, 6))
plt.plot(range(window_size  1, len(rewards)), moving_avg, label=f'{window_size}Episode Moving Average', color='orange')
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Moving Average of Rewards')
plt.legend()
plt.grid(True)
plt.show()
Explanation

Episode Rewards Plot:
 Displays the total reward per episode.
 Helps identify whether the agentâ€™s performance is improving over time.

Moving Average Plot:
 Smoothing the rewards using a moving average provides a clearer view of the overall learning trend.
 The window size determines the level of smoothing.
7. Running the Complete Project
We will now bring all components together and execute the training and evaluation of the agent.
Implementation:
def main():
env = gym.make('CartPolev1')
bins = create_bins(num_bins=10)
state_bins = tuple(len(bin) for bin in bins)
action_size = env.action_space.n
agent = QLearningAgent(state_bins=state_bins, action_size=action_size)
num_training_episodes = 2000
rewards = train_agent(env, agent, bins, num_episodes=num_training_episodes)
plot_rewards(rewards)
print("Starting evaluation...")
evaluate_agent(env, agent, bins, num_episodes=5)
if __name__ == "__main__":
main()
Explanation

Environment and Agent Initialization:
 Create the CartPole environment.
 Define the discretization bins and determine the dimensions for the Qtable.

Training:
 Train the agent using the
train_agent
function.  Record the rewards for visualization.
 Train the agent using the

Visualization:
 Plot the training rewards to analyze learning progress.

Evaluation:
 Evaluate the agentâ€™s performance over a few episodes.
8. Next Steps and Improvements
After successfully implementing and understanding the basic RL agent, consider the following enhancements to deepen your learning:

Parameter Tuning:
 Adjust the Number of Bins:
 Experiment with different numbers of bins for discretization to find a balance between state representation accuracy and computational efficiency.
 Modify Learning Rate and Discount Factor:
 Test different values for the learning rate (( \alpha )) and discount factor (( \gamma )) to see how they affect learning speed and stability.
 Epsilon Decay Strategy:
 Try different epsilon decay rates or strategies (e.g., linear decay, exponential decay) to optimize the explorationexploitation balance.
 Adjust the Number of Bins:

Algorithm Enhancements:
 Double QLearning:
 Implement Double QLearning to address overestimation bias in QLearning.
 Function Approximation:
 Use function approximators like neural networks (Deep QNetworks) to handle continuous state spaces without discretization.
 Eligibility Traces:
 Implement SARSA(Î») or Q(Î») to consider the sequence of states and actions leading to rewards.
 Double QLearning:

Advanced Exploration Strategies:
 Softmax Action Selection:
 Use a softmax function over Qvalues to select actions probabilistically, favoring higher Qvalues but allowing exploration.
 Upper Confidence Bound (UCB):
 Implement UCB to balance exploration and exploitation based on uncertainty estimates.
 Softmax Action Selection:

Applying to Different Environments:
 Test on Other OpenAI Gym Environments:
 Apply your agent to environments like MountainCarv0 or Acrobotv1, adapting your approach as necessary.
 Custom Environments:
 Create simple custom environments to challenge your agent in new ways.
 Test on Other OpenAI Gym Environments:

Performance Monitoring and Logging:
 Enhanced Visualization:
 Plot additional metrics like maximum reward per episode or Qvalue distributions.
 Logging Libraries:
 Integrate logging tools or frameworks (e.g., TensorBoard) to monitor training in realtime.
 Enhanced Visualization:

Handling Continuous Actions:
 Policy Gradient Methods:
 Explore algorithms like REINFORCE or ActorCritic methods that can handle continuous action spaces.
 Policy Gradient Methods:
9. Common Issues and Solutions
Issue: Agentâ€™s performance plateaus or does not improve.
Possible Solutions:

Check the Discretization:
 Ensure the bins cover the full range of possible state values.
 Increase the number of bins for more precise state representation, but be cautious of the curse of dimensionality.

Adjust Hyperparameters:
 Learning Rate (( \alpha )):
 If learning is unstable, try decreasing the learning rate.
 Discount Factor (( \gamma )):
 A higher gamma places more emphasis on future rewards.
 Epsilon Decay:
 Adjust the decay rate to ensure sufficient exploration.
 Learning Rate (( \alpha )):

Modify the Reward Structure:
 Introduce penalties for undesirable behaviors to guide learning.
Issue: Agent performs well during training but poorly during evaluation.
Possible Solutions:
 Ensure Exploration is Disabled During Evaluation:
 Set
agent.epsilon = 0
before evaluation to prevent random actions.
 Set
 Overfitting to Exploration:
 The agent may rely on exploration to achieve higher rewards; consider adjusting the epsilon decay schedule to encourage learning of the optimal policy.
Issue: High Variance in Rewards Across Episodes.
Possible Solutions:
 Increase the Number of Training Episodes:
 Allow the agent more time to learn consistent behavior.
 Use Moving Averages:
 Analyze the moving average of rewards to assess trends over time.
10. Conclusion
In this project, you have:
 Gained an understanding of the foundational concepts of Reinforcement Learning.
 Implemented a basic QLearning agent to solve the CartPolev1 environment.
 Learned how to discretize continuous state spaces for tabular methods.
 Explored the balance between exploration and exploitation using epsilongreedy strategies.
 Evaluated and visualized your agentâ€™s performance over time.
This foundational knowledge prepares you for more advanced RL topics, such as:
 Deep Reinforcement Learning: Using neural networks to approximate value functions or policies.
 Policy Gradient Methods: Directly optimizing the policy without relying on value functions.
 ModelBased RL: Learning a model of the environment to plan ahead.
 MultiAgent RL: Extending RL concepts to environments with multiple agents.