Rewards and Policies

Rewards and policies are fundamental concepts that guide agent learning in Reinforcement Learning.

🎯 Rewards

A reward is the feedback the agent receives for its actions, indicating how good or bad its decisions were.

Characteristics of Rewards

Immediate - Received right after the action
Scalar - A number indicating action quality
Can be negative - To penalize bad actions
Guide learning - The agent learns to maximize them

Example in Ants Saga

+10 - Collect food
-1 - Collide with obstacle
+1 - Move towards food
-0.1 - Move without purpose

📊 Policies

A policy is the strategy the agent uses to decide what action to take in each state.

Types of Policies

Deterministic - Always chooses the same action for a state
Stochastic - Chooses actions based on probabilities
Greedy - Always chooses the best known action
ε-Greedy - Mix of greedy and random actions

Policy Representation

Function - π(s) = a
Table - Q-table with state-action values
Neural Network - Deep Q-Network (DQN)

💰 Value Functions

State Value V(s)

The expected cumulative reward from state s following policy π.

Action Value Q(s,a)

The expected cumulative reward from state s, action a, following policy π.

Relationship

V(s) = max_a Q(s,a) for optimal policy
Q(s,a) = R(s,a) + γV(s') for next state s'

🎮 Examples in Ants Saga

Reward Design

def calculate_reward(state, action, next_state):
    reward = 0
    
    # Collect food
    if food_collected:
        reward += 10
    
    # Avoid obstacles
    if hit_obstacle:
        reward -= 1
    
    # Move towards food
    if closer_to_food:
        reward += 1
    
    # Penalty for wandering
    if no_progress:
        reward -= 0.1
    
    return reward

Policy Implementation

def epsilon_greedy_policy(state, q_table, epsilon):
    if random.random() < epsilon:
        # Explore: random action
        return random.choice(actions)
    else:
        # Exploit: best action
        return np.argmax(q_table[state])

🔧 Best Practices

Reward Design

Sparse rewards - Only reward final goals
Dense rewards - Reward intermediate progress
Shaped rewards - Guide learning with hints

Policy Learning

Start simple - Use basic algorithms first
Monitor exploration - Balance exploration/exploitation
Evaluate regularly - Check learning progress

🎯 Rewards​

Characteristics of Rewards​

Example in Ants Saga​

📊 Policies​

Types of Policies​

Policy Representation​

💰 Value Functions​

State Value V(s)​

Action Value Q(s,a)​

Relationship​

🎮 Examples in Ants Saga​

Reward Design​

Policy Implementation​

🔧 Best Practices​

Reward Design​

Policy Learning​

📚 Further Reading​

🎯 Rewards

Characteristics of Rewards

Example in Ants Saga

📊 Policies

Types of Policies

Policy Representation

💰 Value Functions

State Value V(s)

Action Value Q(s,a)

Relationship

🎮 Examples in Ants Saga

Reward Design

Policy Implementation

🔧 Best Practices

Reward Design

Policy Learning

📚 Further Reading