Rewards and Policies
Rewards and policies are fundamental concepts that guide agent learning in Reinforcement Learning.
๐ฏ Rewardsโ
A reward is the feedback the agent receives for its actions, indicating how good or bad its decisions were.
Characteristics of Rewardsโ
- Immediate - Received right after the action
- Scalar - A number indicating action quality
- Can be negative - To penalize bad actions
- Guide learning - The agent learns to maximize them
Example in Ants Sagaโ
- +10 - Collect food
- -1 - Collide with obstacle
- +1 - Move towards food
- -0.1 - Move without purpose
๐ Policiesโ
A policy is the strategy the agent uses to decide what action to take in each state.
Types of Policiesโ
- Deterministic - Always chooses the same action for a state
- Stochastic - Chooses actions based on probabilities
- Greedy - Always chooses the best known action
- ฮต-Greedy - Mix of greedy and random actions
Policy Representationโ
- Function - ฯ(s) = a
- Table - Q-table with state-action values
- Neural Network - Deep Q-Network (DQN)
๐ฐ Value Functionsโ
State Value V(s)โ
The expected cumulative reward from state s following policy ฯ.
Action Value Q(s,a)โ
The expected cumulative reward from state s, action a, following policy ฯ.
Relationshipโ
- V(s) = max_a Q(s,a) for optimal policy
- Q(s,a) = R(s,a) + ฮณV(s') for next state s'
๐ฎ Examples in Ants Sagaโ
Reward Designโ
def calculate_reward(state, action, next_state):
reward = 0
# Collect food
if food_collected:
reward += 10
# Avoid obstacles
if hit_obstacle:
reward -= 1
# Move towards food
if closer_to_food:
reward += 1
# Penalty for wandering
if no_progress:
reward -= 0.1
return reward
Policy Implementationโ
def epsilon_greedy_policy(state, q_table, epsilon):
if random.random() < epsilon:
# Explore: random action
return random.choice(actions)
else:
# Exploit: best action
return np.argmax(q_table[state])
๐ง Best Practicesโ
Reward Designโ
- Sparse rewards - Only reward final goals
- Dense rewards - Reward intermediate progress
- Shaped rewards - Guide learning with hints
Policy Learningโ
- Start simple - Use basic algorithms first
- Monitor exploration - Balance exploration/exploitation
- Evaluate regularly - Check learning progress