Skip to main content

Rewards and Policies

Rewards and policies are fundamental concepts that guide agent learning in Reinforcement Learning.

๐ŸŽฏ Rewardsโ€‹

A reward is the feedback the agent receives for its actions, indicating how good or bad its decisions were.

Characteristics of Rewardsโ€‹

  • Immediate - Received right after the action
  • Scalar - A number indicating action quality
  • Can be negative - To penalize bad actions
  • Guide learning - The agent learns to maximize them

Example in Ants Sagaโ€‹

  • +10 - Collect food
  • -1 - Collide with obstacle
  • +1 - Move towards food
  • -0.1 - Move without purpose

๐Ÿ“Š Policiesโ€‹

A policy is the strategy the agent uses to decide what action to take in each state.

Types of Policiesโ€‹

  • Deterministic - Always chooses the same action for a state
  • Stochastic - Chooses actions based on probabilities
  • Greedy - Always chooses the best known action
  • ฮต-Greedy - Mix of greedy and random actions

Policy Representationโ€‹

  • Function - ฯ€(s) = a
  • Table - Q-table with state-action values
  • Neural Network - Deep Q-Network (DQN)

๐Ÿ’ฐ Value Functionsโ€‹

State Value V(s)โ€‹

The expected cumulative reward from state s following policy ฯ€.

Action Value Q(s,a)โ€‹

The expected cumulative reward from state s, action a, following policy ฯ€.

Relationshipโ€‹

  • V(s) = max_a Q(s,a) for optimal policy
  • Q(s,a) = R(s,a) + ฮณV(s') for next state s'

๐ŸŽฎ Examples in Ants Sagaโ€‹

Reward Designโ€‹

def calculate_reward(state, action, next_state):
reward = 0

# Collect food
if food_collected:
reward += 10

# Avoid obstacles
if hit_obstacle:
reward -= 1

# Move towards food
if closer_to_food:
reward += 1

# Penalty for wandering
if no_progress:
reward -= 0.1

return reward

Policy Implementationโ€‹

def epsilon_greedy_policy(state, q_table, epsilon):
if random.random() < epsilon:
# Explore: random action
return random.choice(actions)
else:
# Exploit: best action
return np.argmax(q_table[state])

๐Ÿ”ง Best Practicesโ€‹

Reward Designโ€‹

  • Sparse rewards - Only reward final goals
  • Dense rewards - Reward intermediate progress
  • Shaped rewards - Guide learning with hints

Policy Learningโ€‹

  • Start simple - Use basic algorithms first
  • Monitor exploration - Balance exploration/exploitation
  • Evaluate regularly - Check learning progress

๐Ÿ“š Further Readingโ€‹