A typical reinforcement learning (RL) problem have some basics elements such as:
An Environment: Physical world in which the agent operates.
State: Current situation of the agent.
Reward: Feedback from the environment.
Policy: Method to map agent’s state to actions.
But we can think the policy like an agent's strategy. For example, imagine a world where a robot (agent) moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:
A room is an environment.
Robot's current position is a state.
A policy is what an agent (the robot) does to accomplish this task. The robots have a few options:
Policy #1: dumb robots just wander around randomly until they accidentally end up in the right place.
Policy #2: other robots may, for some reason, learn to go along the walls most of the route.
Policy #3: smart robots plan the route in their "head" and go straight to the goal.
Obviously, some policies are better than others, and there are multiple ways to assess them, but the goal of RL is to learn the best policy. In the example, the best policy would be option 3. In such terms, the policy is then what defines the learning agent's way of behaving at a given time and is typically used by the agent to decide what action should be performed.