基于tf2的强化学习算法实现-50+个recipe.zip资源-CSDN文库

共12个文件

py：12个

需积分: 5 104 浏览量 2024-05-11 13:48:55 上传评论收藏 21KB ZIP 举报

强化学习（Reinforcement Learning, RL），又称再励学习、评价学习或增强学习，是机器学习的范式和方法论之一。它主要用于描述和解决智能体（agent）在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习的特点在于没有监督数据，只有奖励信号。强化学习的常见模型是标准的马尔可夫决策过程（Markov Decision Process, MDP）。按给定条件，强化学习可分为基于模式的强化学习（model-based RL）和无模式强化学习（model-free RL），以及主动强化学习（active RL）和被动强化学习（passive RL）。强化学习的变体包括逆向强化学习、阶层强化学习和部分可观测系统的强化学习。求解强化学习问题所使用的算法可分为策略搜索算法和值函数（value function）算法两类。强化学习理论受到行为主义心理学启发，侧重在线学习并试图在探索-利用（exploration-exploitation）间保持平衡。不同于监督学习和非监督学习，强化学习不要求预先给定任何数据，而是通过接收环境对动作的奖励（反馈）获得学习信息并更新模型参数。强化学习问题在信息论、博弈论、自动控制等领域有得到讨论，被用于解释有限理性条件下的平衡态、设计推荐系统和机器人交互系统。一些复杂的强化学习算法在一定程度上具备解决复杂问题的通用智能，可以在围棋和电子游戏中达到人类水平。强化学习在工程领域的应用也相当广泛。例如，Facebook提出了开源强化学习平台Horizon，该平台利用强化学习来优化大规模生产系统。在医疗保健领域，RL系统能够为患者提供治疗策略，该系统能够利用以往的经验找到最优的策略，而无需生物系统的数学模型等先验信息，这使得基于RL的系统具有更广泛的适用性。总的来说，强化学习是一种通过智能体与环境交互，以最大化累积奖励为目标的学习过程。它在许多领域都展现出了强大的应用潜力。

资源推荐

资源详情

资源评论

收起资源包目录

基于tf2的强化学习算法实现-50+个recipe.zip （12个子文件）

content

ch_01

neu_evo_agent.py 3KB

nn_rl_discrete.py 2KB

nn_rl_continuous.py 5KB

SimpleImageViewer.py 915B

build_env_mechanism.py 7KB

value_function_utils.py 5KB

ch_02

stochastic_env.py 7KB

monte_carlo_rl.py 5KB

value_based_rl.py 4KB

policy_gradients.py 4KB

sarsa_algorithm.py 1KB

temporal_diff.py 5KB

""" The learning environment is a simulator that provides observations for the RL agent, supports a set of actions that the RL agent can perform by executing the actions, and returns the resultant/new observation as a result of the agent taking the action. """ import gym import numpy as np from typing import List class MazeEnv(gym.Env): """Maze learning environment that represents a simple 2D map with cells representing the location of the agent, their goal, walls, coins, and empty space:""" def __init__(self, stochastic=True): self.map = np.asarray(['SWFWG', 'OOOOO', 'WOOOW', 'FOWFW']) self.observation_space = gym.spaces.Discrete(1) self.dim = (4, 5) self.img_map = np.ones(self.dim) if stochastic: self.slip = True self.distinct_states = 112 self.action_space = gym.spaces.Discrete(4) self.obstacles = [(0, 1), (0, 3), (2, 0), (2, 4), (3, 2), (3, 4)] for x in self.obstacles: self.img_map[x[0]][x[1]] = 0 # Clock-wise action slip for stochasticity self.slip_action_map = {0: 3, 1: 2, 2: 0, 3: 1, } self.slip_probability = 0.1 self.start_pos = (0, 0) self.goal_pos = (0, 4) # a lookup table in the form of a dictionary to map indices to cells # in the Maze environment: self.index_to_coordinate_map = { 0: (0, 0), 1: (1, 0), 2: (3, 0), 3: (1, 1), 4: (2, 1), 5: (3, 1), 6: (0, 2), 7: (1, 2), 8: (2, 2), 9: (1, 3), 10: (2, 3), 11: (3, 3), 12: (0, 4), 13: (1, 4), } self.coordinate_to_index_map = dict((v, k) for k, v in self.index_to_coordinate_map.items()) self.state = self.coordinate_to_index_map[self.start_pos] def num2coin(self, n: int): """a method that will handle the coins and their statuses in the Maze, where 0 means that the coin wasn't collected by the agent and 1 means that the coin was collected by the agent""" coinlist = [ (0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1), ] return list(coinlist[n]) def coin2num(self, v: List): if sum(v) < 2: return np.inner(v, [1, 2, 3]) else: return np.inner(v, [1, 2, 3]) + 1 def set_state(self, state: int) -> None: """a setter function to set the state of the environment. This is useful for algorithms such as value iteration, where each and every state needs to be visited in the environment for it to calculate values: Args: state (int): A valid state in the Maze env int: [0, 112] """ self.state = state def reset(self): """return the initial state""" self.state = self.coordinate_to_index_map[self.start_pos] return self.state def step(self, action, slip=True): """ Run one step into the Maze env Args: state (Any): Current index state of the maze action (int): Discrete action for up, down, left, right slip (bool, optional): Stochasticity in the env. Defaults to True. Raises: ValueError: If invalid action is provided as input Returns: Tuple : Next state, reward, done, _ """ self.slip = slip if self.slip: if np.random.rand() < self.slip_probability: action = self.slip_action_map[action] # update the state of the maze based on the action that's taken: cell = self.index_to_coordinate_map[int(self.state / 8)] if action == 0: c_next = cell[1] r_next = max(0, cell[0] - 1) elif action == 1: c_next = cell[1] r_next = min(self.dim[0] - 1, cell[0] + 1) elif action == 2: c_next = max(0, cell[1] - 1) r_next = cell[0] elif action == 3: c_next = min(self.dim[1] - 1, cell[1] + 1) r_next = cell[0] else: raise ValueError(f"Invalid action:L{action}") # determine whether the agent has reached the goal: if (r_next == self.goal_pos[0]) and (c_next == self.goal_pos[1]): v_coin = self.num2coin(self.state % 8) self.state = (8 * self.coordinate_to_index_map[(r_next, c_next)] + self.state % 8) return self.state, float(sum(v_coin)), True else: # handle cases when the action results in hitting an obstacle/wall: if (r_next, c_next) in self.obstacles: return self.state, 0.0, False else: # e action leads to collecting a coin: v_coin = self.num2coin(self.state % 8) if (r_next, c_next) == (0, 2): v_coin[0] = 1 elif (r_next, c_next) == (3, 0): v_coin[1] = 1 elif (r_next, c_next) == (3, 3): v_coin[2] = 1 self.state = 8 * self.coordinate_to_index_map[(r_next, c_next)] + self.coin2num( v_coin) return self.state, 0.0, False def render(self): """implement a render function that will print out a text version of the current state of the Maze environment:""" cell = self.index_to_coordinate_map[int(self.state / 8)] desc = self.map.tolist() desc[cell[0]] = ( desc[cell[0]][:cell[1]] + "\x1b[1;34m" # Blue font + "\x1b[4m" # Underline + "\x1b[1m" # Bold + "\x1b[7m" # Reversed + desc[cell[0]][cell[1]] + "\x1b[0m" + desc[cell[0]][cell[1] + 1:] ) print("\n".join("".join(row) for row in desc)) if __name__ == "__main__": env = MazeEnv() obs = env.reset() env.render() done = False step_num = 1 action_list = ['UP', 'DOWN', 'RIGHT', 'LEFT'] while not done: # sample a random action from the action space action0 = env.action_space.sample() next_obs, reward, done = env.step(action0) print(f"step {step_num} action:L {action_list[action0]} reward:{reward} done: {done}") step_num += 1 env.render() env.close() """ Our map, as defined in step 1 in the How to do it... section, represents the state of the learning environment. The Maze environment defines the observation space, the action space, and the rewarding mechanism for implementing a Markov decision process (MDP). We sampled a valid action from the action space of the environment and stepped the environment with the chosen action, which resulted in us getting the new observation, reward, and a done status Boolean (representing whether the episode has finished) as the response from the Maze environment. The env.render() method converts the environment's internal grid representation into a simple text/string grid and prints it for easy visual understanding. 我们的地图，如“如何做到”部分中的步骤1所定义的，代表了学习环境的状态。迷宫环境定义了观察空间、行动空间和实施马可夫决策过程(mDP)的奖励机制。我们从环境的行为空间中采样了一个有效的行为，并用所选择的行为步进了环境，这导致我们得到了新的观察、奖励和一个已完成的状态布尔值(表示该事件是否已经完成)作为迷宫环境的响应。 Render ()方法将环境的内部网格表示转换为一个简单的文本/字符串网格，并将其打印出来，以便于视觉理解。 """

评论收藏

内容反馈