【免费】强化学习matlab代码_强化学习matlab代码资源-CSDN文库

共2个文件

m：2个

强化学习

3星 · 超过75%的资源需积分: 0 83 浏览量 2014-08-11 19:39:35 上传评论 5 收藏 2KB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

.rar （2个子文件）

Q-learning

ReinforcementLearning.m 3KB

RandomPermutation.m 929B

%I have made simple Matlab Code below for this tutorial example and you can modify it for your need. You can copy and paste the two functions into separate text files and run it as ReinforcementLearning. To model the environment you need to make the instant reward matrix R. Put zero for any door that is not directly to the goal and put value 100 to the door that lead directly to the goal. For unconnected states, use minus Infinity (-Inf) so that it become very negative number. We want to maximize the Q values, thus very negative number will not be considered at all. The state is numbered 1 to N (in our previous example N = 6). The result of the code is only normalized Q matrix. You may experiment in the effect of parameter gamma to see how it influences the results. % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Q learning of single agent move in N rooms % Matlab Code companion of % Q Learning by Example, by Kardi Teknomo % (http://people.revoledu.com/kardi/) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function q=ReinforcementLearning clc; format short format compact % Two input: R and gamma % immediate reward matrix; % row and column = states; -Inf = no door between room R=[-inf,-inf,-inf,-inf, 0, -inf; -inf,-inf,-inf, 0,-inf, 100; -inf,-inf,-inf, 0,-inf, -inf; -inf, 0, 0,-inf, 0, -inf; 0,-inf,-inf, 0,-inf, 100; -inf, 0,-inf,-inf, 0, 100]; gamma=0.80; % learning parameter q=zeros(size(R)); % initialize Q as zero q1=ones(size(R))*inf; % initialize previous Q as big number count=0; % counter for episode=0:50000 % random initial state y=randperm(size(R,1)); state=y(1); % select any action from this state x=find(R(state,:)>=0); % find possible action of this state if size(x,1)>0, x1=RandomPermutation(x); % randomize the possible action x1=x1(1); % select an action end qMax=max(q,[],2); q(state,x1)= R(state,x1)+gamma*qMax(x1); % get max of all actions state=x1; % break if convergence: small deviation on q for 1000 consecutive if sum(sum(abs(q1-q)))<0.0001 & sum(sum(q >0)) if count>1000, episode % report last episode break % for else count=count+1; % set counter if deviation of q is small end else q1=q; count=0; % reset counter when deviation of q from previous q is large end end %normalize q g=max(max(q)); if g>0, q=100*q/g; end

评论收藏

内容反馈