学习笔记第一篇

分类: 365be体育app 时间: 2025-11-22 23:26:33 作者: admin 阅读: 3027 点赞: 711
学习笔记第一篇

BCQ

文章解读1.主要问题2.采用了哪些方法与框架2.1策略约束2.1.1 解决外推误差

2.2BCQ算法

3.代码部分3.1 main3.1.1 interact_with_environment3.1.2 train_BCQ3.1.3 main

3.2 DDPG3.2.1 Actor3.2.2 Critic3.2.3 DDPG

3.3BCQ3.3.1Actor3.3.2Critic3.3.3VAE3.3.4BCQ

3.4utils3.4.1 ReplayBuffer

4.总结4.1 数据采取方法4.2 批约束思想4.3 Q值估计的改进

文章解读

1.主要问题

data absent 数据缺失问题,由于离线强化只依靠数据集,并不能包含所有的数据;

distribution shift分布偏移,分布偏移最主要的原因是learned policy 和 behavior policy之间的偏移,这也是offlineRL相比于Online RL在不能交互学习的情况下造成的。

2.采用了哪些方法与框架

2.1策略约束

如果能够使得(s,a)尽可能的与数据集D中的数据相似,就可以解决上述问题。 为此提出了以下解决思路。 (1).最小化所选择的动作和数据集中存在的工作的距离 (2).能转移到和数据集中状态相似的状态 (3).最大化值函数 其中最为重要的的就是第一点,只有保证了第一布的成功,才能准确的估计第二步和第三步。

2.1.1 解决外推误差

首先给MDP MB中的数据转移PB给出了定义:

接下来提出了一个误差函数的定义,用于解释为何会出现这些误差(其中Q为我们需要推导的实际价值函数,QB是依据数据集推导的价值函数):

通过推导可以化为以下形式:

通过这个推导我们可以发现当PM与PB的概率一样的时候,可以得出结论EMDP=0,从而使得以下公式误差的值也为0:

有了以上的推论,当eMDP=0,且初始状态S0存在的时候,可以将批策略约束与Q-Learning(BCQL)结合,得到以下公式:

通过以上的推导公式可以得到以下两个定理:

一.学习率为α,通过对环境标准的采样,BCQL 可以收敛到最优动作值函数 。 二.给定确定性 MDP 和 coherent 数据集B,学习率为α,BCQL 将会收敛到 ,其中是最优 batch-constrained 策略。

2.2BCQ算法

将 BCQL 算法拓展到连续环境,提出了 BCQ 算法。其中为了满足 Batch-constrained 的条件,BCQ 利用了一个生成模型VAE。对于给定的状态,BCQ 利用生成模型来生成与 batch 相似的动作集合,并通过 Q 网络来选择价值最高的动作。另外,还对价值估计过程增加了对未来稀有的状态进行惩罚,与 Clipped Double Q-learning 算法类似。最后,BCQ 能学到与数据集的状态动作对访问分布相似的策略。 对于给定的状态(s,a)和数据集D中的数据状态对相似度生成概率函数来减小外推误差所造成的错误。但是直接估计比较困难,所以提出了用来模拟P函数。在这里我们所用的方式是用VAE模型来近似,并且与一起选择动作估值最大的函数。但是在此还加入了扰动模型来增加他的探索度,其中扰动模型是服从再加上一范围限制,最终得到了以下的行为策略:

在上述的式子之中n和的数据决定了采用的是模仿学习还是强化学习。当n=1,=0时,该算法是模仿学习及一比一的还原数据集D中的策略,但是当且时BCQ 算法就类似于Q-learning 算法.扰动模型的训练和 DDPG算法的训练目标类似:

在最后,BCQ 采用了 Clipped Double Q-learning 算法 对动作值进行估计,也就是训练两个动作值网络,从中选取取它们的最小值作为动作值的估计。改进 Clipped Double Q-learning 算法,对两个动作值采用新的结合方式: 最终得到的伪代码图如下所示:

中间VAE部分的模型推理如下所示: VAE Gω由两个网络定义,编码器Eω1(s,a)和解码器Dω2(s,z),其中ω={ω1,ω2}。编码器获取状态-动作对,并输出高斯分布N(µ,σ)的平均值µ和标准偏差σ。状态s,连同从高斯中采样的潜在向量z一起,被传递到解Dω2(s,z),该解码器输出动作。马网络遵循默认架构(图10),有两个大小为750的隐藏层,而不是400和300相对于重建的均方误差以及KL正则化项进行训练: 注意到两种分布的高斯形式,KL散度项可以简化:

3.代码部分

本代码主要由四个部分组成:main,DDPG,BCQ,utils

3.1 main

main主要分为三个部分。 第一部分:主要是利用DDPG跑出我们所需要的100w个数据 第二部分:主要是对于BCQ的训练 第三部分:从跑出来的数据集中提取数据用于BCQ的训练,同时设置好相关的参数

3.1.1 interact_with_environment

//主要用于与环境交互产生所需要的数据

def interact_with_environment(env, state_dim, action_dim, max_action, device, args):

# For saving files

setting = f"{args.env}_{args.seed}"

buffer_name = f"{args.buffer_name}_{setting}"

# Initialize and load policy

policy = DDPG.DDPG(state_dim, action_dim, max_action, device)#, args.discount, args.tau)

if args.generate_buffer: policy.load(f"./models/behavioral_{setting}")

# Initialize buffer

replay_buffer = utils.ReplayBuffer(state_dim, action_dim, device)

evaluations = []

state, done = env.reset(), False

episode_reward = 0

episode_timesteps = 0

episode_num = 0

# Interact with the environment for max_timesteps

for t in range(int(args.max_timesteps)):

episode_timesteps += 1

# Select action with noise

if (

(args.generate_buffer and np.random.uniform(0, 1) < args.rand_action_p) or

(args.train_behavioral and t < args.start_timesteps)

):

action = env.action_space.sample()

else:

action = (

policy.select_action(np.array(state))

+ np.random.normal(0, max_action * args.gaussian_std, size=action_dim)

).clip(-max_action, max_action)

# Perform action

next_state, reward, done, _ = env.step(action)

done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0

# Store data in replay buffer

replay_buffer.add(state, action, next_state, reward, done_bool)

state = next_state

episode_reward += reward

# Train agent after collecting sufficient data

if args.train_behavioral and t >= args.start_timesteps:

policy.train(replay_buffer, args.batch_size)

if done:

# +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True

print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")

# Reset environment

state, done = env.reset(), False

episode_reward = 0

episode_timesteps = 0

episode_num += 1

# Evaluate episode

if args.train_behavioral and (t + 1) % args.eval_freq == 0:

evaluations.append(eval_policy(policy, args.env, args.seed))

np.save(f"./results/behavioral_{setting}", evaluations)

policy.save(f"./models/behavioral_{setting}")

# Save final policy

if args.train_behavioral:

policy.save(f"./models/behavioral_{setting}")

# Save final buffer and performance

else:

evaluations.append(eval_policy(policy, args.env, args.seed))

np.save(f"./results/buffer_performance_{setting}", evaluations)

replay_buffer.save(f"./buffers/{buffer_name}")

3.1.2 train_BCQ

//用于训练BCQ的代码

def train_BCQ(state_dim, action_dim, max_action, device, args):

# For saving files

setting = f"{args.env}_{args.seed}"

buffer_name = f"{args.buffer_name}_{setting}"

# Initialize policy

policy = BCQ.BCQ(state_dim, action_dim, max_action, device, args.discount, args.tau, args.lmbda, args.phi)

# Load buffer

replay_buffer = utils.ReplayBuffer(state_dim, action_dim, device)

replay_buffer.load(f"./buffers/{buffer_name}")

evaluations = []

episode_num = 0

done = True

training_iters = 0

while training_iters < args.max_timesteps:

pol_vals = policy.train(replay_buffer, iterations=int(args.eval_freq), batch_size=args.batch_size)

evaluations.append(eval_policy(policy, args.env, args.seed))

np.save(f"./results/BCQ_{setting}", evaluations)

training_iters += args.eval_freq

print(f"Training iterations: {training_iters}")

3.1.3 main

f __name__ == "__main__":

parser = argparse.ArgumentParser()

parser.add_argument("--env", default="Hopper-v3") # OpenAI gym 的环境名称environment name

parser.add_argument("--seed", default=0, type=int) # 设置Gym, PyTorch and Numpy seeds

parser.add_argument("--buffer_name", default="Robust") # 保存的文件名称

parser.add_argument("--eval_freq", default=5e3, type=float) # 更新频率

parser.add_argument("--max_timesteps", default=1e6, type=int) # 训练的最大步长

parser.add_argument("--start_timesteps", default=25e3, type=int)# 运行的最大步长(或者说是缓冲区大小)

parser.add_argument("--rand_action_p", default=0.3, type=float) # 批处理中选取随机动作的概率

parser.add_argument("--gaussian_std", default=0.3, type=float) # 高斯分布噪声的标准差

parser.add_argument("--batch_size", default=100, type=int) # 从数据集中抽取的最小样本数量

parser.add_argument("--discount", default=0.99) # 奖励折扣参数

parser.add_argument("--tau", default=0.005) # 目标网络更新参数

parser.add_argument("--lmbda", default=0.75) # 在BCQ中 clipped double Q-learning的权重

parser.add_argument("--phi", default=0.05) # BCQ中的扰动最大超参数

parser.add_argument("--train_behavioral", action="store_true") # If true, train behavioral (DDPG)

parser.add_argument("--generate_buffer", action="store_true") # If true, generate buffer

args = parser.parse_args()

print("---------------------------------------")

if args.train_behavioral:

print(f"Setting: Training behavioral, Env: {args.env}, Seed: {args.seed}")

elif args.generate_buffer:

print(f"Setting: Generating buffer, Env: {args.env}, Seed: {args.seed}")

else:

print(f"Setting: Training BCQ, Env: {args.env}, Seed: {args.seed}")

print("---------------------------------------")

if args.train_behavioral and args.generate_buffer:

print("Train_behavioral and generate_buffer cannot both be true.")

exit()

if not os.path.exists("./results"):

os.makedirs("./results")

if not os.path.exists("./models"):

os.makedirs("./models")

if not os.path.exists("./buffers"):

os.makedirs("./buffers")

env = gym.make(args.env)

env.seed(args.seed)

env.action_space.seed(args.seed)

torch.manual_seed(args.seed)

np.random.seed(args.seed)

state_dim = env.observation_space.shape[0]

action_dim = env.action_space.shape[0]

max_action = float(env.action_space.high[0])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if args.train_behavioral or args.generate_buffer:

interact_with_environment(env, state_dim, action_dim, max_action, device, args)

else:

train_BCQ(state_dim, action_dim, max_action, device, args

3.2 DDPG

第一部分:Actor网络的搭建,用于选择出动作a 第二部分:Critic网络的构建,同与价值及函数Q的评估 第三部分:利用Actor和Critic训练同时进行参数的更新

3.2.1 Actor

class Actor(nn.Module):

def __init__(self, state_dim, action_dim, max_action):

super(Actor, self).__init__()

self.l1 = nn.Linear(state_dim, 400)

self.l2 = nn.Linear(400, 300)

self.l3 = nn.Linear(300, action_dim)

self.max_action = max_action

def forward(self, state):

a = F.relu(self.l1(state))

a = F.relu(self.l2(a))

return self.max_action * torch.tanh(self.l3(a))

3.2.2 Critic

class Critic(nn.Module):

def __init__(self, state_dim, action_dim):

super(Critic, self).__init__()

self.l1 = nn.Linear(state_dim + action_dim, 400)

self.l2 = nn.Linear(400, 300)

self.l3 = nn.Linear(300, 1)

def forward(self, state, action):

q = F.relu(self.l1(torch.cat([state, action], 1)))

q = F.relu(self.l2(q))

return self.l3(q)

3.2.3 DDPG

class DDPG(object):

//初始化参数

def __init__(self, state_dim, action_dim, max_action, device, discount=0.99, tau=0.005):

self.actor = Actor(state_dim, action_dim, max_action).to(device)

self.actor_target = copy.deepcopy(self.actor)

self.actor_optimizer = torch.optim.Adam(self.actor.parameters())

self.critic = Critic(state_dim, action_dim).to(device)

self.critic_target = copy.deepcopy(self.critic)

self.critic_optimizer = torch.optim.Adam(self.critic.parameters())

self.discount = discount

self.tau = tau

self.device = device

//选择动作

def select_action(self, state):

state = torch.FloatTensor(state.reshape(1, -1)).to(self.device)

return self.actor(state).cpu().data.numpy().flatten()

//训练

def train(self, replay_buffer, batch_size=100):

# Sample replay buffer

state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)

# 计算目标Q值

target_Q = self.critic_target(next_state, self.actor_target(next_state))

target_Q = reward + (not_done * self.discount * target_Q).detach()

# 计算当前Q值

current_Q = self.critic(state, action)

# 计算critic loss

critic_loss = F.mse_loss(current_Q, target_Q)

# 优化Critic参数

self.critic_optimizer.zero_grad()

critic_loss.backward()

self.critic_optimizer.step()

# 计算动作loss

actor_loss = -self.critic(state, self.actor(state)).mean()

# Optimize the actor

self.actor_optimizer.zero_grad()

actor_loss.backward()

self.actor_optimizer.step()

# 更新参数

for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):

target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):

target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

//保存数据

def save(self, filename):

torch.save(self.critic.state_dict(), filename + "_critic")

torch.save(self.critic_optimizer.state_dict(), filename + "_critic_optimizer")

torch.save(self.actor.state_dict(), filename + "_actor")

torch.save(self.actor_optimizer.state_dict(), filename + "_actor_optimizer")

//加载模型

def load(self, filename):

self.critic.load_state_dict(torch.load(filename + "_critic"))

self.critic_optimizer.load_state_dict(torch.load(filename + "_critic_optimizer"))

self.critic_target = copy.deepcopy(self.critic)

self.actor.load_state_dict(torch.load(filename + "_actor"))

self.actor_optimizer.load_state_dict(torch.load(filename + "_actor_optimizer"))

self.actor_target = copy.deepcopy(self.actor)

3.3BCQ

BCQ主要分为了四个部分: 第一部分:Actor网络的搭建,用于选择出动作a 第二部分:Critic网络的构建,同与价值及函数Q的评估 第三部分:VAE网络的搭建,用于对数据集D中的行为策略进行模拟,同时给出相应的数据 第四部分:训练

3.3.1Actor

class Actor(nn.Module):

def __init__(self, state_dim, action_dim, max_action, phi=0.05):

super(Actor, self).__init__()

self.l1 = nn.Linear(state_dim + action_dim, 400)

self.l2 = nn.Linear(400, 300)

self.l3 = nn.Linear(300, action_dim)

self.max_action = max_action

self.phi = phi

def forward(self, state, action):

a = F.relu(self.l1(torch.cat([state, action], 1)))

a = F.relu(self.l2(a))

a = self.phi * self.max_action * torch.tanh(self.l3(a))

return (a + action).clamp(-self.max_action, self.max_action)

3.3.2Critic

class Critic(nn.Module):

def __init__(self, state_dim, action_dim):

super(Critic, self).__init__()

self.l1 = nn.Linear(state_dim + action_dim, 400)

self.l2 = nn.Linear(400, 300)

self.l3 = nn.Linear(300, 1)

self.l4 = nn.Linear(state_dim + action_dim, 400)

self.l5 = nn.Linear(400, 300)

self.l6 = nn.Linear(300, 1)

def forward(self, state, action):

q1 = F.relu(self.l1(torch.cat([state, action], 1)))

q1 = F.relu(self.l2(q1))

q1 = self.l3(q1)

q2 = F.relu(self.l4(torch.cat([state, action], 1)))

q2 = F.relu(self.l5(q2))

q2 = self.l6(q2)

return q1, q2

def q1(self, state, action):

q1 = F.relu(self.l1(torch.cat([state, action], 1)))

q1 = F.relu(self.l2(q1))

q1 = self.l3(q1)

return q1

3.3.3VAE

class VAE(nn.Module):

def __init__(self, state_dim, action_dim, latent_dim, max_action, device):

super(VAE, self).__init__()

self.e1 = nn.Linear(state_dim + action_dim, 750)

self.e2 = nn.Linear(750, 750)

//均值

self.mean = nn.Linear(750, latent_dim)

//方差

self.log_std = nn.Linear(750, latent_dim)

self.d1 = nn.Linear(state_dim + latent_dim, 750)

self.d2 = nn.Linear(750, 750)

self.d3 = nn.Linear(750, action_dim)

self.max_action = max_action

self.latent_dim = latent_dim

self.device = device

def forward(self, state, action):

z = F.relu(self.e1(torch.cat([state, action], 1)))

z = F.relu(self.e2(z))

mean = self.mean(z)

# Clamped for numerical stability

log_std = self.log_std(z).clamp(-4, 15)

std = torch.exp(log_std)

z = mean + std * torch.randn_like(std)

//选出动作

u = self.decode(state, z)

return u, mean, std

//解码器用于选出动作

def decode(self, state, z=None):

# When sampling from the VAE, the latent vector is clipped to [-0.5, 0.5]

if z is None:

z = torch.randn((state.shape[0], self.latent_dim)).to(self.device).clamp(-0.5,0.5)

a = F.relu(self.d1(torch.cat([state, z], 1)))

a = F.relu(self.d2(a))

return self.max_action * torch.tanh(self.d3(a))

3.3.4BCQ

class BCQ(object):

def __init__(self, state_dim, action_dim, max_action, device, discount=0.99, tau=0.005, lmbda=0.75, phi=0.05):

latent_dim = action_dim * 2

self.actor = Actor(state_dim, action_dim, max_action, phi).to(device)

self.actor_target = copy.deepcopy(self.actor)

self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=1e-3)

self.critic = Critic(state_dim, action_dim).to(device)

self.critic_target = copy.deepcopy(self.critic)

self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=1e-3)

self.vae = VAE(state_dim, action_dim, latent_dim, max_action, device).to(device)

self.vae_optimizer = torch.optim.Adam(self.vae.parameters())

self.max_action = max_action

self.action_dim = action_dim

self.discount = discount

self.tau = tau

self.lmbda = lmbda

self.device = device

//选出动作

def select_action(self, state):

with torch.no_grad():

state = torch.FloatTensor(state.reshape(1, -1)).repeat(100, 1).to(self.device)

action = self.actor(state, self.vae.decode(state))

q1 = self.critic.q1(state, action)

ind = q1.argmax(0)

return action[ind].cpu().data.numpy().flatten()

def train(self, replay_buffer, iterations, batch_size=100):

for it in range(iterations):

# Sample replay buffer / batch

state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)

#编码器训练

recon, mean, std = self.vae(state, action)

recon_loss = F.mse_loss(recon, action)

KL_loss = -0.5 * (1 + torch.log(std.pow(2)) - mean.pow(2) - std.pow(2)).mean()

vae_loss = recon_loss + 0.5 * KL_loss

self.vae_optimizer.zero_grad()

vae_loss.backward()

self.vae_optimizer.step()

# Critic Training

with torch.no_grad():

# Duplicate next state 10 times

next_state = torch.repeat_interleave(next_state, 10, 0)

# Compute value of perturbed actions sampled from the VAE

target_Q1, target_Q2 = self.critic_target(next_state, self.actor_target(next_state, self.vae.decode(next_state)))

# Soft Clipped Double Q-learning

target_Q = self.lmbda * torch.min(target_Q1, target_Q2) + (1. - self.lmbda) * torch.max(target_Q1, target_Q2)

# Take max over each action sampled from the VAE

target_Q = target_Q.reshape(batch_size, -1).max(1)[0].reshape(-1, 1)

target_Q = reward + not_done * self.discount * target_Q

current_Q1, current_Q2 = self.critic(state, action)

critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

self.critic_optimizer.zero_grad()

critic_loss.backward()

self.critic_optimizer.step()

# Pertubation Model / Action Training

sampled_actions = self.vae.decode(state)

perturbed_actions = self.actor(state, sampled_actions)

# Update through DPG

actor_loss = -self.critic.q1(state, perturbed_actions).mean()

self.actor_optimizer.zero_grad()

actor_loss.backward()

self.actor_optimizer.step()

//更新目标网络

for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):

target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):

target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

3.4utils

主要用于数据的存储和读取,大多是这一套模板

3.4.1 ReplayBuffer

class ReplayBuffer(object):

def __init__(self, state_dim, action_dim, device, max_size=int(1e6)):

self.max_size = max_size

//数据所在编号

self.ptr = 0

//存储数据数目

self.size = 0

self.state = np.zeros((max_size, state_dim))

self.action = np.zeros((max_size, action_dim))

self.next_state = np.zeros((max_size, state_dim))

self.reward = np.zeros((max_size, 1))

self.not_done = np.zeros((max_size, 1))

self.device = device

//添加新数据

def add(self, state, action, next_state, reward, done):

self.state[self.ptr] = state

self.action[self.ptr] = action

self.next_state[self.ptr] = next_state

self.reward[self.ptr] = reward

self.not_done[self.ptr] = 1. - done

self.ptr = (self.ptr + 1) % self.max_size

self.size = min(self.size + 1, self.max_size)

//抽取数据

def sample(self, batch_size):

//所抽取数据的编号

ind = np.random.randint(0, self.size, size=batch_size)

return (

torch.FloatTensor(self.state[ind]).to(self.device),

torch.FloatTensor(self.action[ind]).to(self.device),

torch.FloatTensor(self.next_state[ind]).to(self.device),

torch.FloatTensor(self.reward[ind]).to(self.device),

torch.FloatTensor(self.not_done[ind]).to(self.device)

)

//数据保存

def save(self, save_folder):

np.save(f"{save_folder}_state.npy", self.state[:self.size])

np.save(f"{save_folder}_action.npy", self.action[:self.size])

np.save(f"{save_folder}_next_state.npy", self.next_state[:self.size])

np.save(f"{save_folder}_reward.npy", self.reward[:self.size])

np.save(f"{save_folder}_not_done.npy", self.not_done[:self.size])

np.save(f"{save_folder}_ptr.npy", self.ptr)

//模型加载

def load(self, save_folder, size=-1):

reward_buffer = np.load(f"{save_folder}_reward.npy")

# Adjust crt_size if we're using a custom size

size = min(int(size), self.max_size) if size > 0 else self.max_size

self.size = min(reward_buffer.shape[0], size)

self.state[:self.size] = np.load(f"{save_folder}_state.npy")[:self.size]

self.action[:self.size] = np.load(f"{save_folder}_action.npy")[:self.size]

self.next_state[:self.size] = np.load(f"{save_folder}_next_state.npy")[:self.size]

self.reward[:self.size] = reward_buffer[:self.size]

self.not_done[:self.size] = np.load(f"{save_folder}_not_done.npy")[:self.size]

4.总结

此章节主要对BCQ采用的方法作出总结,同时对这些方法做出一些拓展,探索其他的可能,并作出对比。

4.1 数据采取方法

此论文用了DDPG作为采集数据的方式,为何采取此方法主要原因有以下几点:

DDPG是off policy算法可以采用经验池进行优化,很契合离线强化的数据集采样方式,同时可以与BCQ算法作对比,来展示为何普通的off policy算法不可以直接应用到离线强化上。BCQ算法的一些内容是基于DDPG的内容上做改进的,和DDPG有共通之处,例如: 但是可以看到也与TD3的算法非常的类似: 选取动作方面都采用了类似的DPG加一个扰动的思想。

4.2 批约束思想

BCQ为了解决out-of-distribution和distribution shift的问题而提出来的约束想法,主要思想是将行为策略与实际策略结合在一起,将选取的状态,动作对尽可能的限制在已知的数据集B之中。为此采取了批约束的想法,通过VAE模拟和扰动模型的加入来进行训练,但是这很受数据集好坏的约束,因为BCQ探索的策略大多与数据集之中的策略高度相似。所以针对这一点,我们可以提出一些想法,BCQ之中是完全的模拟数据集的状态转移概率,那我们是否可以只保证我们所挑选的状态,动作对在数据集中存在,但是概率却是随意的不必强行匹配,建立一个散度来设定一个范围,保证一定约束的同时,在设立一个阈值确保不会超出这个范围。虽然这个可以保证了不会出现out-of-distribution问题的存在,但distribution shift问题还是会出现的,所以如何解决分布偏移问题又是我们所学要考虑的一个点。

4.3 Q值估计的改进

我们可以看到他对于Q值估计的函数也作了改进: 这个公式是对Clipped Double Q-learning 公式的改进, 他所作的改进是取得两个值的凸组合,同谁给予最小值上更高的权重,这样在削减了高估的同时也能减小不常见状态的影响。

相关推荐