广义优势估计(GAE)推导 发表于 2026-03-09 | 分类于 强化学习 | | 阅读 次 强化学习中广义优势估计(GAE, Generalized Advantage Estimation) 的核心逻辑实现: 12345678910111213141516171819202122232425262728293031323334353637383940414243def compute_gae_advantage_return( token_level_rewards: torch.Tensor, values: torch.Tensor, response_mask: torch.Tensor, gamma: torch.Tensor, lam: torch.Tensor,): """Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py Args: token_level_rewards: `(torch.Tensor)` shape: (bs, response_length) values: `(torch.Tensor)` shape: (bs, response_length) response_mask: `(torch.Tensor)` shape: (bs, response_length). [EOS] mask. The token after [EOS] have mask zero. gamma: `(float)` discounted factor used in RL lam: `(float)` lambda value when computing Generalized Advantage Estimation (https://arxiv.org/abs/1506.02438) Returns: advantages: `(torch.Tensor)` shape: (bs, response_length) Returns: `(torch.Tensor)` shape: (bs, response_length) """ with torch.no_grad(): lastgaelam = 0 advantages_reversed = [] gen_len = token_level_rewards.shape[-1] for t in reversed(range(gen_len)): nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0 delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t] lastgaelam = delta + gamma * lam * lastgaelam advantages_reversed.append(lastgaelam) advantages = torch.stack(advantages_reversed[::-1], dim=1) returns = advantages + values advantages = verl_F.masked_whiten(advantages, response_mask) return advantages, returns 推导过程: ------ 本文结束------ 赞赏此文?求鼓励,求支持! 赏 微信支付 支付宝 本文标题: 广义优势估计(GAE)推导 本文作者: 创建于: 2026年03月09日 - 23时03分 更新于: 2026年03月09日 - 23时03分 本文链接: https://gfjiangly.github.io/RL/ppo_gae.html 版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!