news 2026/5/25 19:25:04

《从曾经到智能》CPT 强化学习完整实现(PyTorch 版 - Actor-Critic + CPT)

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
《从曾经到智能》CPT 强化学习完整实现(PyTorch 版 - Actor-Critic + CPT)

《从曾经到智能》
CPT 强化学习完整实现(PyTorch 版 - Actor-Critic + CPT)


项目说明

本实现将Cumulative Prospect Theory(累积前景理论)融入Actor-Critic强化学习框架,让智能体不仅追求期望回报,还体现人类的损失厌恶概率扭曲参照点依赖等心理特性。

这正是“从曾经(传统RL)到智能(行为智能)”的实践路径。

核心特性

  • PyTorch 实现
  • Actor-Critic(A2C 风格)
  • CPT 价值函数 + 优势函数重塑
  • 自适应参照点
  • 支持连续动作空间(LunarLanderContinuous-v2)
  • 代码结构清晰、可扩展

完整代码

importgymnasiumasgymimporttorchimporttorch.nnasnnimporttorch.optimasoptimimportnumpyasnpfromcollectionsimportdequeimportrandomimportmatplotlib.pyplotasplt# ====================== 1. CPT 模块 ======================classCumulativeProspectTheory:def__init__(self,alpha=0.88,beta=0.88,lambda_loss=2.25,gamma_gain=0.61,gamma_loss=0.69,reference=0.0):self.alpha=alpha# 收益凹度self.beta=beta# 损失凸度self.lambda_loss=lambda_loss# 损失厌恶系数 ≈2.25self.gamma_gain=gamma_gain self.gamma_loss=gamma_loss self.reference=reference# 参照点(可动态调整)defvalue_function(self,x):"""前景理论价值函数 v(x)"""x=torch.as_tensor(x,dtype=torch.float32)positive=x>=0result=torch.zeros_like(x)result[positive]=x[positive]**self.alpha result[~positive]=-self.lambda_loss*(-x[~positive])**self.betareturnresultdefcompute_cpt_advantages(self,rewards,gamma=0.99):"""计算整条轨迹的 CPT 优势"""returns=[]R=0.0forrinreversed(rewards):R=r+gamma*R returns.insert(0,R)returns=torch.tensor(returns,dtype=torch.float32)# 相对于参照点的得失relative_returns=returns-self.reference cpt_values=self.value_function(relative_returns)# 标准化advantages=(cpt_values-cpt_values.mean())/(cpt_values.std()+1e-8)returnadvantages# ====================== 2. Actor-Critic 网络 ======================classActorCritic(nn.Module):def__init__(self,state_dim:int,action_dim:int,hidden_dim:int=128):super().__init__()self.shared=nn.Sequential(nn.Linear(state_dim,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,hidden_dim),nn.ReLU())# Actor: 输出动作分布参数self.actor_mean=nn.Linear(hidden_dim,action_dim)self.actor_logstd=nn.Parameter(torch.zeros(action_dim))# Critic: 状态价值self.critic=nn.Linear(hidden_dim,1)defforward(self,x):x=self.shared(x)mean=self.actor_mean(x)logstd=self.actor_logstd.expand_as(mean)std=torch.exp(logstd)value=self.critic(x)returnmean,std,value# ====================== 3. CPT Agent ======================classCPTActorCriticAgent:def__init__(self,state_dim,action_dim,lr=3e-4,gamma=0.99,device=None):self.device=deviceor("cuda"iftorch.cuda.is_available()else"cpu")self.gamma=gamma self.cpt=CumulativeProspectTheory()self.model=ActorCritic(state_dim,action_dim).to(self.device)self.optimizer=optim.Adam(self.model.parameters(),lr=lr)self.memory=deque(maxlen=4096)defselect_action(self,state):state=torch.FloatTensor(state).unsqueeze(0).to(self.device)mean,std,_=self.model(state)dist=torch.distributions.Normal(mean,std)action=dist.sample()# 裁剪到环境允许范围action=torch.clamp(action,-1.0,1.0)returnaction.squeeze().cpu().numpy()defstore_transition(self,state,action,reward,next_state,done):self.memory.append((state,action,reward,next_state,done))defupdate(self,batch_size=256):iflen(self.memory)<batch_size:return0.0batch=random.sample(self.memory,batch_size)states,actions,rewards,next_states,dones=zip(*batch)states=torch.FloatTensor(np.array(states)).to(self.device)actions=torch.FloatTensor(np.array(actions)).to(self.device)rewards=torch.FloatTensor(rewards).to(self.device)dones=torch.FloatTensor(dones).to(self.device)# 使用 CPT 计算优势advantages=self.cpt.compute_cpt_advantages(rewards.tolist())# 前向计算means,stds,values=self.model(states)dist=torch.distributions.Normal(means,stds)log_probs=dist.log_prob(actions).sum(dim=-1)# Loss 计算actor_loss=-(log_probs*advantages.detach()).mean()critic_loss=nn.functional.mse_loss(values.squeeze(),advantages)loss=actor_loss+0.5*critic_loss self.optimizer.zero_grad()loss.backward()torch.nn.utils.clip_grad_norm_(self.model.parameters(),0.5)self.optimizer.step()returnloss.item()defadapt_reference(self,recent_rewards,alpha=0.3):"""动态调整参照点"""ifrecent_rewards:self.cpt.reference=np.mean(recent_rewards)*alpha# ====================== 4. 训练函数 ======================deftrain(episodes=1200,save_interval=200):env=gym.make("LunarLanderContinuous-v2")state_dim=env.observation_space.shape[0]# 8action_dim=env.action_space.shape[0]# 2agent=CPTActorCriticAgent(state_dim,action_dim)reward_history=[]recent_rewards=deque(maxlen=50)print("开始训练 CPT-Actor-Critic 智能体...")forepisodeinrange(episodes):state,_=env.reset()episode_rewards=[]done=Falsetotal_reward=0whilenotdone:action=agent.select_action(state)next_state,reward,terminated,truncated,_=env.step(action)done=terminatedortruncated agent.store_transition(state,action,reward,next_state,done)state=next_state total_reward+=reward episode_rewards.append(reward)# 定期更新iflen(agent.memory)>=256:agent.update()recent_rewards.append(total_reward)reward_history.append(total_reward)# 自适应参照点ifepisode%20==0andlen(recent_rewards)>0:agent.adapt_reference(list(recent_rewards))ifepisode%50==0orepisode==episodes-1:avg_reward=np.mean(reward_history[-50:])print(f"Episode{episode:4d}| "f"Reward:{total_reward:7.1f}| "f"Avg(50):{avg_reward:7.1f}| "f"Ref Point:{agent.cpt.reference:.2f}")env.close()# 保存模型torch.save(agent.model.state_dict(),"cpt_actor_critic_final.pth")print("训练完成!模型已保存。")returnreward_historyif__name__=="__main__":rewards=train(episodes=1500)# 绘制学习曲线plt.figure(figsize=(10,6))plt.plot(rewards)plt.plot(np.convolve(rewards,np.ones(50)/50,mode='valid'))plt.title('CPT Actor-Critic Training Curve (LunarLanderContinuous)')plt.xlabel('Episode')plt.ylabel('Total Reward')plt.legend(['Raw','Moving Avg'])plt.grid(True)plt.show()

运行方式

pipinstallgymnasium[box2d]torch matplotlib numpy python cpt_rl.py

想进一步优化?请告诉我:

  • 切换到PPO版本
  • 使用SAC(Soft Actor-Critic)+ CPT
  • 支持多智能体自定义环境
  • 添加可视化 + TensorBoard支持
  • 对比实验(传统 RL vs CPT-RL)

随时说,我立即给你对应版本。

从传统 RL 到融入人类行为偏好的智能体,这正是《从曾经到智能》的实践一步。

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/25 19:15:47

AI是怎么学习的?一篇看懂机器学习

AI是怎么"学习"的&#xff1f;一篇看懂机器学习&#x1f31f; 系列&#xff1a;从0到1学AI&#xff08;入门系列&#xff09;第 2 篇&#x1f3af; 适合人群&#xff1a;知道AI是什么&#xff0c;想了解AI如何学习的朋友⏱️ 阅读时长&#xff1a;约 12 分钟前言 上一…

作者头像 李华
网站建设 2026/5/25 19:15:45

我的大二叉树

#include <stdio.h> #include <malloc.h> #include <stdbool.h>#define QUEUE_SIZE 5/*** 二叉树结点*/ typedef struct BTNode {char element;struct BTNode* left;struct BTNode* right; } BTNode, *BTNodePtr;/*** 存储若干指针的队列*/ typedef struct B…

作者头像 李华