找回密码
 立即注册
首页 业界区 业界 论文速读记录 | 2025.12(2)

论文速读记录 | 2025.12(2)

纪音悦 2 小时前
目录

  • Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning
  • 一些 labeled data / expert demo + unlabeled data 的 offline RL 工作
  • Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
  • Unsupervised-to-Online Reinforcement Learning
  • (HILP) Foundation policies with hilbert representations
  • Multi-Task Learning as Multi-Objective Optimization
  • Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences
  • MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration
  • Absolute Zero: Reinforced Self-play Reasoning with Zero Data
  • CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
  • auto-curriculum learning (Jiang et al., 2021b)
  • Meta-Motivo(Tirinzoni 等人,2025),zero-shot goal-conditioned RL
  • Unsupervised Skill Discovery via Recurrent Skill Training
  • Learning to Discover Skills through Guidance
  • One After Another: Learning Incremental Skills for a Changing World
  • Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching
  • Horizon Generalization in Reinforcement Learning
  • HIQL: Offline Goal-Conditioned RL with Latent States as Actions
  • Contrastive Preference Learning: Learning from Human Feedback without RL
  • Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
  • Rethinking Reward Modeling in Preference-based Large Language Model Alignment
  • DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
  • Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset
  • Data Center Cooling System Optimization Using Offline Reinforcement Learning
  • SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
  • Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
  • Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
  • Thinkless: LLM Learns When to Think
  • Learning to Reason without External Rewards

Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning


  • arxiv:https://arxiv.org/abs/2302.08738
  • 来源:无意中看到的,AAAI 2023。
  • 主要内容:为 PbRL 提出两种无监督 / 自监督技术,来 online 地利用 unlabelled data。1. 认为所有 unlabelled segment 都是人类喜欢的,并将 [R1 R2 ... RH] 作为奖励向量,通过神秘的 triplet loss 进行对比学习;2. 鼓励 reward model 中 state 的 embedding(没有细看这是什么)之间的距离满足 temporal distance,使用 MSE loss 来做。
  • 没有细读。
一些 labeled data / expert demo + unlabeled data 的 offline RL 工作

<ul>除了 CDS UDS 之外,还有:
The Provable Benefits of Unsupervised Data Sharing for Offline Reinforcement Learning,https://arxiv.org/abs/2302.13493  ,ICLR 2023,师兄的工作。好像很理论,没有看。
CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning,https://arxiv.org/abs/2104.07749  ,CoRL 2023:<ul>
校准潜在空间(Calibrated Latent Guidance):用 CVAE 学习 state-action 的潜在表示,但通过关键正则化强制所有专家数据嵌入坍缩到原点(均值 / 方差 ≈ 0)。这样,专家行为在潜在空间被"绑"成单点,任意样本与它的距离天然构成任务导向的内在奖励 —— 越像专家,奖励越高。无需对抗、无需时序建模,距离即奖励。

来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!

相关推荐

您需要登录后才可以回帖 登录 | 立即注册