目录
- Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning
- 一些 labeled data / expert demo + unlabeled data 的 offline RL 工作
- Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
- Unsupervised-to-Online Reinforcement Learning
- (HILP) Foundation policies with hilbert representations
- Multi-Task Learning as Multi-Objective Optimization
- Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences
- MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
- auto-curriculum learning (Jiang et al., 2021b)
- Meta-Motivo(Tirinzoni 等人,2025),zero-shot goal-conditioned RL
- Unsupervised Skill Discovery via Recurrent Skill Training
- Learning to Discover Skills through Guidance
- One After Another: Learning Incremental Skills for a Changing World
- Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching
- Horizon Generalization in Reinforcement Learning
- HIQL: Offline Goal-Conditioned RL with Latent States as Actions
- Contrastive Preference Learning: Learning from Human Feedback without RL
- Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
- Rethinking Reward Modeling in Preference-based Large Language Model Alignment
- DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
- Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset
- Data Center Cooling System Optimization Using Offline Reinforcement Learning
- SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
- Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
- Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
- Thinkless: LLM Learns When to Think
- Learning to Reason without External Rewards
Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning
- arxiv:https://arxiv.org/abs/2302.08738
- 来源:无意中看到的,AAAI 2023。
- 主要内容:为 PbRL 提出两种无监督 / 自监督技术,来 online 地利用 unlabelled data。1. 认为所有 unlabelled segment 都是人类喜欢的,并将 [R1 R2 ... RH] 作为奖励向量,通过神秘的 triplet loss 进行对比学习;2. 鼓励 reward model 中 state 的 embedding(没有细看这是什么)之间的距离满足 temporal distance,使用 MSE loss 来做。
- 没有细读。
一些 labeled data / expert demo + unlabeled data 的 offline RL 工作
<ul>除了 CDS UDS 之外,还有:
The Provable Benefits of Unsupervised Data Sharing for Offline Reinforcement Learning,https://arxiv.org/abs/2302.13493 ,ICLR 2023,师兄的工作。好像很理论,没有看。
CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning,https://arxiv.org/abs/2104.07749 ,CoRL 2023:<ul>
校准潜在空间(Calibrated Latent Guidance):用 CVAE 学习 state-action 的潜在表示,但通过关键正则化强制所有专家数据嵌入坍缩到原点(均值 / 方差 ≈ 0)。这样,专家行为在潜在空间被"绑"成单点,任意样本与它的距离天然构成任务导向的内在奖励 —— 越像专家,奖励越高。无需对抗、无需时序建模,距离即奖励。
来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |