Value learning

Value learning is a research area within artificial intelligence (AI) and AI alignment that focuses on building systems capable of inferring, acquiring, or learning human values, goals, and preferences from data, behavior, and feedback. The aim is to ensure that advanced AI systems act in ways that are beneficial and aligned with human well-being, even in the absence of explicitly programmed instructions.^[1]^[2]

Unlike traditional AI that focuses purely on task performance, value learning aims to ensure that AI decisions are ethically and socially acceptable. It is analogous to teaching a child right from wrong—guiding an AI to recognize which actions align with human moral standards and which do not. The process typically involves identifying relevant values (such as safety or fairness), collecting data that reflects those values, training models to learn appropriate responses, and iteratively refining their behavior through feedback and evaluation. Applications include minimizing harm in autonomous vehicles, promoting fairness in financial systems, prioritizing patient well-being in healthcare, and respecting user preferences in digital assistants. Compared to earlier techniques, value learning shifts the focus from mere functionality to understanding the underlying reasons behind choices, aligning machine behavior with human ethical expectations.^[3]

Motivation

The motivation for value learning stems from the observation that humans are often inconsistent, unaware, or imprecise about their own values. Hand-coding a complete ethical framework into an AI is considered infeasible due to the complexity of human norms and the unpredictability of future scenarios. Value learning offers a dynamic alternative, allowing AI to infer and continually refine its understanding of human values from indirect sources such as behavior, approval signals, and comparisons.^[4]^[5]

A foundational critique of traditional reinforcement learning (RL) highlights its limitations in aligning artificial general intelligence (AGI) with human values. It is argued that RL systems optimize fixed reward signals, which can incentivize harmful or deceptive behavior if such actions increase rewards. As an alternative, he proposes value-learning agents that maintain uncertainty over utility functions and update beliefs based on interactions. These agents aim not to maximize static rewards but to infer what humans truly value. This probabilistic framework enables adaptive alignment with complex, initially unspecified goals and is viewed as a foundational step toward safer AGI.^[6]

The growing importance of value learning is reflected in how AI products are increasingly evaluated and marketed. A notable shift occurred with the release of GPT-4 in March 2023, when OpenAI emphasized not just technical improvements but also enhanced alignment with human values. This marked one of the first instances where a commercial AI product was promoted based on ethical considerations. The trend signals a broader transformation in AI development—prioritizing principles like fairness, accountability, safety, and privacy alongside performance. As AI systems become more integrated into society, aligning them with human values is critical for public trust and responsible deployment.^[7]

Key approaches

One central technique is inverse reinforcement learning (IRL), which aims to recover a reward function that explains observed behavior. IRL assumes that the observed agent acts (approximately) optimally and infers the underlying preferences from its choices.^[8]^[9]

Cooperative inverse reinforcement learning (CIRL) extends IRL to model the AI and human as cooperative agents with asymmetric information. In CIRL, the AI observes the human to learn their hidden reward function and chooses actions that support mutual success.^[10] ^[11]

Another approach is preference learning, where humans compare pairs of AI-generated behaviors or outputs, and the AI learns which outcomes are preferred. This method underpins successful applications in training language models and robotics.^[12]^[13]

Recent research introduces a novel framework for learning human values directly from behavioral data, without relying on predefined models or external annotations. The method distinguishes between value specifications (contextual definitions) and value systems (agents’ prioritizations among values). A demonstration in route choice modeling—using tailored inverse reinforcement learning (IRL) techniques—infers how agents weigh options such as speed, safety, or scenic routes. The results confirm that value learning from demonstrations can effectively capture complex decision-making preferences, supporting the feasibility of value-aligned AI in applied settings.^[14]

Concept alignment

A major challenge in value learning is ensuring that AI systems interpret human behavior using similar conceptual models. Recent research distinguishes between "value alignment" and "concept alignment," the latter referring to the internal representations that humans and machines use to describe the world. Misalignment in conceptual models can lead to serious errors even if value inference mechanisms are accurate.^[15]

Challenges

Value learning faces several difficulties:

Ambiguity of human behavior – Human actions are noisy, inconsistent, and context-dependent.^[16]
Reward misspecification – The inferred reward may not fully capture human intent, particularly under imperfect assumptions.^[17]
Scalability – Methods that work in narrow domains often struggle with generalization to more complex or ethical environments.^[18]

Research from Purdue University reveals that AI training datasets disproportionately emphasize certain human values—such as utility and information-seeking—while underrepresenting others like empathy, civic responsibility, and human rights. By applying a value taxonomy grounded in moral philosophy, researchers found that AI systems trained on these datasets may struggle in morally complex or socially sensitive contexts. To address these gaps, the study employed reinforcement learning from human feedback (RLHF) and value annotation to audit and guide dataset improvements. This work underscores the importance of comprehensive value representation in data and contributes tools for more equitable, value-aligned AI development.^[19]

Hybrid and cultural approaches

Recent work highlights the importance of integrating diverse moral perspectives into value learning. One framework, HAVA (Hybrid Approach to Value Alignment), incorporates explicit (e.g., legal) and implicit (e.g., social norm) values into a unified reward model.^[20] Another line of research explores how inverse reinforcement learning can adapt to culturally specific behaviors, such as in the case of "culturally-attuned moral machines" trained on different societal norms.^[21]

An important global policy initiative supporting the goals of value learning is UNESCO’s Recommendation on the Ethics of Artificial Intelligence, unanimously adopted by 194 member states in 2021. Although the term "value learning" is not explicitly used, the document emphasizes the need for AI to operationalize values such as human dignity, justice, inclusiveness, sustainability, and human rights. It establishes a global ethical framework grounded in four core values and ten guiding principles, including fairness, transparency, and human oversight. Tools like the Readiness Assessment Methodology (RAM) and Ethical Impact Assessment (EIA) help translate these principles into practice.^[22]

Applications

Value learning is being applied in:

Robotics – Teaching robots to cooperate with humans in household or industrial tasks.^[23]
Large language models – Aligning chatbot behavior with user intent using preference feedback and reinforcement learning.^[18]
Policy decision-making – Informing AI-assisted decisions in governance, healthcare, and safety-critical environments.^[20]

References

^ Russell, Stuart (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
^ Xu, Jisheng; Lin, Ding; Fong, Pangkit; Fang, Chongrong; Duan, Xiaoming; He, Jianping (June 2025). "Reward Models in Deep Reinforcement Learning: A Survey". arXiv:2506.09876 [cs.RO].
^ "What is Value Learning?". BytePlus. BytePlus. Retrieved 28 June 2025.
^ Ng, Andrew Y.; Stuart Russell (2000). Algorithms for Inverse Reinforcement Learning (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000). Stanford, CA, USA: Morgan Kaufmann. pp. 663–670.
^ Christiano, Paul F.; Jan Leike; Tom B. Brown; Miljan Martic; Shane Legg; Dario Amodei (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems 30 (NeurIPS 2017). Curran Associates, Inc. pp. 4299–4307. arXiv:1706.03741.
^ Dewey, Daniel (2011). "Learning What to Value" (PDF). Machine Intelligence Research Institute. MIRI. Retrieved 28 June 2025.
^ Abernethy, Jacob; Candelon, François; Evgeniou, Theodoros; Gupta, Abhishek; Lostanlen, Yves (March 2024). "Bring Human Values to AI". Harvard Business Review. Retrieved 28 June 2025.
^ Ng, Andrew Y.; Stuart Russell (May 2000). "Algorithms for Inverse Reinforcement Learning" (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML). Stanford, CA: Morgan Kaufmann: 663–670.
^ Deshpande, Saurabh; Walambe, Rahee; Kotecha, Ketan; Selvachandran, Ganeshsree; Abraham, Ajith (26 March 2025). "Advances and applications in inverse reinforcement learning: a comprehensive review". Neural Computing and Applications. 37 (17): 11071–11123. doi:10.1007/s00521-025-11100-0. Retrieved 24 June 2025.
^ Hadfield-Menell, Dylan; Anca Dragan; Pieter Abbeel; Stuart Russell (5 December 2016). Cooperative Inverse Reinforcement Learning. Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS 2016). Curran Associates, Inc. pp. 3916–3924. Retrieved 24 June 2025.
^ Malik, Asma; et al. (2018). Efficient Bellman Updates for Cooperative Inverse Reinforcement Learning.
^ Christiano, Paul F.; et al. (2017). Deep reinforcement learning from human preferences. arXiv:1706.03741.
^ Xu, Jisheng; Lin, Ding; Fong, Pangkit; Fang, Chongrong; Duan, Xiaoming; He, Jianping (2025). "Reward Models in Deep Reinforcement Learning: A Survey". arXiv:2506.09876 [cs.RO].
^ Holgado‑Sánchez, Andrés; Bajo, Javier; Billhardt, Holger; Ossowski, Sascha; Arias, Joaquín (2025). Value Learning for Value‑Aligned Route Choice Modeling via Inverse Reinforcement Learning. Lecture Notes in Computer Science: Value Engineering in Artificial Intelligence. Lecture Notes in Computer Science. Springer Nature Switzerland. pp. 40–60. doi:10.1007/978-3-031-85463-7_3.
^ Rane, Aditya; et al. (2024). "Transmutation operators and complete systems of solutions for the radial bicomplex Vekua equation". Journal of Mathematical Analysis and Applications. 536 (2). arXiv:2305.09150. doi:10.1016/j.jmaa.2024.128224.
^ Skalse, Tobias (2025). "Misspecification in IRL". AI Alignment Forum.
^ Zhou, Weichao; Li, Wenchao (2024). "Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment". arXiv:2410.23680 [cs.LG].
^ ^a ^b Cheng, Wei; et al. (2025). "Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment". arXiv:2505.09612 [stat.ML].
^ Obi, Ike (6 February 2025). "AI datasets have human values blind spots − new research". The Conversation. Purdue University and partners. Retrieved 28 June 2025.
^ ^a ^b Varys, Kryspin (2025). "HAVA: Hybrid Approach to Value Alignment". arXiv:2505.15011 [cs.AI].
^ Oliveira, Nigini; et al. (2023). "Culturally-Attuned Moral Machines: Implicit Learning of Human Value Systems by AI through Inverse Reinforcement Learning". arXiv:2312.04578 [cs.AI].
^ "Recommendation on the Ethics of Artificial Intelligence". UNESCO. UNESCO. Retrieved 28 June 2025.
^ Deshpande, Saurabh; Walambe, Rahee; Kotecha, Ketan; Selvachandran, Ganeshsree; Abraham, Ajith (26 March 2025). "Advances and applications in inverse reinforcement learning: a comprehensive review". Neural Computing and Applications. 37 (17): 11071–11123. doi:10.1007/s00521-025-11100-0. Retrieved 24 June 2025.

[1] Russell, Stuart (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

[2] Xu, Jisheng; Lin, Ding; Fong, Pangkit; Fang, Chongrong; Duan, Xiaoming; He, Jianping (June 2025). "Reward Models in Deep Reinforcement Learning: A Survey". arXiv:2506.09876 [cs.RO].

[3] "What is Value Learning?". BytePlus. BytePlus. Retrieved 28 June 2025.

[4] Ng, Andrew Y.; Stuart Russell (2000). Algorithms for Inverse Reinforcement Learning (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000). Stanford, CA, USA: Morgan Kaufmann. pp. 663–670.

[5] Christiano, Paul F.; Jan Leike; Tom B. Brown; Miljan Martic; Shane Legg; Dario Amodei (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems 30 (NeurIPS 2017). Curran Associates, Inc. pp. 4299–4307. arXiv:1706.03741.

[6] Dewey, Daniel (2011). "Learning What to Value" (PDF). Machine Intelligence Research Institute. MIRI. Retrieved 28 June 2025.

[7] Abernethy, Jacob; Candelon, François; Evgeniou, Theodoros; Gupta, Abhishek; Lostanlen, Yves (March 2024). "Bring Human Values to AI". Harvard Business Review. Retrieved 28 June 2025.

[8] Ng, Andrew Y.; Stuart Russell (May 2000). "Algorithms for Inverse Reinforcement Learning" (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML). Stanford, CA: Morgan Kaufmann: 663–670.

[9] Deshpande, Saurabh; Walambe, Rahee; Kotecha, Ketan; Selvachandran, Ganeshsree; Abraham, Ajith (26 March 2025). "Advances and applications in inverse reinforcement learning: a comprehensive review". Neural Computing and Applications. 37 (17): 11071–11123. doi:10.1007/s00521-025-11100-0. Retrieved 24 June 2025.

[10] Hadfield-Menell, Dylan; Anca Dragan; Pieter Abbeel; Stuart Russell (5 December 2016). Cooperative Inverse Reinforcement Learning. Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS 2016). Curran Associates, Inc. pp. 3916–3924. Retrieved 24 June 2025.

[11] Malik, Asma; et al. (2018). Efficient Bellman Updates for Cooperative Inverse Reinforcement Learning.

[12] Christiano, Paul F.; et al. (2017). Deep reinforcement learning from human preferences. arXiv:1706.03741.

[13] Xu, Jisheng; Lin, Ding; Fong, Pangkit; Fang, Chongrong; Duan, Xiaoming; He, Jianping (2025). "Reward Models in Deep Reinforcement Learning: A Survey". arXiv:2506.09876 [cs.RO].

[14] Holgado‑Sánchez, Andrés; Bajo, Javier; Billhardt, Holger; Ossowski, Sascha; Arias, Joaquín (2025). Value Learning for Value‑Aligned Route Choice Modeling via Inverse Reinforcement Learning. Lecture Notes in Computer Science: Value Engineering in Artificial Intelligence. Lecture Notes in Computer Science. Springer Nature Switzerland. pp. 40–60. doi:10.1007/978-3-031-85463-7_3.

[15] Rane, Aditya; et al. (2024). "Transmutation operators and complete systems of solutions for the radial bicomplex Vekua equation". Journal of Mathematical Analysis and Applications. 536 (2). arXiv:2305.09150. doi:10.1016/j.jmaa.2024.128224.

[16] Skalse, Tobias (2025). "Misspecification in IRL". AI Alignment Forum.

[17] Zhou, Weichao; Li, Wenchao (2024). "Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment". arXiv:2410.23680 [cs.LG].

[IRLwDRS-18] Cheng, Wei; et al. (2025). "Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment". arXiv:2505.09612 [stat.ML].

[19] Obi, Ike (6 February 2025). "AI datasets have human values blind spots − new research". The Conversation. Purdue University and partners. Retrieved 28 June 2025.

[Kryspin_HAVA-20] Varys, Kryspin (2025). "HAVA: Hybrid Approach to Value Alignment". arXiv:2505.15011 [cs.AI].

[21] Oliveira, Nigini; et al. (2023). "Culturally-Attuned Moral Machines: Implicit Learning of Human Value Systems by AI through Inverse Reinforcement Learning". arXiv:2312.04578 [cs.AI].

[22] "Recommendation on the Ethics of Artificial Intelligence". UNESCO. UNESCO. Retrieved 28 June 2025.

[23] Deshpande, Saurabh; Walambe, Rahee; Kotecha, Ketan; Selvachandran, Ganeshsree; Abraham, Ajith (26 March 2025). "Advances and applications in inverse reinforcement learning: a comprehensive review". Neural Computing and Applications. 37 (17): 11071–11123. doi:10.1007/s00521-025-11100-0. Retrieved 24 June 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]