Skip to content

Session 21 - Mehdi Khamassi - Ensuring Alignment With Human Values

Published: at 07:00 PM

How can we ensure AI systems’ alignment with human values?

Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. This is the object of a number of current researches. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning (RL) from human feedback, neglecting what it philosophically and ethically means and is required for alignment to occur. Moreover, most current RL research use fixed reward functions, often determined in advance, while open-ended learning AI agents might be progressively exposed to an increasing number of rules, conventions, norms and human/societal values.

In this presentation, I’ll present two recent pieces of work. The first one proposes to distinguish strong and weak value alignment of AI systems with human values. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents’ intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we proposed a series of prompts showing ChatGPT’s, Gemini’s and Copilot’s failures to recognize some of these situations. We moreover analyzed word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans’ semantic representations. We then proposed a new thought experiment that we called “the Chinese room with a word transition dictionary”, in extension of John Searle’s famous proposal, to better highlight the specific cognitive abilities that we think are still lacking for strong alignment.

In a second part, I’ll present a novel extension of the RL framework that we proposed, and which we called the ‘Purpose’ framework. It is based on a three-level motivational system (operational level, motivational level, purpose level) for open-ended learning agents. Extending the motivational reinforcement learning formalism, it is intended to relate the purpose level (rules/conventions/norms at the societal level, but also missions required by humans from artificial agents), to the motivational level (so as to modulate the agents’ homeostatic, epistemic, social and mission drives), which in turn determine the multidimensional reward function that will be used by the RL agents at the operational level. I will finish the presentation by discussing the perspectives that this new framework opens, to not only increase the learning possibilities of open-ended learning agents while ensuring high-level norms and values are respected, but also to help formalize constraints on the level of autonomy of these artificial agents that humans collectively wish to see imposed.

About Mehdi Khamassi

A portrait of Mehdi Khamassi. Mehdi Khamassi is a CNRS research director at the Institute of Intelligent Systems and Robotics, located on the Sorbonne University campus in Paris, France. He is also the co-director of the new master's program in cognitive sciences at Sorbonne University and Université Paris Cité. His main research topics include decision-making mechanisms and reinforcement learning in the biological brain and robots, the role of different types of rewards (social, non-social, informational) in learning, and the ethical issues raised by autonomous machine decision-making.

🔗 website
💼 LinkedIn
🦋Bluesky
🐘 Mastodon / 🐥 Piaille

Details

Date and Time: Thursday, 27th of March 2025 - 7pm
Location: Sony CSL, 6 rue Amyot, 75005 Paris
Registration: https://lu.ma/nrvllvbk