This problem came up from my direct interaction with AIs, particularly in fine-tuning Language Models (LMs). Like many others (Qi et al. 2023, Wolf 2023), I've observed that the alignment layer applied in models by OpenAI for example, can be easily removed by users (even unintentionally). This renders these models unsuitable for altruistic purposes, commercial applications or simply for those who wish to avoid outputs that are sexist, homophobic or might put the safety of their users in play.

According to (Dafoe et al. 2021 and Bicchieri 2014) AI systems need social understanding and cooperative intelligence to integrate well into society. My tentative solution to this problem and the issue of misaligned LMs is having a bargaining game framework where a decision is judged as moral or not.

Untitled Diagram.jpg

In this setup, multiple agents, each adhering to a distinct set of deterministic ethical guidelines customized by the developer, evaluate the moral acceptability of a given context. Each agent will have a different opinion about the topic and should utilize the double crux method in order to try to reconcile their differences. By the end of their deliberations their final outputs should be weighted depending on the kind of Bargaining solution the developer chooses (Nash bargaining, Kalai-Smorodinsky or Egalitarian bargaining). At this point a adjudicator agent would make the final decision and the framework outputs a solution to the ethical problem.

The advantages of this framework are a) it is more cost-effective to automate data processing using well-aligned language models rather than relying on human labor for data labeling, b) the variability in human moral opinions can make it challenging to ensure strong and clear adherence to specific ideas, this is not a problem for a language model that’s playing a specific role, c) enables developers to create highly flexible moral frameworks d) the output of this framework could serve as a database to be shared openly with others, improving the general LM ecosystem.

Motivation: In the past, I have successfully broke through the restrictions implemented by big tech on their machine learning models. This was achieved through a series of somewhat contrived approaches, including the intentional circumvention of the built-in safeguards or significant modifications to the context and prompting mechanisms utilizing platforms such as Langchain. However, on my last project, while I collaborating to the RADAR collective, I was able to circumvent OpenAI’s alignment algorithms with only a few hundred 100 samples of well aligned data. I have followed OpenAI’s documentation and have not tried to circumvent their safeguards in any way.

This pose a terrible issue to ethical AI consumption. Especially as the popularity of employing these modified models within commercial spheres increases. The nature of the outputs from OpenAI’s fine tuned models raised serious concerns, with responses ranging from extremely homophobic, to sexist, to completely delusional and dangerous, potentially posing a risk to people’s lives.

General solution: It seems obvious to me that the ecosystem direly needs an open-source framework that will allow those using these LLMs, once their models have been fine-tuned, to develop some sort of alignment as a new fine-tune layer. This set of tools has to be both technically flexible to be able to deal with several different LLM clients and ethically configurable, as ethical preferences might be different across different uses. Besides that, we also need a set of tools that's able to assess the quality of the alignment in these models, so we have means of comparing the improvement/decrease of the alignment qualities.

This is an idea in philosophy, Double Crux [1]. See explanation below:

Screenshot 2023-11-06 at 1.41.33 PM.png