On LLM safety

5 min readJun 11, 2023

Carefully - Photo by Brett Jordan on Unsplash

I don’t know how the censorship layers work in LLMs, but in theory, in the fine tuning, instruction sets, and prompt steps, you could make LLMs as unpolitically correct, or as politically correct as you wish (specially with the open source models).

Unless you already pre-censored the training set, censorship layers are necessary to adjust a model’s outputs/responses.

I am imagining this censorship layer on a freely trained LLM:

Layer 0: Free LLM trained on an internet crawler’s data. An open canvas, the most chaotic, dangerous, and also the most capable (it would be capable, assuming people write useful things on the internet). This model would reflect any freely available data set found on the internet. It would learn to speak, and to speak about anything it finds, it would also learn false things in some cases, specially if the common sense knowledge about something is false. Governments seem to be concerned about these models the most, because these models are by definition as open as the internet, and based solely on open crawl without supervision (feedback from reviewers tasked with a certain objective). Since this would be an LLM trained on everything that people say online, in theory, you could make it say anything, including lies. In this sense, a model like this one would be more human-like, since humans aren’t –factual– all the time either by ignorance, or intentionally.
Layer 1: The fine-tuning layer made by the LLM’s developer. A supervised training set. [‘Censorship’ #1].
Layer 2: The user’s fine-tuning layer. A custom, non-general data set that would take priority of the layers underneath. The user(s) could have censored or selected what data to use in the fine tuning, so it could count as some ‘censorship’ or correction of the previous information underneath it. [‘Censorship’ #2].
Layer 3: Ban on words, or phrases, in the instruction set. The query would never get to the LLM’s main knowledge engine or fine tuning layers, as the query could be banned and return an error. [‘Censorship’ #3].
Layer 4 (user layer): Person prompting, writing instruction sets, or fine-tuning. I think that since users can sub-select what information they pick to contextualize (or auto-complete). The user’s thoughts and preconceptions could count as some censorship, which can lead to confirmation bias. Factual information is information that can be proven by a multiplicity of human observers, also known as scientific truths. However, if a user were to query a LLM in a tone and mode of speaking that contradicted scientific truth, and the most likely response to queries like these on the internet weren’t factual (like anti-scientific knowledge, groups, or comedy), the LLM could respond in a non-factual way.

The higher the layer under the user layer, the higher the hierarchical importance that data takes. So, the program will stop and spit out an error, (or a pre-determined response), before touching the layer 0.

If you wanted to do this super securely, and increase the censorship of a model. You could also have an additional layer before displaying the response to the user:

Layer RE (return filtration layer): Before a response from the LLM is displayed to the user, there could be a simple filter that bans words or phrases, this filter could force a re-generation [Layer RE: Response A], without showing an error to the user, the re-generation could have a slightly different temperature/probability calculation if there were –banned words/phrases– in the returned generation, or, the AI could default to an error where the user is asked to change its prompt, or settings due to the collision with the return filtration layer [Layer RE: Response B].

The Layer RE: Response A, could be limited to a certain number of re-generations, and a fixed maximum deviation from the user-set likelihood (called temperature in OpenAI), to unexpectedly long response times, or responses that deviate too much from what is expected from the user.

OpenAI’s rules probably take precedence before any instruction set, so the hierarchy above is probably accurate. So you, as a developer, get less freedom, the same with DALLE, or other more open models .

In my experience thuogh, I have noticed that the programmatic access is more lenient.

On regulation

I think governments will have a hard time regulating these models because it seems like they don’t understand these models very well, and the tools and algos are already everywhere, and owned by developers everywhere. Long term though, I think AI is going to be like the internet and email, equally, or more useful.

Some “bad aspects” of the internet, like the open possibility of DDoS attacks, and spam in the email protocol, were never fully solved, yet these technologies have been massively useful to humanity’s needs, and wants, as well as technological progress.

Overall, I trust most people will use the tech for useful things, and some -bad actors- or annoying stuff will inevitable happen/co-exist.

Perfect solutions

Near-perfect solutions tend to be costly too, you could solve 99% of truly unwanted email spam, by charging a custom amount for receiving messages like $0.10 to $3 per message, by a user designated price, or by only allowing whitelisted people/entities to send you email.

But then, you could create other problems, like problems of access, where less wealthy users wouldn’t be able to access email due to it being prohibitely expensive.

This is how Ethereum mostly solved spam and efficient storage for example, by charging fees for almost any usage of it.

However, this has made writing data and doing transactions on Ethereum’s main chain is very expensive. Most transactions can cost from $1 to $50 or more depending on the usage of the network at a certain time, I have personally spent thousands of dollars on Ethereum transactions, just to move money around or do simple operations, which makes it a prohibitively expensive technology for most users.

I think very accessible and low cost technology, tends to be inherently broken from the view of a regulator, because it can serve any human, without judgement which also enables bad actors/malicious people.

The most accessible technologies are like knives that you buy at Walmart, low-cost, powerful, and acquired without a license to use knives.

This free market of knives does lead to more people using these tools against humans, but we naturally perceive that the likelihood of these occurrences is so low, and the annoyance of complying with a lot of regulations is so high, that we don't demand a license to use knives, even though the potential for lethal outcomes from knife usage, is real.

Unfortunately, a lot of people are paralyzed by the downside, in part due to our biology (by overemphasizing remembering negative experiences/memories), and in part due to the news, TV programs and mainstream media which make society and the world seem more dangerous than they actually are.

In the same way, a lot of people over-react and become paralyzed by the thought of the potential risks of technology, and tend to disregard its potential benefits which usually far outweigh their downsides.

Hallucinations

For instance, even though LLMs can hallucinate and fabricate non-true answers, they are still very useful for a plethora of applications. Their ability to hallucinate can be used in creative applications such as literature or art, and hallucinating facts can actually lead to better prompting to get the truth, or to the truth, by giving us clues for what information to look for on the internet (LLM + Google search).

On LLM safety

On regulation

Perfect solutions

Hallucinations

Written by @nixtoshi