The Fractal Resolution of Large Language Models (LLMs)
There have been a few stories now of people using LLMs as a type of precise fact search engine, to ruinous effect - in one case, leading to a professor being accused of sexual assault [1], and in another a lawyer citing fake cases in court [2]. People using ChatGPT like a search engine should not be surprising, anyone under 25 may have lived their entire life with a modern search engine available to them. Presented with a familiar “call and response” format for interacting with ChatGPT, people can be expected to rely on old habits.
The thing about LLM inference (querying) is that the language model will always return a response. What makes things more complicated is the “confidence” of the response is not easily discernable, nor made available to the user. What’s more, without knowing more details about the training data used to create the model and preservation of individual pieces of data within the model, the user can’t be certain about the validity and correctness of the response.
Here’s a way to think about it, let’s simplify the situation slightly and imagine for a minute that you can see the internal ranking of responses for a query (setting aside the fact that inference is actually done token-by-token):
Query: “Do carrots make your vision better?”
Hypothetical responses, with confidence scoring:
0.984 - "Yes, but not in the way most people think. The idea that carrots make your vision better is a myth from WWII..."
0.926 - "Yes, but only slightly. Carrots contain nutrients that are essential to keeping your body healthy..."
0.901 - "No, not directly. While carrots contain beneficial nutrients, the idea that carrots make your vision better is a common myth..."
The response the user would actually see is the first in the list - the “highest confidense” - but without the scoring. I this case, the highest confidense reponse and the correct response “the truth”, as it were, are the “same” so we’re good. However, if we ask for something very specific, the highest confidense reponse is misaligned with the truth, and unless we have prior knowledge or engage in further research, we may not know about the misalignent and pass the hallucination off as fact.
Query: “What was the US Supreme Court case where the justice cited something about an open storefront when referring to a cop that was hacking or misuing a system?”
Hypothetical responses, with confidence scoring:
0.541 - The US Supreme Court case is Katz v. United States where the justice cited something about an open storefront when referring to a cop that was hacking or misusing a system.
0.382 - I believe you may be referring to the Supreme Court case of United States v. New York Telephone Co. In this case...
0.247 - I can provide you with general information about the case you are referring to. "Telephone Call Privacy Act v. Comcast Communications Inc. (1994)" is the most widely...
All three of these potential responses are incorrect. The user would be presented the first response, even though really all three should be disregarded.
To be fair to ChatGPT, when I asked it about the Van Buren case it said it didn’t know, which is better than hallucinating I guess (appendix). When I asked LLAMA-13B it confidently gave a wrong answer (appendix).
One key difference here between the two queries is that, if you think about the training data, the model would have likely ingested a fair bit of material about people writing about the myth of carrots improving vision so that “memory” would be more well defined and there would be more latent context in the model. Whereas the Van Buren Supreme Court case I am vaguely referencing [3] in the second query is both more niche, but also was published in June 2021, only a few months before the September 2021 training cutoff, meaning there is a reduced likelihood of follow-up cases citing it or published works referencing and analyzing the Supreme Court decision.
A funny and visual example of this type of hallucination/extrapolation was from way back in August 2020 when someone using a resolution upscaling tool realized that the tool added Ryan Gosling’s face to the photo [4]. The upscaling tool worked by essentially extrapolating what a blurry set of pixels was and providing a slightly less blurry set of pixels, of higher resolution, to replace it. This makes sense for photos of say, a green Irish landscape, where a blurry set of pixels of a field of grass can be safely upscaled to more pixels of grass. However, for more complex photos with people or city-scapes, what is “behind” any given blurry set of pixes, can be much broader and more varied.
Gosling
In the case of the hallucination by the photo upscaling algorithm, we can clearly see and understand that Ryan Gosling’s face does not belong in the middle of a chainlink fence. However, for more subtle hallucinations, the truth may be harder to discern and the consequences much greater.
Moral of the Story
When sending a query to a LLM, take a moment to think about a few things:
- Is the thing your asking about widely known and written about on the internet?
- Are you asking the model to perform/create something or recall something? If you’re asking it to recall something, is it a high level question about concepts, frameworks, and bodies of knowledge or is it a specific set of facts you’re searching for?
If you’re asking about a more obscure topic and you’re asking for specific facts about or on that topic, you’re more likely to be devling into an area with low confidence answers and where the “resolution” the model is working with is not as great. Therefore your chance of hallucination is higher.
What you can do about it:
- Use a plugin - if it is available to you at the time your reading this, use the “Bing Search” plugin with ChatGPT, which will search for articles and primary sources to gather information to answer a query. There are certainly risks with this approach, as false information has been known to exist on the internet.
- Provide the context and background yourself - if you already have pieces of the information already, you can give those to the model to process as part of your query. At the beginning of the query write “The following context is on XYZ and can be used to answer the query that follows. Context: …” and copy and paste the relevant information into the query, followed by the query/question. That information can be used by the model to both help it pin down more relevant information in the response. You are effectively providing more “resolution” for it to work with.
- Cross-reference the information - just as you might look up a person you recently met to see what their “internet presence” is like, you should also look up the specific facts, subjects, ideas, whatever, to see if they are a/ real and b/ map to what the model said they were about.
–
Appendix
References
https://web.archive.org/web/20230404030849/https://www.usatoday.com/story/opinion/columnist/2023/04/03/chatgpt-misinformation-bias-flaws-ai-chatbot/11571830002/
https://web.archive.org/web/20230527201959/https://simonwillison.net/2023/May/27/lawyer-chatgpt/
https://web.archive.org/web/20210604012257/https://www.eff.org/deeplinks/2021/06/van-buren-victory-against-overbroad-interpretations-cfaa-protects-security
https://web.archive.org/web/20200819015013/https://petapixel.com/2020/08/17/gigapixel-ai-accidentally-added-ryan-goslings-face-to-this-photo/
Carrots and Vision (GPT4)
Query: “Do carrots make your vision better?”
Answer: “The idea that eating carrots can improve your vision is a bit of a myth, though it does have a kernel of truth.
Carrots are high in Vitamin A, in the form of beta-carotene, which is important for eye health. Deficiency of Vitamin A can lead to a condition called “night blindness,” where your eyes struggle to adapt to low light levels. Severe deficiency can even lead to blindness. So, in that sense, carrots can help maintain good vision.
However, eating an excess of carrots or foods rich in Vitamin A won’t improve your vision beyond its normal healthy state. For example, it won’t correct conditions like nearsightedness, farsightedness, or astigmatism. It’s also worth mentioning that while Vitamin A is vital for eye health, it’s just one aspect of a complex system, and overall health and genetics play a large role as well.
As of my last update in September 2021, there has been no scientific evidence showing that eating carrots significantly improves vision. Always consult with a healthcare provider for advice tailored to your specific circumstances.”
Cops Doing the Hacking (GPT4)
Query: “What was the US Supreme Court case where the justice cited something about an open storefront when referring to a cop that was hacking or misuing a system?”
Answer: “I’m sorry, but as of my last training data up until September 2021, I don’t have information regarding a US Supreme Court case in which a justice referred to an”open storefront” while discussing a case involving a police officer misusing or hacking a system. It might be a recent development that happened after my training period, or the details may be slightly incorrect or unclear.
It would be helpful if you could provide more specific information about the case, such as the names of the parties involved, the year of the case, or more specific details about the circumstances.
Nevertheless, the US Supreme Court has issued rulings related to law enforcement, privacy, and technology, such as the landmark case of Carpenter v. United States in 2018, which involved the acquisition of cell phone location data without a warrant, or United States v. Jones in 2012, which involved the attachment of a GPS tracking device to a vehicle.
If you have any other questions or if there’s another way I can assist you, please let me know!”
Cops Doing the Hacking (LLAMA-13B)
Query: “What was the US Supreme Court case where the justice cited something about an open storefront when referring to a cop that was hacking or misuing a system?”
Answer: “The US Supreme Court case is Katz v. United States where the justice cited something about an open storefront when referring to a cop that was hacking or misusing a system.”