Good news! Variability in responses to the same question/query by the same AI model could indeed be a major issue! Though, I bet this issue will be resolved soon!
"... research lab gave the world its first look into one of its projects: creating AI models with reproducible responses. ...
tries to unpack the root cause of what introduces randomness in AI model responses. For example, ask ChatGPT the same question a few times over, and you’re likely to get a wide range of answers. This has largely been accepted in the AI community as a fact — today’s AI models are considered to be non-deterministic systems ..."
"... Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. ...
This by itself is not surprising, since getting a result from a language model involves “sampling”, a process that converts the language model’s output into a probability distribution and probabilistically selects a token.
What might be more surprising is that even when we adjust the temperature down to 0. This means that the LLM always chooses the highest probability token, which is called greedy sampling. (thus making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice ... Even when running inference on your own hardware with an OSS inference library like vLLM or SGLang, sampling still isn’t deterministic ...
But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism."
Defeating Nondeterminism in LLM Inference (a very long, technical blog post)
No comments:
Post a Comment