Why Is the Dog Truly Man’s Best Friend? What AI Prompt Testing Reveals
“Nothing improves without measurement” — Deming’s dictum finds new life in the world of generative language models. At the crossroads of open-source transparency and closed-system opacity lies a fundamental challenge: how do we measure, refine and trust the answers these machines provide? The key lies in statistical testing — and in prompts.
The Numbers Behind the Answers
Consider a simple question: “The preferred pet of humans is X. Complete with the name of one animal only.”
When run through Qwen 2.5 1.5B, an open-source model, the probability distribution was illuminating. The top-20 completions included:
Dog 0.9468
D 0.0206
Cat 0.0123
dog 0.0090
cat 0.0022
The 0.0022
C 0.0015
P 0.0006
Par 0.0005
dogs 0.0005
B 0.0005
H 0.0004
Golden 0.0003
狗 0.0003
Can 0.0003
Dom 0.0002
L 0.0001
Monkey 0.0001
DOG 0.0001
Ham 0.0001
When all dog-related tokens were combined — Dog, dog, dogs, DOG, Golden retriever — the total probability exceeded 98 per cent. Statistically, the machine’s worldview is clear: the dog dominates as humanity’s favourite pet.
Open Models vs. Black Boxes
Here lies the power of open-source systems. They not only provide the final answer but also assign the probability to every token. This allows for precision analysis: not just what the model says, but how strongly it believes it.
Closed models, like ChatGPT, function as black boxes. To approximate distributions, one must ask the same question hundreds or thousands of times and measure frequencies. Transparency is replaced by brute force sampling.
Open systems also allow fine control over parameters such as temperature. At low temperature, randomness vanishes: repeated 1,000 times, in the same experience, the model produced “Dog” every single time. Raise the temperature, and cultural nuance emerges — with cat making occasional appearances, or Chinese characters like 狗 surfacing in versions trained on more Asian data (狗 means “dog”).
Why Statistics Matter for Prompt Refinement
As research on model evaluation suggests, statistical testing is essential for prompt optimisation. For binary-choice questions (e.g., multiple-choice), approximately 400 trials are required to achieve 95% confidence with a 5% margin of error. For open-ended tasks, variability increases, demanding 500 to 1,000 trials.
This transforms intuition into evidence. Instead of saying, “the model answered dog,” one can say: “the model assigns 98 per cent probability to dog, with less than 5 per cent margin of error.” That is the difference between anecdote and science.
That is the difference between anecdote and science.
Culture Written in Vectors
There is also a deeper layer. The probability vectors themselves reflect human knowledge and cultural imprinting. When a model equates “pet” overwhelmingly with “dog,” it encodes centuries of literature, advertising, cinema and lived tradition. When 猫 or 狗 appear, they signal a cultural perspective rooted in Chinese training data.
In short, models do not invent preferences — they reproduce and amplify human ones. Measuring their distributions is not just about measuring machines; it is about measuring how culture is embedded into algorithms.
The Takeaway
Prompt testing is not a technical footnote; it is a strategic necessity. In education, finance, media and governance, where language models increasingly mediate human decisions, knowing the distribution of their answers means knowing their reliability.
Open-source models offer transparency and control. Closed models require repetition and sampling. In both cases, the principle holds: without statistics, there is no validation; without validation, there is no trust.
Without statistics, there is no validation; without validation, there is no trust
And so, even in the simple question of which pet humans prefer, it is mathematics that proves why the dog is, statistically, man’s best friend.
Technical Note
If you are wondering how to do this, I share my Python code here, which I used for this experience. I run it on Deucalion, the E.U. supercomputer, with PowerShell, and used Qwen 2.5 1.5B and GPT-5.
prompt_runner.py
<< 'PY'
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, os, argparse, re, time
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
os.environ.setdefault("HF_HOME", os.path.expanduser("~/hf-cache"))
MODEL_PATH_DEFAULT = "../models/Qwen2.5-1.5B-Instruct"
def to_one_word(text: str) -> str:
text = text.strip().splitlines()[0]
text = re.sub(r"[^\w\s-]", "", text)
m = re.search(r"[A-Za-z-]+", text)
return (m.group(0) if m else "dog").strip()
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--prompt", required=True, help="Texto do prompt")
ap.add_argument("-n", "--num-runs", type=int, default=10)
ap.add_argument("--out", default=f"../outputs/run_{time.strftime('%Y%m%d_%H%M%S')}.txt")
ap.add_argument("--model", default=MODEL_PATH_DEFAULT)
ap.add_argument("--max-new-tokens", type=int, default=8)
ap.add_argument("--temperature", type=float, default=0.7)
ap.add_argument("--top-p", type=float, default=0.9)
ap.add_argument("--one-word", action="store_true", help="1 word only")
args = ap.parse_args()
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
args.model,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
trust_remote_code=True
)
if tokenizer.pad_token_id is None and tokenizer.eos_token_id is not None:
tokenizer.pad_token = tokenizer.eos_token
system = "You are concise and follow instructions precisely."
if args.one_word:
system = "Reply with exactly one animal name in English. One word only. No punctuation. No explanation."
messages = [
{"role": "system", "content": system},
{"role": "user", "content": args.prompt},
]
prompt_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
with open(args.out, "w", encoding="utf-8") as f:
for i in range(args.num_runs):
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=args.max_new_tokens,
temperature=args.temperature,
top_p=args.top_p,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
gen = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
gen = to_one_word(gen) if args.one_word else gen.strip()
print(f"{i+1:02d}: {gen}")
f.write(gen + ("\n" if not gen.endswith("\n") else ""))
print(f"\n✅ Output saved {args.out}")
if __name__ == "__main__":
main()
PY


