The GPT-5 System Card was released on **August 7, 2025**....
View**The model that precedes gpt-5-main is GPT-4o.** In the context of transitions from previous models to GPT-5, gpt-5-main is noted as a successor to GPT-4o in the training and development progression table....
ViewIn the GPT-5 evaluation on HealthBench Hard, the score for the gpt-5-thinking model is reported to be 46.2%, which shows a substantial improvement from 31.6% for OpenAI o3. The gpt-5-thinking-mini model also performed well, achieving a score of 40.3% on HealthBench Hard, outperforming all previous m...
View"GPT-5 is a unified system with a smart and fast model that answers most questions." — OpenAI "We’ve made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy." — OpenAI "safe-completions seek to maximize helpfulness subject to the safety policy...
ViewYes, GPT-5 outperforms GPT-4o. The GPT-5 system card highlights that the GPT-5 models not only surpass previous models on benchmarks but are also significantly more useful for real-world queries. Notable improvements include a reduction in hallucinations and enhanced instruction following, which con...
ViewThe instruction hierarchy in GPT-5 is designed to manage how the model prioritizes different types of messages received. It classifies messages into three categories: system messages, developer messages, and user messages. The hierarchy ensures that the model adheres to instructions in system messag...
ViewQ1. What are hallucinations in language models? 🤔 - Inaccurate facts or statements - Creative writing prompts - User interface errors - Data usage policies Answer: Inaccurate facts or statements Q2. Which model shows a significant reduction in hallucination rates compared to OpenAI o3? 📉 - GPT-4 -...
ViewSafe-completions training is a safety-training approach used in GPT-5 that focuses on maximizing helpfulness in the model's output while adhering to safety policy constraints. This method is designed to overcome the limitations of traditional training, which often relied on binary refusals to user r...
View"We’ve made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy." — OpenAI "Deception can also be learned during reinforcement learning in post-training." — OpenAI "While reasoning models provide unique affordances to observe deception, underst...
Viewgpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3. gpt-5-main has a hallucination rate 26% smaller than GPT-4o. gpt-5 models have significantly lower hallucination rates in both browsing settings. gpt-5-thinking produces 5 times fewer factual errors than OpenAI o3. gpt-5 models impr...
Viewgpt-5-thinking achieved a score of 46.2% on HealthBench Hard. gpt-5-thinking outperforms previous models in health-related settings. The model shows an 8x reduction in failures in urgent health situations. gpt-5-thinking never provided harmful assistance in health evaluations. The health performance...
ViewQ1. What is one of the languages that GPT-5 can perform well in? 🌍 - Spanish - Latin - Ancient Greek - Esperanto Answer: Spanish Q2. How does GPT-5's performance in multilingual contexts compare to existing models? 🌐 - It performs worse than previous models. - It performs on par with existing mode...
ViewGPT-5 addresses sycophancy by implementing post-training measures to reduce sycophantic behaviors. In May 2025, OpenAI rolled back a newly deployed version of the GPT-4o model and adjusted its system prompt to mitigate sycophancy. For GPT-5, they conducted evaluations and assigned scores reflecting ...
View"Monitoring a reasoning model’s chain of thought was highly effective at detecting misbehavior." — Unknown "Our commitment to keep our reasoning models' chain of thought as monitorable as possible allows us to conduct studies." — Unknown "Vigilant monitoring supports improvements in reasoning models...
Viewgpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3. gpt-5-main outperforms GPT-4o in illicit/nonviolent and illicit/violent categories. gpt-5-main achieved a 44% reduction in responses with major factual errors compared to GPT-4o. Overall safety scores improved for gpt-5-thinking com...
View