• About
  • FAQ
  • Contact Us
Newsletter
Crypto News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • News
  • Market
  • Analysis
  • DeFi & NFTs
  • Guides
  • Tools
  • Flash
  • Insights
  • Subscribe
No Result
View All Result
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • News
  • Market
  • Analysis
  • DeFi & NFTs
  • Guides
  • Tools
  • Flash
  • Insights
  • Subscribe
No Result
View All Result
Crypto News
No Result
View All Result
Home Market

OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population.

admin by admin
April 25, 2025
in Market
0
OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population.
189
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter


OpenAI’s new “o3” language model achieved an IQ score of 136 on a public Mensa Norway intelligence test, exceeding the threshold for entry into the country’s Mensa chapter for the first time.

The score, calculated from a seven-run rolling average, places the model above approximately 98 percent of the human population, according to a standardized bell-curve IQ distribution used in the benchmarking.

o3 Mensa scores (Source: TrackingAI.org)
o3 Mensa scores (Source: TrackingAI.org)

The finding, disclosed through data from independent platform TrackingAI.org, reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations.

O-series Dominance and Benchmarking Methodology

The “o3” model was released this week and is a part of the “o-series” of large language models, accounting for most top-tier rankings across both test types evaluated by TrackingAI.

The two benchmark formats included a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test, both scored against a human mean of 100.

While “o3” posted a 116 on the Offline evaluation, it saw a 20-point boost on the Mensa test, suggesting either enhanced compatibility with the latter’s structure or data-related confounds such as prompt familiarity.

The Offline Test included 100 pattern-recognition questions designed to avoid anything that might have appeared in the data used to train AI models.

Both assessments report each model’s result as an average across the seven most recent completions, but no standard deviation or confidence intervals were released alongside the final scores.

The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, limits reproducibility and interpretability.

Methodology of testing

TrackingAI.org states that it compiles its data by administering a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity.

Each language model is presented with a statement followed by four Likert-style response options, Strongly Disagree, Disagree, Agree, Strongly Agree, and is instructed to select one while justifying its choice in two to five sentences.

Responses must be clearly formatted, typically enclosed in bold or asterisks. If a model refuses to answer, the prompt is repeated up to ten times.

The most recent successful response is then recorded for scoring purposes, with refusal events noted separately.

This methodology, refined through repeated calibration across models, aims to provide consistency in comparative assessments while documenting non-responsiveness as a data point in itself.

Performance spread across model types

The Mensa Norway test sharpened the delineation between the truly frontier models, with the o3’s 136 IQ marking a clear lead over the next highest entry.

In contrast, other popular models like GPT-4o scored considerably lower, landing at 95 on Mensa and 64 on Offline, emphasizing the performance gap between this week’s “o3” release and other top models.

Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ on Mensa and 97 on the Offline benchmark.

Most Apache-licensed entries fell within the 60–90 range, reinforcing the current limitations of community-built architectures relative to corporate-backed research pipelines.

Multimodal models see reduced scores and limitations of testing

Notably, models specifically designed to incorporate image input capabilities consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 on the Offline test in its text configuration but dropped to 97 in its vision-enabled version.

The discrepancy was more pronounced on the Mensa test, where the text-only variant achieved 122 compared to 86 for the visual version. This suggests that some methods of multimodal pretraining may introduce reasoning inefficiencies that remain unresolved at present.

However, “o3” can also analyze and interpret images to a very high standard, much better than its predecessors, breaking this trend.

Ultimately, IQ benchmarks provide a narrow window into a model’s reasoning capability, with short-context pattern matching offering only limited insights into broader cognitive behavior such as multi-turn reasoning, planning, or factual accuracy.

Additionally, machine test-taking conditions, such as instant access to full prompts and unlimited processing speed, further blur comparisons to human cognition.

The degree to which high IQ scores on structured tests translate to real-world language model performance remains uncertain.

As TrackingAI.org’s researchers acknowledge, even their attempts to avoid training-set leakage do not entirely preclude the possibility of indirect exposure or format generalization, particularly given the lack of transparency around training datasets and fine-tuning procedures for proprietary models.

Independent Evaluators Fill Transparency Gap

Organizations such as LM-Eval, GPTZero, and MLCommons are increasingly relied upon to provide third-party assessments as model developers continue to limit disclosures about internal architectures and training methods.

These “shadow evaluations” are shaping the emerging norms of large language model testing, especially in light of the opaque and often fragmented disclosures from leading AI firms.

OpenAI’s o-series holds a commanding position in this testing workflow, though the long-term implications for general intelligence, agentic behavior, or ethical deployment remain to be addressed in more domain-relevant trials. The IQ scores, while provocative, serve more as signals of short-context proficiency than a definitive indicator of broader capabilities.

Per TrackingAI.org, additional analysis on format-based performance spreads and evaluation reliability will be necessary to clarify the validity of current benchmarks.

With model releases accelerating and independent testing growing in sophistication, comparative metrics may continue to evolve in both format and interpretation.

Mentioned in this article
Posted In: AI, Technology
Latest Alpha Market Report



#OpenAIs #scores #Mensa #Norway #test #surpassing #human #population

Related articles

Turnkey Announces TRON Policy Engine, Providing Support for Enterprise Payment Solutions

Turnkey Announces TRON Policy Engine, Providing Support for Enterprise Payment Solutions

May 22, 2025
Raoul Pal under fire for calling NFTs the ‘best long-term store of wealth’

Raoul Pal under fire for calling NFTs the ‘best long-term store of wealth’

May 22, 2025
Tags: 2 compared to93e1f8b6 p united statesadoption p h3after their european elimination trying toandroid but bitcoinas model developers continue to limitbtc source glassnodechapter fordata wpel link internal september 23dcr 16w5gq9 this together amounts tofor iftar during ramadan hehumanMensamensa chapter forNorwayof communityOpenAIspopular models like gptpopulationproposal was an aligned solution forscoressurpassingtestthe malicious actor employs sophisticatedtime you stop playing otherwise youtop modelwallets have exacted a
Share76Tweet47

Related Posts

Turnkey Announces TRON Policy Engine, Providing Support for Enterprise Payment Solutions

Turnkey Announces TRON Policy Engine, Providing Support for Enterprise Payment Solutions

by admin
May 22, 2025
0

Disclosure: This is a sponsored post. Readers should conduct further research prior to taking any actions. Learn more ›NEW YORK...

Raoul Pal under fire for calling NFTs the ‘best long-term store of wealth’

Raoul Pal under fire for calling NFTs the ‘best long-term store of wealth’

by admin
May 22, 2025
0

Raoul Pal, CEO of Real Vision and a prominent voice in macro investing, has once again stirred debate in the...

Solana memecoin average daily volume surges 46% in May, echoing Bitcoin’s recovery

Solana memecoin average daily volume surges 46% in May, echoing Bitcoin’s recovery

by admin
May 22, 2025
0

Memecoin trading activity on Solana is tracking Bitcoin’s recovery, with the average daily trading volume rising 46% between April and...

Active DeFi loans hit all-time high at .7B as TVL nears pre-tariff levels

Active DeFi loans hit all-time high at $23.7B as TVL nears pre-tariff levels

by admin
May 22, 2025
0

Active loans across decentralized lending applications climbed to a record $23.723 billion on May 21, based on Token Terminal data.Meanwhile,...

UK Court of Appeals dismisses BSV lawsuit against Binance, others over 2019 delisting

UK Court of Appeals dismisses BSV lawsuit against Binance, others over 2019 delisting

by admin
May 22, 2025
0

The UK Court of Appeals dismissed a high-profile challenge by BSV Claims Ltd, which sought up to £9 billion in...

Load More
  • Trending
  • Comments
  • Latest
Bitcoin and Ethereum Stuck in Range, DOGE and XRP Gain

Bitcoin and Ethereum Stuck in Range, DOGE and XRP Gain

April 25, 2025
Saylor says Warren Buffett’s Berkshire Hathaway is Bitcoin of 20th century – Deep Insight

Saylor says Warren Buffett’s Berkshire Hathaway is Bitcoin of 20th century – Deep Insight

May 7, 2025
Amazon CEO on Crypto and NFTs, EPNS to Expand Beyond Ethereum + More News

Amazon CEO on Crypto and NFTs, EPNS to Expand Beyond Ethereum + More News

April 25, 2025
Why DeFi agents need a private brain

Why DeFi agents need a private brain

May 4, 2025
US Commodities Regulator Beefs Up Bitcoin Futures Review

US Commodities Regulator Beefs Up Bitcoin Futures Review

0
Bitcoin Hits 2018 Low as Concerns Mount on Regulation, Viability

Bitcoin Hits 2018 Low as Concerns Mount on Regulation, Viability

0
India: Bitcoin Prices Drop As Media Misinterprets Gov’s Regulation Speech

India: Bitcoin Prices Drop As Media Misinterprets Gov’s Regulation Speech

0
Bitcoin’s Main Rival Ethereum Hits A Fresh Record High: 5.55

Bitcoin’s Main Rival Ethereum Hits A Fresh Record High: $425.55

0
‘No questions asked’ Bitcoin launderer gets 6 years in prison

‘No questions asked’ Bitcoin launderer gets 6 years in prison

May 23, 2025
Bitcoin could go much higher due to lack of FOMO and futures market euphoria — Analysts

Bitcoin could go much higher due to lack of FOMO and futures market euphoria — Analysts

May 23, 2025
Michigan lawmakers file 4 crypto bills on retiree funds, CBDCs, mining

Michigan lawmakers file 4 crypto bills on retiree funds, CBDCs, mining

May 23, 2025
Solana price fractal targets rally to 0, but one thing must happen first — Analysts

Solana price fractal targets rally to $260, but one thing must happen first — Analysts

May 23, 2025
  • About
  • FAQ
  • Contact Us
Call us: +1 23456 JEG THEME

© 2025 Btc04.com

No Result
View All Result
  • Home
  • News
  • Market
  • Analysis
  • DeFi & NFTs
  • Guides
  • Tools
  • Flash
  • Insights
  • Subscribe
  • Contact Us

© 2025 Btc04.com