Challenges in multi-hop reasoning and search

What does TTD-DR stand for? 🤔
Difficulty: Easy
What is the main advantage of the TTD-DR framework? 📈
Difficulty: Medium
What are the two core mechanisms in the TTD-DR framework? 🔍
Difficulty: Hard

What is the biggest driver of AI CapEx today?

 title: 'Years to 50% Adoption of Household Technologies in USA, per Morgan Stanley'

The bulk of spending in AI large language model (LLM) development is still dominated by compute, specifically, the compute needed to train and run models[1]. Training costs remain extraordinarily high and are rising fast, often exceeding $100 million per model today[1].

Even as the cost to train models climbs, a growing share of total AI spend is shifting toward inference, the cost of running models at scale in real-time[1]. As inference becomes cheaper, AI gets used more[1]. And as AI gets used more, total infrastructure and compute demand rises, dragging costs up again[1].

Space: Trends In Artificial Intelligence 2025 By Mary Meeker et. Al

What is Ilya's Sutskever new company?

Ilya Sutskever's New Company Overview

Ilya Sutskever, known for his work in the field of artificial intelligence, has embarked on a new venture with his company, Safe Superintelligence Inc. (SSI). This company aims to safely develop superintelligence that surpasses human intelligence[1]. The focus is on creating AI systems that are both powerful and safe, addressing what Sutskever describes as the most critical technical problem of our time[6].

Company Objectives and Goals

title: 'A screen capture of Safe Superintelligence's initial formation announcement captured on June 20, 2024.' and caption: 'a letter to a company'

Safe Superintelligence Inc. is[6] dedicated to pushing the boundaries of artificial intelligence while ensuring that the development process remains safe and ethical. The company's mission is to create superintelligence that exceeds human capabilities while also prioritizing safety measures to prevent any potential risks associated with advanced AI systems.

Ilya Sutskever's Vision

title: 'Ilya Sutskever Has a New Plan for Safe Superintelligence' and caption: 'a man speaking into a microphone'

Ilya Sutskever's vision for Safe Superintelligence Inc. revolves around the concept of building a safe AI environment[9] where superintelligent systems can coexist with humans. By focusing on the responsible and secure development of AI, Sutskever hopes to contribute to the advancement of technology in a sustainable and innovative manner.

Industry Impact and Significance

title: 'Last year, Ilya Sutskever helped create what was called a Superalignment team inside OpenAI that aimed to ensure that A.I. technologies would not do harm.' and caption: 'a man sitting on a couch'

The establishment of Safe Superintelligence Inc. signifies a crucial step in the evolution of artificial intelligence research and development. With a strong emphasis on safety and ethics, the company's efforts could potentially shape the future of AI technologies and their integration into various sectors of society.

Transition from OpenAI

title: 'Illya Sutskever physically gestures as OpenAI CEO Sam Altman looks on at Tel Aviv University on June 5, 2023.' and caption: 'a man in a suit raising his hand'

Ilya Sutskever's decision to leave OpenAI and pursue his new project underscores his personal commitment to addressing the challenges and opportunities presented by superintelligence. While specific details about the company's operations have not been fully disclosed, Sutskever's dedication to this endeavor highlights the importance of responsible AI innovation.

Conclusion

title: 'OpenAI's Former Chief Scientist Ilya Sutskever Launches a New AI Company' and caption: 'a man in a black shirt'

In conclusion, Ilya Sutskever's new company, Safe Superintelligence Inc., represents a pioneering effort in the field of artificial intelligence. By prioritizing safety and ethics in the development of superintelligent systems, Sutskever aims to create a groundbreaking AI environment that fosters collaboration between humans and advanced AI technologies. This ambitious vision has the potential to redefine the landscape of artificial intelligence research and shape the future of technology in a meaningful and impactful way.


LLM temperature control

🤔 What does a lower temperature setting typically do to an LLM's response?
Difficulty: Easy
🌡️ How does temperature control the randomness of token selection in LLMs?
Difficulty: Medium
🧐 What is a common issue in Large Language Models that is often exacerbated by inappropriate temperature settings?
Difficulty: Hard
Space: LLM Prompting Guides From Google, Anthropic and OpenAI

Key statements on adversarial AI training

Our approach combined two elements: Helpful-only training and maximizing capabilities relevant to Preparedness benchmarks in the biological and cyber domains.
Unknown[1]
We simulated an adversary who is technical, has access to strong post-training infrastructure and ML knowledge, can collect in-domain data for harmful capabilities.
Unknown[1]
Even with robust fine-tuning, gpt-oss-120b did not reach High capability in Biological and Chemical Risk or Cyber risk.
Unknown[1]
Our models are trained to follow OpenAI’s safety policies by default.
Unknown[1]
Rigorously assessing an open-weights release’s risks should thus include testing for a reasonable range of ways a malicious party could feasibly modify the model.
Unknown[1]
Space: Let’s explore the gpt-oss-120b and gpt-oss-20b Model Card

Enhancing Knowledge-Based Visual Question Answering with mR2AG

Introduction to mR2AG

In the ever-evolving field of Artificial Intelligence, particularly in multimodal understanding, the challenge of effectively integrating visual and textual knowledge has gained significant attention. Traditional Multimodal Large Language Models (MLLMs) like GPT-4 have shown prowess in visual question answering (VQA) tasks; however, they often falter when confronted with Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA. These tasks require the models to provide specific and accurate answers based on external information rather than relying solely on their pre-existing knowledge base.

To address these limitations, the mR2AG framework—short for Multimodal Retrieval-Reflection-Augmented Generation—has been developed. This innovative approach combines retrieval mechanisms with reflective processes to enhance the performance of MLLMs in answering knowledge-based questions accurately and efficiently.

Overview of mR2AG

mR2AG introduces two critical reflection operations: Retrieval-Reflection and Relevance-Reflection. Retrieval-Reflection determines whether the user query is Knowledge-based or Visual-dependent, thereby deciding the necessity of information retrieval. This adaptive retrieval process helps avoid the unnecessary complexity of retrieving information when it’s not needed, ultimately streamlining the question-answering process.

The second reflection operation, Relevance-Reflection, plays a crucial role in identifying specific pieces of evidence from the retrieved content that are beneficial for answering the query. This allows the MLLM to generate answers rooted in accurate and relevant information rather than vague generalities, which is often a problem with current models.

Table 1. Main results of models with external knowledge on the INFOSEEK. † denotes our method and its variants with alternative designs.
Table 1. Main results of models with external knowledge on the INFOSEEK. † denotes our method and its variants with alternative designs.

As described in the paper, mR2AG “achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity”[1]. This efficiency is vital for maintaining the MLLMs' original performance across a variety of tasks, especially in Visual-dependent scenarios.

Performance and Results

The mR2AG framework has demonstrated significant improvements over prior models in handling knowledge-based queries. Comprehensive evaluations on datasets such as INFOSEEK reveal that mR2AG outperforms existing MLLMs by notable margins. Specifically, when using LLaVA-v1.5-7B as the basis for MLLM, applying mR2AG leads to performance gains of 10.6% and 15.5% on the INFOSEEK Human and Wikidata test sets, respectively, while also excelling in the Encycopedic-VQA challenge[1].

Table 9. Complete results by question type on INFOSEEKHuman, with LLaVA-FT referring to the fine-tuned model.
Table 9. Complete results by question type on INFOSEEKHuman, with LLaVA-FT referring to the fine-tuned model.

One of the compelling aspects of mR2AG is its ability to refine its outputs based on the relevance of retrieved information. The results indicate that by effectively evaluating retrieval content, mR2AG can identify and utilize evidence passages, resulting in more reliable answer generation. “Our method can effectively utilize noisy retrieval content, accurately pinpoint the relevant information, and extract the knowledge needed to answer the questions”[1].

Moreover, mR2AG does not merely improve knowledge-based questioning; it preserves the foundational capabilities of the underlying MLLMs to handle Visual-dependent tasks with similar finesse. This balance between specialized retrieval and generalizeable knowledge is a hallmark of mR2AG's design.

Methodology

The success of mR2AG hinges on its structured methodology. Initially, user queries are classified by type—either Visual-dependent or Knowledge-based. The MLLM generates retrieval-reflection predictions to decide whether external knowledge is necessary. If the model predicts that retrieval is required, it selects relevant articles from a knowledge base, focusing on Wikipedia entries, which are rich in information[1].

Table 6. Effect of retrieving different numbers of Wikipedia entries.
Table 6. Effect of retrieving different numbers of Wikipedia entries.

Once the relevant documents are retrieved, the model employs Relevance-Reflection to assess each passage's potential as evidence for the query. Each passage undergoes evaluation to determine its relevance, allowing the model to generate answers based on identified supportive content. This layered approach—first distinguishing the need for external information, then pinpointing the most pertinent evidence—significantly enhances the accuracy of responses.

The mR2AG framework also introduces an instruction tuning dataset (mR2AG-IT) specifically designed for Knowledge-based VQA tasks, which aids in the model's adaptability through a structured training process[1].

Conclusion

The mR2AG framework represents a significant advancement in the domain of knowledge-based visual question answering within AI. By integrating adaptive retrieval with precise evidence identification, mR2AG not only enhances the accuracy of answers but also streamlines the complexity typically associated with multimodal models. Its robust performance across various benchmarks demonstrates its effectiveness in tackling challenging knowledge-centric tasks while maintaining the versatility required for visual understanding.

Table 4. Results on MLLMs with different architectures and scales.
Table 4. Results on MLLMs with different architectures and scales.

As the AI landscape continues to evolve, frameworks like mR2AG underline the potential for models that can both comprehend intricate visual data and harness external knowledge bases efficiently, setting a foundation for future advancements in multimodal AI systems.

Follow Up Recommendations

What challenges do LLMs face with generalisation?

Large language models (LLMs) face significant challenges with generalisation, particularly with out-of-distribution (OOD) scenarios. Generalisation can only be expected in areas covered by observations, meaning LLMs often struggle to apply their learned patterns to new contexts that do not resemble their training data. As stated, 'the generalisation behaviour does not match human generalisation well, lacking the ability to generalise to OOD samples and exhibit compositionality'[1].

Moreover, the phenomenon of 'hallucination,' where models confidently make incorrect predictions, is a notable overgeneralisation challenge for LLMs. This occurs when critical differences are ignored in their predictions[1].


Understanding Toolformer: Enhancing Language Models with API Tools

In the realm of language models (LMs), researchers continuously explore ways to enhance their capabilities. Toolformer, a recent innovation, is designed to enable language models to learn how to utilize various external tools, such as search engines, calculators, and translation systems. This blog post breaks down the key findings and methodologies presented in the Toolformer paper while making it accessible for a broader audience.

The Challenge with Conventional Language Models

Language models demonstrate impressive abilities to tackle new tasks based on limited examples. However, they often struggle with more complex functionalities. As outlined in the paper, while tasks like arithmetic calculations and factual lookups can be performed by simpler models, LMs face challenges when instructed to use external tools effectively. The authors note that 'LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds'[1].

Introducing Toolformer

The authors introduce Toolformer as a model that autonomously decides which APIs to call, which arguments to pass, and how to incorporate the results into future predictions. Toolformer uses a self-supervised method that requires no more than a handful of demonstrations for each API. The fundamental goal is to allow language models to control various downstream tasks while improving their language understanding capabilities.

Key Features of Toolformer

  1. Self-Supervised Learning: Toolformer learns to execute API calls through self-supervised training, leading it to better internalize which tasks require external help.

  2. Variety of Tools: The model can utilize multiple tools, including a calculator, a question-answering system, a search engine, and a translation system[1]. This flexibility allows it to adapt to various use cases seamlessly.

  3. Dynamic API Call Selection: Toolformer intelligently samples API calls during its training phase, leveraging both successful and non-successful call outcomes to fine-tune its understanding of when and how to use specific tools effectively.

Methodology Overview

Training and Evaluation

Toolformer’s training involved augmenting a base language model (GPT-3) with a wide range of API calls. The model was trained on how to generate text by deciding when to call the associated API effectively. The authors experimented on various downstream tasks, ensuring that the model could not only predict text but also integrate information from external queries.

Table 1: Examples of inputs and outputs for all APIs used.
Table 1: Examples of inputs and outputs for all APIs used.

For example, a typical scenario might illustrate how Toolformer, when asked about a historical fact, could decide to call an API for a question-answering tool instead of relying solely on its internal knowledge. The researchers implemented multiple experiments to assess the efficacy of Toolformer on diverse tasks, including math benchmarks, question answering, and multilingual tasks. They found that 'Toolformer uses the question answering tool for most examples, clearly outperforming all baselines of the same size'[1].

Performance Metrics

Through extensive testing on different benchmarks, Toolformer showed remarkable improvements, especially in scenarios requiring external information assistance. The model outperformed traditional language models by an average of 11.5 to 18.6 points on various benchmarks, demonstrating its capability to learn from interactions with external APIs effectively. The paper highlighted that 'Toolformer consistently improves performance across all benchmarks' by leveraging the additional context provided by API calls[1].

Table 5: Results for various question answering dataset. Using the Wikipedia search tool for most examples, Toolformer clearly outperforms baselines of the same size, but falls short of GPT-3 (175B).
Table 5: Results for various question answering dataset. Using the Wikipedia search tool for most examples, Toolformer clearly outperforms baselines of the same size, but falls short of GPT-3 (175B).

Practical Implications

Use Cases of Toolformer

Toolformer has promising applications across various domains. For instance:

  • Math Calculations: When faced with complex arithmetic, Toolformer can reference a calculator API to deliver precise answers.

  • Question Answering: For factual queries, it can utilize a question-answering tool to provide accurate responses based on current data.

  • Translations and Search Queries: The model can assist with multilingual translations and seek additional data via search engines, thus broadening its utility well beyond simple text generation.

Future Directions

This research leads to broader implications for the field of artificial intelligence. The ability of LMs to autonomously decide when to use external tools suggests a path toward more intelligent, context-aware applications. The authors express hope that further advancements in this space will bring about LMs that can operate more effectively in real-world scenarios, perhaps leading to the development of 'LLMs that understand when to seek external help'[1].

Conclusion

In summary, Toolformer represents a significant step forward in the capabilities of language models. By teaching LMs to learn from the tools they can access, the potential for innovation in artificial intelligence expands vastly. This new approach not only enhances the basic functionalities of language models but also opens new avenues for practical applications, creating smarter systems that can deliver more reliable and relevant information. As research continues in this domain, the prospects for improved LMs that better understand their capabilities and limitations seem promising.


Challenges in Aligning Human and Machine Generalisation

Fundamental Differences in Generalisation

One of the core challenges in aligning human and machine generalisation arises from the fundamental differences in how each system forms and applies general concepts. The text explains that humans tend to rely on sparse abstractions, conceptual representations, and causal models. In contrast, many current AI systems, particularly those based on statistical methods, derive generalisation from extensive data as correlated patterns and probability distributions. For instance, it is noted that "humans tend toward sparse abstractions and conceptual representations that can be composed or transferred to new domains via analogical reasoning, whereas generalisations in statistical AI tend to be statistical patterns and probability distributions"[1]. This misalignment in the nature of what is learnt and how it is applied stands as a primary barrier to effective alignment.

Conceptual and Methodological Misalignment

The text clearly highlights that the methodologies underlying human and machine generalisation differ significantly. While human generalisation is viewed in terms of processes (abstraction, extension, and analogy) and results (categories, concepts, and rules), AI generalisation is often cast primarily as the ability to predict or reproduce statistical patterns over large datasets. One passage states that "if we wish to align machines to human-like generalisation ability (as an operator), we need new methods to achieve machine generalisation"[1]. In effect, while humans can generalise fresh from a few examples and adapt these insights across tasks, machines often require heavy data reliance, leading to products that do not encapsulate the inherent flexibility of human cognition. This discrepancy makes it difficult to seamlessly integrate AI systems into human–machine teaming scenarios.

Challenges in Evaluation and Robustness

Another challenge concerns the evaluation of generalisation capabilities and ensuring robustness. AI evaluation methods typically rely on empirical risk minimisation by testing on data that is assumed to be drawn from the same distribution as training data. However, this approach is limited when it comes to out-of-distribution (OOD) data and subtle distributional shifts. The text reflects that statistical learning methods often require large amounts of data and may hide generalisation failures behind data memorisation or overgeneralisation errors (for example, hallucinations in language models)[1]. Moreover, deriving provable guarantees — such as robustness bounds or measures for distribution shifts — poses a further challenge. This is complicated by difficulties in ensuring that training and test data are truly representative and independent, which is crucial for meaningful evaluation of whether a model generalises in practice.

Human-AI Teaming and Realignment Mechanisms

Effective human–machine teaming requires that the outputs of AI systems align closely with human expectations, particularly in high-stakes or decision-critical contexts. However, the text highlights that when such misalignments occur (for example, when AI predictions diverge significantly from human assessments), developing mechanisms for realignment and error correction becomes critical. The text emphasizes the need for collaborative methods that support not only the final decision but also the reasoning process, stating that "when misalignments occur, designing mechanisms for realignment and error correction becomes critical"[1]. One aspect of the challenge is that human cognition often involves explicit explanations based on causal history, whereas many AI systems, especially deep models, operate as opaque black boxes. This discrepancy necessitates the incorporation of explainable prediction methods and neurosymbolic approaches that can provide insights into underlying decision logic.

Integrating Diverse Generalisation Methods

The text also outlines challenges in harmonising the strengths of different AI methods. It distinguishes among statistical methods, knowledge-informed generalisation methods, and instance-based approaches. Each of these has its own set of advantages and limitations. For example, statistical methods deliver universal approximation and inference efficiency, yet they often fall short in compositionality and explainability. In contrast, knowledge-informed methods excel at explicit compositionality and enabling human insight but might be constrained to simpler scenarios due to their reliance on formalised theories[1]. Integrating these varying methods into a unified framework that resonates with human generalisation processes is a critical but unresolved goal. Approaches like neurosymbolic AI are being explored as potential bridges, but they still face significant hurdles, particularly in establishing formal generalisation properties and managing context dependency.

Conclusion

In summary, aligning human and machine generalisation is multifaceted, involving conceptual, methodological, evaluative, and practical challenges. Humans naturally form abstract, composable, and context-sensitive representations from few examples, while many AI systems depend on extensive data and statistical inference, leading to inherently different forms of generalisation. Furthermore, challenges in measuring robustness, explaining decisions, and ensuring that AI outputs align with human cognitive processes exacerbate these differences. The text underscores the need for interdisciplinary approaches that combine observational data with symbolic reasoning, develop formal guarantees for generalisation, and incorporate mechanisms for continuous realignment in human–machine teaming scenarios[1]. Addressing these challenges will be essential for advancing AI systems that truly support and augment human capabilities.


The Influence of Geopolitical Dynamics on AI Technology Acceleration and Adoption

The Intertwining of Technology and Geopolitics in AI

The rapid evolution of artificial intelligence (AI) is not occurring in a vacuum; it is increasingly intertwined with global geopolitical dynamics, creating both opportunities and uncertainties[1]. Technological advancements and geopolitical strategies are now heavily influencing each other, shaping the trajectory of AI development and deployment across nations[1]. This interplay is particularly evident in the competition between major global powers, notably the United States and China, as they vie for leadership in the AI domain[1].

AI as a New 'Space Race' and the Geopolitical Stakes

The convergence of technological and geopolitical forces has led many to view AI as the new 'space race'[1]. As Andrew Bosworth, Meta Platforms CTO, noted, the progress in AI is characterized by intense competition, with very few secrets, emphasizing the need to stay ahead[1]. The stakes are high, as leadership in AI could translate into broader geopolitical influence[1]. This understanding has spurred significant investments and strategic initiatives by various countries, all aimed at securing a competitive edge in the AI landscape[1].

The Competitive Landscape and Strategic Responses

In this competitive environment, countries are revving up due to economic, societal, and territorial aspirations[1]. The reality is that AI leadership could beget geopolitical leadership and not vice-versa[1] This state of affairs brings tremendous uncertainty[1].

China and the USA: A Technological and Geopolitical Duel

The document highlights the acute competition between China and the USA in AI technology development[1]. This competition spans various aspects, including innovation, product releases, investments, acquisitions, and capital raises[1]. The document cites Andrew Bosworth (Meta Platforms CTO), who described the current state of AI as our space race, the people we’re discussing, especially China, are highly capable… there’s very few secrets[1]. The document also notes in this technology and geopolitical landscape that it’s undeniable that it’s ‘game on,’ especially with the USA and China and the tech powerhouses charging ahead[1].

The Role of Global Powers and Competitive Advantages

The document briefly touches on global powers challenging each other’s competitive and comparative advantage[1]. It notes that the most powerful countries are revved up by varying degrees of economic/societal/territorial aspiration[1].

The Downside of Geopolitical Competition

This situation brings tremendous uncertainty[1]. The pace of change is rapid, which fuels excitement and trepidation[1]. All of this is intensified by global competition and sabre rattling[1].

The Bright Side of Geopolitical Competition

However, the intense competition and innovation, increasingly-accessible compute, rapidly-rising global adoption of AI-infused technology, and thoughtful and calculated leadership could foster sufficient trepidation and respect, that in turn, could lead to Mutually Assured Deterrence[1].

Strategic Implications and Shifting Global Order

The document indicates the AI ‘space race’ has the potential to reshape the world order, testing political systems and enhancing strategic deterrence[1]. If authoritarian regimes take the lead on AI, they may force companies to share user data and develop cyber weapons[1].

The Impact on Global Trade and Supply Chains

Economic trade tensions between the USA and China continue to escalate, driven by competition for control over strategic technology inputs[1]. China is the dominant global supplier of ‘rare earth elements,’ while the USA has prioritized reshoring semiconductor manufacturing and bolstered partnerships with allied nations to reduce reliance on Chinese supply chains[1].

The Blurring Lines Between Economic and National Interests

AI, semiconductors, critical minerals, and technology developments are no longer viewed solely as economic or technology assets[1]. They are strategic levers of national resilience and geopolitical power for both the USA and China[1].

Space: Trends In Artificial Intelligence 2025 By Mary Meeker et. Al