(Training) Compute Thresholds — Features and Functions in AI Governance

(Training) compute thresholds serve as a trigger for further evaluation and scrutiny of AI models, rather than being the sole basis for determining the regulatory framework applicable to a given model.

(Training) Compute Thresholds — Features and Functions in AI Governance

This article is also available as a PDF here.

Disclaimer: This article summarizes my current view on (training) compute thresholds based on my research and discussions with experts, governments, think tanks, and other stakeholders over the past year. While not meant to be authoritative, it aims to provide a condensed overview of how I believe compute thresholds should be understood and used. More authoritative pieces will be published in the future (maybe this article gets turned into one).
This article draws from work scattered across published papers, upcoming research, and shared memos, including Pistillo, Van Arsdale, Heim, and Winter (forthcoming) and Sastry, Heim, Belfield, Anderljung, Brundage, Hazell, O'Keefe, Hadfield, et al. (2024).
Please note that I reference many papers and often summarize them briefly. For more nuanced discussions, refer to the original sources.
I’d consider the article already a summary of my understanding and existing work, so the summary below is the summary of the summary with even less nuance.
Be aware that the footnotes are located at the end of each section, rather than at the end of the document.

Summary

  • (Training) compute thresholds serve as a trigger for further evaluation and scrutiny of AI models, rather than being the sole basis for determining the regulatory framework applicable to a given model.
  • Compute thresholds offer several advantages that are difficult to achieve with other metrics, making them a useful complement (Section 3):
    • Risk-tracking: Higher training compute is associated with greater model capabilities and potential risks.
    • Quantifiability and ease of measurement: Training compute is a quantifiable metric that is relatively straightforward and cost-effective to calculate.
    • Difficulty of circumvention: Reducing training compute to evade regulation is likely to simultaneously reduce a model's capabilities and risks.
    • Knowable before development and deployment: Training compute can be estimated prior to a model's development and deployment, facilitating proactive measures.
    • External verifiability: Compute usage can potentially be verified by external parties without compromising sensitive information.
    • Targeted regulatory scope: The metric is proportionately higher for models that cost more to develop, minimizing the burden on smaller actors while focusing on the most well-resourced ones.
  • Regulation of frontier models based on compute thresholds is primarily concerned with ensuring government visibility and the capacity to act if these models are found to present serious societal-scale risks. It is not intended to address all possible downstream impacts of AI on society, many of which should be regulated at the use level. (Section 4)
  • Regulations based on compute thresholds should be used along with other sector-specific regulations and broader AI governance measures, which are better suited to address downstream impacts. (Section 4)
  • While not perfect, compute thresholds are currently one of the best metrics available. They provide a valuable starting point for identifying potentially high-risk models and triggering further scrutiny, while also offering a range of practical benefits that make them well-suited for regulatory purposes. (Section 5)
  • I address commonly asked questions, such as adjusting the threshold over time to account for algorithmic advancements, the potential use of "effective compute", and the practical considerations surrounding the measurement. (FAQ)

1. Background

Figure 1: Executive Order 14110 introduced a notification requirement for models trained with more than 1026 operations (and 1023 operations if trained using primarily biological sequence data). The EU AI Act presume a general-purpose AI model with high impact capabilities if it is trained with more than 1025 operations. Figure adapted from Sastry, Heim, Belfield, Anderljung, Brundage, Hazell, O'Keefe, Hadfield, et al. (2024). You can find the data and updates in the Epoch Database.

The 2023 Executive Order 14110, "Ensuring the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," introduces a range of AI governance measures. Section 4 of the order leverages training compute thresholds as a criterion for classifying AI models that warrant additional scrutiny due to potential safety and security concerns. Specifically, it targets:

“(i) any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations;”

The executive order mandates U.S. AI companies to proactively notify the government about any ongoing or planned activities concerning the development of frontier models. It also requires these companies to share the results of red-team safety tests and instructs the new AI Safety Institute within the National Institute of Standards and Technology (NIST) to develop evaluation standards. These requirements are designed to capture future developments in AI. At the time of writing, no publicly known AI model meets the 1026 training operations threshold[1], whereas one model appears to meet the biological sequence data threshold.

The European Union's AI Act also leverages training compute thresholds to identify general-purpose AI models with systemic risks. Article 51 (2) states:[2], [3]

“A general-purpose AI model shall be presumed to have high impact capabilities pursuant to paragraph 1, point (a), when the cumulative amount of computation used for its training measured in FLOPs is greater than 10^25.”

China, too, is considering similar measures, as indicated by recent discussions in policy circles.

The adoption of compute thresholds in the U.S., the ongoing discussions in the EU, and the potential for similar measures in other countries highlight the importance of and interest in this approach in AI governance. However, debates have emerged regarding the effectiveness of compute thresholds as indicators of capabilities and their potential to become obsolete quickly. This article aims to bring clarity and nuance to these discussions by examining the role of compute thresholds, their advantages and limitations, and how I think they can be used in AI governance frameworks.


  1. Throughout this article, I use the terms "(training) operations" or "(training) compute threshold" to emphasize that the focus is on the computational resources used specifically during the training phase of an AI model's lifecycle. It's important to note that there can be various types of "compute thresholds" used for AI governance. For example, Executive Order 14110 also includes a reporting requirement for computing clusters above a certain performance, which is another type of "compute threshold" that is not directly tied to the training process itself but rather to the computing infrastructure used by AI developers. ↩︎

  2. Recommendation 1: When establishing compute thresholds, it is advisable for the European Union to use the term "operations" rather than specifying a particular type of operation (such as floating-point operations or FLOP). Using the broader term "operations" ensures that the threshold remains agnostic to the specific type of computational operations performed during training, providing flexibility. This approach also maintains consistency with the terminology used in the US Executive Order 14110.
    Recommendation 2: When referring to the quantity of computational resources used during training, it is preferable to use the term "FLOP". This helps to avoid confusion with the term "FLOP/s" (floating-point operations per second), which is a measure of computational performance, not the total amount of compute used. ↩︎

  3. The use of "presumed" is significant here, as it suggests that the compute threshold is not the sole determinant of risk. The EU AI Act also includes provisions for updating the threshold based on technological developments, as stated in Article 51 (3):
    “The Commission shall adopt delegated acts in accordance with article 97 to amend the thresholds listed in paragraphs 2 and 3 of this Article, as well as to supplement benchmarks and indicators in light of evolving technological developments, such as algorithmic improvements or increased hardware efficiency, when necessary, for these thresholds to reflect the state of the art.”
    The mention of "hardware efficiency" in this context is somewhat misplaced. Hardware efficiency improvements do not directly impact the threshold itself, but rather make it cheaper to achieve a given amount of compute. In other words, as hardware becomes more efficient, AI developers can train models using the same amount of compute at a lower cost. However, this does not necessarily mean that the threshold for what constitutes a "high-risk" model should be adjusted based on hardware efficiency improvements alone. Instead, the focus should be on algorithmic improvements that directly affect the relationship between compute and model capabilities. For a more detailed discussion of this issue, see our paper “Increased Compute Efficiency and the Diffusion of AI Capabilities” and the FAQ below. ↩︎

2. What are Training Compute Thresholds?

Training compute refers to the total number of operations required to train an AI model. Training an AI model is an iterative process where an architecture is exposed to a large amount of data, allowing the model to learn from it. This learning can be supervised, where the model is provided with labeled examples, or unsupervised, where the model learns from unlabeled data.

During the training process, the computer performs a vast number of mathematical operations—it is "crunching numbers." These operations are executed by the computer's processor (which has a certain processing performance—how many operations it can execute per second (FLOP/s or OP/s)) and can be quantified as a total number of operations required to train the model.[1] So the total amount of processed operations is a quantity, described in operations. We are agnostic here if these are integer operations, floating-point operations (FLOP), or other operations.[2] We broadly refer to them as operations.[3]

In recent years, the scale of AI training has grown significantly, with increases in the amount of data used, the number of model parameters, and, therefore, increasing the corresponding amount of compute required to train these parameters.

A training compute threshold is essentially a value assigned to this metric, which, when exceeded, triggers a specific action or requirement. What this action or requirement is critical and can vary.

In summary, training compute is a quantifiable metric that can be used to classify AI models based on the amount of computational resources ("compute") required to train them. By setting these thresholds, governing bodies can differentiate between AI models above and below the threshold, applying distinct requirements and oversight measures to each group.


  1. Some hardware engineers might roll their eyes, and yes, I feel you. This requires more nuance, some of it discussed in the FAQ below and in forthcoming work, but the broad strokes here are sufficient for understanding it. ↩︎

  2. Presently, most AI training predominantly uses floating point numbers, but this could change in the future. ↩︎

  3. It's important to note that the term "operations" in this context refers to the total quantity of computations performed, not to be confused with the performance — the rate of operations per second (FLOP/s or OP/s). The latter describes the computational performance of an integrated circuit. For example, running an NVIDIA A100 processor for a week results in a total number of X FLOP being executed, which represents the quantity of operations used to train an AI model. See “FLOP for Quantity, FLOP/s for Performance” for a longer elaboration. ↩︎

Figure 2: Historical trends of training compute usage for notable AI models. Researchers and companies are scaling compute for training AI models, not because they like to waste money, but because it leads to predictably better capabilities. Figure from Sastry, Heim, Belfield, Anderljung, Brundage, Hazell, O'Keefe, Hadfield, et al. (2024).

3. Features of Training Compute (Thresholds)

Training compute (thresholds) offer several key features that make them a valuable tool for AI governance. Let's briefly discuss these features:

  • (1) Risk-tracking: Training compute is indicative of capabilities and, therefore, its potential risks.

    • The amount of compute used to train a model correlates with model capabilities and, consequently, with the potential risks it presents. AI research has identified relationships, known as scaling laws, between a model's training compute and its capabilities. As models become more capable, they tend to pose greater risks if misused or if they exhibit unexpected and potentially harmful behaviors.
    • Furthermore, the capabilities of a model serve as a good proxy for how widely it will be used, how heavily it will be relied upon, and the potential for it to be catastrophically misused.
    • This aspect of training compute—their correlation with capabilities and risk—is perhaps the most controversial and warrants a more in-depth discussion. I briefly touch upon this topic later in the FAQ and refer to literature.[1]
  • (2) Quantifiability and ease of measurement: Training compute is a quantifiable metric that is relatively easy and inexpensive to calculate for various models.

    • To effectively enforce regulation based on training compute, it is essential that the metric can be easily calculated. This is the case for training compute which can be directly calculated from model specifications or inferred from hardware usage data.[2]
    • Unlike other metrics, which may quickly become outdated (e.g., benchmarks) or are multidimensional (e.g., data quality and type), training compute provides a quantifiable and unidimensional metric that can be directly calculated or inferred with minimal effort.
  • (3) Difficulty of circumvention: Training compute is difficult to reduce without also decreasing a model's risks.

    • Training compute is relatively robust to circumvention attempts, as reducing the amount of compute used to train a model will generally decrease its capabilities and, consequently, lower risk. This is because, for a given architecture and algorithm, the amount of compute used is directly related to the model's performance and potential risks.
    • Unlike other metrics, such as the score on a specific benchmark, which can be manipulated without significantly affecting the model's general capabilities, reducing training compute is more likely to cause a decrease in the model's performance and associated risks.[3]
    • However, it is important to note that improvements in algorithmic efficiency can reduce the amount of compute required over time for a given level of performance. We address the question of modifying the training compute threshold to account for research advances below.[4]
  • (4) Knowable before development and deployment: Training compute can be calculated before the model is deployed, and even before it is trained.

    • Training compute can be known and estimated ahead of development and deployment[5], which is important because many requirements and security precautions may apply to how a model is developed and deployed.
    • By measuring training compute before training begins, developers can implement compute-indexed precautions during the training process. For example, they can ensure strong cybersecurity measures are in place for compute-intensive training runs, reducing the risk of model theft or unauthorized access.
  • (5) External verifiability: The possibility for external and internal verification of compute usage, without disclosing proprietary details, enhances compliance.

    • Ideally, measurements of training compute should be verifiable by diverse external parties through protocols that maintain the confidentiality of proprietary information. This could enable verifiable commitments across companies and even states.
    • Compute providers can aid in the verification of reporting requirements under compute threshold regulations. This is particularly desirable for cloud/compute service providers, as they can monitor and verify compute usage without infringing on the confidentiality of AI developers, in contrast to other measures that may require access to sensitive model details.[6]
  • (6) Targeted regulatory scope: Training compute is proportionately higher for models that cost more to develop, minimizing the burden on smaller actors while focusing on the most well-resourced ones. The amount of compute directly corresponds to the amount of financial resources required for the model development (Figure 3).

    • The cost of training a model scales with the amount of compute used. For example, a model trained with 1026 operations will cost approximately 10 times more than a model trained with 1025 operations.
    • The large amounts of compute required to train state-of-the-art models are typically only available to well-resourced organizations. By setting compute thresholds at appropriate levels, regulators can focus on the most advanced and potentially risky models without imposing undue burdens on smaller actors in the AI ecosystem, such as startups, small businesses, or academic researchers.However, it is important to note that the cost of compute decreases over time due to increasing computational price-performance. As the cost of compute falls, the financial barrier to reaching certain compute thresholds may also decrease, potentially allowing less well-resourced actors to develop models that meet the threshold.[7] We discuss this in “What Increasing Compute Efficiency Means for the Proliferation of Dangerous Capabilities.”[8]

  1. For a more comprehensive discussion of the relationship between compute, capabilities, and risk: “Computing Power and the Governance of Artificial Intelligence” and “Frontier AI Regulation: Managing Emerging Risks to Public Safety.” ↩︎

  2. See the FAQ for more details on how to measure training compute. ↩︎

  3. For example, an actor cannot simply decide to use less compute while maintaining the same level of performance. In contrast, an actor might be able to adjust other metrics, such as the score on a specific benchmark, to avoid regulation without substantially impacting the model's general capabilities. ↩︎

  4. The need to adapt compute thresholds over time to account for algorithmic advancements is not immediately obvious (see the FAQ for a more detailed discussion). However, if such adjustments are deemed necessary, compute thresholds offer a relatively straightforward way to do so. ↩︎

  5. Training compute can be estimated before model development using the architectural details and the amount of training data, as outlined in Method 1 of this article. AI companies carefully plan their training runs, as training state-of-the-art models often requires tens of thousands of GPUs and costs on the order of millions of dollars. Given the significant computational resources and financial investment involved, companies have a strong incentive to accurately estimate training compute beforehand to ensure efficient resource allocation and budget planning. ↩︎

  6. Unlike capability measures, which often require access to sensitive model details, compute providers can feasibly detect "above threshold compute" usage without infringing on the confidentiality of AI developers. This is because compute usage can be monitored and verified at an aggregate level, without the need to access specific details of the model architecture, training data, or other proprietary information. ↩︎

  7. The decreasing cost of compute due to hardware price-performance improvements does pose a challenge; however, the pace is slower than other AI trends. Currently, the increasing investments required for frontier AI capabilities and algorithmic improvements are growing faster than the rate of hardware price-performance gains. ↩︎

  8. An AI Chatbot also suggested: “Another potential feature could be the incentivization of computational efficiency. By setting compute thresholds, there could be an implicit encouragement for developers to optimize their training processes, promoting more efficient use of resources without compromising model capabilities. This aligns with sustainable AI development and can be an essential aspect of governance.↩︎

Figure 3: Cost and compute used for training AI models. The amount of compute directly corresponds to the amount of financial resources required for the model development. Figure adapted from Sastry, Heim, Belfield, Anderljung, Brundage, Hazell, O'Keefe, Hadfield, et al. (2024).(While the exact cost calculations vary across different sources due to the inherent uncertainty involved, the rough estimates tend to fall within a similar ballpark range. For the purposes of this analysis, I assume a cost of $1.8 per hour for an NVIDIA H100, which is rather an optimistic assumption. Also, note that this is only the cost of acquiring the computational resources. It does not include staff cost and all the necessary development beyond the final-training run.)

4. The Role of Compute Thresholds in AI Governance

Figure 4: A conceptual framework for the role of compute thresholds in AI governance. Compute thresholds can serve as an initial filter, identifying potentially high-risk foundation models that warrant further scrutiny through capability assessments. Models that are deemed high-risk based on these assessments are then subject to specific regulations, in addition to the general and sector-specific regulations that apply to all AI models. Models that fall below the compute threshold or are not classified as high-risk are subject only to these general and sector-specific regulations.

Compute thresholds offer a unique set of features that make them a valuable tool for AI governance. They are quantifiable, risk-correlated, externally verifiable, and difficult to circumvent, making them an attractive option for regulators seeking to classify AI models and impose appropriate oversight. The specific role of compute thresholds depends on the context in which they are applied.[1] I argue that:

  • (i) compute thresholds can and should be used as a first trigger, and
  • (ii) are not the sole determinant for all regulations.

Compute Thresholds as a First Trigger

One of the primary ways I conceptualize compute thresholds is as a trigger for further scrutiny. Regulators need a quick and cheap way to classify AI models pre-development/deployment and identify those that may warrant higher regulatory scrutiny. Compute thresholds serve this purpose by acting as an initial filter, with more detailed evaluations and assessments required for models that exceed the threshold (Figure 4).

The ambition is to identify training compute thresholds that would serve as intuitive starting points to identify potential models of concern and help define the limited subset of high-compute models that would be subject to regulation.[2] Once a certain level of training compute is reached, AI models are presumed to have a higher risk of displaying dangerous capabilities (especially unknown dangerous capabilities) and, hence, are subject to stricter oversight and requirements.[3]

Figure 4: A conceptual framework for the role of compute thresholds in AI governance. Compute thresholds can serve as an initial filter, identifying potentially high-risk foundation models that warrant further scrutiny through capability assessments. Models that are deemed high-risk based on these assessments are then subject to specific regulations, in addition to the general and sector-specific regulations that apply to all AI models. Models that fall below the compute threshold or are not classified as high-risk are subject only to these general and sector-specific regulations.

More generally, when designing AI governance frameworks, policymakers need to consider two key aspects:

  • (i) what determines if a model falls within the scope of the specific AI regulations triggered by the compute threshold (i.e., the "regulatory framework" for high-compute models)
  • (ii) what requirements are imposed on a model once it exceeds the compute threshold and is subject to these specific AI regulations

For the first aspect, easily measurable criteria that are known before model deployment, such as training compute, are preferable for legal certainty and clarity in determining which models are subject to the additional regulations. For the second aspect, a more nuanced approach sensitive to factors like deployment context and evaluation results is appropriate in determining the specific requirements and oversight measures imposed on these models.

Not the Sole Determinant for All Regulations

Compute thresholds are not meant to be the sole determinant for all AI regulations.[4] Instead, they are designed to target a specific subset of AI models, such as "frontier AI models" or "high-impact foundation models."[5] This does not mean that compute thresholds should replace other sector-specific regulations or broader AI governance measures.

Moreover, regulating frontier models is not the same as regulating all possible impacts of AI on society. Downstream impacts due to distribution methods (e.g., harms caused by building a legal advisor on top of a frontier model) should be regulated at the use level. Frontier AI regulation is primarily concerned with ensuring governments have visibility and the capacity to act if frontier models are found to present serious societal-scale risks, rather than targeting specific societal impacts.[6]

To define the upper tier of AI models subject to additional regulation, policymakers could consider proxies of (a) model capabilities and (b) downstream impacts. For the latter, metrics such as the number of customers, number of downstream users, and number of high-risk system providers that report using the foundation model could be used.


  1. For example, for compute providers, compute thresholds can be used to determine the level of know-your-customer (KYC) requirements for users of large-scale computing resources. In the EU’s AI Act, compute thresholds are used to presume general-purpose AI models with systemic risks, triggering additional obligations for their providers. Similarly, in the United States, Executive Order 14110 sets compute thresholds for reporting requirements and safety evaluations. ↩︎

  2. Compute thresholds are not claimed to be an exhaustive measure of risk. One reason for using compute as a trigger is that evaluations are often imprecise and narrowly focused on a specific domain or test. In contrast, compute thresholds do not claim to be precise; they simply indicate that a significant amount of work went into the system, suggesting a need for greater caution. This is a benefit, not a bug, as it complements the more targeted nature of capability evaluations and other metrics ↩︎

  3. We have many similar thresholds in place across various domains. While none of them are perfect, they serve as a first classifier/trigger. For example, cars are classified based on weight (e.g., anything above 3.5 tons), buildings are categorized by height (e.g., skyscrapers), businesses are regulated based on revenue, drones are subject to different rules depending on their weight, and there are regulations on the distance you must keep from whales during whale watching. Other examples include car emissions standards and speed limits. These thresholds, while imperfect, provide a starting point for targeted regulation and oversight. ↩︎

  4. While I generally have reservations about using compute thresholds as the sole measure, my views might be more relaxed if the threshold is set sufficiently high and the regulatory burden is kept relatively low. For example, I suggest using compute thresholds to classify Know Your Customer (KYC) requirements and other potential measures for the usage of large compute providers. I believe this approach is fair, as it only targets a handful of companies (see “Governing Through the Cloud”). ↩︎

  5. I have some disagreements with specific compute thresholds, e.g., for models trained on biological data. I discuss this in the FAQ below. ↩︎

  6. Here’s a paraphrase of a perspective that was shared with me: A common mistake is trying to encode desired outcomes directly into the legislation, such as regulating AI based on its capabilities and societal impact. However, laws do not directly produce their intended outcomes; instead, those outcomes are the result of a complex system. Currently, we lack a clear understanding of which evaluations or capabilities to regulate, and using such conditions as the basis for registration or other requirements would likely lead to gaming and high costs for both regulators and regulated entities. Compute thresholds, while not perfect, provide a more objective and measurable starting point for targeted regulation. ↩︎

5. Conclusion

While not perfect, compute thresholds are currently one of the best tools available for AI governance. They offer a quantifiable, risk-correlated, and externally verifiable metric that can inform regulatory decisions while minimizing circumvention and targeting the most potentially impactful models.

The discussion of whether capability measures are better than compute thresholds is, in my view, misguided. Capability measures are supplemental, and while one of them might play a more important role in the future, compute thresholds remain a valuable tool in the current AI governance landscape.
Compute thresholds are not self-sufficient for governance[1]; they are one tool among many. Their effectiveness depends on the specific context and the design of the overall governance framework. By understanding the strengths and limitations of compute thresholds, policymakers can make informed decisions about when and how to use them as part of a comprehensive approach to AI governance.


  1. It is important to recognize that compute, and therefore also compute thresholds, are just one tool in the AI governance toolkit and are not self-sufficient for comprehensive governance. For a more in-depth discussion on the limitations of compute governance and the circumstances under which it can be a potent governance node, I recommend Section 5 of the "Computing Power and the Governance of AI,” and my blog post “Crucial Considerations for Compute Governance.” ↩︎

Appendix

A. FAQ

Disclaimer: This section is more opinionated, and everything I believe is open to be challenged. Please reach out. I plan to update this over time.

Content includes:

  • A.1 How can we account for algorithmic efficiency improvements in compute threshold?
  • A.2 What is the optimal frequency for reviewing and updating compute thresholds?
  • A.3 Is effective compute a viable alternative to raw compute thresholds in addressing algorithmic efficiency?
  • A.4 Is it practical for AI developers to reduce compute use to avoid regulation?
  • A.5 How do post-training enhancements affect the use of compute thresholds?
  • A.6 FLOP, integer, or operations — what is the most appropriate metric for training compute thresholds?
  • A.7 FLOP, FLOPs, FLOPS, or FLOP/s
  • A.8 How do you measure it? What about fine-tuning?
  • A.9 What about domain-specific compute thresholds, such as the 1023 operations threshold for models trained primarily on biological sequence data?
  • A.10 Capabilities measures are better!
  • A.11 Which other measures could one consider, and how do they score?

A.1 How can we account for algorithmic efficiency improvements in compute threshold?

  • As advances in algorithmic efficiency lead to increased algorithmic efficiency, some argue for a decrease in the compute threshold over time. Continued AI research reduces the amount of compute needed to obtain any fixed capability level, meaning less compute is required to build higher-capability models. Adjusting the threshold at the same pace of algorithmic efficiency improvements would maintain its correspondence to the same level of capabilities.
  • While improved algorithmic efficiency may justify lowering the compute threshold, it shouldn't be the primary driver for such changes. It merely defines the upper limit for how quickly these thresholds could change if the goal is to consistently regulate the same level of capability.
  • Factors such as the offense-defense balance in AI models and societal resilience against unregulated AI suggest the threshold may need slower adjustments or remain unchanged despite increases in algorithmic efficiency. Lowering the compute threshold in line with increased algorithmic efficiency or maintaining the same capability threshold may not always be necessary, especially if unregulated models offer more benefits than potential harm and society feels adequately protected against them.
  • Over time, society could become better at managing risks from models. A model of a given capability level could be less risky in the future, or experiments could show that future models—whose risks are currently uncertain—pose limited risks. In such cases, the threshold could even be increased over time to focus on the most potentially harmful systems.
  • We discuss the implications of increasing compute efficiency (the combined effect of improved hardware price-performance and algorithmic efficiency) at length in this paper: ”Increased Compute Efficiency and the Diffusion of AI Capabilities”. A brief summary is available on the GovAI blog: “What Increasing Compute Efficiency Means for the Proliferation of Dangerous Capabilities”. Most importantly, learning what's happening at the frontier provides insight into "what's about to come down the pipe."

A.2 What is the optimal frequency for reviewing and updating compute thresholds?

  • As mentioned above, the frequency of updates is unclear and depends on various factors.
  • It goes without saying that any AI regulation should have the ability to be updated. While this may be challenging, I'm skeptical about AI regulations that lack a mechanism for adaptation. AI is a rapidly evolving field, and everyone should have learned this lesson by now.

A.3 Is effective compute a viable alternative to raw compute thresholds in addressing algorithmic efficiency?

  • Algorithmic efficiency improvements, including advancements in training methods and architecture, have reduced the amount of compute required to train models to similar performance levels over time.
  • As a result, there is growing interest in using effective compute, which accounts for both increases in training compute and algorithmic advancements. Effective compute is relative, as improvements in algorithmic efficiency are measured against a specific model.
  • However, I challenge the use of effective compute for now for the outlined reasons below. Given these challenges, I recommend using training compute thresholds for regulatory purposes, with government agencies regularly reviewing and adjusting these thresholds as needed.

Challenges to Effective Compute:

  • Selecting Performance Metrics: Choosing the right performance metric to define the "effectiveness" of training compute is challenging due to the wide range of available metrics, from technical aspects like test loss to real-world benchmarks and actual impact. Technical metrics may not accurately represent real-world capabilities, while more practical metrics are subject to the unpredictable emergence of new capabilities.

performance-metrics.png
Figure 5: The spectrum of performance metrics ranges from technical aspects like test loss to more practical and real-world benchmarks such as performance on human tests, and even the actual impact or utility in the world. To measure effective compute, one needs to choose a performance metric or a combination of them. Another dimension to consider is the prompting method or the number of trials used in the evaluation process.

  • Post-Training Improvements: Performance gains can result from post-training techniques like fine-tuning or prompting methods, which require minimal compute. Determining the appropriate point in the model's lifecycle for performance measurement remains unclear. Recent discussions around benchmarks have highlighted the challenges posed by prompting techniques in accurately assessing model capabilities.
  • Limited Research: Research on algorithmic efficiency is sparse, with only a few papers published. The existing work covers a wide range of improvement estimates over time, highlighting the difficulty of this research area.
  • Perception of Exhaustiveness: Effective compute can create a false sense of being a comprehensive measure—as it combines capabilities and compute—potentially leading to oversight and underestimating the need for further evaluations.
  • Regulating Consistent Capability Levels: Effective compute suggests a regulatory focus on maintaining consistent performance levels over time, which may not always be practical or warranted. Factors such as the offense-defense balance in AI models and societal resilience against unregulated AI indicate that this threshold might require more gradual or no adjustments (as discussed in the first questions).
  • Ease of Measurement and Verification: Training compute is simpler to measure and is usually known prior to training, making it a reliable trigger for further evaluations. In contrast, effective compute requires more in-depth analysis of performance, and therefore after training, and potentially access to proprietary information, posing challenges in measurement and verification.
  • Disclosure of Proprietary Information: Publishing effective compute data alongside standard compute metrics could expose a company's algorithmic progress, potentially intensifying competitive dynamics in the AI industry.

Therefore,

  • This assessment focuses on the current practical and inherent limitations of using effective compute as a regulatory measure. More research is needed to develop standardized measures for effective compute that address the challenges and limitations outlined here. If successful, an effective compute measure could become the "GDP per capita" of AI performance (e.g., by creating “a nominal basked of benchmarks”).
  • Effective compute is currently most suitable for internal company use (e.g., for Responsible Scaling Policies[1]), where the necessary tools and insights for accurate assessment are readily available. External entities, including regulatory bodies, may struggle to accurately assess effective compute due to the lack of standardized methods and limited access to models. Moreover, regulatory bodies may not possess the technical know-how required to accurately assess effective compute.

A.4 Is it practical for AI developers to reduce compute use to avoid regulation?

  • As discussed above, "compute" is not something that you can easily adjust downwards while maintaining model performance. There is already enormous pressure to optimize compute use.[2]
  • In addition, as we discuss in "Governing Through the Cloud" on the question of "Can't you just structure a training run across multiple CSPs to stay below the threshold?":
    • “[..] Moreover, the strategic implementation of a compute threshold (Pistillo et al., forthcoming) could serve as a deterrent against “structured training” practices. Next-generation AI models generally necessitate an order of magnitude more training compute than their predecessors to achieve meaningful advancements in performance.[3] A compute threshold set slightly above the current models, for example, the one used in the AI Executive Order, reflects that future, more advanced models will not just require marginal incremental compute increases but rather an exponential scale-up (usually an order of magnitude). Practically, this leads to the necessity of involving more than ten compute providers to distribute the workload sufficiently to circumvent the threshold. This requirement imposes substantial logistical and performance challenges. Simply identifying more than ten compute providers with sufficient compute capacity over which to structure a workload is a formidable challenge, and doing so without detection is even more difficult. Furthermore, next-generation models—given the exponential increase in training compute—increase the challenge seven further. Should the exponential scaling of training compute persist, it would necessitate a proportionally exponential increase in the number of compute providers involved, exacerbating the challenges even further.”

A.5 How do post-training enhancements affect the use of compute thresholds?

  • Post-training enhancements can indeed significantly improve model performance.

  • Regulators should be aware that AI capabilities might be substantially improved through post-training enhancements. We observe that, with current methods, capabilities can be increased to an equivalent of 5x to 20x increase in training compute through post-training enhancements.

  • However, there is likely a limit to these improvements, and the base capabilities determined by the initial architecture and pre-trained model remain critical.

  • When setting compute thresholds, regulators should account for potential post-training enhancements and include a safety buffer.[4] The thresholds should be based on the observed and enhanced capabilities of models like GPT-4 or Gemini, rather than the capabilities of the base model alone.

  • For example, when you set a threshold at 1025 operations, it is because you've assessed that models like GPT-4 or Gemini should be subject to the desired regulation. This assessment is not based on the capabilities of the base model but rather on the observed and evaluated capabilities achieved, including post-training improvements.

  • Similarly, the Executive Order uses 1026 operations because it assesses that no current model poses capabilities that are of significant concern, even after accounting for potential post-training enhancements.

  • While we might see these models improve over time, I don't expect these improvements to be able to overcome or "skip multiple generations" of compute scaling. In other words, post-training enhancements alone are unlikely to enable models to achieve the capabilities of models trained with multiple orders of magnitude more compute.

A.6 FLOP, integer, or operations — what is the most appropriate metric for training compute thresholds?

  • The EU AI Act only discusses floating point operations (FLOP).
  • As mentioned above, I recommend using the term "operations" rather than specifying a particular type of operation (such as FLOP) when establishing compute thresholds. This approach ensures flexibility and consistency with the terminology used in the US Executive Order 14110.

A.7 FLOP, FLOPs, FLOPS, or FLOP/s?

  • Let’s use the term “FLOP” for training compute (thresholds) and “FLOP/s” for performance (of chips)! FLOP/s refers to the computational performance of an integrated circuit, representing the number of operations executed per second. In contrast, FLOP denotes the quantity of operations (e.g., running an NVIDIA A100 for a week results in a total number of X FLOP being executed). For a more detailed explanation, see my blog post: FLOP for Quantity, FLOP/s for Performance.

A.8 How do you measure it?

  • We suggest measuring training compute by plugging architectural details and hyperparameters into formulas of training compute, as described in Method 1 of “Estimating Training Compute of Deep Learning Models”.
  • There is currently no standardized method for measuring a model's training compute. That introduces ambiguity regarding "what counts" and the appropriate measurement approach. However, given the order of magnitude growth, the exact accuracy of these methods is not of critical importance.
  • More important is an agreement on whether we count “fine-tuning”, “enhancements”, or other measures. I hope for more guidance on this in the future.
  • While some in the EU AI Act suggest measuring all aspects of compute usage, I believe this approach is complicated and not feasible. Instead, I recommend only measuring “pre-training” compute. I can share some guidance on standardized measurement methods on request.

lifecycle.png
Figure 6: I recommend using pre-training compute and not including compute used in further enhancement processes.

What about fine-tuning?

  • I suggest not measuring fine-tuning compute for the purpose of compute thresholds. Including fine-tuning compute is impractical because fine-tuning is often offered to customers and repeated many times for a given underlying foundation model.
  • Re-measuring (and potentially resubmitting or retriggering red-teaming) for every fine-tuning occurrence would be burdensome. For example, if the Executive Order requires reporting every model above a training compute threshold, this would necessitate reporting every newly fine-tuned model. While fine-tuning and other techniques can significantly increase capabilities, this should be addressed from the beginning by designing the "pre-training compute threshold" adequately (as discussed above).
  • While certain fine-tuned models might warrant additional reporting, not every model does. The criteria for additional reporting should be based on factors other than the amount of increased compute used, such as the type of data used or the elicited capabilities.
  • Furthermore, in today's large language model (LLM) training, SFT and RL compute is small compared to pre-training compute (perhaps ≈1%, or up to ≈10% in the largest case we are aware of). As a result, pre-training compute is currently a fairly good approximation of total LLM training compute.
  • Existing LLM scaling laws describe model performance as a function of pre-training compute, not SFT nor RL compute. Therefore, reports of pre-training compute could uniquely be used to predict base model capabilities, which are arguably most relevant to identifying the need for evaluations.

A.9 What about domain-specific compute thresholds, such as the 1023 operations threshold for models trained primarily on biological sequence data?

  • I have reservations about domain-specific compute threshold, and the specific bio threshold in the Executive Order 14410 for several reasons:

    • The regulations I discuss primarily apply to general-purpose, "frontier models" or foundation models. Designing multiple thresholds to capture every model that might pose a risk may not be the most appropriate approach. Regulatory compliance for these models should be tailored to their specific characteristics and risks.

      • For example, the red-teaming or evaluation process for a bio-focused model should differ from that of a general-purpose model, as the methods and considerations are dramatically different.[5]
    • The regulatory cost of compliance should be considered when determining the appropriateness of a metric. If the compliance burden is too high, the metric may not be suitable. Depending on the threshold, it might encompass a level of compute regularly used by many actors, making it an overinclusive and ineffective regulatory measure.

    • Models have already crossed this threshold, and the current definition can be gamed (though there's a straightforward fix for this).[6]

  • Compute thresholds are not intended to address all risks. Most effective measures at the intersection of biology and AI will likely be “traditional bio-risk interventions” rather than AI-specific ones. Threats from bio-AI models may require other forms of regulation, with the most cost-effective measures lying in the domain of traditional bio-risk interventions, such as those addressing pandemics. E.g., wastewater sequencing and synthesis procurement screening (included in Executive Order 14110) are valuable, and not AI-specific.

  • Measures that limit the type of data a model can be trained on may be more promising, although enforcement could be challenging. Certain datasets, perhaps should not be public.

  • The concept of compute thresholds is supported by the observation of scaling laws across different domains. However, given the differences in “slope” and “starting point”, different domains would require different thresholds. For example, the current frontier of models trained on primarily biological sequence data might only sit at 1023 operations.[7] While the same features as above would apply, certain thresholds are harder to enforce, as the required computational performance is achievable at lower costs. Where exactly to draw this line is a separate question.

A.10 Capabilities measures are better!

  • I disagree. Capability measures are supplemental to compute thresholds, not a replacement for them.
  • Capability measures are supplemental to compute thresholds, not a replacement for them. Comprehensive evaluations, including substantial red-teaming, will require significant effort, making it crucial to identify which models should undergo that scrutiny. Easy-to-run evaluations may be too easy to game, raising concerns about companies designing models to perform poorly on benchmarks while still excelling at the capabilities the benchmarks are meant to measure.
  • Compute thresholds and capability assessments should be viewed as complementary approaches. Compute thresholds excel at providing a quick, easily measurable, and externally verifiable way to identify potentially high-risk models and trigger further scrutiny. Capability assessments offer a more nuanced and context-specific evaluation of a model's risks and potential impacts.
  • By using compute thresholds as a first-pass filter and then applying capability assessments to the identified high-risk models, regulators can create a more efficient and effective governance process (Figure 4). As the science of measuring and assessing AI capabilities matures, capability assessments can be increasingly integrated into the governance framework alongside compute thresholds.

A.11 Which other measures could one consider, and how do they score?

  • I'd rule out effective compute for now, for the reasons discussed above.

  • Several alternative measures have been considered, but each has its limitations:[8]

    • Effective compute: Ruled out for the reasons discussed above.
    • Training dataset size: Data quality matters in addition to quantity, and there is currently no objective way to measure data quality. Dataset size also affects training compute and is partly captured by that metric.
    • Training algorithms or model architecture: This metric does not track risk, as a randomly initialized model with a given architecture would not be high-risk. It is also manipulable, as developers could modify the algorithm or architecture to avoid high-risk classification.
    • Number of model parameters: Parameter count does not track risk, as a randomly initialized model with many parameters would not be high-risk. It is also manipulable through techniques like model pruning.
    • Evaluations of specific capabilities: The current science of assessing a model's capabilities is nascent, and evaluations can only be conducted after training, making it unclear to developers whether their model will be classified as high-risk.
    • Number of citizens affected by a model: This metric is difficult to measure objectively in advance of deployment and is not known before deployment.

  1. AI companies can use effective compute internally, e.g., for selecting review- or checkpoints for responsible scaling policies. ↩︎

  2. Some have argued that compute is the bottleneck in AI development. ↩︎

  3. Scaling laws suggest that achieving substantial enhancements in performance on downstream tasks necessitates exponential increases in training compute. This principle underscores that compute investments grow exponentially for comparatively linear improvements in task performance (Owen, 2024). ↩︎

  4. See, e.g., Davidson et al, suggest incorporating a "safety buffer" by restricting capabilities that are projected to reach dangerous levels through future improvements in post-training enhancements. ↩︎

  5. Additionally, red-teaming biological design tools poses its own risks: Red teams should probably not involve wet lab validation of model predictions due to unacceptable biosafety and biosecurity risks. And, stringent information security and secrecy measures would be necessary to prevent leaks or publication of findings. ↩︎

  6. As pointed out, there’s already one way to game the “trained primarily on biological sequence” description. It is easily gameable by training on additional non-biological tokens to avoid being classified as "primarily biological sequence data" (as the Epoch article points out). However, this can be fixed by specifying that the computing power was used to train "with" the biological training data, regardless of the presence of non-bio tokens/data. ↩︎

  7. The appropriate compute threshold may need to vary not only between large language models (LLMs) and biological data-trained models (BDTs), but also across different types of BDTs. For instance, a threshold of 1023 operations could be suitable for protein function prediction models, but a lower threshold may be warranted for models focused on immune escape prediction or protein folding. ↩︎

  8. I’d love to hear more suggestions. ↩︎

Potential metric

Correlation with risk

Quantifiability and ease of measurement

Difficulty of circumvention

Knowable before development and deployment

External verifiability

Targeted regulatory scope

Training compute

Medium to high

High

High

Yes

High

High

Number of parameters

Medium to low

High

Medium

Yes

High

Medium to high

Amount of training data

Medium

Medium

Medium

Yes

High

Medium to high

Capabilities

High

Low

Medium to low

Yes before depl., not dev.

Medium

High

Performance (pre-training loss)

Medium to high

Medium

High

Yes before depl., not dev.

Medium

Medium to high

Impacts

Very high

Low

Medium

No

Medium

High

Applications

High for some risks; not others

Medium

Medium

Sometimes

Medium

Medium

Table 1: First-pass evaluation of potential metrics for AI governance based on the here discussed criteria (h/t Mauricio).


Acknowledgments

Thanks to all my colleagues, collaborators, friends, and regulatory targets for the valuable conversation on this topic and brief feedback on this article. I also appreciate Claude’s help with editing this article.