In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks. [1] [2]
In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, error rate after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws".[ citation needed] These are usually written as (number of parameters, dataset size, computing cost, loss).
In most cases, the size of the model is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. [3] In sparse models, during every inference, only a fraction of the parameters are used. In comparison, most other kinds of neural networks, such as Transformer networks, always use all their parameters during every inference.
The size of the training dataset is usually quantified by the number of data points it contains. Larger training datasets are typically preferred as they provide a richer and more diverse source of information for the model to learn from. This in turn can lead to improved generalization performance when the model is applied to unseen data. [4] However, increasing the size of the training dataset also increases the computational resources and time required for model training.
With the "pretrain, then finetune" method used in most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes would have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. [5]
In some cases, a small amount of high quality data suffices for finetuning, and more data does not improve performance. [5]
The cost of training is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required to train the model). It's important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware like GPUs or TPUs.
The cost of training a neural model is a function of several factors including the size of the model, the size of the training dataset, the complexity of the training algorithm, and the computational resources available. [4] In particular, doubling the training dataset does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an " epoch").
The performance of a neural model is evaluated based on its ability to accurately predict the output given the input data. Common metrics for evaluating model performance include: [4]
Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.
The 2017 paper [2] is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like , with , the paper found that .
Of the factors they varied, only task can change the exponent . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have while another might have . The also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like for another exponent .
They studied machine translation with LSTM (), generative language modelling with LSTM (), ImageNet classification with ResNet (), and speech recognition ().
A 2020 analysis [8] studied statistical relations between over a wide range of values and found similar scaling laws, over the range of , , and over multiple modalities (text, video, image, text to image, etc.). [8]
In particular, the scaling laws it found are (Table 1 of [8]):
The scaling law of was confirmed during the training of GPT-3 (Figure 3.1 [9]).
One particular scaling law (" Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have: [10]where the variables are
and the statistical parameters are
Although Besiroglu et. al. [12] claims that the statistical estimation is slightly off, and should be .
The statistical laws were fitted over experimental data with .
Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed , we can uniquely solve for all 4 variables that minimizes . This provides us with the optimal for any fixed :Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on.
There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of . One can also directly fit a statistical law for without going through the detour, for which one obtains:or as tabulated:
/ FLOP | / FLOPs of training Gopher | ||
---|---|---|---|
400 Million | 1.92e+19 | 1/29968 | 8.0 Billion |
1 Billion | 1.21e+20 | 1/5706 | 20.2 Billion |
10 Billion | 1.23e+22 | 1/2819 | 205.1 Billion |
67 Billion | 5.76e+23 | 1 | 1.5 Trillion |
175 Billion | 3.85e+24 | 6.7 | 3.7 Trillion |
280 Billion | 9.90e+24 | 17.2 | 5.9 Trillion |
520 Billion | 3.43e+25 | 59.5 | 11.0 Trillion |
1 Trillion | 1.27e+26 | 221.3 | 21.2 Trillion |
10 Trillion | 1.30e+28 | 22515.9 | 216.2 Trillion |
In simpler terms, the Chinchilla scaling law for training Transformer language models suggests that when given an increased budget (in FLOPs), to achieve compute-optimal, the number of model parameters (N) and the number of tokens for training the model (D) should scale in approximately equal proportions. This conclusion differs from the previous scaling law for neural language models, [11] which states that N should be scaled faster than D. The discrepancy arises from setting different cycle lengths for cosine learning rate schedulers. In estimating the Chinchilla scaling, the authors set the cycle length to be the same as the training steps, as experimental results indicate that larger cycles overestimate the loss of the models.
As Chinchilla scaling has been the reference point for many large-scaling training runs, there had been a concurrent effort to go "beyond Chinchilla scaling", meaning to modify some of the training pipeline in order to obtain the same loss with less effort, or deliberately train for longer than what is "Chinchilla optimal".
Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute. For instance, filtering data can make the scaling law exponent larger. [13]
Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet. [14] found that augmenting the dataset with a mix of "denoising objectives" constructed from the dataset improves performance. [15] studies optimal scaling when all available data is already exhausted (such as in rare languages), so one must train multiple epoches over the same dataset (whereas Chinchilla scaling requires only one epoch). The Phi series of small language models were trained on textbook-like data generated by large language models, for which data is only limited by amount of compute available. [16]
Chinchilla optimality was defined as "optimal for training compute", whereas in actual production-quality models, there will be a lot of inference after training is complete. "Overtraining" during training means better performance during inference. [17] LLaMA models were overtrained for this reason. Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal. [18]
A 2022 analysis [19] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:
in which refers to the quantity being scaled (i.e. , , , number of training steps, number of inference steps, or model input size) and refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, solve rate, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters are found by statistical fitting.
On a log–log plot, when is not too large and is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the transitions between the segments are called "breaks", hence the name Broken Neural Scaling Laws (BNSL).
The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent, supervised learning, unsupervised/ self-supervised learning, and reinforcement learning (single agent and multi-agent).
The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include ResNets, Transformers, MLPs, MLP-Mixers, Recurrent Neural Networks, Convolutional Neural Networks, Graph Neural Networks, U-Nets, Encoder-Decoder (and Encoder-only) (and Decoder-only) Models, Ensembles (and Non-Ensembles), MoE (Mixture of Experts) (and Non-MoE) Models, and Sparse Pruned (and Non-Sparse Unpruned) Models.
Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts , on image sets of sizes , for computing (in units of TPUv3-core-days). [20]
After training the model, it is finetuned on ImageNet training set. Let be the error probability of the finetuned model classifying ImageNet test set. They found .
Ghorbani, Behrooz et al. [21] studied scaling laws for neural machine translation (specifically, English as source, and German as target) in encoder-decoder Transformer models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost or dataset size ). They varied They found three results:
The authors hypothesize that source-natural datasets have uniform and dull target sentences, and so a model that is trained to predict the target sentences would quickly overfit.
[23] trained Transformers for machine translations with sizes on dataset sizes . They found the Kaplan et al (2020) [11] scaling law applied to machine translation: . They also found the BLEU score scaling as .
Hernandez, Danny et al. [24] studied scaling laws for transfer learning in language models. They trained a family of Transformers in three ways:
The idea is that pretraining on English should help the model achieve low loss on a test set of Python text. Suppose the model has parameter count , and after being finetuned on Python tokens, it achieves some loss . We say that its "transferred token count" is , if another model with the same achieves the same after training on Python tokens.
They found for pretraining on English text, and for pretraining on English and non-Python code.
{{
cite book}}
: CS1 maint: multiple names: authors list (
link)
In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks. [1] [2]
In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, error rate after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws".[ citation needed] These are usually written as (number of parameters, dataset size, computing cost, loss).
In most cases, the size of the model is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. [3] In sparse models, during every inference, only a fraction of the parameters are used. In comparison, most other kinds of neural networks, such as Transformer networks, always use all their parameters during every inference.
The size of the training dataset is usually quantified by the number of data points it contains. Larger training datasets are typically preferred as they provide a richer and more diverse source of information for the model to learn from. This in turn can lead to improved generalization performance when the model is applied to unseen data. [4] However, increasing the size of the training dataset also increases the computational resources and time required for model training.
With the "pretrain, then finetune" method used in most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes would have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. [5]
In some cases, a small amount of high quality data suffices for finetuning, and more data does not improve performance. [5]
The cost of training is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required to train the model). It's important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware like GPUs or TPUs.
The cost of training a neural model is a function of several factors including the size of the model, the size of the training dataset, the complexity of the training algorithm, and the computational resources available. [4] In particular, doubling the training dataset does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an " epoch").
The performance of a neural model is evaluated based on its ability to accurately predict the output given the input data. Common metrics for evaluating model performance include: [4]
Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.
The 2017 paper [2] is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like , with , the paper found that .
Of the factors they varied, only task can change the exponent . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have while another might have . The also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like for another exponent .
They studied machine translation with LSTM (), generative language modelling with LSTM (), ImageNet classification with ResNet (), and speech recognition ().
A 2020 analysis [8] studied statistical relations between over a wide range of values and found similar scaling laws, over the range of , , and over multiple modalities (text, video, image, text to image, etc.). [8]
In particular, the scaling laws it found are (Table 1 of [8]):
The scaling law of was confirmed during the training of GPT-3 (Figure 3.1 [9]).
One particular scaling law (" Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have: [10]where the variables are
and the statistical parameters are
Although Besiroglu et. al. [12] claims that the statistical estimation is slightly off, and should be .
The statistical laws were fitted over experimental data with .
Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed , we can uniquely solve for all 4 variables that minimizes . This provides us with the optimal for any fixed :Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on.
There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of . One can also directly fit a statistical law for without going through the detour, for which one obtains:or as tabulated:
/ FLOP | / FLOPs of training Gopher | ||
---|---|---|---|
400 Million | 1.92e+19 | 1/29968 | 8.0 Billion |
1 Billion | 1.21e+20 | 1/5706 | 20.2 Billion |
10 Billion | 1.23e+22 | 1/2819 | 205.1 Billion |
67 Billion | 5.76e+23 | 1 | 1.5 Trillion |
175 Billion | 3.85e+24 | 6.7 | 3.7 Trillion |
280 Billion | 9.90e+24 | 17.2 | 5.9 Trillion |
520 Billion | 3.43e+25 | 59.5 | 11.0 Trillion |
1 Trillion | 1.27e+26 | 221.3 | 21.2 Trillion |
10 Trillion | 1.30e+28 | 22515.9 | 216.2 Trillion |
In simpler terms, the Chinchilla scaling law for training Transformer language models suggests that when given an increased budget (in FLOPs), to achieve compute-optimal, the number of model parameters (N) and the number of tokens for training the model (D) should scale in approximately equal proportions. This conclusion differs from the previous scaling law for neural language models, [11] which states that N should be scaled faster than D. The discrepancy arises from setting different cycle lengths for cosine learning rate schedulers. In estimating the Chinchilla scaling, the authors set the cycle length to be the same as the training steps, as experimental results indicate that larger cycles overestimate the loss of the models.
As Chinchilla scaling has been the reference point for many large-scaling training runs, there had been a concurrent effort to go "beyond Chinchilla scaling", meaning to modify some of the training pipeline in order to obtain the same loss with less effort, or deliberately train for longer than what is "Chinchilla optimal".
Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute. For instance, filtering data can make the scaling law exponent larger. [13]
Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet. [14] found that augmenting the dataset with a mix of "denoising objectives" constructed from the dataset improves performance. [15] studies optimal scaling when all available data is already exhausted (such as in rare languages), so one must train multiple epoches over the same dataset (whereas Chinchilla scaling requires only one epoch). The Phi series of small language models were trained on textbook-like data generated by large language models, for which data is only limited by amount of compute available. [16]
Chinchilla optimality was defined as "optimal for training compute", whereas in actual production-quality models, there will be a lot of inference after training is complete. "Overtraining" during training means better performance during inference. [17] LLaMA models were overtrained for this reason. Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal. [18]
A 2022 analysis [19] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:
in which refers to the quantity being scaled (i.e. , , , number of training steps, number of inference steps, or model input size) and refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, solve rate, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters are found by statistical fitting.
On a log–log plot, when is not too large and is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the transitions between the segments are called "breaks", hence the name Broken Neural Scaling Laws (BNSL).
The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent, supervised learning, unsupervised/ self-supervised learning, and reinforcement learning (single agent and multi-agent).
The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include ResNets, Transformers, MLPs, MLP-Mixers, Recurrent Neural Networks, Convolutional Neural Networks, Graph Neural Networks, U-Nets, Encoder-Decoder (and Encoder-only) (and Decoder-only) Models, Ensembles (and Non-Ensembles), MoE (Mixture of Experts) (and Non-MoE) Models, and Sparse Pruned (and Non-Sparse Unpruned) Models.
Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts , on image sets of sizes , for computing (in units of TPUv3-core-days). [20]
After training the model, it is finetuned on ImageNet training set. Let be the error probability of the finetuned model classifying ImageNet test set. They found .
Ghorbani, Behrooz et al. [21] studied scaling laws for neural machine translation (specifically, English as source, and German as target) in encoder-decoder Transformer models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost or dataset size ). They varied They found three results:
The authors hypothesize that source-natural datasets have uniform and dull target sentences, and so a model that is trained to predict the target sentences would quickly overfit.
[23] trained Transformers for machine translations with sizes on dataset sizes . They found the Kaplan et al (2020) [11] scaling law applied to machine translation: . They also found the BLEU score scaling as .
Hernandez, Danny et al. [24] studied scaling laws for transfer learning in language models. They trained a family of Transformers in three ways:
The idea is that pretraining on English should help the model achieve low loss on a test set of Python text. Suppose the model has parameter count , and after being finetuned on Python tokens, it achieves some loss . We say that its "transferred token count" is , if another model with the same achieves the same after training on Python tokens.
They found for pretraining on English text, and for pretraining on English and non-Python code.
{{
cite book}}
: CS1 maint: multiple names: authors list (
link)