Catastrophic Interference, also known as catastrophic forgetting, is the tendency of a artificial neural network to completely and abruptly forget previously learned information upon learning new information. [1] [2]. Neural networks are an important part of the network approach and connectionist approach to cognitive science. These networks use computer simulations to try and model human behaviours, such as memory and learning. Catastrophic interference is an important issue to consider when creating connectionist models of memory. It was originally brought to the attention of the scientific community by research from McCloskey and Cohen (1989) [1], and Ractcliff (1990) [2]. It is a radical manifestation of the ‘sensitivity-stability’ dilemma [3] or the ‘stability-plasticity’ dilemma [4]. Specifically, these problems refer to the issue of being able to make an artificial neural network that is sensitive to, but not disrupted by, new information. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. [5] The former remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general priciples, from new inputs. On the other hand, connectionst networks like the standard backpropagation network are very sensitive to new information and can generalize on new inputs. Backpropagation models can be considered good models of human memory insofar as they mirror the human ability to generalize but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. This is considered an issue when attempting to model human memory because, unlike these networks, humans typically do not show catastrophic forgetting. Thus, the issue of catastrophic interference must be eradicated from these backpropagation models in order to enhance the plausibility as models of human memory.
In order understand the topic of catastrophic interference it is important to understand the components of an artificial neural network and, more specifically, the behaviour of a backpropagation network. The following account of neural networks is summarized from Rethinking Innateness: A Connectionist Perspective on Development by Elman et al (1996). [6]
Artificial neural networks are inspired by biological neural networks. They use mathematical models, namely algorithms, to do things such as classifying data and learning patters in data. Information is represented in these networks through patterns of activation, known as a distributed representations.
The basic components of artificial neural networks are nodes/units and weights. Nodes or units are simple processing elements, which can be considered artificial neurons. These units can act in a variety of ways. They can act like sensory neurons and collect inputs from the environment, they can act like motor neurons and sent and output, they can act like interneurons and relay information, or they may do all three functions. A backpropagation network is often a three-layer neural network that includes input nodes, hidden nodes, and output nodes (see Figure 1). The hidden nodes allow the network to be transformed into an internal representation, akin to a mental representation. These internal representations give the backpropagation network its ability to capture abstract relationships between different input patterns.
The nodes are also connected to each other, thus they can send activation to one another like neurons. These connections can be unidirectional, creating a feedforward network, or they can be bidirectional, creating a reccurent network. Each of the connections between the nodes has a ``weight``, or strength, and it is in these weights are where the knowledge is ‘stored’. The weights act to multiply the output of a node. They can be excitatory (a positive value) or inhibitory (a negative value). For example, if a node has an output of 1.0 and it is connected to another node with a weight of -0.5 then the second node will receive an input signal of (1.0 x -0.5) = -0.5. Since any one node can receive multiple inputs, the sum of all of these inputs must be taken to calculate the net input.
The net input (neti) to a node j would be defined as:
neti = Σwijoj [2] wij = the weight between node i and j oj = the input vector / activation
Once the input has been sent to the hidden layer from the input layer, the hidden node may then send an output to the output layer. The output of any given node depends on the activation of that node and the response function of that node. In the case of a three-layer backpropagation network, the response function is a non-linear, logistic function. This function allows a node to behave in an all or none fashion towards high or low input values and in a more graded and sensitive fashion towards mid-ranged input values. It allows the nodes the result in more substantial changes in the network when the node activation is at the more extreme values. Transforming the net input into a net output that can be sent onto the output layer is calculated by:
oi = 1/[1 +exp(neti)] [2] oi = the activation of node i
An important feature of neural networks is that they can learn. Simply put, this means that they can change their outputs when they are given new inputs. Backpropagation, specifically refers to how this the network is trained, i.e. how the network is told to learn. The way in which a backpropagation network learns, is through comparing the actual output to the desired output of the unit. The desired output is known as a 'teacher' and it can be a the same as the input, as in the case of auto-associative/auto-encoder networks, or it can be completely different from the input. Either way, learning which requires a teacher is called supervised learning. The difference between these actual and desired output constitutes an error signal. This error signal is then fedback, or backpropagated, to the nodes in order to modify the weights in the neural network. Backpropagation first modifies the weights between output layer to the hidden layer, then next modifies the weights between the hidden units and the input units. The change in weights help to decrease the discrepancy between the actual and desired output. However, learning is typically incremental in these networks. This means that the these networks will require a series of presentations of the same input before it can come up with the weight changes that will result in the desired output. The weights are usually set to random values for first learning trial and after many trials the weights become more able represent the desired output. The process of converging on an output is called settling. The weights in the This kind of training is based on the error singal and backpropagation learning algorithm / delta rule:
Error signal at output element: e = (ti - oi)oi(1-oi) [2] Error signal at the hidden unit: e = oi(1-oi)Σwike [2] Weight change: Δwij = k e oj [2] e= error signal ti = target output of j Δwij = the weight change between node i and j wik = new weight calculated from error signal at output element k = learning rate o = the activation of node i / actual output of node i oj = the input vector / activation
The issue of catastrophic interference, comes about when learning is sequential. Sequential training involves the network learning an input-output patter until the error is reduced below a specific criterion, then training the network on another set of input-output patterns. Specifically, a backpropagation network will forget information if it first learns input A and then next learns input B. It is not seen when learning is concurrent or interleaved. Interleaved training means the network learns the both inputs-ouput patterns at the same time, i.e. as AB. Weights are only changed when the network is being trained and not when the network is being tested on its response.
To summarize, backpropagation networks:
The term catastrophic interference was originally coined by McCloskey and Cohen (1989) but was also brought to the attention of the scientific community by research from Ractcliff (1990) [2].
McCloskey and Cohen(1989) noted the problem of catastrophic interference during two different experiments with backpropagation neural network modelling.
McCloskey and Cohen tried to reduce interference through a number of manipulations including changing the number of hidden units, changing the value of the learning rate parameter, overtraining on the A-B list, freezing certain connection weights, changing target values 0 and 1 instead 0.1 and 0.9. However none of these manipulations satisfactorily reduced the catastrophic interference exhibited by the networks.
Overall, McCloskey and Cohen (1989) concluded that:
Ratcliff (1990) used multiple sets of backpropagation models applied to standard recognition memory procedures, in which the items were sequentially learned. [2] After inspecting the recognition performance models he found two major problems:
Many researchers have suggested that the main cause of catastrophic interference is overlap in the representations at the hidden layer of distributed neural networks. [11] [12] [13] In a distributed representation any given input will tend to create changes in the weights to many of the nodes. Catastrophic forgetting occurs because when many of the weights, where ‘knowledge is stored, are changed it is impossible for prior knowledge to be kept intact. During sequential learning, the inputs become mixed with the new input being superimposed over top of the old input. [12]. Another way to conceptualize this is through visualizing learning as movement through a weight space. [14] This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network can possess. When a network first learns to represent a set of patterns, it has found a point in weight space which allows it to recognize all of the patterns that is has seen. [13] However, when the network learns a new set of patterns sequentially it will move to a place in the weight space that allows it to only recognize the new pattern. [13] To recognize both sets of patterns, the network must find a place in weight space that can represent both the new and the old output. One way to do this is by connecting a hidden unit to only a subset of the input units. This reduces the likelihood that two different inputs will be encoded by the same hidden units and weights and so will decrease the chance of interference. [12] Indeed, a number of the proposed solutions to catastrophic interference involve reducing the amount of overlap that occurs when storing information in these weights.
Many of the early techniques in reducing representational overlap involved making either the input vectors or the hidden unit activation patterns orthogonal to one another. Lewandowsky and Li (1995) [15] noted that the interference between sequentially learned patterns is minimized if the input vectors are orthogonal to each other. Input vectors are said to be orthogonal to each other if the pairwise product of their elements across the two vectors sum to zero. For example the patterns [0,0,1,0] and [0,1,0,0] are said to be orthogonal because (0x0 + 0x1 + 1x0 + 0x0) = 0. One of the techniques which can create orthogonal representations at the hidden layers involves bipolar feature coding (i.e., coding using -1 and 1 rather than 0 and 1) [13]. Orthogonal patterns tend to produce less interference with each other. However not all learning problems can be represented using these types of vectors and some studies report that the degree of interference is still problematic with orthogonal vectors. [2] Simple techniques such varying the learning rate parameters in the backpropagation equation were not successful in reducing interferece. Varying the number of hidden nodes has also been used to try and reduce interference. However, the findings have been mixed with some studies finding that more hidden units decrease interference [16] and other studies finding it does not. [1] [2]
Below are a number of techniques which have empirical support in successfully in reducing catastrophic interference in backpropagation neural networks:
Anew = Aold + α(1- Aold) For nodes to be sharpened, i.e. more activated Anew = Aold – αAold For all other nodes
Δwij = kδi di Δwij = weight change between nodes i and j k = learning rate, δi = error signel di= novely vector
Catastrophic Interference, also known as catastrophic forgetting, is the tendency of a artificial neural network to completely and abruptly forget previously learned information upon learning new information. [1] [2]. Neural networks are an important part of the network approach and connectionist approach to cognitive science. These networks use computer simulations to try and model human behaviours, such as memory and learning. Catastrophic interference is an important issue to consider when creating connectionist models of memory. It was originally brought to the attention of the scientific community by research from McCloskey and Cohen (1989) [1], and Ractcliff (1990) [2]. It is a radical manifestation of the ‘sensitivity-stability’ dilemma [3] or the ‘stability-plasticity’ dilemma [4]. Specifically, these problems refer to the issue of being able to make an artificial neural network that is sensitive to, but not disrupted by, new information. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. [5] The former remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general priciples, from new inputs. On the other hand, connectionst networks like the standard backpropagation network are very sensitive to new information and can generalize on new inputs. Backpropagation models can be considered good models of human memory insofar as they mirror the human ability to generalize but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. This is considered an issue when attempting to model human memory because, unlike these networks, humans typically do not show catastrophic forgetting. Thus, the issue of catastrophic interference must be eradicated from these backpropagation models in order to enhance the plausibility as models of human memory.
In order understand the topic of catastrophic interference it is important to understand the components of an artificial neural network and, more specifically, the behaviour of a backpropagation network. The following account of neural networks is summarized from Rethinking Innateness: A Connectionist Perspective on Development by Elman et al (1996). [6]
Artificial neural networks are inspired by biological neural networks. They use mathematical models, namely algorithms, to do things such as classifying data and learning patters in data. Information is represented in these networks through patterns of activation, known as a distributed representations.
The basic components of artificial neural networks are nodes/units and weights. Nodes or units are simple processing elements, which can be considered artificial neurons. These units can act in a variety of ways. They can act like sensory neurons and collect inputs from the environment, they can act like motor neurons and sent and output, they can act like interneurons and relay information, or they may do all three functions. A backpropagation network is often a three-layer neural network that includes input nodes, hidden nodes, and output nodes (see Figure 1). The hidden nodes allow the network to be transformed into an internal representation, akin to a mental representation. These internal representations give the backpropagation network its ability to capture abstract relationships between different input patterns.
The nodes are also connected to each other, thus they can send activation to one another like neurons. These connections can be unidirectional, creating a feedforward network, or they can be bidirectional, creating a reccurent network. Each of the connections between the nodes has a ``weight``, or strength, and it is in these weights are where the knowledge is ‘stored’. The weights act to multiply the output of a node. They can be excitatory (a positive value) or inhibitory (a negative value). For example, if a node has an output of 1.0 and it is connected to another node with a weight of -0.5 then the second node will receive an input signal of (1.0 x -0.5) = -0.5. Since any one node can receive multiple inputs, the sum of all of these inputs must be taken to calculate the net input.
The net input (neti) to a node j would be defined as:
neti = Σwijoj [2] wij = the weight between node i and j oj = the input vector / activation
Once the input has been sent to the hidden layer from the input layer, the hidden node may then send an output to the output layer. The output of any given node depends on the activation of that node and the response function of that node. In the case of a three-layer backpropagation network, the response function is a non-linear, logistic function. This function allows a node to behave in an all or none fashion towards high or low input values and in a more graded and sensitive fashion towards mid-ranged input values. It allows the nodes the result in more substantial changes in the network when the node activation is at the more extreme values. Transforming the net input into a net output that can be sent onto the output layer is calculated by:
oi = 1/[1 +exp(neti)] [2] oi = the activation of node i
An important feature of neural networks is that they can learn. Simply put, this means that they can change their outputs when they are given new inputs. Backpropagation, specifically refers to how this the network is trained, i.e. how the network is told to learn. The way in which a backpropagation network learns, is through comparing the actual output to the desired output of the unit. The desired output is known as a 'teacher' and it can be a the same as the input, as in the case of auto-associative/auto-encoder networks, or it can be completely different from the input. Either way, learning which requires a teacher is called supervised learning. The difference between these actual and desired output constitutes an error signal. This error signal is then fedback, or backpropagated, to the nodes in order to modify the weights in the neural network. Backpropagation first modifies the weights between output layer to the hidden layer, then next modifies the weights between the hidden units and the input units. The change in weights help to decrease the discrepancy between the actual and desired output. However, learning is typically incremental in these networks. This means that the these networks will require a series of presentations of the same input before it can come up with the weight changes that will result in the desired output. The weights are usually set to random values for first learning trial and after many trials the weights become more able represent the desired output. The process of converging on an output is called settling. The weights in the This kind of training is based on the error singal and backpropagation learning algorithm / delta rule:
Error signal at output element: e = (ti - oi)oi(1-oi) [2] Error signal at the hidden unit: e = oi(1-oi)Σwike [2] Weight change: Δwij = k e oj [2] e= error signal ti = target output of j Δwij = the weight change between node i and j wik = new weight calculated from error signal at output element k = learning rate o = the activation of node i / actual output of node i oj = the input vector / activation
The issue of catastrophic interference, comes about when learning is sequential. Sequential training involves the network learning an input-output patter until the error is reduced below a specific criterion, then training the network on another set of input-output patterns. Specifically, a backpropagation network will forget information if it first learns input A and then next learns input B. It is not seen when learning is concurrent or interleaved. Interleaved training means the network learns the both inputs-ouput patterns at the same time, i.e. as AB. Weights are only changed when the network is being trained and not when the network is being tested on its response.
To summarize, backpropagation networks:
The term catastrophic interference was originally coined by McCloskey and Cohen (1989) but was also brought to the attention of the scientific community by research from Ractcliff (1990) [2].
McCloskey and Cohen(1989) noted the problem of catastrophic interference during two different experiments with backpropagation neural network modelling.
McCloskey and Cohen tried to reduce interference through a number of manipulations including changing the number of hidden units, changing the value of the learning rate parameter, overtraining on the A-B list, freezing certain connection weights, changing target values 0 and 1 instead 0.1 and 0.9. However none of these manipulations satisfactorily reduced the catastrophic interference exhibited by the networks.
Overall, McCloskey and Cohen (1989) concluded that:
Ratcliff (1990) used multiple sets of backpropagation models applied to standard recognition memory procedures, in which the items were sequentially learned. [2] After inspecting the recognition performance models he found two major problems:
Many researchers have suggested that the main cause of catastrophic interference is overlap in the representations at the hidden layer of distributed neural networks. [11] [12] [13] In a distributed representation any given input will tend to create changes in the weights to many of the nodes. Catastrophic forgetting occurs because when many of the weights, where ‘knowledge is stored, are changed it is impossible for prior knowledge to be kept intact. During sequential learning, the inputs become mixed with the new input being superimposed over top of the old input. [12]. Another way to conceptualize this is through visualizing learning as movement through a weight space. [14] This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network can possess. When a network first learns to represent a set of patterns, it has found a point in weight space which allows it to recognize all of the patterns that is has seen. [13] However, when the network learns a new set of patterns sequentially it will move to a place in the weight space that allows it to only recognize the new pattern. [13] To recognize both sets of patterns, the network must find a place in weight space that can represent both the new and the old output. One way to do this is by connecting a hidden unit to only a subset of the input units. This reduces the likelihood that two different inputs will be encoded by the same hidden units and weights and so will decrease the chance of interference. [12] Indeed, a number of the proposed solutions to catastrophic interference involve reducing the amount of overlap that occurs when storing information in these weights.
Many of the early techniques in reducing representational overlap involved making either the input vectors or the hidden unit activation patterns orthogonal to one another. Lewandowsky and Li (1995) [15] noted that the interference between sequentially learned patterns is minimized if the input vectors are orthogonal to each other. Input vectors are said to be orthogonal to each other if the pairwise product of their elements across the two vectors sum to zero. For example the patterns [0,0,1,0] and [0,1,0,0] are said to be orthogonal because (0x0 + 0x1 + 1x0 + 0x0) = 0. One of the techniques which can create orthogonal representations at the hidden layers involves bipolar feature coding (i.e., coding using -1 and 1 rather than 0 and 1) [13]. Orthogonal patterns tend to produce less interference with each other. However not all learning problems can be represented using these types of vectors and some studies report that the degree of interference is still problematic with orthogonal vectors. [2] Simple techniques such varying the learning rate parameters in the backpropagation equation were not successful in reducing interferece. Varying the number of hidden nodes has also been used to try and reduce interference. However, the findings have been mixed with some studies finding that more hidden units decrease interference [16] and other studies finding it does not. [1] [2]
Below are a number of techniques which have empirical support in successfully in reducing catastrophic interference in backpropagation neural networks:
Anew = Aold + α(1- Aold) For nodes to be sharpened, i.e. more activated Anew = Aold – αAold For all other nodes
Δwij = kδi di Δwij = weight change between nodes i and j k = learning rate, δi = error signel di= novely vector