According to [3] a lower learning rate is needed when one layer is either
linear or rectified linear.
• lrW linear: Learning rate for weights when either hidden or visible
layers are linear or rectified linear (default = 0.001)
• lrVb linear: Learning rate for visible biases when either hidden or vis-
ible layers are linear or rectified linear (default = 0.001)
• lrHb linear: Learning rate for hidden biases when either hidden or
visible layers are linear or rectified linear (default = 0.001)
According to [3] a good range for the weight decay L2 coefficient is b e-
tween 0.01 and 0.00001.
• weightPenaltyL2: L2 Weight decay coefficient (default value = 0.0002)
• initMomentum: momentum value as long as epoch ≤ momentumE-
pochThres (default = 0.5)
• finalMomentum: momentum value as long as epoch > momentumE-
pochThres (default = 0.9)
• momentumEpochThres: epoch number after which the final momen-
tum is used. For epochs 1 to momentumEpochThres the initial momen-
tum is used (default = 5)
• type: training type of contrastive divergence (default = 1)
– 1 is what Hinton suggests in [3], i.e., “When the hidden units are
being driven by data, always use stochastic binary states. When
they are b eing driven by reconstructions, always use probabili-
ties without sampling. Assuming the visible units use the logistic
function, use real-valued probabilities for both the data and the
reconstructions. When collecting the pairwise statistics for learn-
ing weights or the individual statistics for learning biases, use the
probabilities, not the binary states”
– 2 is consistent with theory. Check trainRBM for more details on
how these approaches are implemented.
2