-
-
Save yosemitebandit/8aec5677e69017bed04c to your computer and use it in GitHub Desktop.
I am curious, why did you initialize from truncated normal with stddev = sqrt(2 / <input_size>)? Why not just truncated, say, stddev = 0.1 for all layers?
help! when i run the following code, my loss function diverges, please can someone explain why?
batch_size = 128
#regularisation parameter
beta = 0.001
#2 hidden layers, neural network
hidden_nodes1 = 1024
hidden_nodes2 = 512
keep_prob = 0.5 #probability of drop out
initial_learning_rate = 0.5
graph = tf.Graph()
with graph.as_default():
Input data. For the training data, we use a placeholder that will be fed
at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
Variables.
hidden_weights1 = tf.Variable(
tf.truncated_normal([image_size * image_size, hidden_nodes1]))
hidden_biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights1)
+ hidden_biases1)
hidden_layer_drop1 = tf.nn.dropout(hidden_layer1, keep_prob) #Dropout added
hidden_weights2 = tf.Variable(
tf.truncated_normal([hidden_nodes1, hidden_nodes2]))
hidden_biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer_drop1, hidden_weights2)
+ hidden_biases2)
hidden_layer_drop2 = tf.nn.dropout(hidden_layer2, keep_prob) #Dropout added
weights = tf.Variable(tf.truncated_normal([hidden_nodes2, num_labels]))
biases = tf.Variable(tf.zeros([num_labels]))
Training computation.
logits = tf.matmul(hidden_layer_drop2, weights) + biases
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels,
logits=logits))
loss = loss + beta * tf.nn.l2_loss(weights)
Optimizer. Learning rate decreases with number of cycles
global_step = tf.Variable(0) # count the number of steps taken.
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
100000, 0.95, staircase=True)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss,
global_step=global_step)
Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
valid_relu1 = tf.nn.relu(tf.matmul(tf_valid_dataset, hidden_weights1) + hidden_biases1)
valid_relu2 = tf.nn.relu(tf.matmul(valid_relu1, hidden_weights2) + hidden_biases2)
valid_prediction = tf.nn.softmax(tf.matmul(valid_relu2, weights) + biases)
test_relu1 = tf.nn.relu(tf.matmul(tf_test_dataset, hidden_weights1) + hidden_biases1)
test_relu2 = tf.nn.relu(tf.matmul(test_relu1, hidden_weights2) + hidden_biases2)
test_prediction = tf.nn.softmax(tf.matmul(test_relu2, weights) + biases)
@zhuanquan I would assume somewhere where the losses are being computed incorrectly. If I drop your initial learning rate by an order of magnitude or more it begins to minimize. But it will not work for me either when I start at 0.5
@zhuanquan I would suggest to initialize your weights' variables with standard deviation between 0.1 and 0.2 i.e, weights = tf.Variable([size], stddev=stdvalue)
Why do you use np.random.choice(np.arange(5))
instead of just np.random.choice(5)
? Just looking at the docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html and wondering what I am missing. Are they the same or slightly different? Or is this way easier for understanding?
The problem asks for restricting the data which is achieved using
offset = batch_size * np.random.choice(np.arange(5))
Number of steps does not needs to be reduced to show the overfitting.