# Getting Started in TensorFlow

## A look at a very simple neural network in TensorFlow

This is an introduction to working with TensorFlow. It works through an example of a very simple neural network, walking through the steps of setting up the input, adding operators, setting up gradient descent, and running the computation graph.

## A simple neural network

Let’s start with code. We’re going to construct a very simple neural network computing a linear regression between two variables, y and x. The function it tries to compute is the best $w_1$ and $w_2$ it can find for the function $y = w_2 x + w_1$ for the data. The data we’re going to give it is toy data, linear perturbed with random noise.

This is what the network looks like:

from __future__ import print_function

from IPython.display import Image
import base64 Here is the TensorFlow code for this simple neural network and the results of running this code:

#@test {"output": "ignore"}
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# Set up the data with a noisy linear relationship between X and Y.
num_examples = 50
X = np.array([np.linspace(-2, 4, num_examples), np.linspace(-6, 6, num_examples)])
X += np.random.randn(2, num_examples)
x, y = X
x_with_bias = np.array([(1., a) for a in x]).astype(np.float32)

losses = []
training_steps = 50
learning_rate = 0.002

with tf.Session() as sess:
# Set up all the tensors, variables, and operations.
input = tf.constant(x_with_bias)
target = tf.constant(np.transpose([y]).astype(np.float32))
weights = tf.Variable(tf.random_normal([2, 1], 0, 0.1))

tf.initialize_all_variables().run()

yhat = tf.matmul(input, weights)
yerror = tf.sub(yhat, target)
loss = tf.nn.l2_loss(yerror)

for _ in range(training_steps):
# Repeatedly run the operations, updating the TensorFlow variable.
update_weights.run()
losses.append(loss.eval())

# Training is done, get the final values for the graphs
betas = weights.eval()
yhat = yhat.eval()

# Show the fit and the loss over time.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 4)
ax1.scatter(x, y, alpha=.7)
ax1.scatter(x, np.transpose(yhat), c="g", alpha=.6)
line_x_range = (-4, 6)
ax1.plot(line_x_range, [betas + a * betas for a in line_x_range], "g", alpha=0.6)
ax2.plot(range(0, training_steps), losses)
ax2.set_ylabel("Loss")
ax2.set_xlabel("Training steps")
plt.show()

/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') In the remainder of this notebook, we’ll go through this example in more detail.

## From the beginning

Let’s walk through exactly what this is doing from the beginning. We’ll start with what the data looks like, then we’ll look at this neural network, what is executed when, what gradient descent is doing, and how it all works together.

## The data

This is a toy data set here. We have 50 (x,y) data points. At first, the data is perfectly linear.

#@test {"output": "ignore"}
num_examples = 50
X = np.array([np.linspace(-2, 4, num_examples), np.linspace(-6, 6, num_examples)])
plt.figure(figsize=(4,4))
plt.scatter(X, X)
plt.show() Then we perturb it with noise:

#@test {"output": "ignore"}
X += np.random.randn(2, num_examples)
plt.figure(figsize=(4,4))
plt.scatter(X, X)
plt.show() ## What we want to do

What we’re trying to do is calculate the green line below:

#@test {"output": "ignore"}
weights = np.polyfit(X, X, 1)
plt.figure(figsize=(4,4))
plt.scatter(X, X)
line_x_range = (-3, 5)
plt.plot(line_x_range, [weights + a * weights for a in line_x_range], "g", alpha=0.8)
plt.show() Remember that our simple network looks like this:

from IPython.display import Image
import base64 That’s equivalent to the function $\hat{y} = w_2 x + w_1$. What we’re trying to do is find the “best” weights $w_1$ and $w_2$. That will give us that green regression line above.

What are the best weights? They’re the weights that minimize the difference between our estimate $\hat{y}$ and the actual y. Specifically, we want to minimize the sum of the squared errors, so minimize $\sum{(\hat{y} - y)^2}$, which is known as the L2 loss. So, the best weights are the weights that minimize the L2 loss.

What gradient descent does is start with random weights for $\hat{y} = w_2 x + w_1$ and gradually moves those weights toward better values.

It does that by following the downward slope of the error curves. Imagine that the possible errors we could get with different weights as a landscape. From whatever weights we have, moving in some directions will increase the error, like going uphill, and some directions will decrease the error, like going downhill. We want to roll downhill, always moving the weights toward lower error.

How does gradient descent know which way is downhill? It follows the partial derivatives of the L2 loss. The partial derivative is like a velocity, saying which way the error will change if we change the weight. We want to move in the direction of lower error. The partial derivative points the way.

So, what gradient descent does is start with random weights and gradually walk those weights toward lower error, using the partial derivatives to know which direction to go.

## The code again

Let’s go back to the code now, walking through it with many more comments in the code this time:

#@test {"output": "ignore"}
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Set up the data with a noisy linear relationship between X and Y.
num_examples = 50
X = np.array([np.linspace(-2, 4, num_examples), np.linspace(-6, 6, num_examples)])
# Add random noise (gaussian, mean 0, stdev 1)
X += np.random.randn(2, num_examples)
# Split into x and y
x, y = X
# Add the bias node which always has a value of 1
x_with_bias = np.array([(1., a) for a in x]).astype(np.float32)

# Keep track of the loss at each iteration so we can chart it later
losses = []
# How many iterations to run our training
training_steps = 50
# The learning rate. Also known has the step size. This changes how far
# we move down the gradient toward lower error at each step. Too large
# jumps risk inaccuracy, too small slow the learning.
learning_rate = 0.002

# In TensorFlow, we need to run everything in the context of a session.
with tf.Session() as sess:
# Set up all the tensors.
# Our input layer is the x value and the bias node.
input = tf.constant(x_with_bias)
# Our target is the y values. They need to be massaged to the right shape.
target = tf.constant(np.transpose([y]).astype(np.float32))
# Weights are a variable. They change every time through the loop.
# Weights are initialized to random values (gaussian, mean 0, stdev 0.1)
weights = tf.Variable(tf.random_normal([2, 1], 0, 0.1))

# Initialize all the variables defined above.
tf.initialize_all_variables().run()

# Set up all operations that will run in the loop.
# For all x values, generate our estimate on all y given our current
# weights. So, this is computing y = w2 * x + w1 * bias
yhat = tf.matmul(input, weights)
# Compute the error, which is just the difference between our
# estimate of y and what y actually is.
yerror = tf.sub(yhat, target)
# We are going to minimize the L2 loss. The L2 loss is the sum of the
# squared error for all our estimates of y. This penalizes large errors
# a lot, but small errors only a little.
loss = tf.nn.l2_loss(yerror)

# This essentially just updates weights, like weights += grads * learning_rate
# using the partial derivative of the loss with respect to the
# weights. It's the direction we want to go to move toward lower error.

# At this point, we've defined all our tensors and run our initialization
# operations. We've also set up the operations that will repeatedly be run
# inside the training loop. All the training loop is going to do is
# repeatedly call run, inducing the gradient descent operation, which has the effect of
# repeatedly changing weights by a small amount in the direction (the
# partial derivative or gradient) that will reduce the error (the L2 loss).
for _ in range(training_steps):
# Repeatedly run the operations, updating the TensorFlow variable.
sess.run(update_weights)

# Here, we're keeping a history of the losses to plot later
# so we can see the change in loss as training progresses.
losses.append(loss.eval())

# Training is done, get the final values for the charts
betas = weights.eval()
yhat = yhat.eval()

# Show the results.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 4)
ax1.scatter(x, y, alpha=.7)
ax1.scatter(x, np.transpose(yhat), c="g", alpha=.6)
line_x_range = (-4, 6)
ax1.plot(line_x_range, [betas + a * betas for a in line_x_range], "g", alpha=0.6)
ax2.plot(range(0, training_steps), losses)
ax2.set_ylabel("Loss")
ax2.set_xlabel("Training steps")
plt.show() This version of the code has a lot more comments at each step. Read through the code and the comments.

The core piece is the loop, which contains a single run call. run executes the operations necessary for the GradientDescentOptimizer operation. That includes several other operations, all of which are also executed each time through the loop. The GradientDescentOptimizer execution has a side effect of assigning to weights, so the variable weights changes each time in the loop.

The result is that, in each iteration of the loop, the code processes the entire input data set, generates all the estimates $\hat{y}$ for each $x$ given the current weights $w_i$, finds all the errors and L2 losses $(\hat{y} - y)^2$, and then changes the weights $w_i$ by a small amount in the direction of that will reduce the L2 loss.

After many iterations of the loop, the amount we are changing the weights gets smaller and smaller, and the loss gets smaller and smaller, as we narrow in on near optimal values for the weights. By the end of the loop, we should be near the lowest possible values for the L2 loss, and near the best possible weights we could have.

## The details

This code works, but there are still a few black boxes that are worth diving into here. l2_loss? GradientDescentOptimizer? What exactly are those doing?

One way to understand exactly what those are doing is to do the same thing without using those functions. Here is equivalent code that calculates the gradients (derivatives), L2 loss (sum squared error), and GradientDescentOptimizer from scratch without using those functions.

#@test {"output": "ignore"}

# Use the same input data and parameters as the examples above.
# We're going to build up a list of the errors over time as we train to display later.
losses = []

with tf.Session() as sess:
# Set up all the tensors.
# The input is the x values with the bias appended on to each x.
input = tf.constant(x_with_bias)
# We're trying to find the best fit for the target y values.
target = tf.constant(np.transpose([y]).astype(np.float32))
# Let's set up the weights randomly
weights = tf.Variable(tf.random_normal([2, 1], 0, 0.1))

tf.initialize_all_variables().run()

# learning_rate is the step size, so how much we jump from the current spot
learning_rate = 0.002

# The operations in the operation graph.
# Compute the predicted y values given our current weights
yhat = tf.matmul(input, weights)
# How much does this differ from the actual y?
yerror = tf.sub(yhat, target)
# Change the weights by subtracting derivative with respect to that weight
loss = 0.5 * tf.reduce_sum(tf.mul(yerror, yerror))
gradient = tf.reduce_sum(tf.transpose(tf.mul(input, yerror)), 1, keep_dims=True)
update_weights = tf.assign_sub(weights, learning_rate * gradient)

# Repeatedly run the operation graph over the training data and weights.
for _ in range(training_steps):
sess.run(update_weights)

# Here, we're keeping a history of the losses to plot later
# so we can see the change in loss as training progresses.
losses.append(loss.eval())

# Training is done, compute final values for the graph.
betas = weights.eval()
yhat = yhat.eval()

# Show the results.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 4)
ax1.scatter(x, y, alpha=.7)
ax1.scatter(x, np.transpose(yhat), c="g", alpha=.6)
line_x_range = (-4, 6)
ax1.plot(line_x_range, [betas + a * betas for a in line_x_range], "g", alpha=0.6)
ax2.plot(range(0, training_steps), losses)
ax2.set_ylabel("Loss")
ax2.set_xlabel("Training steps")
plt.show() This code looks very similar to the code above, but without using l2_loss or GradientDescentOptimizer. Let’s look at exactly what it is doing instead.

This code is the key difference:

loss = 0.5 * tf.reduce_sum(tf.mul(yerror, yerror))

gradient = tf.reduce_sum(tf.transpose(tf.mul(input, yerror)), 1, keep_dims=True)

update_weights = tf.assign_sub(weights, learning_rate * gradient)

The first line calculates the L2 loss manually. It’s the same as l2_loss(yerror), which is half of the sum of the squared error, so $\frac{1}{2} \sum (\hat{y} - y)^2$. With this code, you can see exactly what the l2_loss operation does. It’s the total of all the squared differences between the target and our estimates. And minimizing the L2 loss will minimize how much our estimates of $y$ differ from the true values of $y$.

The second line calculates $$\begin{bmatrix}\sum{(\hat{y} - y)*1}\sum{(\hat{y} - y)*x_i}\end{bmatrix}$$. What is that? It’s the partial derivatives of the L2 loss with respect to $w_1$ and $w_2$, the same thing as what gradients(loss, weights) does in the earlier code. Not sure about that? Let’s look at it in more detail. The gradient calculation is going to get the partial derivatives of loss with respect to each of the weights so we can change those weights in the direction that will reduce the loss. L2 loss is $$\frac{1}{2} \sum (\hat{y} - y)^2$$, where $$\hat{y} = w_2 x + w_1$$. So, using the chain rule and substituting in for $\hat{y}$ in the derivative, $\frac{\partial}{\partial w_2} = \sum{(\hat{y} - y), *x_i}$ and $\frac{\partial}{\partial w_1} = \sum{(\hat{y} - y), *1}$. GradientDescentOptimizer does these calculations automatically for you based on the graph structure.

The third line is equivalent to weights -= learning_rate * gradient, so it subtracts a constant the gradient after scaling by the learning rate (to avoid jumping too far each time, which risks moving in the wrong direction). It’s also the same thing that GradientDescentOptimizer(learning_rate).minimize(loss) does in the earlier code. Gradient descent updates its first parameter based on the values in the second after scaling by the third, so it’s equivalent to the assign_sub(weights, learning_rate * gradient).

Hopefully, this other code gives you a better understanding of what the operations we used previously are actually doing. In practice, you’ll want to use those high level operators most of the time rather than calculating things yourself. For this toy example and simple network, it’s not too bad to compute and apply the gradients yourself from scratch, but things get more complicated with larger networks.