DLT Experiment 2
DLT Experiment 2
This Expt. classifies movie reviews as positive or negative using the text of the review. This
is an example of binary—or two-class—classification, an important and widely applicable
kind of machine learning problem.
We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet
Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for
testing. The training and testing sets are balanced, meaning they contain an equal number of
positive and negative reviews.
CODE:
Let's take a moment to understand the format of the data. Each example is a sentence
representing the movie review and a corresponding label. The sentence is not pre-processed
in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1
is a positive review.
In this example, the input data consists of sentences. The labels to predict are either 0 or 1.
One way to represent the text is to convert sentences into embeddings vectors. We can use a
pre-trained text embedding as the first layer, which will have two advantages:
For this example we will use a model from TensorFlow Hub called google/nnlm-en-dim50/2.
Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and
try it out on a couple of input examples. Note that the output shape of the produced
embeddings is a expected: (num_examples, embedding_dimension).
Let's now build the full model:
2. This fixed-length output vector is piped through a fully-connected (Dense) layer with
16 hidden units.
3. The last layer is densely connected with a single output node. This outputs logits: the
log-odds of the true class, according to the model.
Hidden units
The above model has two intermediate or "hidden" layers, between the input and output. The
number of outputs (units, nodes, or neurons) is the dimension of the representational space
for the layer. In other words, the amount of freedom the network is allowed when learning an
internal representation.
If a model has more hidden units (a higher-dimensional representation space), and/or more
layers, then the network can learn more complex representations. However, it makes the
network more computationally expensive and may lead to learning unwanted patterns—
patterns that improve performance on training data but not on the test data. This is
called overfitting, and we'll explore it later.
A model needs a loss function and an optimizer for training. Since this is a binary
classification problem and the model outputs a probability (a single-unit layer with a sigmoid
activation), we'll use the binary_crossentropy loss function.
This isn't the only choice for a loss function, you could, for instance,
choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with
probabilities—it measures the "distance" between probability distributions, or in our case,
between the ground-truth distribution and the predictions.