Turn Data Into Models
Turn Data Into Models
Input: 5, 8, 2, 9
Process: Add the values [5 + 8 + 2 + 9] then divide by the number of inputs [4]
Output: 6
This simple set of rules for turning an input into an output is an example of an
algorithm. Algorithms have been written to perform some pretty sophisticated tasks.
But some tasks have so many rules (and exceptions) that it’s impossible to capture
them all in a hand-crafted algorithm. Swimming is a good example of a task that is
hard to encapsulate as a set of rules. You might get some advice before jumping in
the pool, but you only really figure out what works once you’re trying to keep your
head above water. Some things are learned best by experience.
What if we could train a computer in the same way? Not by tossing it into
a pool, but by letting it figure out what works to succeed at a task? But just like
learning to swim is very different from learning to speak a foreign language, the
kind of training depends on the task.
Experience Required
Imagine that every time you went to the store to pick up milk, you tracked details
of the trip in a spreadsheet. It’s a little weird, but go with it. You set up the
following columns.
Is it the weekend?
Time of day
Is it raining or not?
Distance to store
Total minutes of trip
After several trips you start getting a feel for how conditions affect how long
it’ll take. Like, rain makes the drive longer, but it also means fewer people are
shopping. Your brain makes connections between the inputs (weekend [W], time [T],
raining [R], distance [D]) and the output (minutes [M]).
But how can we get a computer to notice trends in the data so it can estimate too?
One way is the guess-and-check method. Here’s how you do it.
Step 1: Assign all of your inputs a “weight.” This is a number that represents how
strongly an input should affect the output. It’s OK to start with the same weight
for everything.
Step 2: Use the weights with your existing data (and some clever math we won’t get
into here) to estimate the minutes for a milk run. We can compare the estimate to
the historic data. It’ll be way off, but that’s OK.
Step 3: Let the computer guess a new weight for each input, making some a little
more important than others. For example, the time of day might be more important
than whether or not it’s raining.
Step 4: Rerun the calculations to check if the new weights result in a better
estimate. If so, it means the weights are a better fit, and changing in the right
direction.
Step 5: Repeat steps 3 and 4, letting the computer tweak weights until its
estimates aren’t getting any better.
At this point the computer has settled on weights for each input. If you think of
weight as how strongly an input is connected to the output, you can make a diagram
that uses line-thickness to represent the weight of a connection.
For this example it looks like the time of day has the strongest connection, but
apparently rain doesn’t make much of a difference.
This process of guess-and-check has created a model of our milk runs. And like a
model boat, we can take it to the pool to see if it floats, so to speak. That means
testing it in the real world. So for your next several milk runs, before you leave,
have the model estimate how long it’ll take. If it’s right enough times in a row,
you can confidently let it do the estimating for every future trip.
Second, not all data is the same. In our milk run example, the spreadsheet is what
we would call structured data. It is well organized, with labels on every column so
you know the significance of every cell. In contrast, unstructured data would be
something like a news article, or an unlabeled image file. The kind of data that
you have available will affect what kind of training you can do.
Third, the structured data from our spreadsheet lets computers do supervised
learning. It’s considered supervised because we can make sure every piece of input
data has a matching, expected output that we can verify. Conversely, unstructured
data is used for unsupervised learning, which is when AI tries to find connections
in the data without really knowing what it’s looking for.
Letting the computer figure out a single weight for each input is just one kind of
training regimen. But often interconnected systems are more complicated than what
1-to-1 weighting can represent. Thankfully, as you learn in the next unit, there
are other ways to train!