0% found this document useful (0 votes)
15 views36 pages

cs50 Cybersecurity Lecture1-720p MBR-en

Uploaded by

rnj1230
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views36 pages

cs50 Cybersecurity Lecture1-720p MBR-en

Uploaded by

rnj1230
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 36

[MUSIC PLAYING]

DAVID J. MALAN: All right.


This is CS50's Introduction to Cybersecurity.
My name is David Malan, and this week, we'll focus on securing data.
Last week, recall, we focused on accounts,
and particularly one of the mechanisms by which we
protect our accounts is generally by way of these things called passwords.
But we focused last time really on our having the responsibility
to keep these things secure.
And yet, there's another party involved whenever
you have an account with a username and a password,
and that's the server or app that is actually
storing that password in some form long-term
so that you can actually authenticate yourself--
that is, prove to this application or website
that you are who you claim to be.
Well, in the simplest form, perhaps these servers
that are storing our usernames and passwords for which we have registered
or maybe doing something very simple like this.
For instance, if a website or app has two users at the moment, at least,
Alice and Bob, and suppose for simplicity
that Alice's password is Apple and Bob's password is banana,
you could imagine that a website or that app, simply storing
in a very simple text file these key value pairs--
username, colon, password, new line.
Username, colon, password, new line.
And in fact, that's actually very commonly
how passwords are stored on systems, at least certain operating
systems like Linux, not necessarily as simply as this.
They often have a little more information off to the right there,
but in essence, it's the username and password.
But this wouldn't be a good thing to store the passwords exactly like this.
Why?
Well, suppose that this website or this app and its database
are somehow hacked by an adversary.
That if someone gains access to that file containing these usernames
and passwords, well, at that point, they literally
have everyone's username and password.
And we talked last time about attacks like credential
stuffing whereby an adversary, once they know your username and password on one
system, they can try stuffing that username
and password into other systems, other websites,
other apps just in hopes that you are, unfortunately, using the same username
and password elsewhere as well.
So this is generally not a good thing if an adversary gets access
to everyone's usernames and passwords.
And even though, of course, in an ideal world, that would never happen,
we should probably, as the administrators,
as the creators of this website or app, we
should probably do everything we can to at least minimize
the fallout, the downsides, the damages that might result if,
and daresay, when our database or this text file here are somehow compromised.
So how might we go about doing that here?
Rather than just storing apple and banana in clear text,
so to speak, literally in the English words themselves,
why don't we go ahead and employ a technique known as hashing?
Now if you've studied computer science before,
you might actually know this phrase in the context of hash tables and data
structures.
Well, it turns out the idea in this world of securing data is very similar,
and in fact, this is a technique that's incredibly common for solving
all sorts of problems.
Well, what do we mean by hashing in this context?
Hashing is the process of taking a password as input
and somehow converting it to a so-called hash or hash value.
Now these hash values don't look like English.
They're typically strings of text that might have letters, might have numbers,
but they're generally of some fixed length typically.
And in this case here, when we go about taking our password as input,
converting it somehow via an algorithm or some code that we wrote,
we want to convert it into this hash value
and then store that hash value in that database of passwords instead.
So here's a proverbial black box, and let's stipulate for the moment
that I have no idea how hashing works, but I do know that this box can do it.
So how do I think about this process?
Well generally speaking, there's going to be some input to this box.
Ultimately, I want to get some output from that box.
And what this box really represents is, in fact, a hash function.
You can think of this as a device like some kind of machine;
you can think of it like a program, some piece of software;
or you can even think about it as a mathematical function that operates
simply on numbers coming in as input.
In fact, if you're mathematically inclined,
though we won't use this syntax often, you can think of that hash function
as being represented by f, you can think of the input as being represented by x,
and you can think of the output of this process as being so-called f of x.
If you're not familiar with that notation, that's fine,
but this is directly connected hashing to basic mathematics
as well that you might encounter before long.
But what we care about is passing into this black box a password
and getting out a hash, and then storing that hash and not
the password in our database or text file of usernames and passwords.
So how might we go about doing this?
Well, if I were to provide apple as an input to this hash function,
let's think about the simplest hash function possible
that doesn't output apple, but some representation of apple
that I can eventually store in that database.
So I'm going to propose very simply that maybe the a simplest hash function
we can come up with-- and indeed, if you've studied computer science
or taken CS50 itself, you might recall that we
can hash our inputs unlike specific letters therein.
So apple starts with A. So you know what?
A is the first letter of the English alphabet.
So I'm going to create a hash function here
pictorially that outputs one whenever the input happens to start with an A,
as does apple.
Meanwhile, if we pass in banana, I'm going
to have this hash function output 2 because B
is the letter of the English alphabet.
And dot-dot-dot, we might get to cherry or other passwords as well that might
output 3 and beyond.
And you could imagine doing this for all letters of the English alphabet.
Now unfortunately, this isn't the best hash function
because it's fairly simplistic.
And in fact, I can quickly think of some other fruits like avocados
that also start with A and that would give me the same hash value.
And that's actually a characteristic we'll come back
to whereby when you hash values, there can actually be ambiguities,
potentially, whereby two inputs might actually have the same output,
and we'll consider eventually what the implications of that might be.
But for now, I dare say that's a little too simplistic.
And what might be better than outputting 1 or 2 or 3
is a little something more cryptic, because that's just too helpful.
That's too much of a hint.
If I see that your hash value is 1, I at least
know that your password now clearly starts with an A, which means at best,
I can do 1/26th the amount of work to figure out what it actually is.
So we want these hashes generally to be a little weird-looking and really
unguessable and not leak any information.
So for instance, a very common older hash function for apple might actually
output this-- ..ekWXa83dhiA with some mixed uppercase and lowercase letters
therein.
Now it looks weird, you probably can't and shouldn't see any kind of pattern
in there.
There is a fancy math formula that took as input apple
and outputted as its hash value that string of text
there, but in and of itself, it doesn't really leak any information
like the number 1 or 2 or 3 would.
So we've already made an improvement.
Banana, meanwhile, would look like this.
And cherry, meanwhile, would look like that.
So notice that these values are indeed quite different.
So using this better hash function, I claim, that doesn't just
look at the first letter of the alphabet,
but looks at maybe all of the letters in the input--
C-H-E-R-R-Y in this case, we can probably come up with something more
interesting, more cryptic-looking, if you will,
like the values that we've just seen.
So let me propose now that what we should do in our database of passwords
is not store alice, apple, bob, banana, but let's instead
store the hashes of apple and banana respectively.
So instead in this password database, I'm going to store this instead.
The exact same values that we just saw coming as outputs from that black box,
but in this case now, I'm storing in my database
of passwords usernames and hash values.
Now why is this perhaps a good thing?
Well, one, if someone now attacks this server
and somehow gains access to all of these usernames and hashes, what they don't
have is an entire list of passwords.
So they can't quite as easily go about credential stuffing and figuring out
maybe if this database will give me access to my accounts somewhere else.
I'm at least creating some work for the adversary.
But at the same time, I feel like I've kind of broken the whole system
because previously, presumably, when you log into a website or app
and you type in your username and then you type in your password, what
is the website or app probably do?
Well, once that username and password are sent over the internet,
typically to that server, well, the server probably
compares what you typed in against the username and their database,
or their text file, and the server compares
what you typed in as your password against whatever
password is in their database.
But now we have a problem.
We have you typing the username and we do have the username
still in the database.
Case in point, Alice and Bob are still here.
But what we don't have is apple and banana.
We've replaced those altogether with hashes.
So even if you type in-- or Alice types in apple,
well we don't want to compare A-P-P-L-E to this because it obviously
doesn't match; and Bob's banana, we don't want to compare against this
because it's not going to match; and so forth.
So what can we do?
Well, the way authentication typically works on the server side
when using hashing is as follows.
When you first create an account or register for this website or app,
you type in, if you're Alice, Alice, Enter, and then apple, for instance,
Enter.
That username, Alice, that password, Apple, are sent to the server.
But what the server does before saving the username and password
is it runs that hash function on Alice's password,
which is apple, converts it thereafter to this value, and stores Alice's
username and the hash of Alice's password only and throws away apple,
deletes it, it forgets it in memory.
What then happens next?
Well, the next time Alice tries to log into this website--
maybe the next day, a week from then, a year from then for the second or third
or more time, what happens?
Well, Alice types in Alice as her username, hopefully
apple as her password, hits Enter, those get sent to the server as usual,
and obviously the server can't just compare
username against username and password against password
because it doesn't have the password in its database, so
what can the server do?
The server can repeat the very same process,
taking Alice's password as inputted, A-P-P-L-E,
run it through the exact same hash function a day, a week, a year later,
and then compare that resulting hash value to whatever is stored in this
text file or database.
And now admittedly, we're creating a whole lot more work for ourselves,
but it's not that big a deal because this is just a math function,
or if you know how to program, it's just a few lines of code
that you've written in software that converts passwords to hash values.
And honestly, nowadays, you wouldn't even rewriting most of this code
yourself, you'd be using a library, third-party code that someone
else smarter than you, maybe, has written and gotten it just right,
no bugs or mistakes, so you're just relying on someone else's code
anyway to achieve this goal.
But the upside now, to be clear, is if this file is compromised somehow,
the server's hacked into and this data is leaked,
at least they only know the usernames on your system, not the actual passwords.
And let me pause here and see if there's any questions on this technique
of hashing for passwords specifically.
STUDENT: You said yourself, we are using libraries
more often than write the hash functions ourselves if we are not
taking the course on CS50.
So then it's easy to hack these hashes, right?
Because we can go through 10, 40, I don't
know, hash functions that are available in the libraries,
and then you can reverse the hash results, is that right?
DAVID J. MALAN: Almost.
Can do exactly what you described first whereby
you use the same library, the same code, to create hash values to then compare
those against what's in the database, but generally, these hashes
are not reversible, per se.
You can compare them, but you can't reverse the process
for reasons we'll come back to.
But your intuition is right.
And so really, the takeaway here is that we
haven't made our system absolutely secure,
we've made it relatively more secure.
Why?
Because we've increased the cost to the adversary, to the hacker.
They now have to do more work to figure out what the actual passwords are
if they want to benefit from this hack.
So again, it just raises the bar, it does not
keep the adversary necessarily out or even
stop them from figuring out one person's password,
but it might take them a lot more time, it
might take them a lot more resources like server or cloud costs or money,
or it might even heighten the risk before they actually are successful.
How about one other question here on hashing?
STUDENT: If the password is intercepted before--
after the website is hacked and the password
is intercepted before it's encrypted, so wouldn't that pose a problem?
DAVID J. MALAN: Yes, absolutely.
Then all bets are off.
Everything we just discussed is not useful
at all if the adversary has actually intercepted the password
before it has even been hashed.
Now thankfully, there's going to be solutions to that problem, too,
and we'll come to them today, but for now, focusing only on hashes,
it solves one problem but not all.
In fact, it turns out that those attacks we talked about last time with respect
to our accounts are still possible.
You can still use a dictionary, for instance, of English words,
or better yet, a dictionary of English fruits,
and you could, one fruit at a time, run each of those values as input
into the same hash function, the library or code
that you're using to achieve this, and then
that's going to give you one hash value after another.
And you could compare each of those hash values
against whatever is in the database or the file of passwords
that you, the hacker in this story, might have actually stolen somehow.
You have to do more work though, because it's
no longer as simple as just comparing apple against apple and banana
against banana.
You actually have to do some work.
You have to do some computational work.
And if the file is only a few values, of course, not a big deal.
If it's thousands or millions of rows, it might actually take a lot more
time, energy, and effort.
So again, we're just raising the bar, but not keeping the adversary
out altogether.
And even if you don't have a dictionary available,
and even if the passwords are not all fruits in English,
well, you can still, as the adversary, resort to brute-force attacks.
And you can try even the simplest of passwords like 0000 or maybe eight
0's instead, and you can hash that and see what the resulting hash value is
and compare that against what's in the database.
Then you can try 00000001, hash that, compare that against
what's in the database, and then move on to the next and the next,
doing this not just for numbers, but for letters as well.
A, A, A, A, A, A, A, A, A, hash that and compare.
Eventually, apple will be on that list.
Eventually, banana will be on that list.
But there, too, the brute force attack is still
going to take some amount of time.
So it's just increasing the cost or the complexity
for the adversary in this particular case.
But there's yet another threat that's possible in the context now
of the hashes, which is worth knowing about.
There's a term of art known as a rainbow table, which
is a very beautiful way of saying that adversaries in advance
might have already hashed all possible English words in a dictionary.
Adversaries might have already hashed all possible passwords of length 4
or 5 or 6 or 7 or 8 or something else.
And maybe if they have a big enough hard drive,
they are storing a big table, like an Excel file
or a CSV file of all of the words that they've tried, all of the passwords
they've tried, and all of the hash values they've already computed.
Then it's even easier.
Then they don't even need to do a brute-force attack, per se,
hashing and hashing and hashing and hashing.
Then they can just compare, compare, compare.
Because indeed, a rainbow table simply contains
all of the passwords they've tried, all of the hash values they've generated,
and so they just compare left to right whatever
the user typed in against the hash value they've already computed.
Now for certain hash functions, this threat of a rainbow table
is just not feasible.
You might need terabytes or petabytes of data, which means a lot of hard drives
and a lot of money, so there are potential downward pressures
on this kind of an attack, but it can certainly speed things up.
Certainly if you're pre-computing-- that is,
pre-calculating some of the hashes for at least words
in an English dictionary, and certainly some short list like all
of the fruits in the world.
But there's another problem that we might
encounter on the server with regard to our passwords.
Alice might have a password of apple, Bob might have a password of banana,
but suppose that both Carol and Charlie have a password of cherry.
And just by coincidence, they both chose the same password
and are in this same database.
Now we've already concluded, I think, that we definitely don't
want to store the plaintext passwords.
We don't want to store literally in the clear apple, banana, cherry, and cherry
because this is just too easy for the adversary to do bad things with it.
So we at least want to hash this, but here's
where hashing can leak information, so to speak.
If I go ahead and use the same function I've
been using to hash apple and banana and now cherry,
what do you notice about Carol's and Charlie's hash values?
Curiously, but maybe not surprisingly, they're exactly the same.
That's, after all, how functions typically work,
be it in math or in software, in code.
If you pass the exact same input, unless there's some randomness going on,
you're going to get the same output again and again.
Now why is this a big deal?
Well, if some adversary attacks this database and gains
access to all of these usernames and hashes,
we have leaked information in the sense that the adversary, just
by glancing at this file, knows that, OK, I
don't know what Carol's password is or what Charlie's password is,
but I know it's the same password, and that alone
might be enough information to figure out with higher probability what it is.
Maybe Carol and Charlie are related.
So maybe you focus on words or numbers that are common to both of them.
Maybe there's some information that's implied by this if they both are--
they both like the same TV shows, they both like the same movies.
You can try to find, in your mind, maybe the intersection of information that
might lead you, with higher probability, to figure out,
without brute force, even, what Carol's password is and Charlie's password is.
So this is a common problem, and we only have four users in this database.
You can imagine having many more.
Odds are, some of us are going to have the same username-- not
the same username, some of us are going to have the same passwords.
In fact, without raising your hands or admitting to this for the whole world
to see, do any of you have a password of 1234 In some website or app?
Maybe a little harder?
12345678?
Something very simple like this.
Maybe it's an account you don't really care about.
Well, that's a perfect example of where, if you
have an account on the same system as someone else here in the classroom,
you're going to have, in that database, presumably, the same hash values,
and that might be alone enough information to leak and increase
the probability that you, and not Alice or Bob,
are actually compromised with respect to your account.
So how can we fix this?
Well, it turns out there's another technique in the world of data
that we can use to perturb this process.
And you can think of it metaphorically as
like sprinkling a little bit of salt on the hash function
so as to change what its output is.
It's not random, per se, but you are perturbing the output
so that it's much less likely that two people with the same passwords
are going to have the same hash value.
So how does this work?
In this case before, when we passed in cherry as our input,
we got the same hash again and again.
But let me propose that we modify our hash function to take two inputs now.
Not just the password, but also a salt value, so to speak.
A little bit of a sprinkling of, in this case, just two characters--
two numbers, two letters, or a combination thereof.
Now this hash function that I'm describing is still
going to output a hash value, but notice,
it's different from the one before, and even if you don't quite remember
what it was before, it was not this.
But worth noting is that in the output of this hash function
now is the salt itself.
So the salt isn't something that's meant to be private or secret or secure, it's
just sprinkled in there to make sure that whatever hash value comes out
of this black box is a little bit different than if you
had put a different salt value instead.
So for instance, suppose that for Carol and for Charlie,
we use different salts.
And that's the idea.
Different users should have different salt values
just in case they choose the same passwords.
So instead of 50 and cherry, suppose that Charlie
uses a salt value of, say, 49.
49 is not a number that Charlie or you or me have to pick.
This is all done by the server automatically,
picking a random two characters like 4-9 or 5-0.
But notice what just happened.
If I rewind to cherry with a salt of 5, this was the hash value, the first two
characters of which are the salt. If, though,
I change the salt from 50 to 49, the hash changes completely,
and it prefixes it with now 49 instead of 50.
This ensures that even if Carol and Charlie have the exact same password,
there's no way I, the adversary, am going to know by looking at it.
Because indeed, what ends up in the file now are these two values.
One is prefixed with 50, one is prefixed with 49, the rest of the hash values
clearly are completely different.
So again, the upside is this approach where the hash function
takes two inputs, the password and a salt,
and then outputs one hash value means that we're not leaking information
except--
except-- so there is a corner case--
if by chance, by bad luck, the system chooses
the same salt for both Carol and Charlie,
yes, there might still be information leaked.
And honestly, that may very well happen if you've
got thousands, millions of users, then you're
going to run out of two-character possibilities,
you're going to have to reuse salt.
But the idea is that we're just trying to put
downward pressure on the probability of being attacked successfully.
We're trying to equivalently raise the bar to the adversary
so that they are not as likely to gain access to my data or, in turn,
my account.
Questions now on salting or hashing itself?
STUDENT: Oh, I'm curious.
Where do we store the salt?
DAVID J. MALAN: So where do you store the salt?
The salt is actually stored in the hash value itself,
according to this algorithm, in the first two characters.
And the value of storing the salt in the first two characters of the hash
is as follows.
The next time Carol logs in, she types in her username, Carol, and hits Enter.
The server now knows, OK, I'm expecting a password from Carol,
let's see what she types in.
Suppose that she types in correctly cherry.
Now the system is not storing cherry, so it's not
going to compare literally what Carol typed in,
but it is going to hash cherry, but first, the system is going to check,
what is Carol's hash-- what is Carol's salt?
And it's going to infer as much by looking at Carol's hash value
and looking only at the first two characters by convention.
Then what the server is going to do, it's going to take whatever Carol typed
in, cherry, C-H-E-R-R-Y, it's going to pass in 50, 5-0, and then hopefully,
it's going to get back to this same value here,
this whole string in yellow.
And if those are correct, then Carol will be considered authenticated.
By contrast, if the username happens to be Charlie and Charlie hits Enter,
then what the server is going to do is look at Charlie's hash value,
grab the first two characters for Charlie's salt,
use that salt and cherry as the input to the hash function,
and hope that the result is Charlie's value, not Carol's.
Really good question.
Other questions on salting or hashing?
STUDENT: Is there any sense in rehashing a password?
So hashing it a first time to get a string,
then rehashing it for a second string?
Or it's just impractical?
DAVID J. MALAN: No, you could certainly hash the value multiple times,
but a good hash function should not require that of you.
Especially now, more recent modern hashes, one of which
we'll look at in a moment, they should have sufficiently calculated
and proven characteristics that allow you to hash it just once
and you will get a seemingly random string
that represents whatever that input is.
And here, too, is where I should emphasize
that when it comes to this world of hashing and salting
and today's other topics ultimately, these are not wheels
that you or I should be reinventing.
Unless you are the researcher or the company that's actually
developing the algorithm, stress-testing them, analyzing them theoretically
and practically so often in industry or the real world,
when people like you and me invent our own systems for storing information,
we just haven't spent nearly as much time
or we're just not nearly as sharp as some of the security researchers
out there who have really given this some thought.
So when it comes to all things security-- and let me get on my soapbox
here and say, you and I should not be solving these problems unless it is
your full-time job or calling in life.
There's just too many corner cases unless you're
collaborating with a smart team.
All right.
With that said, here is what hashes generally
look like nowadays in practice.
For the sake of discussion, I deliberately
chose a fairly simple hash function that was using a fairly short salt,
just two characters, and a fairly short hash value as output.
Here, in a smaller font, no less, is how Alice's and Bob's and Carol's
and Charlie's passwords would probably be
stored nowadays using a more recent modern hash function
that, notice, by the shear length of the text on the screen,
outputs a much larger value.
If you're familiar from computer science with the notion of bits,
0's and 1's that are used to store information in systems,
these hash values use many more bits, many more 0's and 1's.
You and I as humans are seeing them as alphabetical letters and as numbers,
but underneath the hood, these are just more and more 0's and 1's
that the computer is storing, which means it's much, much less likely
that someone who steals this kind of file
is going to be able to figure out efficiently what
those original passwords were.
And you can see, too, that for both Carol and Charlie,
even though their passwords are still cherry,
these two strings along the bottom look completely different.
Except in one location here.
It turns out that the scheme a lot of systems have adopted
is that if you look between dollar signs at the beginning of what
seems to be the hash value, you'll see a code like y or y or y or y
or other numbers or letters as well.
That's a little cheat sheet that tells the system exactly what hash function
was used to generate the rest of it.
And that's in the documentation that you can read online
for any number of hash functions.
So that's just to say, when you create an account on some new website or app,
if they are doing things well in a manner consistent with best practices
and they are being mindful of your security, they are probably in a file
or in a database or some other mechanism storing
values that look quite like these based on whatever password you actually
typed in.
In fact, just to give you a sense of how easy or difficult
it might be to crack passwords-- that is, figure out what they are based only
on these hashes, in the case of our first hash function
whereby we had a fairly short hash value being outputted
with or without the salt, turns out, there's
18 quintillion possible hash values.
Now that's a lot.
That's bigger than last times quadrillion value.
But, with enough time, enough money, and enough cloud computing,
those early hash functions can be broken.
That is, with enough time and energy, you can probably
figure out what someone's password is.
If you fast forward to the other strings that I showed you
on the screen, the much longer ones that use more bits, so to speak,
then you have this many possible hash values nowadays.
And I actually did look up how to pronounce this,
but based on reading it on my screen, I wasn't actually sure
how to say the word since this is a really big number that my mathematician
colleagues could do a better job pronouncing.
But given how many digits are on the screen,
given how many commas are on the screen here,
this is a really big number such that you
and I probably don't need to worry about an adversary using brute force figuring
out and still being able to figure out by the end of time
what the corresponding password might be unless there
are other weaknesses in the system.
Now speaking of weaknesses.
Has anyone ever forgotten your password?
Yes, of course.
But have you ever gone to a website or app, clicked that link that says,
Forgot Password, question mark, in hopes of getting an email of some sort
so that you can reset the password?
I mean, odds are, almost everyone here has experienced that.
But has anyone ever clicked on that link, gotten back
an email that actually contains your password
so that you're just immediately reminded what it is?
I'm seeing a few nods of the head.
You can copy-paste it, then, into the website.
Do not use that website anymore.
That is evidence of-- that is a symptom of a website or application not
practicing best practices.
Why?
Well, if it is the case that the website can email you your password,
that means they can see and they know what your password is.
That means this database, this text file we've
been talking about is probably vulnerable to some hacker
eventually getting into it and stealing all of those usernames
and passwords in the clear, no less.
Because recall what these hashes are.
They're generally meant to be irreversible.
When you take as input apple, banana, and cherry,
the output looks completely different with no obvious relationship to what
those original passwords actually were.
And so if that's what's being stored in the database,
the company who made that website, the person who made that website or app,
they should not be able to reverse that process either,
otherwise surely, the adversary can.
So it is the case, and I've experienced this myself, often
from smaller shops or companies that maybe haven't really
invested a lot of time or care into their website,
if they are able to email you your original password,
it is, by definition, not secure.
And it's certainly not up to today's standards, it's just too easy
for it to be compromised.
So maybe minimally stop using that service
and make sure you're not using that password anywhere else.
Maximally, maybe send them a note explaining your concern
and maybe linking them to some reference online--
maybe this video-- in which you explain why you have that concern.
Questions, then, on forgetting passwords or hashing or salting?
STUDENT: So as you said, some companies may not be practicing these hashes
and maybe practicing something very bad.
So if I were, let's say, a company and I--
because of my practices, I had a leak of passwords and all the data,
do I as a company have any obligations or responsibility for what
happened since I have all the customer's data and all their passwords,
do I have any obligations or responsibilities?
DAVID J. MALAN: It's a really good, a noble question.
The answer to that ethically is probably yes you should, quite simply.
However, the more nuanced answer is that it's probably
going to depend on the industry that you're in, the country that you're in,
any regulatory requirements that your company faces which might
oblige you to report out in that way.
So I would read up on the context that's specific to you yourself.
And I will say, unfortunately, it is not that common in the world, I dare say,
that companies document and detail publicly
when there have been security exploits.
They might announce that something indeed has happened,
but it is rare that companies will go into any amount of detail.
Now this is understandable because, one, they're already embarrassed,
or if not in legal trouble or financial trouble because that has happened
already, but they probably, typically, don't
want to provide other adversaries-- other future attackers-- with more
information about their systems and the weaknesses that those systems have.
The downside, of course, societally, is that if each of us
is secretly getting attacked in ways we didn't
expect, learning things that would be ideal to share
with others in the world.
This itself is actually a big question in the world of cybersecurity,
just how much and how often to share, especially when
you discover a bug or a mistake in someone's system,
do you tell them privately, do you tell the world publicly?
These are ethical questions that we'll touch on indeed in the coming days
as well.
Allow me to propose that separate from these concerns here,
we can come back to some of those recommendations
that we started the class with from this, the National Institute
for Standards and Technology.
Notice that this was one other quote we did not share last time.
A recommendation from NIST is that "Verifiers
shall store memorized secrets in the form that
is resistant to offline attacks.
Memorized secrets SHALL be salted and hashed
using a suitable one-way key derivation function.
Their purpose is to make each password guessing trial
by an attacker who has obtained a password hash file expensive,
and therefore, the cost of guessing attack--
of a guessing attack high or prohibitive."
So when I refer to best practices, I'm really
referring to actual documentation like this, either from the United States,
from other countries, from other companies.
There are indeed these best practices, and among our goals
for this class is to expose you to some of those,
both on the consumer side-- you and me as individual computer users,
but also on the corporate or the academic side
as well as to what you should be doing when
you are in a position of being responsible for someone else's data as
well.
Now as for the actual hash functions to use nowadays,
these are just some of them that are generally recommended nowadays
that can be categorized as SHA-2 and SHA-3.
These refer to fairly sophisticated mathematical functions that
take as input, typically, a password, or some input more generally,
and then output a hash value thereof.
There are other algorithms, too, that can even
be used to verify the authenticity and integrity of messages as well.
In fact, today, we'll also focus on how we can use primitives like these
to ensure that data was not actually changed in transit when you sent it
over the internet from one person to another.
But ultimately, what we've been focusing on and what you've seen on this list
here are what are generally known as one-way hash functions.
That is, these are mathematical functions,
or, in the context of programming, these are
functions written in code, languages like Python
or otherwise, that take as input a string of arbitrary length.
That is, a password that's this long, maybe this long, maybe this long,
but what's key to these cryptographic functions
is they output a hash value of fixed length
that is always this many bytes or characters or this many bytes
or characters.
That is, it doesn't matter how short or how long the password is,
these cryptographic, these one-way hash functions are one-way in the sense
that they take a potentially infinite domain, if you
know this term for mathematics, and condense it into a finite range.
That is, a huge number of values, all possible passwords in the world,
to just a finite list of possible hash values.
It might be a long list of possible hash values,
but indeed, no matter how long a string of text
is, if it's of some fixed length--
16 characters, 32 characters, something else,
there's only a finite number of those values.
Now there's an implication of this.
When you take a really large input space or domain mathematically
and map it to a smaller finite range, so to speak,
mathematically, it turns out that if you do try to reverse the process,
there will be multiple inputs that yield the same output.
Think about it this way.
If you've got 100 possible passwords in the world,
but you only have 10 possible hash values--
so 100 passwords, 10 hash values, you have
to figure out how to put all of those passwords into 10 buckets, so to speak.
So surely, some of those passwords are going to be in the same bucket.
Think about it in terms of the English alphabet.
If we stuck with that original hash function where A was 1, B was 2,
C was 3, presumably Z was 26, there's more than one fruit
that starts with a--
apple, avocado, and so forth.
So there, too, you are going to have multiple fruits mapping
to the same finite range of values, hash values 1 through 26.
What that means is that if an adversary, or even you, the owner of the system,
look at that hash value and see the number 1,
you don't know if the password was apple or avocado or some other word that
started with A. And so that's what we mean by one-way hash functions.
You cannot reliably reverse the process by any means and know definitively what
the original input is.
Now there is a catch.
That technically means on some systems, it
might be possible to log in with apple or avocado, or more
generally, your actual password and some other seemingly random password that
might make no sense to you, but just because mathematically it
has the same hash value, that password, too, might let you into the system.
But the idea is, especially as we're using really large numbers of bits,
really long hash values, the probability of you or me figuring
out or an adversary even guessing what that other hash
value or what those other inputs--
passwords might be is just so small that we tend not to worry about it as well.
The algorithms we've looked at on the board
here are also known as cryptographic hash functions, which
means they have utility in the world of cryptography
where the world of cryptography is all about the practice and the study
of securing data.
Securing data while in transit from one point to another or while
at rest on your own system.
Let's go ahead here and take a five-minute break,
and when we come back, we'll explore precisely that world of cryptography
with respect to our data.
All right.
We're back.
And indeed, cryptography is all about the practice and study
of securing our data, particularly when we want to transmit it
from one person to another.
So cryptography can be broken down into a couple of different categories, one
of which are codes.
And codes are not the type of code that you might write in Python or the like.
It has nothing to do with software, but rather,
a mapping between what we'll call code words
and the actual message or true reading that those words represent.
Here, for instance, is an actual book from over 100 years ago
that was used to map these code words in the left column
to these, indeed, messages or true readings on the right.
The idea is, that if that one party wanted
to send a secure message to another party,
they wouldn't just write it out in plain English.
Why?
Because if that message, written on a piece of paper or parchment,
were intercepted by another human, that other human,
assuming they, too, know English, could just
read the actual message, the so-called plaintext.
In a code, though, you can convert the words
that you want to say to code words that make no sense necessarily
to someone who's intercepted the message in and of itself
unless they, too, have this book.
Now you can imagine this being a fairly time-consuming process
because when the recipient receives that message, unless they've memorized
all of these pages, these code words and the meanings thereof,
they have to do quite a bit of work flipping through their copy of the book
in order to figure out what that message is.
But the fact that they have a copy of the book, too,
is a potential threat because if one party or another had their code
book stolen, then any of the messages they've sent can now be decoded,
so to speak, by looking them up retrospectively.
And any future messages, if the owners of the book
don't realize that code book has been taken, so, too,
could those messages be translated.
Not to mention the fact, it's fairly cumbersome.
This alone is page 187.
And so that's quite a bit of codes and quite a bit of work
just to achieve this layer of indirection.
But there are some terms of art here that are worth knowing,
and you might actually use in everyday context,
but not necessarily for the same purpose.
So encode, what do we mean by that?
It means taking a plaintext text message,
be it in English or any human language, and taking that as input
and producing as output codetext.
So the codetext might be a short succinct
sequence of words that might actually be English words,
but they're not meant to mean what they normally mean.
They're meant to be looked up in the code book
to figure out what the message is actually trying to say.
Meanwhile, decode, as you might expect, is the opposite.
You take as input the codetext that you have received as the recipient,
you use that same code book to look up the code words
and figure out what the actual message is in order
to get the original plaintext, be it in English or any other human language
that the code book is designed for.
But there's an alternative to codes, if only because those code books can
get very cumbersome indeed, they can be taken and compromised and the like.
So it's not necessarily the best system in that you
need to physically keep something like that secure, let alone
do so efficiently when converting.
So there are also what we'll call ciphers.
And ciphers are more algorithmic in nature.
So if you have taken a computer science or a programming course,
you already have the predisposition to thinking algorithmically and taking
a big problem and breaking it down into smaller pieces
and then applying some kind of logic, sometimes again and again,
in order to solve some problem.
So ciphers focus on exactly that.
They don't focus on maybe words or phrases.
They might focus on individual letters instead or even bits
if it's in the context nowadays of computers.
So in the world of ciphers, you might have actually
seen them in popular culture.
So here, for instance, is just one frame from a famous film known as A Christmas
Story, at least here in the US.
It plays like every day all day long on a couple of TV channels
around Christmas time, but this here is Ralphie,
one of the main characters in the movie, and in his hands
here is this secret decoder pin that he tried so hard to get through the mail,
and the secret decoder pin was from little Orphan Annie herself.
And what it does is implement mechanically a cipher,
converting one letter to a number and back.
But the thing twists left and right so that you can actually
figure out what the mapping might be.
So this is more of a cipher because it's operating at a lower level--
not in entire words or phrases, but one letter at a time.
And it's a repeatable process that Ralphie, in this case,
can apply again and again to all of the letters of the secret message.
In World War II, the Germans, for instance,
had the Enigma Machine that you might have read about or seen
depicted in films, and this was a mechanical implementation
of this same idea of a cipher.
But instead of using mathematics or gears turning just this way and that,
it was much more mechanical.
It was with rotors and lights and the like,
but it, too, was implementing a cipher and could
be configured with different inputs in order
to influence exactly what the output would be.
But that, too, is a physical device, and we'll focus here for the most part,
though, on things more digital, things that you can ultimately, for instance,
nowadays implement much more readily and much more scalably in software.
But the words we'll use are pretty much the same.
To encipher a message means to take that message in English or any other
language, or so-called plaintext, and convert it, not surprisingly,
to ciphertext as output Meanwhile, the reverse--
or rather, an equivalent term here that you might know as well is to encrypt.
Same idea, synonyms for our purposes, plaintext to ciphertext.
To encipher or to encrypt.
Nowadays, encrypt is probably the more common of those terms
Meanwhile, decipher would be the opposite of that,
to actually take the ciphertext that someone else has sent to you,
run it through an algorithm or cipher, and get back the plaintext.
Meanwhile, decrypt would be a synonym for that phrase, which
refers to exactly the same process of taking ciphertext as input
and outputting plaintext as output.
So how do we configure these ciphers so that you and I
can use the same algorithm but customize them, not only with our own messages,
but also with our own settings so that just because you and I might
want to send the same plaintext doesn't mean that the ciphertext has
to actually be identical?
And indeed, in the world of cryptography,
it's quite recommended that you and I use public and well-documented,
well-tried-and-tested algorithms publicly,
but we do keep one piece of information secret so that our use of that cipher,
that algorithm is specific to us.
And this customization, this configuration
are generally known as keys.
Now keys, much like a physical key to a lock on an actual door to your home,
a key is what unlocks the capabilities of this cipher,
but it's a key that needs to be known and used not only by you, typically,
but also by the recipient.
So that by having copies of the same key,
you can not only encrypt messages or encipher them,
but you can also decrypt or decipher those messages, too.
Now what are these keys in practice?
They're not physical objects in the virtual world,
but really just really big numbers.
And often, there's some mathematical significance of these numbers,
and sometimes those numbers don't even look like numbers.
They might be presented on your phone or your laptop
or desktop actually as letters of an alphabet
and maybe even with some punctuation, too.
But at the end of the day, they're really just numbers, or, of course,
if you know a bit of computer science already,
they're really just 0's and 1's.
But it's perhaps helpful to think about them metaphorically as
akin to these physical keys.
Now how are these keys actually used?
Well, within the world of cryptography, there
are different types of encryption.
And the first we'll look at is known as secret key cryptography.
The presumption is that the security of your data
relies on the secrecy of some key.
So if A wants to send a message to B, then A and B
must keep secret whatever key they are using to configure
their choice of algorithms.
So what do we mean by that?
Well, secret key cryptography, specifically
in the context of encryption and scrambling data,
is also known as symmetric key encryption for the reason
that both A and B in this story are going to use the exact same key.
And we'll contrast this in just a bit with asymmetric key encryption,
which solves other problems as well.
So let's consider the process of encryption,
much like the process of hashing, as being this black box.
Somehow or other, this Black box is going to encrypt information for me.
Taking as input my plaintext and hopefully outputting as output
my ciphertext that I can actually send over the internet or some other channel
to a recipient as well.
So in the context, then, of secret key encryption,
the picture looks a little something like this.
Not only do you pass as input to the algorithm
your plaintext message in English or any other human language,
you also pass a key.
And for now, just think of that key as a number that you and the other person
have somehow agreed upon in advance.
That algorithm, then, will ultimately output the ciphertext.
And to be clear, the motivation for that key
is to ensure that if I and you and you and you and you are all
using the exact same encryption algorithm,
it's not going to be obvious if and when we're
sending the exact same messages because that,
too, per our discussion of passwords, would leak information.
Maybe you don't care about the information being leaked,
but it's probably not a good thing if-- just because someone else is getting
some message, that, makes it more likely that an adversary can
infer what it is you sent because the ciphertext just so happens
to look the same.
We want our ciphertext to be unique to each of our transmissions.
So, let's consider a simple, simple example.
Suppose that the message I want to send is just as short as the capital letter
A, and suppose that the key that I want to use is as simple as the number 1.
These are not good best practices, but we'll
use them for the sake of discussion.
Let me propose that the simplest algorithm I can perhaps think of
is actually one that would take A as input and 1 as input and output B.
And you can perhaps infer where this is going.
If I instead provide B as input and 1 as input for the plaintext and key
respectively, then the output is C.
So believe it or not, in yesteryear, Julius Caesar
was known to use an algorithm like this whereby this algorithm, Caesar Cipher,
is what's generally known as a rotational cipher,
because you're rotating the letters of the English alphabet.
A becomes B, B becomes C. And I bet if we continue this logic,
we can go around from Z becoming A as well.
Now this, of course, is being applied at the moment
to very short messages that are not that useful.
Sending A or B or C is not particularly useful in general,
but it's demonstrating how we can encipher or encrypt
our plaintext into our ciphertext.
However, when someone receives this message,
they need to not only what algorithm I used to encrypt it--
in this case, Caesar Cipher or a rotational cipher more generally,
but they also need to know what the key is.
And the key might not be as simple as 1.
Here, for instance, is an example of 13.
If your key is 13 and your plaintext is A,
then your ciphertext should be N, because that is 13 places away from A,
and so now the algorithm seems a little less obvious.
13 is also representative of something that's long been known on the internet
as ROT13 for R-O-T-1-3-- rotate 13 places.
It's a very popular way of scrambling information
but not in a way that you intend to be secure.
Historically, it was often used for like movie spoilers online.
If you want to make something a spoiler before there was CSS and blurring
effects on websites and whatnot, you could just
scramble it so it looks completely encrypted,
but it's very easy for someone else with a click of a button
even to just decrypt it.
However, I would recommend that you not use a key of 26 because why?
Well, at least in English, there's only 26 letters of the alphabet, capital A
through capital Z in this case.
So a key of 26 is going to output for your ciphertext
the exact same thing as your plaintext.
So there's another joke on the internet whereby ROT26 is twice
as secure as ROT13 because 13 times 2 is 26,
and obviously, that's not the case deductively here.
Now of course, this particular algorithm and keys of this small size,
1 through 26, not at all secure.
Why?
Well honestly, I don't even need a computer to crack this cipher.
I can probably take out a piece of paper and pencil
and just try all possible numbers from 1 to 25--
I don't need to even waste my time with 26--
and just figure out via brute force what keys someone might have used
to send a message using this algorithm.
Not on even single letters, but maybe it operates on every individual letter
of their message.
Wouldn't take me that long to probably figure this out by brute force by hand.
And with code, my gosh, I could write some Python code probably
that does it even faster than that.
So here on the screen is some ciphertext that I created in advance.
And I'll stipulate that this ciphertext was enciphered
using that same rotational cipher, but I'm not
going to tell you just yet what key I actually used.
It was originally an English message in all capital letters.
So the task at hand now is to decrypt this, I dare say.
Whether you are the intended recipient of the message or maybe maliciously,
you've intercepted my transmission with this message and it,
and now you're trying to brute force your way through by trying,
and by the looks of some heads going down and some scribbling, 1 or 2 or 3.
I bet we could also brute force our way through this algorithm, but how?
How does the decrypting process work?
It's really just the same thing in reverse.
If this now is our picture and you have ciphertext as your input,
you should be able to pass the same key as input--
1, for instance or 13 or, with no good reason,
26, and get back out the plaintext.
But of course, the decryption algorithm is indeed the opposite
because you don't want to just add one position
or add two positions or three positions, you want to subtract 1 or 2 or 3 or 13.
You want to go in reverse, so to speak.
And so, if I were to pass in B as the ciphertext and 1 as the key,
well, the plaintext decrypted should, of course, be A.
And that holds now for all of the other letters of the alphabet,
assuming I'm reversing this process, in order to decrypt.
And now, I'll let you a glance at the screen here for just a moment
and see if you yourselves can't figure out
what this ciphertext is trying to say.
And if you like the idea of figuring this out,
if you want to get better at this particular skill,
you are an aspiring cryptanalyst, I dare say, focusing
on this world of cryptanalysis.
And this, too, itself is a job, I dare say particularly with governments,
trying to decrypt messages that might very well have been encrypted.
Now hopefully the world is using more secure algorithms
than these simple rotational ciphers.
And what do I mean by secure?
Hopefully they're using keys that are much bigger than small numbers like 1
through 25.
Hopefully they're using much, much, much larger numbers, many more bits, if only
so that it takes you and me, when we try to apply cryptanalysis to ciphertext,
it takes us way, way longer than this particular algorithm alone.
Now I don't want to keep you in suspense,
but I also don't want to spoil this if you'd like to try your hand at this.
So go ahead and close your eyes if you don't want to see the answer to this,
or I suppose you can just look away from your screen.
But in five seconds, I'll reveal what the plaintext actually is--
and some of you, if you've seen that movie I mentioned,
will know immediately why this is the way it is,
but otherwise, you might just see this as an advertisement of sorts.
So here we go.
Your chance to close your eyes in 5, 4, 3, 2, 1.

From some faces, some of you have seen this movie around the holidays,
but now, I've taken it off the screen and we'll move on now
with some actual algorithms.
If you'd like to come back on replay and actually see what the answer is,
we'll, of course, leave it on-demand.
So what are some of the actual algorithms
used nowadays for encryption that are best practices?
This rotational cipher that I described earlier, Caesar's simple one,
is not to be recommended.
It's wonderful for demonstration sake and discussion's sake,
but it's not something you should be using in practice unless, for instance,
you're in, say, middle school trying to send a message on a piece of paper
through your classroom of classmates and worried
that the teacher might intercept it and the teacher probably
doesn't have the instinct to or the care to actually
brute force their way through it and figure out what the key is.
But that's the level of security you're getting with something
like that rotational cipher.
But in the real world, with our phones and desktops and laptops today,
generally used our AES or triple DES, both of which
are popular algorithms that have been vetted by the world
and are very commonly used as secret key encryption ciphers
or symmetric key encryption ciphers, which, to be clear,
require that both the sender and the receiver
know and use the exact same key.
And for our purposes today, let me just stipulate
that the mathematics of these two and other algorithms
much more sophisticated and documented in textbooks,
but, therefore, it makes it much harder for the adversary
to figure out, as by trying 25 different keys, what the actual key in use
might be.
Questions now about secret key cryptography or any of the primitives
we've just discussed?
STUDENT: So is it possible that if someone hacks the-- like
gets to know about the hash value-- the hash function of a company that it
is using, he might be able to use the hash values
and use-- like find a reverse function and then get the passwords for that?
DAVID J. MALAN: A good question.
I wouldn't worry mathematically about someone reversing the hash functions,
if only because with all of the ones that are in popular use
today in modern systems, there are a lot of smart mathematicians, computer
scientists, professionals who have vetted, if not proven mathematically,
that these things work as expected.
However, if the passwords that have been hashed are relatively easy to guess,
or if the adversary just gets lucky with whatever technique
they are using, it is absolutely possible to find at least a password,
a input that maps to that hash value, but often
not without significant effort.
And so generally, a company does not want to,
should not try to keep proprietary or secret what
hash function they're using, what encryption algorithm they're using.
If anything, I dare say, it should be reassuring
to the public if and when companies are using best practices and de facto
standards, all of these algorithms are designed
to keep secret not the algorithm itself, which literally can be found
in like university textbooks nowadays and on Wikipedia and beyond,
but rather, to keep secret the thing that's designed to be secret,
which is the key.
And now, if you're using too small of a key like I did originally,
well, then you're just using the algorithm poorly, perhaps.
But so long as you're adhering to best practices
and picking a really big, recommended-sized key,
then things mathematically should be trustworthy.
STUDENT: For an attacker, rather than like basically cracking a hash
or cracking an algorithm, wouldn't it be easier
to just try and access the basic server database
and access the hash function like generated code?
So rather, access how the specific algorithm works.
That way, they can basically just reverse-engineer it?
DAVID J. MALAN: Everything you described is possible.
However, I would push back on this assumption
that the company should try to keep its hash algorithm secure or hidden.
You should trust in the mathematics of what
we're discussing today, both in the context of hashes
and in the context of encryption.
And I've pulled back up on the screen here the number of possible hashes
that exist when using one of the most modern standards for hashing passwords.
This is such a big number--
I dare say, I don't remember how many atoms are in the universe,
but I'm going to guess it's fewer than this, maybe.
The idea is, intuitively, that if the search space of possible hash values
or the search space of possible keys is so darn big,
both you and I, not to speak darkly, are going
to be dead before the attacker actually figures out what
that password or that hash actually is.
So that's generally the presumption.
Most of what we do today in terms of security all boils down
to probabilities and trying to derive the probability of being exploited way,
way, way down, even though, if your password is still 00000000,
doesn't matter if there's this many or more possibilities if the adversary
tries that one first.
So keeping algorithms secret, keeping ciphers secret
is generally not best practice.
You should be trusting that the math and the probabilities
will protect your data if you are using these algorithms correctly.
And how about one more question before we resume?
STUDENT: How cipher work with word?
Not number, like with words, how it work?
How we can cipher-- or cryptograph like our latest with words, not the number,
how it can be work?
DAVID J. MALAN: OK, so if your key is a word and not a number,
let me first say that generally when it comes to encryption,
the keys are not words.
These are not passwords, they're not meant to be used in quite the same way.
These keys are generally generated by the computer for you,
and so as such, they're just random numbers for the most part.
With that said, even if it is a word like apple, there are ways--
and you would learn this in a class like CS50
itself-- to convert a word to the underlying numeric representation.
There's a system called ASCII or Unicode.
So capital A is actually the number 65 in most systems.
Capital B is the number 66.
But we can go one level deeper.
There's actually a pattern of 0's and 1's that represent A's and B's and C's
and so forth, so we can convert everything in the world of computers
to numbers.
And for that, let me encourage you to take CS50x online.
So that, then, is secret key cryptography
or symmetric key cryptography, but it doesn't solve all of our problems,
because I've taken for granted throughout this whole discussion
that the sender and the receiver have a shared secret between them.
Whether it's a simple key like 1 or 2 or 13--
hopefully not 26-- or hopefully some much bigger value.
But there's kind of a chicken and the egg problem there,
so to speak, in English whereby how do you actually establish
a shared secret between parties A and B if A and B have never talked before,
in fact?
So for instance, if you're visiting Amazon.com
for the first time, a popular e-commerce website, or gmail.com for your email,
ideally, and you probably know this already
from just living in the real world nowadays,
ideally you want that connection to Amazon or Gmail to be encrypted,
to be scrambled in some way.
Why?
Well, you don't want your password being stolen by someone.
You don't want your credit card number being intercepted by someone.
You don't want your personal emails being read by other people.
So it stands to reason that encryption is generally a good thing.
And you've seen this, perhaps, in the URL bar
via something called HTTPS where the S literally is meant to mean Secure.
But odds are, you don't know anyone personally at amazon.com
and you don't know anyone personally at gmail.com.
So what key are you going to use to communicate securely
with these websites, not to mention new websites that don't even exist today
but might come online tomorrow, how do you establish a shared secret
with someone else?
So that's a fundamental gotcha or caveat with symmetric key
or secret key encryption, is that it assumes
that you have a shared secret between you and the other person.
But the chicken and the egg scenario comes
in whereby the only way to establish a shared secret
would be to send it to the other person securely,
but if you can't communicate securely, you can't even
send them the secret you want to use.
So you're caught in this deadlock.
Thankfully, thanks to math, there are ways
that we can solve this, too, via not symmetric key cryptography,
but public key cryptography, otherwise known as asymmetric key cryptography.
And among the algorithms here might be these, something called Diffie-Hellman,
MQV, RSA, and others as well.
And I dare say, on this list, maybe RSA is among the most well-known.
It's perhaps an acronym you've actually seen in the wild.
Now what do we mean by public key cryptography,
or more specifically, public key encryption?
Well, in the world of public key encryption,
or asymmetric key encryption, the asymmetry
is implying that you actually don't use one key between the two people
A and B. You actually use two keys.
In the world of public key encryption, everyone in the world
has both a public key and a private key.
And these two are just really big numbers.
There is a mathematical relationship between these numbers, the public key
and the private key, but that's a relationship
that your phone or your laptop or your desktop
figures out when generating these values for you.
So unlike our previous discussion of passwords, which you and I as humans do
choose and memorize or store in our password managers,
when it comes to keys, these are generally,
in the world of public key cryptography, generated for you.
And as the name suggests, the whole purpose of these keys
is to tell the whole world if you want what your public key is.
It is not in any way secret.
You can literally email it out, you can put it in the signature of every email,
you can post it on your website, on social media.
The whole point of the public key is to make it, indeed, public.
But, suffice it to say, the private key should be kept secret by you,
private by you on your own device.
That should never be shared with anyone else.
But the cool thing about public key cryptography and the mathematics
underlying it is that if you share your public key
with someone else on the internet, they can use that public key
to encrypt a message and then send it to you over email
or chat or any other technology.
And if you had to guess, what is the only key
in the world that can decrypt a message that has
been encrypted with your public key?
The only key in the world that can decrypt
a message that has been encrypted with your public key is your private key.
That's what the mathematical relationship ultimately does for you.
So, pictorially here, if this is our algorithm that
implements this idea of public key encryption,
let's see what the inputs and outputs should be.
If the goal is to send a message to you and you
have shared with the world your public key, whoever is sending you
this message uses your public key, their plaintext message, and out of that
comes ciphertext.
That, then, is how asymmetric key encryption works.
Meanwhile, when you receive that message,
you can use your own private key and the ciphertext you've just
received to get back the plaintext.
And this is what we mean by asymmetric.
Unlike secret key cryptography or symmetric key cryptography where
you're using the same key back and forth, plus 1 or minus 1
in the case of the rotational cipher, with asymmetric encryption,
you are using one key for one process and another key for the decryption
process.
So that's what's fundamentally different.
RSA is one of the most popular algorithms for this.
The browsers you probably use every day are probably
using some variant of RSA underneath the hood.
We won't get into great detail about the mathematics,
but one of the most important details about RSA
is that it relies on really big prime numbers.
In fact, in a nutshell, what happens with RSA is your computer or your phone
chooses a really big prime number called p.
It then chooses a really big other prime number called q.
Then it multiplies them together to get a new value, we'll call it n.
And it uses that value n in the resulting mathematics
that the algorithm's authors came up with, dot-dot-dot.
The presumption here is that when you take a really big prime number
and multiply it against a really big other prime number,
it is really hard to figure out from the product of those numbers
what the original p and q were.
And if you're a little hazy on prime numbers,
it's a number that can be only-- that can only be divided by itself and 1.
And indeed, we can use those, coming up with two big ones,
multiply it together in order to get this value n that is subsequently
used in the rest of the mathematics.
What are the rest of those mathematics?
In essence, this.
And this will be the scariest-looking formulas you perhaps
see over the course of this class.
The value n I just described is used as to divide values
ultimately if you're unfamiliar with mod here, this means to, in this context,
take the remainder of some value.
So what are we doing?
Here is a quick summary of how encryption and decryption works
with RSA.
If you have some message m that you want to send to another person
and you have come up with somehow, via the dot-dot-dot process
earlier that I alluded to, you've come up with your own public key e there.
Well then, someone can take their message,
encrypt it by raising that message to the power of e, the exponent of e,
and then divide it, divide it, divide it, divide it by n
and figure out what the remainder is when dividing by n.
That then gives you a value called c for ciphertext.
When you then receive that message c, you can use your private key,
known here as d, and you raise the ciphertext,
its numeric value, to the power of d-- that is, the exponent in d, and you
divide, divide, divide by n in order to figure out
that remainder, which will give you back the original message.
Now that is a significant oversimplification of what's going on,
but that's the essence of the algorithm.
It has to do with picking two very large prime numbers,
multiplying them together to get that value n,
and then using n as well as other values that, dot-dot-dot, are generated
by the algorithm for you, e and d, in order to encrypt and decrypt messages
ultimately.
And this is what's generally known as modular arithmetic.
It involves lots of division and division and division
in order to come up with these remainders,
but ultimately, it is a very secure way to asymmetrically share information
without having to agree on one shared key in advance,
but rather, using a public and a private key instead.
Now there are other techniques that come with this world
of public key cryptography, and another technique is that of key exchange.
So by contrast, if you do actually want to establish
some kind of shared secret, there are alternative algorithms
that different humans have invented over the years.
So there are alternatives to one algorithm or another,
and one of these alternatives is actually
called Diffie-Hellman, named after another pair of authors here.
So here is the essence of the mathematics for this algorithm,
the goal of which is indeed key exchange.
To figure out, using fancy mathematics, how both A and B can come up
with the same value that they can then use as a shared secret,
but without anyone who intercepts any of their messages
being able to figure out what is that shared value, that shared secret.
So what's the essence of the math here?
Well, you first pick a value g, which is called a generator.
It can be as simple as the number 2.
And you pick a big prime number, call it p here.
And those are agreed-upon in advance.
Meanwhile, person A, say Alice, picks her own private key A,
which is another really big number, and then she does this math. g
to the power of A mod p.
And again, mod refers to taking the remainder of some value.
Meanwhile, B, or Bob, still uses the same g, still uses the same p,
picks his own private key called B and raises g to the power of B modulo p,
and that gives him back this value capital
B, whereas Alice had capital A. Then, turns out that Alice and Bob can
send those values across the internet--
A one way, B the other way, and thanks to some fancy modular arithmetic
here, too, Alice can take Bob's B value and raise it
to the power of her A value, which effectively gives you
g to the power of A times B mod p.
Bob, meanwhile, can take Alice's A value that was sent to him,
raise it to the power of his private key B, and then mod p.
So calculate the remainder with respect to p.
The end result, and it's totally fine if these mathematics
are uncomfortable for you or whoo!
Just know that, thanks to some basic principles of mathematics,
this results in both Alice and Bob having the exact same value--
we'll call it s for shared secret--
even though the value never went across the internet in its entirety.
Alice sent part of it this way, Bob sent part of it this way,
but because Alice and Bob held on to private values, the little A
and the little B, they kept that to themselves, they're
able to do these mathematics that ensure that they both came up
with the same value even though you or I, if we intercepted
any one of those messages, we could not figure out what it is.
And now that they have a shared secret s,
they can use that using any of those other symmetric
ciphers we talked about earlier.
AES I put on the board briefly, triple DES I put on the board briefly.
Heck, we could even use this in a rotational cipher
if we really wanted to, but not, indeed, best practice.
So again, don't worry so much about focusing on the mathematics,
but if you were to take a higher-level class in theoretical computer science,
these are intellectual rabbit holes that you could go down to better understand
how the software works.
And now to my comments earlier about not trying
to invent your own cryptographic functions,
this is the kind of reason why.
This is the degree of sophistication that you and I take
for granted in our phones, our laptops, and desktops
that have been vetted by industry and academics alike.
Generally best practice is to rely on standards
that have been tried and tested rather than
try to come up with your own creative cryptosystem, so to speak,
that may very well have faults that you yourself do not know.
And the icing on the cake is that this is ultimately, if curious as
to the underlying mathematics, what value ultimately
Alice and Bob are both calculating, g to the power A times B mod p.
But more on that in a higher-level mathematics course if indeed
of interest.
How about one final building block that you
get from this world of public key cryptography,
and this is one that's going to be increasingly omnipresent,
I do think, in our world, especially as we move away
from very archaic paper-pencil signatures
that you might write with a pen on a paper,
and rather, moving to what we'll call digital signatures as well.
It turns out that once you're comfortable with the idea
of public key cryptography generally involving a public key
and a private key, the first of which is literally public,
you can share it with the world; the second of which is meant to be private,
kept only to you.
And if you can take at face value my claim
that through appropriate mathematics, there's
a relationship possible between these two numbers,
that whereas one can encrypt data, the other can decrypt,
even if you don't care to get into the specifics of the mathematics,
but you just agree that, OK, that sounds reasonable to me,
that that math can work, we can now use that building block
of a public key and a private key to solve other problems as well.
Not just encrypt messages from point A to point B
and back, but rather, to sign information, sign documents,
even, and say, yes, this was signed by David or someone else.
So how does this work?
In the world of digital signatures, here's
a few more acronyms of algorithms that are commonly
used even though we'll continue to simplify them in our discussion.
DSA, ECDSA, RSA, and others can be used to give you
the ability to sign documents or other pieces of information digitally.
So what does it mean to sign something digitally?
It's not at all like this with a unique signature,
it's all mathematics involved.
So, here, then, might be our algorithm for digitally signing
some document or piece of information.
And I claim that the input to this process is a message.
A letter that you've written, a contract that you want to sign,
something that you want to put your digital signature on.
And the output of this message initially is going to be a hash.
So we can use any number of hash functions
we talked about earlier that take as input an arbitrary length
input, like a message, a document, an essay, a contract,
and produce as output a fixed length hash value.
So we've seen that and we've stipulated that is indeed
possible, similar in spirit to our password discussion earlier.
You can even do it for larger inputs than passwords.
You can do it for entire documents as well.
Once you have that hash, here's how you digitally sign the document.
You use your private key, you pass that as input, as well as the hash value
you just computed a moment ago into the digital signature algorithm,
and the output of that process is a signature.
So if you think about this intuitively, what are we doing?
Well, we're taking an arbitrary-sized document.
Maybe it's a letter that you've written, maybe it's
a contract that you've written that you need to sign that might be short
or it might be really long.
Here's where the value of cryptographic hash functions come in.
Recall that a cryptographic hash function, by definition,
takes an arbitrary-sized input and reduces it to a fixed-sized output.
So it doesn't matter how big the original
was, you can distill it into a distinct representation that's shorter.
So, per this diagram, if you take that hash value
and you encrypt it with your private key, what we say
is that the output of that process, which
is just a really big number or some sequence of weird-looking text,
is your digital signature.
Now this is a little weird because what we're doing now
is the opposite of public key encryption.
With public key encryption, remember, someone else
used your public key to encrypt a message to you
and you used your private key to decrypt it.
But in the case of digital signatures, the story gets flipped upside-down.
You use your private key and a hash of your message
to digitally sign your document and the output of that is a signature-- again,
a number or some string of text.
And you send that signature to the recipient saying, this
is my digital signature, you can verify it now if you so choose.
And they should.
So that invites the question, well, how does the recipient
verify your digital signature?
How do they know that this weird-looking sequence of characters or numbers
actually was signed by you?
Well, recall that you have not only a private key, but a public key as well.
And that public key is accessible to everyone, including that recipient.
And so, what happens is this.
When that recipient gets your document and your digital signature,
so to speak, they probably want to and should verify the digital signature
to confirm that, yes, you signed off on that document or contract.
So what does that box look like?
Well, they have received not only the document itself, the so-called message,
they've also received your digital signature.
So you've sent them two things.
And the digital signature, you can think of it like a human signature,
but it's, of course, a big number or a string of text.
But they've sent you two things-- the document and that signature.
So what do you do?
You take the document you've received and you run it
through the exact same publicly available hash
function, because the document might be long,
so you want to collapse it into a short hash representation
thereof, just like our use of passwords.
So that you can just do easily, no private information involved.
But then what do you do?
You then take the public key of the person who signed this document, you
take the signature that they claim is their signature,
and you decrypt their signature with their public key.
That should output the exact same hash that you just calculated.
So to summarize, the message itself the document in this story is public.
It's not encrypted, it's not something you really worry about being private.
What you really care about in this story is
that it was signed by a specific person.
So if that message, that document is available to both the sender
and the receiver, both of them do this first process of hashing the message,
hashing the document just to get some succinct representation thereof.
So it's not this big, it's this big.
Makes the math quicker and easier.
However, what the recipient does is upon receiving not only that message, which
they just hashed, but also your claimed digital signature,
they try to decrypt your signature using your public key.
And here, too, just as the private key can
reverse the encryption done by a public key,
so can the public key reverse the encryption done by a private key.
So if the recipient mathematically gets the exact same hash
after decrypting what you sent them, it must be the case
mathematically that the only person in the world who
could have signed this document is, in fact, you
because they have your public key.
And maybe some third party, some registry,
some company has said, yes, that is David Malan's public key,
you can trust that.
And so, if David Malan's private key has not been compromised,
you can trust that any signature that you can decrypt with my public key
must have been encrypted with my private key.
And it takes a while, I think, for these ideas, and certainly the mathematics
to sink in, but for now, if you just trust
that there's two big numbers in the world, one public, one private,
there's a mathematical relationship between them such that one can reverse
the effects of the other in either direction,
we humans can use this now not only to secure
our messages per our discussion of encryption,
we can also use it to authenticate messages
and attest, yes, this came from David Malan or did not.
And unlike a human signature on a piece of paper
that can obviously just be photographed, duplicated, traced over,
the secrecy of digital signatures relies on keeping your private key private,
and that notion does not exist in the world of human signatures,
and so in that sense, digital signatures are objectively better
than our old-form human ones.
Questions now?
And I know that's a lot, and it's OK if it didn't all go down at once.
Questions on digital signatures, public key encryption or decryption,
or anything prior?
STUDENT: Would these public and private keys be attributed to, what,
your IP address?
DAVID J. MALAN: A good question.
To what are they attributed?
Not to your IP address typically.
They are typically stored in a registry, like a central registry that
knows that this is Vlad's public key, this is David's public key and so
forth.
And it relies on a system of trust and transitivity.
So if you trust this third party company that is storing all of our public keys,
then you can trust whoever it is "they" are, in turn, trusting.
Or it can be more distributed.
Your public key can literally be distributed
in the footer of your emails.
It can be posted on your website.
It can be on your LinkedIn profile or the like.
And so long as other people in the world trust
your emails or your website or LinkedIn, they
can trust that that is, in fact, your public key.
So different ways to implement that system of trust.
Other questions?
STUDENT: Hashing uses a mathematical function and encryption uses
a mathematical function plus a key.
Like the Caesar Cipher basically uses the simple function plus the key.
Is that analogy correct?
DAVID J. MALAN: Yes, that is correct.
And if it helps you-- this is an oversimplification,
but it's generally helpful, I think, to think of hashing as one-way.
So you can only convert a value to a hash value but not the opposite.
But encryption is like two-way--
it's reversible hashing, so to speak.
The output still looks weird and random, but you can undo the process.
And one way to think about this is in the world of hashing,
because I claim that you can take like an infinite domain,
like any possible message you want to send, and convert it
to a finite range,--
for instance, all A-words could be a hash value of 1,
all B-words could have a hash value of 2.
That simple example already captures the reality
that if you only have the hash values 1, 2,
I have no idea what the original input is.
And it doesn't matter how hard I try, I'm never going to figure it out
because it could be apple or avocado or something else that starts with A.
So hashing in that sense, one-way hashing throws away information such
that it's not recoverable.
But encryption does the opposite.
It would be pretty useless if encryption threw away information
because the whole point of encryption is to secure messages and information
we want to send.
So encryption is reversible; hashing, in general, is not.
And, as you know, the key, no pun intended, to encryption
is necessary so that you can reverse the process in a way that
remains secret to other people.
How about one more question, and then we'll take a short break
and then we'll come back and wrap up.
STUDENT: Is there any possibility to spoof the signatures?
DAVID J. MALAN: Short answer, no.
Like so long as you are using a standard that we believe
to be correct and not compromised, so long as your private key has not
been stolen by someone or no one's taken it off of your phone or your computer,
they should not-- it should not be possible to forge it.
The probability is so, so, so low, it should be the least of your concerns
is the idea.
Now it turns out, there is yet one other application
of this world of public key cryptography that solves a problem from last time.
Recall that we ended our first class on a note of emphasizing
that passwords and password managers can improve our security if used properly,
but there's another technology that's becoming increasingly available.
And it's colloquially called passkeys.
Or more technically, it's an implementation
of a standard called web authentication.
And it turns out that these passkeys, which
are available on certain platforms and certain websites and evermore
will be available soon quite shortly, they, too,
rely on public and private keys as follows.
And thankfully now, as fancy as the mathematics
we're alluding to today sound, there really are only two ways
to use these public and private keys--
to either encrypt with one and decrypt with the other or vice versa.
So we have just a fairly basic building block
that we can use in one direction or another.
So how do passkeys work?
In the near-future, as you will find, when
you go to certain websites or applications,
you probably will not be prompted as frequently to type in a username
and pick a password, which is to say, you
don't have to generate a hard-to-guess password,
you don't have to memorize a hard-to-guess password.
You don't have to even store a hard-to-guess password in a password
manager because passkeys eliminate passwords.
It moves us more toward a world of passwordless accounts.
Now how can that be?
Because up until now, we've been using usernames and passwords
to authenticate ourselves.
Well, it turns out, we humans have been getting really good at this math,
even if it doesn't feel like it today, we've
been getting really good at using mathematics
to solve these problems as well.
So imagine the following scenario.
When you go to a website in the future or app,
rather than being prompted to create a username and password,
you'll just be prompted to create a passkey.
What that means is your laptop or desktop or phone will probably
prompt you with some form of factor.
They'll ask you for your fingerprint or they'll ask you for a scan of your face
or maybe a pin code, a short number that you type in just
to demonstrate with high probability that you
are authorized to be using this device and creating this account.
What then will your device and the website do?
Your device will generate a public key and a private key
just for that one website or app.
Your device will send the public key to that new website, along with your user
ID or username, some identifying information
so that they know your David or someone else.
But you don't send a password.
You only send to the website or app your public key.
And you keep private, within your browser
or some other piece of software, your corresponding private key.
And to be clear, this public-private key pair is used only for this one website.
You'll do this repeatedly, but automatically
for every other website in the world in this model.
So what happens when you not register for that website, which
you've just done, but you want to log into it tomorrow,
next week, or next year?
Well, assuming you still have that same device
or you're using some kind of cloud service
that synchronizes all of your past keys, your public and private keys,
across devices--
so you haven't lost these past keys, here's
how you would log in to the website tomorrow, next week, or next year.
The website would send you when you visit a challenge,
and a challenge is like some little message.
It's like a number or a word or a phrase.
It's some piece of randomly-generated data
that the website wants you to digitally sign.
Well, how do you digitally sign information?
I proposed earlier that you can use your private key
and pass that key and that challenge, which is just a random input given
to you by the website, into your digital signature algorithm, this black box.
And the output of that, as before, is your signature.
And what is your device do?
It sends that signature for that challenge to the website.
And if you followed along earlier well enough,
you might now realize where we're going with this.
How does the website now verify that that is, in fact, your signature?
That this did come from David's device and not some adversary online?
The website, because it's stored yesterday,
last week, last year, your public key, it
will use your public key to decrypt your signature
using the same algorithm to get back hopefully the same challenge value.
And if the output of this verification process
matches the challenge the website sent you a second before,
it must be the case mathematically that you
are, in fact, who you claim to be because it
was your device that registered for this website a day, a week,
a year ago as well.
So again, if we trust in the mathematics here
and we trust that these algorithms allow us to encrypt information and decrypt
it using a public key and private key, or conversely,
a private key and public key, we can, with very, very high confidence,
probabilistically say, yes, this is David Malan,
I'm going to allow him back into this account.
So what's the implication of this passwordless world that
uses passkeys keys, or web authentication more technically?
It means that we're getting out of the business, potentially,
as a society of having to remember dozens
or hundreds or thousands of different passwords for all of our accounts.
It does require, though, that we don't lose the device or the devices that
registered for these websites or apps, but again, increasingly,
as the world providing cloud services, whether it's with Apple or Microsoft
or Google or others, that presumably can synchronize
your passkeys across devices and will conclude ultimately today,
by talking about how they can be synchronized securely, even
without Google and Microsoft and Apple knowing what your own passkeys are, so
long as they provide us with a certain technical guarantee.
So the upside of this is we can move away from passwords,
and you can even share these passkeys with other people if you so choose.
The catch is, right now, they're not omnipresently
available on every website out there.
It's probably going to take some time for the world to come on board,
but I do dare say, in the coming weeks, months, and years,
you will see passkeys increasingly offered to you.
And so indeed, the next time you visit a website
that asks you, hey, do you want to register with your fingerprint
or with your face or with a PIN code?
And you're never even asked for a password, odds are,
it's using this passkey technology instead.
Well, let's go ahead and take one more five-minute break here,
and when we come back, we'll talk about securing data
as it's moving back and forth and sitting on our own systems.
All right, so we are back.
And allow me to claim that we now have a bunch of ways
to hash data and also encrypt data and also now, decrypt data.
So how can we use these building blocks to solve
some other perhaps familiar problems?
Well, there's this notion of encryption in transit,
which is a fancy way of saying that you and I probably prefer nowadays
that our data be encrypted whenever it's traveling from point A
to point B. Whether that point B is Amazon.com, Gmail.com,
WhatsApp, or any other service that we're communicating with,
we ideally want no one in between us-- some machine in the middle, so
to speak, to be able to get at that same data.
Because in particular, what you should be worried about
is a scenario like this where if Alice is trying to communicate with Bob,
you might worry that there's some eavesdropper, so to speak,
named Eve between Alice and Bob.
And maybe this is via wires nowadays on the internet.
Maybe it's somehow wirelessly.
Maybe Eve actually represents a company that Alice and Bob
are communicating between, like Gmail or Outlook or the like.
So encryption in transit, though, is important to distinguish
from other forms of encryption.
In particular here, Alice might very well
have an encrypted connection not to an eavesdropper, per se, but just
a third party like Gmail.
So assume that Eve here is Gmail.
And meanwhile, Bob, when checking his email account,
has an encrypted connection to Eve as well, which, in this story now,
is Gmail.
So Alice has a secure connection to Gmail and Bob
has a secure connection to Gmail as well,
but that does not mean necessarily that Alice has a secure connection to Bob.
Security does not really work through transitivity, so to speak.
This might very well mean that the data is only
encrypted while in transit from A to E and from B to E,
but that doesn't mean that Eve, or Gmail in this story,
can't be reading all of Alice's and Bob's emails.
And indeed, that is technically possible on Google's end.
They, of course, run all of the servers that your Gmail accounts might be on.
There's nothing technically probably stopping them
from reading anything and everything.
Now hopefully they have policies.
Hopefully very few humans actually have the privileges or the authorization
to even do anything close to that.
But technically speaking, just because Alice has a secure connection to Gmail
and Bob has a secure connection to Gmail,
that doesn't mean that their communications will
be encrypted entirely between A and B. And there are lots of examples of this
as well.
Zoom, for instance, when it comes to video conferencing,
you might have an encrypted connection to Zoom,
I might have an encrypted connection to Zoom.
That does not necessarily mean that Zoom couldn't be Eve in this story
listening and watching everything that we're saying while video conferencing
as well.
So encryption in transit is good in that it at least keeps random people out
of the picture because they don't have access to these encrypted channels,
but if there is this third party, this machine in the middle
or company in the middle, even they might have access to data that we
do not want them to have access to.
So what, then, is a stronger alternative?
Increasingly possible, increasingly available, and something you as a user
should be looking for with greater frequency is what
we would call an end-to-end encryption.
This is a stronger guarantee whereby you can
trust that Alice's connection to Bob is, in fact, secure
even if-- not pictured here, there are 1, 2, 3, 4 machines in the middle,
companies in the middle, eavesdroppers in the middle.
If you use encryption properly end-to-end,
you can ensure that the only thing Eve or Google or Zoom can see
is just your ciphertext, the seemingly random strings of text
or 0's and 1's that represent your encrypted data, but without your key,
they have no idea what that data actually is.
So end-to-end encryption isn't necessarily in most
company's best interest.
Why?
Well, companies like Gmail tend to presumably mine our data,
whether it's for advertising purposes or otherwise.
And so it's sometimes in companies' interest to have access to your data
to keep it secure on their servers, but still
in a way that they have access to it.
Now that might be not comfortable for you.
And so there are alternatives.
For instance, iMessage for Apple users and WhatsApp
internationally is known in particular for offering end-to-end encryption
which, if implemented truthfully and technically correctly,
should guarantee that even though your messages might
be going through WhatsApp servers, no employee at WhatsApp
can actually see your messages because it's encrypted
all the way from A to B, even though it's
going through a potential eavesdropper.
But that depends on exactly what form of encryption you're using,
and if it's not end-to-end, it might only
be encrypted in transit such that Eve's, that eavesdropper,
might indeed have access to the data.
So as to how you can use end-to-end encryption,
it's an option that a service must provide to you in this case
or you must choose services that offer it.
It's not necessarily something that's always available,
but it is increasingly available in different software.
So let's now consider a fairly mundane operation,
but one that has implications for these same technologies and solutions.
That is, deleting a file, be it on your Mac or your PC
or your phone or some other device.
Now where is data stored in your devices?
Well generally, it might be in a device like this,
a large, somewhat older but large hard drive that
can store lots and lots of files and folders,
or perhaps something smaller known as a solid state
drive that might store information entirely digitally
without any moving parts.
And even smaller might be something like this
that you carry around like a USB stick, and they are even smaller nowadays,
too, that similarly stores some data digitally.
Now how do we go about deleting files from a computer or any
of these devices?
Well, you typically click it and drag it somewhere, or maybe you right-click it
or maybe you tap and drag it to some trash or the like.
There's any number of user interface mechanisms for deleting files,
but let's consider for our purposes what happens underneath the hood.
So let me stipulate that your hard drive, your solid state
drive, your USB stick just contains ultimately
a whole bunch of 0's and 1's, and those 0's and 1's represent your files
and folders.
So when you go about deleting a file, by dragging it
to the recycle bin on Windows, or dragging it to the Trash
Can on macOS, what actually happens?
Well, it turns out, not anything at all, really.
When you recycle a file on Windows or when you trash a file on macOS,
it doesn't actually get deleted in the sense that you and I might expect.
By delete it, I mean it's gone.
I don't want to be able to find it anywhere.
OK, wait a minute, though.
Of course, we all know by now, at least on computers,
you at least have to empty the Recycle Bin or empty the Trash Can.
So OK, maybe I missed that step.
But even then, contrary to what you might expect,
emptying the Recycle + Bin, emptying the Trash Can also does not generally
delete the data.
And here's where I'd, again, emphasize, wait a minute,
when I delete a file, I want it gone, removed from my computer altogether.
But what macOS and Windows and operating systems in general tend to do instead,
when you even empty the Recycle Bin or Trash Can,
they don't actually get rid of the file, per se, they just forget where it is.
Somewhere in the computer's memory, there's
like a spreadsheet of sorts, some kind of database or table
with at least two columns, one of which has the name of your file
or the location of your file, the other of which
has some kind of reference to which 0's and 1's on your actual computer
implement that specific file.
Maybe these 0's and 1's are for one file, these 0's and 1's are
for another file, and so forth.
So somewhere, your computer is keeping track of what
is where physically on your computer.
But when you delete a file by emptying the Trash or Recycle Bin,
the computer just, eh, forgets where it is.
And more importantly, it frees up the space so it can be used later.
So what do I mean by that?
Well, suppose I do go ahead and delete a file
and empty the Recycle Bin or Trash Can, and suppose
that these yellow 0's and 1's represent the file that I no longer care about.
Well, what's actually going to happen underneath the hood, so to speak,
of the computer?
Well eventually, some of those yellow 0's and 1's might just
get reused for other files.
In other words, these 0's and 1's highlighted in yellow
represent a file that used to be there, but is not.
That is equivalent to saying some other file can now use those same
0's and 1's.
And so here's some random 0's and 1's that may be overwrite some of the file,
but not all of it.
Notice, there's still a bunch of yellow 0's and 1's here
in my depiction of my computer.
So it turns out that over time, yes, your file will probably
get actually deleted.
What do I mean by that?
Eventually those 0's and 1's will be repurposed, changed from 1 to 0,
changed from 0 to 1 such that your file, for all intents and purposes,
is actually gone, because it's been repurposed, that space, altogether.
But notice, at least at this point in time,
and shortly after you delete a file, even if you've created or downloaded
new files, there might still be parts of your files
around, which means that sensitive word document or Excel file or images
that you had on your computer, there might still be remnants of them,
just a few lines from any of those.
So you should realize that deleting a file doesn't really get rid of it
in the way you might expect or hope.
To do that, you need to be a little better with practices.
Now what do I mean by this?
Secure deletion is another beast altogether.
And typically when we delete files, they're not deleted securely.
They're not deleted typically in a way that you would hope.
So secure deletion does what you might really hope for, get rid of this file
altogether.
So if we go back to the original contents of my computer
with all of these here 0's and 1's, and suppose
that I want to delete this file here at the top of the screen,
in an extreme ideal world, those 0's and 1's would just be gone.
Like that's pretty darn secure.
Those bits, those 0's and 1's, they don't even exist anymore.
Now this is probably not the best way to securely delete information
because if I just got rid of those 0's and 1's somehow, like my hard drive
is getting like literally smaller and smaller
in terms of how much stuff I can put on it if I don't have as many bits
or 0's and 1's available.
So that's probably not the best long-term solution
because it's expensive.
It's like getting rid of some of my capacity.
So we don't actually do that, but how might we securely delete a file?
I don't think we want to just wait and hope that those 0's and 1's eventually
get reused by the system because we might still
be left with some remnants which might not be ideal.
So what we can do when securely deleting a file is something like this--
change all of the 0's and 1's that we don't care about anymore or want,
change them all to 0's.
And this will effectively securely delete the file
because now the 1's that were previously there
that represented some piece of information are just completely gone.
Or equivalently, I could change them all to 1's.
Or I could even change it to random 0's and 1's.
The point is, to securely delete a file, you
should change all of the 0's and 1's to at least some other pattern
so that the file is effectively gone.
Now how can you use this to your benefit?
Well, some operating systems nowadays support
what's called full-disk encryption, and this is good for a number of reasons.
One, if you enable a feature called full-disk encryption,
which is actually a specific incarnation of an idea known as encryption at rest.
Encryption in transit refers, of course, to your data going back and forth
from point A to point B. Encryption at rest
means it's just sitting there on your device, in your pocket, or on your lap
or on your desktop, sitting unused, maybe on or off.
So when it comes to full-disk encryption or encryption at rest,
you ideally want all of your data somehow encrypted on your Mac,
on your PC, on your phone.
And only when you log in with your password or maybe
your fingerprint or your face should that data be decrypted automatically,
and this can happen pretty darn fast nowadays with modern hardware,
should the data be unencrypted so you can actually
use it and interact with that device.
So why is this advantageous?
Well, one, if your device gets stolen, so long
as you're not logged into it, so long as it's locked,
so long as the lid is closed, so long as it's unplugged or any other number
of scenarios, at least if someone takes your laptop from the table in Starbucks
or the cafe, well, hopefully, if you have
a good password or good biometrics, they're
not going to be able to get any of your data.
They can maybe delete all of your data and they can
and sell your computer, they can use your computer, but they probably,
if you're practicing best practices, don't have access
to the data that's on the system.
Why?
Because it's completely encrypted at rest and they don't know your password,
they don't have your fingerprint, they don't have your face,
they should not be able to decrypt that data.
So in other words, if this is my unencrypted data,
the way I want it and need it when I'm using my computer,
full-disk encryption, at rest, would change my entire computer
to look random.
These are random 0's and 1's now that I generated by using,
for instance, my password or my fingerprint or my face.
And this is what your hard drive or your solid state drive
should look like when the lid is closed, when the power is off.
When you are logged out of it, it should be random 0's and 1's.
And the upside of this now is that, again,
if it's stolen while in this state, there's no data to be used
by the adversary because it looks like random 0's and 1's.
Better yet, if you deliberately want to get rid of the device
because you want to trade it in for resale value,
because you want to donate it to someone else,
because you want to sell it to someone online,
when using full-disk encryption, the upside
is that so long as you had a really hard-to-guess password, your data is,
for all intents and purposes, securely deleted already.
Because only if the new buyer figures out or knows
your password or has your same fingerprint or has your same face,
they're not going to be able to access any of your data anyway.
And this is important nowadays because it turns out, with modern hardware,
even if you might want to change all of the 0's and 1's to all 0's or all 1's
or all random data, it turns out that today's hardware can fail over time.
So even little USB sticks or solid state drives over time can kind of wear out.
But they're smart enough, thanks to software
known as firmware inside of it, as soon as the device realizes, wait a minute,
those bits over there aren't working properly anymore,
the device might not let you change them to all 0's or all 1's or a random 0's
and 1's anymore.
It might just leave them as is forever.
Which is to say, it's even more important to start
using full-disk encryption, encryption at rest,
when you first get a device because that way,
you can trust that even if parts of the device degrade over time,
all of the data that's there and has been there
was at least encrypted with one of your passwords or one of your biometrics
in the past.
So this is the kind of feature to look for in your Mac, your PC, or your phone
to ensure that it is somehow enabled.
Thankfully, once you log back in with your password,
it goes back to the original data and you can use it.
Of course, then, an implication of this best practice
is that if you lose your laptop or your phone
or your desktop's password, or your fingerprint somehow changed,
or your face sufficiently changes, you might be locked out
of all of your data, too, but again, that's
just another example of this trade-off between usability and security as well.
Now a downside, an evil side to full-disk encryption
is ransomware, which is how adversaries are monetizing attacks.
It's not uncommon nowadays for hackers, for adversaries,
when they get into a system, whether it's your laptop
or, for instance, a corporate network, or in some cases, hospital
systems or a city's own computer networks, to not try to do any damage
or just do something like spam or cryptocurrency mining,
but to actually encrypt all of the data on these systems they somehow
accessed online.
Why?
Well, if they encrypt all of the data they can then ask for a ransom
and say, listen, if you don't give me this many bitcoins,
I'm going to give you the key that I used to encrypt your data.
And if you poke around online, there have been many examples of this,
unfortunately, where hackers have gotten into systems that were not
very well-protected, all of the data therein was encrypted,
and this is an opportunity for the adversaries
to try to extort, say, financial gain from a situation
by then only handing you the keys, if ever, once you've actually paid up.
And there, too, there's the risk, as in any ransom scenario,
where who even knows if they're going to give you the proper key in the end,
but this is increasingly a concern for municipalities, for companies,
for universities, and the like.
So just as we have some upsides here, there,
too, is this trade-off in what you can do.
And lastly, we thought we'd end on a note about the future
because this is a topic that will come up
and has come up over time, this topic of quantum computing.
So for those less familiar, we've been talking a lot
about bits, 0's and 1's today, and at the end
of the day that's how today's computer systems are implemented.
Patterns of 0's and 1's to represent numbers and letters and colors
and videos and sounds and everything.
We've been discussing today data more generally.
Now typically, in our world now, a bit, a binary digit, can either there be a 0
or it can be a 1, as per the diagram we had on the screen in these examples.
Either a 0 or a 1.
In the world of quantum computing, thanks to some very fancy physics
and quantum mechanics in particular, it is possible,
it seems, physically, for us to implement the idea of bits a little bit
differently using quantum techniques.
And there's this idea of not just a bit, but a quantum bit or qubit whose power
derives from the reality that physically, you
can implement a qubit in such a way that it is representing both a 0
and a 1 at the exact same time.
So it can be not in just one state, so to speak,
one condition at once, but two states at once.
And if you have two qubits, they can be in four states at once.
If you have three, they can be in eight states at once.
If you have 32 of them, they can be in 4 billion states at once.
Now what's the implication of this?
Well, when we talk about cryptography, when
we talk about hashing, when we talk about just very large numbers
and trying to figure out via brute force or some other mechanism
what some input to a function was, if you have exponentially more computing
capabilities by not being able to do one or two
things at a time with individual bits, but two or four or eight or 4
billion things at once, it stands to reason
that if adversaries have access to quantum computing before you
and I do, then all of the security you and I now rely on
and that we've talked about today could suddenly become insecure.
Because we're trusting right now that it's just
going to take the adversary a lot, a lot,
a lot of time, maybe money, maybe resources,
maybe risk to attack our accounts.
But if they have exponentially more resources than you and me,
then our data really is at risk.
And all of the mathematics we've been trusting need to be hardened instead.
Now hopefully you and I will have access to quantum computing at the same time
as or ideally before all of these adversaries,
so hopefully our algorithms for securing information
will continue to evolve along with these technologies.
So this isn't necessarily something you need to worry about for now.
Indeed, I think after today, we have more than enough to worry about.
So for today, that's all.
We'll see you next time.

You might also like