Python Linkedin
Python Linkedin
@classmethod
- [Instructor] One of the core concepts of object-oriented programming is the notion of
inheritance, and in this example, we're going to see how that works in Python. Inheritance
defines a way for a given class to inherit attributes and methods from one or more base
classes. This makes it easy to centralize common functionality and data in one place instead
of having it spread out and duplicated across multiple classes. Let's go ahead and open up our
inheritance_start file, and in this example, you can see that I have three classes. There's a
book, magazine, and newspaper class. Each one of these classes represents a type of
publication, and each of them has a set of attributes that are relevant to that publication type.
So books have a title and a price along with the author's name and their number of pages. A
newspaper also has a title and a price, but they are published on a periodical basis and have a
publisher instead of an author. Magazines have a title and price too and have a recurring
publishing period and a publisher, and you can also see down here that there's some code that
creates each kind of object, and then, accesses some data on each of them. So let's go ahead
and run this as is. Alright, and you can see that in the output, we have the book's author,
which is this line of code right here, we have the newspaper publisher's name, and then the
prices of each one. Now at the moment, each of these is a standalone implementation of it's
own class, but let's go back to the code, and you can see that there's a considerable amount of
duplication among the data that each class holds. So, for example, all three classes have
attributes for title and price, and the newspaper and magazine classes also have the same
attributes for period and publisher. So we can improve the organization of these classes and
make it easier to introduce new classes by implementing some inheritance and class
hierarchy. So let's start with the most obvious duplication, which is the title and price
attributes. So one way we can handle this is by defining a new base class called publication,
and then, have that class define some common attributes. So I'll put my init function in here,
and we'll give that a title and price, parameters, and then we'll just set self.title equals to title,
and then the same thing for price. Alright, so now, we can fix the book class and have it
inherit from the publication class, and then we're going to put the name of the base class in the
parenthesis here, and now what we need to do is call these super classes init function, and
then, we can just take off the title and price, and then just have the book specific attributes in
the book class. Now we could do the same thing with newspaper and magazine classes, but
there's some duplication here too. Both of these classes have period and publisher attributes,
so that's a pretty good hint that we can collect those in a superclass too. So let's go ahead and
make another base class and we'll call this one periodical, and we'll have that class based on
publication, and then, once again, we'll create the init, and a periodical will have it's self, title,
price, period, and publisher, so then we call the superclasses init for the title and the price, and
then, we'll have the periodical class to find the period and the publisher. Alright, so let's go
ahead and save that. Okay, so now, we have a class hierarchy with publication at the top, and
book inherits from that, and then we have periodical, and then, that inherits from publication
as well, and now we have to fix magazine and publisher to inherit from periodical. So let's go
ahead and do that, so now we have the base classes, and what we're going to do is call the
superclass to init each of these, and we'll pass in the title, price, period, and publisher, and
then we no longer need these, and then I can just do the same thing here for the newspaper
class. Alright, so now, we should be able to run our original code that creates these objects
and accesses the data without any changes, so let's go ahead and try that, and when I execute
the code, you can see that the output is the same as before. So we're getting the same results,
but with much better code organization, which is one of the main benefits of inheritance. So I
can now add properties that are specific to each kind of publication just in one place and I
only have one place to edit them if I want to change the names of any of these attributes going
forward.
Abstract base classes Selecting transcript lines in this section will navigate to timestamp in the
video - [Narrator] So now that we've seen how inheritance works in Python, let's move on to a
related topic called abstract base classes. There's a fairly common design pattern
programming where you want to provide a base class that defines a template for other classes
to inherit from, but with a couple of twists. So first, you don't want consumers of your base
class to be able to create instances of the base class itself. Because it's just intended to be a
blueprint. It's just an idea. And you want subclasses to provide concrete implementations of
that idea. And then second, you want to enforce the constraint that there are certain methods
in the base class that subclasses have to implement. And this is where abstract base classes
become really useful. So let's go ahead and open up the abstract underscore start file. Let's
imagine that we're building a drawing program that lets the user create different kinds of two
dimensional shapes. And we want the program to be extensible so that new shape types can be
added. So you can see here that I've defined a base class called graphic shape, and it has a
function called calcArea that is currently empty, right? There's no implementation here. And
then I have two subclasses, circle and square, both of which inherit from graphics shape. So
the scenario here is that we want each shape to inherit from graphic shape. We want to
enforce that every shape implements the calcArea function, and we want to prevent the
graphic shape class itself from being instantiated on its own. Now, if I run the existing code
that I have here, you'll see that none of these constraints are currently enforced. So I can
instantiate the graphic shape. And if I run this, you'll see that the calcArea function returns
nothing, right? Because we didn't override that in the subclasses. So to fix this, let's go back to
the code, I'm going to use the ABC module from the standard library. So what I'm going to do
is from ABC, I'm going to import ABC. And I'm going to use abstract method. Alright, so the
first thing I'm going to do is have graphic shape inherit from the ABC base class. And that
stands for abstract base class. Then I'm going to use the abstract method decorator to indicate
that the calcArea function is an abstract method. So this tells Python that there's no
implementation in the base class. And each subclass has to override this method. So now
you'll see if I run this, well, now I get an error from trying to instantiate the graphic shape
right? It says can't instantiate abstract class graphic shape. So let's go ahead and comment that
out. All right, and now let's try it again. Now I'm getting another error, right? It says that my
subclass didn't override the calcArea method. So now I need to fix that too. Let's go back to
the code. And so for the circle, I'm going to write def calcArea. And that's going to return
3.14, which is pi, right? Times the radius of the circle squared. And then for the square, I'll do
the same thing. And that's going to return self dot side times self dot side, okay? All right, so
now I've satisfied all the conditions. I'm no longer trying to instantiate the graphic shape by
itself, and now both of my subclasses, override calcArea. So let's run this again. And now you
can see that everything is working. So abstract base classes can be a very useful tool for
enforcing a set of constraints among the consumers of your classes. So it's worth taking the
time to experiment with these and understand their benefits.
this write calss A
What are magic methods? - [Instructor] In this chapter, we're going to spend some time
learning about Python's so-called magic methods, which are a set of methods that Python
automatically associates with every class definition. Your classes can override these methods
to customize a variety of behavior and make them act just like Python's built-in classes. Now
there's quite a few of these methods, and I'm not going to cover all of them. Instead, I'm going
to demonstrate the ones that are most useful and commonly employed. Using these methods,
you can customize how your objects are represented as strings, both for display to the user
and for debugging purposes. You can control how attributes are accessed on an object both
for when they are set, and for when they are retrieved. You can add capabilities to your
classes that enable them to be used in expressions such as testing for equality, or other
comparison operations like greater than or less than. And then finally we'll see how to make
an object callable just like a function, and how that can be used to make code more concise
and readable. Features like these are what gives Python its flexibility and power, and in this
chapter, we'll see examples of how they can be put to good use.
String representation
- [Instructor] The first set of magic methods that we're going to learn about are the ones that
Python uses to generate string representations of objects, and we saw a little bit of a peek of
this in the prior chapter when we worked on object composition, but we're going to see much
more of it now. So let's go ahead and open up the magicstr_start file, and you can see that I
have my book class defined with some properties, along with a couple of statements to create
some book objects. So there are two magic string functions, one is called str and one is called
repr. The str function is used to provide a user-friendly string description of the object, and is
usually intended to be displayed to the user. The repr function is used to generate a more
developer-facing string that ideally, can be used to recreate the object in its current state. It's
commonly used for debugging purposes, so it gets used to display a lot of detailed
information. So these functions get invoked on an object in a variety of ways. So for example,
when you call the print function and pass in the object, or when you use the str or repr casting
functions, these methods will get called. So let's run our code as it currently is before we
override these functions. You can see that I'm creating two book objects and then printing
them out. So let's go ahead and run this. And so here, in the output, you can see that when I
print each object, I just get a vague string that identifies the class name and its location in
memory. So let's make that a little bit better. Let's go ahead and add the str function, and you
can see that these are double underscore function names, indicating that they are Python
magic functions. So when I override the str function, I get to decide what the string
representation looks like. So I'm going to return a formatted string, in this case it's going to be
self.title by, and then it's going to be self.author, and it costs self.price. All right, so now let's
rerun the code. Now you can see that when I print these objects out, there's a much nicer
string representation of each book object, containing their realtime data. All right, so let's go
back to the code. Now let's add the repr function, and it's going to return a slightly different
string. So this is going to return a formatted string as well, and just a whole bunch of
properties. It's going to say title equals self.title, and then author equals, and then I'll just print
the author information, and then price. All right, and now let's add a couple of more function
calls to convert the book objects to strings by using str and repr directly. So for the first
example, I'm actually going to call the str function on b1, and then I'll use repr on b2. All
right, so let's go ahead and save, and now let's run the code again. All right, so now you can
see that when I print the objects or call str directly, the str function gets used. And when I call
the repr function, that causes my double underscore version of repr to be used instead. So
each of these functions is totally optional for you to override, but it's usually a pretty good
idea to at least define the repr function for classes that you create in order to make debugging
easier.
Equality and comparison
[Instructor] Plain objects in Python, by default, don't know how to compare themselves to
each other. But we can teach them how to do so by using the equality and comparison magic
methods. So, let's go ahead and open up the magiceq_start file. And let's see how to do this.
So, once again, we have a book class defined with some attributes, and then a few variables
that create some book objects. So b one and b three, both contain the same information. But
watch what happens when I try to compare them to each other. So, I'm going to write print,
and then b one double equals b three. Now, when I run this, you can see that the output is
false, which is weird, because all the attribute values are the same as each other. So, the title is
the same, the prices is the same, the author is the same. The reason this happens is because
Python doesn't do an attribute by attribute comparison on objects. It just compares two
different instances to each other and sees that they're not the exact same object in memory.
And therefore it says, oh false, they're not equal to each other. So, Python's flexibility gives us
an easy object-oriented way of addressing this problem. The magic method named eq, gets
called on your object when it is compared to another object. So, let's go ahead and implement
that. And override the eq method, And that takes my object as well as the objects it's being
compared to. So, to see if two books are equal, we can just compare the attributes of each one.
So, I'm going to return if self dot title is equal to value title, and self dot author is equal to the
other objects, author and the price is equal to the other objects price. Okay. We should also
make sure that we throw an exception, if we're past an object that's not a book to compare
against. So, let me just do that right now. So, I'm going to say if not, is instance, and I'm going
to take the value that were passed and compare it to the book class. And if it's not, I'm going
to raise a value error that says can't compare book to a non-book. Alright, so now we have
that code in place, we can perform the comparison again. And let's add one that we also know
is false. So, we'll print b one is equal to b two, and those attributes are different. So, the first
one should be true and the second one should be false. So, let's go ahead and run. Alright, sure
enough, we have true and we have false. So, what we've essentially done is add the equality
check behavior to our book object. And if I try to compare a book to something else, if I say
print b one, is equal to 42. Let's run that. And you can see that I'm raising a value error here
because I can't compare a book to something that's not a book. Alright, so we can also
perform other kinds of comparisons by overriding the corresponding magic method. So, let's
add the ability to perform comparisons to our book class. Suppose we wanted to be able to
perform a greater than or equal to operation like this. We want to be able to say b two is
greater than or equal to b one. Or suppose we wanted to be able to do a less than comparison.
Suppose we wanted to be able to say, hey, is b two less than b one. So, there are magic
methods that correspond to all the different kinds of logical operators. Greater than, less than,
greater than or equal to, so on. Now, that's a lot of methods. So, I'm not going to demonstrate
all of them. But let's go ahead and add support for both of these. So, I'll scroll back up, and I'll
override the greater than or equal to function. And that takes my object and the comparison
one. And then once again, we'll just copy to make sure that we're comparing to a book. And
what we'll do here is we'll just return self dot price is greater than or equal to value price. So,
let's do a comparison based upon price alone. And then let's do the same thing for less than, so
I'll just go ahead and copy this code and paste it in down here, change this to lt. And then once
again, I'll change this operator to less than. Okay, so now let's run that code. And let's
comment out our prior example for the equality check. So, I'm going to run this. Alright, and
we can see that b two's price is 2995, and that is not greater than b one's price which is 3995,
that evaluates to false. And b two is less than b one based upon price. So, now we have the
ability to do comparisons of book objects. And what's really neat about this is that, now that
we have added the less than support, we automatically gained the ability to have our books be
sortable. So, let's go back to the code, and let's make a quick list of our books in some random
order that we know is not sorted. Alright, so the built-in sort function, uses the less than
operator to perform sorting. So, now we can do this, we can write books dot sort, and then I
can print out each book title in the newly sorted list. And I'll use a comprehension for this. So,
I'll print book title for book in books. Alright, so let's go ahead and comment that out. And
now when I run this code, we can see that the books are now all sorted in order from low to
high, based on price. And like I said, there are a lot of these methods you can implement in
your base classes, and they're documented in Python's data model. And that's this right here.
So, the documentation for the data model, contains all of these methods. If you just go ahead
and click on the Special method names, heading over here, if you scroll down, you'll see a lot
of these special magic method names. These are all the comparison ones right here. So, just
go ahead and click on that Special names link in the sidebar, and you can read through these
at your own pace.
Attribute Access
Python's magic methods also give you complete control over how an object's attributes are
accessed. Your class can define methods that intercept the process any time an attribute is set
or retrieved. So let's open up the magicattr_start file and we'll see how this works. So here in
my code, I have the Book class which defines some attributes and also overrides the __str__
function that we saw earlier in the chapter to print out a nice version of the object. So let's
start off by seeing how we can control access when an attribute's value is retrieved. So Python
lets us define a magic method called __getattribute__ which is called whenever the value of
an attribute is accessed. So what I'm going to do is override this function. I want
__getattribute__. And that is going to be passed, my object and the name of the attribute. So
this gives us the opportunity to perform any operations on the value before it gets returned. So
for example, you can see that we have an internal attribute named discount. So let's imagine
that we wanted to automatically apply the discount whenever the price attribute is retrieved.
So first, we can check to see if the name argument is equal to price, so we know that that's the
price attribute being accessed. And now, here's where things get tricky. So since we're already
inside the function that's going to get called whenever an attribute value gets accessed, we
can't just refer to this object's attributes by name because then this function will just get
recursively called over and over again and it will eventually crash. So what we need to do is
get the value of the current price by calling the superclass version of __getattribute__. So I'm
going to write p =, and then I'm going to call the superclass' __getattribute__, and I'll ask for
the price, and then I'll do the same thing for the discount, and I'll name that d, and we'll put in
_discount And then, I'll just calculate the discount. So I'll return p minus p times d. And if
we're not operating on the price attribute, we'll just go ahead and call the superclass'
__getattribute__ so everything just works normally. So I'll just call __getattribute__ with the
name. All right, so let's scroll down and you can see I've got some code that creates a couple
of Book objects. Let's go ahead and try setting the price of book one to 38.95, and then let's go
ahead and print b1. And remember, the print statement will trigger the __str__ output. So I'll
run this. And we can see, in the output, that the price of the book is reduced by 10% but,
clearly, the other attributes, the title and the author are unaffected. So we can also control the
setting of an attribute by overriding the __setattr__ function, so let's try that next. So here, I'm
going to override __setattr__, and that takes a reference to my object, the name of the attribute
and the new value. So in this case, let's use __setattr__ to enforce that when the price attribute
is set, the caller is using a floating point number. So once again, we'll check to see if name is
equal to price, and if it is, we'll check to see if the type of the value that we were passed is not
float. Then, we're going to raise an error. And we'll just say, the price attr must be a float.
Okay, so now we're going to raise an error if the value that we were passed is not a floating
point number. So if we pass that test, then we'll just call the superclass' version of __setattr__.
So we'll just go ahead and return. And once again, we'll call the superclass, and we'll pass in
the name and the value. All right, so once again, let's try this out. I will comment out my
previous example, and I'll set the price of book two to be 40, as an integer, so that should
cause an error, and then, we'll print b2. So I'm going to save and I'm going to run this. And
when I run it, you can see that I'm getting my error right here, it says the price attr must be a
float. So let's go back to the code. Now, I could either fix this by making this by making this a
floating point number or I can cast it to a floating point number by using the float function.
All right, let's save and run it again. And now you can see it works fine, right? So I set it to be
40, and then 10% off of 40 is 36. All right, let's do one more example. Let's go back to the
code. There's another magic method that lets us customize the retrieval of attributes, but it
only gets called if the given attribute doesn't exist. So it's the __getattr__ version of
__getattribute__. So let's go ahead and give that a try. So I'll define that method. And that's
going to take self and name. All right, so this version of the function only gets called if the
__getattribute__ version either doesn't exist, or if it throws an exception, or if the attribute
doesn't actually exist. So let's go ahead and comment out my overriding of __getattribute__.
And then, let's implement this version. So all we're going to do is return the name and we're
going to say, is not here. All right, and then, let's try to access an attribute that we know
doesn't exist. So I'll comment this out and I'll just try to print out, you know, b1., and I'll call it
randomprop. All right, so let's go ahead and run this. And you can see that now I'm getting the
output, randomprop is not here. Now, you could use this to generate attributes on the fly, or if
you wanted to extend the syntax for accessing attributes, but you know, just like other
methods, you need to be careful that you don't enter into a recursive loop. But by using these
attribute methods, Python gives you a great amount of flexibility and control over how
attributes are retrieved and set in your classes.
Callable objects
Selecting transcript lines in this section will navigate to timestamp in the video - [Instructor]
To finish up this chapter, we'll take a look at the magic method that enables Python objects to
be callable just like any other function. Now that might sound a little bit weird, but it's easy to
understand when you see it in action. So let's go ahead and open up the magiccall_start file,
and once again, you can see that I have my book class, and it's already implementing the str
magic function that learned about earlier. And I've got some code that creates a couple of
book instances with titles, and authors, and prices. So what I'm going to do is this, I'm going
to override the call function that lets me treat the instance of the object like a function. And
I'll define the function to take the same parameters as the init function. Now you can also
define the function to take a variable number of arguments, but that's a little more advanced,
and I want to focus here on the feature itself. So I'm going to define call, and this method will
be invoked when I call my object. So it gets a copy of the object, and then I'll pass in the title,
the author, and the price, and then for the function body, I'll just assign the parameters to the
object attributes, pretty much just like I have here in my init function. So I'll just go ahead and
do that. Okay, so now let's try this out. So first what I'm going to do is print book1's values,
then I'm going to call the object like a function to change the values of the object's attributes.
So I'll write b1, and then I'm going to call it, just like a function. And this time I'll change the
title to something else, and then author's going to stay the same, and the price is going to
change, and then I'll print the book again, and remember, this will use the str method to do so.
Okay. So let's try this out. And when I run the code, you can see here that I'm changing the
value of the object's attributes by calling the object as if it were a function. So here's the
original version, and here is the version after I've changed the values. And that's one of the
benefits of this technique is if you have objects whose attributes change frequently, or are
often modified together, this can result in more compact code that's also easier to read.
Defining a data class
[Instructor] As we've been working through this course, you may have noticed a pattern with
each of our examples so far. And that is that one of the main use cases for creating classes in
Python is to contain and represent data. Our code creates classes like this book class here in
the data class start file and then uses the init function to store values on the instance of the
class. And you might be wondering, well, if this is such a common pattern, then why doesn't
Python just automate this? Why do I have to explicitly store each argument on the object by
setting attributes on the self parameter? Well, starting in Python 3.7, you actually don't. In 3.7,
Python introduced a new feature called the data class, which helps to automate the creation
and managing of classes that mostly exist just to hold data. And that's what we're going to
focus on in this chapter. You can read more about data classes at this link in the Python docs.
But for now, let's go back to our code, and let's begin by converting our book class into a
version that uses a data class. So to do that, the first thing we need to do is import the data
class from the data classes module. So from dataclasses we're going to import dataclass. And
again, this only works in 3.7 and later, so make sure you have at least that version of Python
on your computer. Next, we're going to use the dataclass decorator to indicate that the book
class is going to be a dataclass. Then, we get rid of the init function, and let's go ahead and fix
the indenting here. And we're going to get rid of each of these self keywords, because there's
no more self keyword right now. And then we need to annotate each of these attributes with
the new type hints that were introduced in Python 3.5. So I'm going to get rid of each of the
equal signs, I'm going to write title, and then colon str. And then the author is also going to be
a string, so that's a str. Pages is going to be an int. And price is going to be a float. And guess
what? That's pretty much all we have to do. Now, there's quite a bit going on here, so let me
explain. At first glance, it looks like what we're doing is defining class attributes instead of
instance attributes. But what's going to happen behind the scenes is that the dataclass
decorator code will actually rewrite this class to automatically add the init function where
each of these attributes will be initialized on the object instance. And the second thing you
notice here is these type hints. These are required for data classes to work. But in keeping
with Python's philosophy of being flexible, their type isn't actually enforced. So you can see
here, we have some existing code that creates some book instances and then accesses some
fields. And we don't even need to change our existing code as long as the parameters are
passed in the same order and they are. We've got title, author, pages, and price. So everything
looks fine. So let's go ahead and run what we have. I'm going to save and then run this. So
here in the output you can see we've got the title and the author of what is that, book one and
book two, right. So, we can access attributes on the object just as we could before. But
dataclasses have more benefits than just concise code. They also automatically implement
both the repr and eq magic methods we learned about earlier in the course. So, for example, I
can add a print statement here to just print out book one. And we can compare two objects
with each other, so I'm going to make another book that has the same values as book one. And
I'll call that book three. And then I'll add a comparison to see if book one is equal to book
three. All right, so, now let's go ahead and run again. And you can see that the printed output
of the book object automatically contains all of the data attributes and the equality comparison
also just automatically works. All right, so one more thing to demonstrate for our intro to data
classes. So they are just like any other Python classes. There's nothing really special about
them. They're a Python class just like any other Python class. So if I want to add a regular
Python method to my class, it's really straightforward to do so. So I can just go ahead and
create a method called bookinfo, and I'll just return some formatted string that contains
self.title and self.author. And then let's go ahead and modify some attributes of one of the
books. And then let's go ahead and call that method, b1.bookinfo. All right, so let's go ahead
and comment out some of these others. Okay, so let's go ahead and run this. And there you
can see that our book info method works just as expected. So data classes let you write a lot
more concise code and skip a lot of the boilerplate that comes along with the init method and
initializing object instances but at the same time, they're just regular Python classes and you
can use them just as you would any other Python class.
push() Selecting transcript lines in this section will navigate to timestamp in the video -
[Female Instructor] The first method we can flush out is this push method and this is the
method that's going to allow us to add an item onto the stack. We've already specified the item
here and so now we just need to append that to the list that we've created above. So we'll just
say self dot items dot append and we'll pass in the item that we want to add. Now since this is
a learning exercise, I do think it's a good idea that we add doc strings to each of these
methods. So I'm just going to write a brief note about what this method does. So I'll say, it
accepts an item as a parameter, and appends it to the end of our list. And it returns nothing.
Now we talked earlier about how appending to the end of a list happens in constant time. So
let's make a note about the run time as well and the reason I think it's important to do this is
data structures are often covered in interviews and typically, your interviewer will ask you
about the run time. So I'm just going to say that the run time for this method is 0(1), or
constant time, because appending to the end of a list happens in constant time. And the reason
this is constant time is what's actually happening is python is going directly to the last item of
the list. It's indexing to the last item and indexing into a list. As you may know, happens in
constant time. So there we go. To test to make sure that my new push method actually works,
we can run that python file interactively in the terminal. I'm using python3 and the way that I
can run that is like this. And so, now all of the code that I just wrote is available in my
terminal. So I can create my own stack object and, at this point the only method that we have
is the push method. So I can try that out by passing in apple, a string containing the word
apple. Now remember that the variable that we're using here is items. So I can go back to the
terminal, type in my stack dot items, and that will show me now what is in my list. If we add
another item, let's say, banana. So we'll do my stack dot push and we'll pass in a string
containing the word banana. I expect that banana will appear to the right of apple. Let's see if
that's correct. It is, and the reason this is the case is because we always consider the right side
of the list to be the top of the stack and we can only add to and remove from the top so every
time we add something else it's always going to show up on the right most side of that list.
We'll do my stack dot items again to check and there we go.
pop() Selecting transcript lines in this section will navigate to timestamp in the video -
[Instructor] The second method we can work on is our Pop Method. And this is the method
that's going to allow us to remove an item from the top of the stack. Now, we know already
that the list data type has a built in Pop Method. So, we're just going to make use of that. So,
we will return whatever value is given to us by self.items.pop. Now, we don't have to specify
an index here, because we always are going to be wanting it to return us the last item from the
list. That's how the Pop Method works. If you don't give it any parameters, it will always
return the last method. That's pretty much it. Let's add another doc string, as we did for Push.
So, what does this method do? This method returns the last item. We should say, removes and
returns the last item from the list, which is also the top item of the stack. And just as we did
before, we should talk about the run time here, as well. And the run time is also constant time,
just as it was for Push. Because really, all we're doing is, we are indexing to that last item in
the list and then returning it. So, all it does is index to the last item of the list. Let's go ahead
and test this out in the terminal. I will again run Python 3 interactively, and I will create my
own stack object. Now, at this point we have two methods. We have Push and Pop. And I
want to try the Pop Method without having pushed anything first, to see what's going to
happen. So, I'll just do my_stack.pop and I'm kind of expecting an error to happen here. And
that's what happens. The index error is that I'm trying to Pop from an empty list, or in other
words, I'm trying to remove something from a list that doesn't have anything in it. Now, what
this tells me is, this is an opportunity to go back into my code and create a condition to look
for this. Let's add in that condition statement here. So, what I want to say is, as long as there
are items in the list, give me the last item. To do that Pythonically, I can say, if self.items, and
this basically means if there is something in self.items, return the popped value. Otherwise,
we may want to just return none. So, I'm going to save this. We'll go back to our terminal. I'll
have to restart my interpreter, but that's okay. I'll do python3 -i and then the name of my file.
Okay, so, we'll recreate our stack object. And now, let's try calling that Pop Method again.
peek() Selecting transcript lines in this section will navigate to timestamp in the video - We
now have the basic functionality for our stack class because we have coded the push and pop
methods. So let's move on to these other three extra methods if you will. The next one is
called peek. And what this is going to do is show us what the next value is that's ready to be
popped. So in other words, this should show us the item that's on the top of the stack. We
want to return that item as well. So all we need to do here is we just need to return whatever
value or whatever item is in the last index of the list. So that would be self.items and then we
can index into the negative first position. So I'm going to save this and we'll go back to the
terminal here, fire up the interpreter. I'm going to create another stack object and I'll just push
one thing onto it this time. Going to stick with our apple example. So now if I call peek I
expect it to show us apple. And it does. But what would happen if I tried to call peek on an
empty stack? We can test that out ourselves. So I'm going to pop off apple. I can double check
that my stack is empty now. And it is. So if I call my_stack.peek, at this point I get an error.
The reason I get an error is because I'm trying to index into the last item of a list but the list
doesn't have anything in it. So let's add a condition for that into our code here. It'll be pretty
similar to the condition that we used in our pop method. I want to say, as long as there are
items in that list, so if self.items, then return the last item in the list. Otherwise, just return
none. So I'll save. Let's try this one more time. Forgot to re-run my interpreter here. There we
go. So my stack equals and we don't have to add anything this time, right, cause we just want
to test what happens to our peek method on an empty list. We should get nothing. And that's
what happens. Cool. So let's go back to our code and we'll add in the dock string and a note
about the run time as we have been. Alright, so this method returns the last item in the list
which is also the item at the top of the stack. Clean up my capitalization here. Here we go.
And let's make a note about the run time. So again since all we're really doing is indexing into
a list and we know that indexing into a list is done in constant time, that's the only thing we
need to say here. So this method runs in constant time because indexing into a list is done in
constant time.
Queue
Queues as a linear abstract data type Selecting transcript lines in this section will navigate to
timestamp in the video - [Instructor] Let's talk about what makes the queue class unique
especially in comparison to stacks. Here's a high level definition of what a queue is. Queues
hold a collection of items in the order in which they were added. Items are added to the back
of a queue and removed from the front of the queue. If you think of a queue as a line of
costumers, a customer adds themselves to the end of the line and eventually leaves from the
front of the line. This is called first-in first-out or FIFO. Note that this is subtly different from
stacks which are last-in first-out. Queues also preserve order so when you think about people
joining a line, the order is preserved there as well. Just as we did for the stack class, we can
see if there is a built-in data type we could use to implement the queue behind the scenes.
Again, we can use a list because a list is an ordered collection of items that we can modify.
When we code our queue class, we're going to use the right side or end of a list to represent
the front of the queue. We do this because we can remove an item from the end of a list in
constant time. So if we make the end of the list match the front of the queue, we can always
remove items from the queue in constant time. This means we'll have to use the left side of
the list as the rear of our queue. Consequently, we'll be adding items to the queue in linear
time. It's linear time because when we insert an item into a list's zeroth index, it causes all the
other items to have to shift one index to the right. The larger our queue gets, the longer it will
take to enqueue an item into it. We could have arbitrarily chosen to represent the queue as a
list where we remove from the left side in linear time and add from the right side in constant
time. We'd still wind up with one operation being oh of one and the other operation being oh
of N. The basic functionality for a queue is getting items into and out of the list we are using
as a representation of the queue. When we're talking about adding items to this list, the word
we use is enqueue. And when we're talking about removing items, the word we use is
dequeue. Any data type that can be stored in a list can be stored in a queue. And we call a
queue limited access because we can only access the data from one place which is the front of
the queue. We may also want to know whether or not a queue is empty, how many items are
in it or which item is at the front of the queue? We'll create a method for each of those
capabilities later on. The queue linear data structure is especially useful when we need to
process information in the same order in which the information became available. For
example, a print queue, so that documents are printed in the order in which they were sent to
the machine. Queues have another unique property about them and that is that they are also
recursive data structures. A queue is either empty or it consists of a front item and the rest of
which is also a queue. We won't discuss recursion in this course but knowing that a queue is a
recursive data structure may come in handy in an interview.
enqueue() Selecting transcript lines in this section will navigate to timestamp in the video -
[Instructor] Let's start coding the minimal functionality we need for our Queue class. We'll
start with the enqueue method. Again, as we talked about, we have to pass in as a parameter
the item that we want to add into the queue, and in the body of our method, we're going to
need to do something with that. Now we know that we're going to be operating on our empty
list, which is called items, but instead of appending the item to the end of the list, like we did
for a stack, we want to now insert the item into the zeroth index of the list. And we talked
about how the reason we do that is because we want to sort of save the end of the list for
popping and use the front of the list for inserting. So I'll save that. We can go back to the
terminal. I'll start up the Python interpreter here, and we can try this out. I'll create my own
Queue object first, and next I will enqueue an item into it. So my_q.enqueue(), and let's add in
a string containing the word apple.
Measuring algorithm performance Selecting transcript lines in this section will navigate to
timestamp in the video - [Narrator] Because algorithms are designed to work on sets of data
and solve computational problems it's important to understand how to talk about algorithm
performance. This is an important factor in how you choose a particular algorithm to solve a
programming problem as well as understanding how your program will behave under
different circumstances. So what we want to do is measure how does the performance of an
algorithm change, based on the size of the input set of data. You'll often hear a term called
Big-O notation used to describe algorithm performance. This notation format is used to
describe how a particular algorithm performs as the size of the set of input grows over time.
And the reason the letter O is used is because the growth rate of an algorithm's time
complexity is also referred to as the order of operation. It usually describes the worst case
scenario of how long it takes to perform a given operation. And it's important to note that
many different algorithms and data structures have more than one Big-O value. Data
structures, for example, can usually perform multiple types of operations such as inserting or
searching for values each of which have their own order of operation. So let's take a look at
some of the common Big-O notation terms to see what they mean in real-world scenarios.
And as we go through the course we'll see many of these in action, so don't worry too much
about understanding them completely right now. So I've listed each of these items in
ascending order of time complexity. The simplest example is what's called constant time, and
that corresponds to a Big-O of one. And essentially that means that the operation in question
doesn't depend on the number of elements in the given data set. So a good example of this is
calculating whether a number is even or odd, or looking up a specific element index in an
array. Next is the order of log n which is called logarithmic time. And a typical example of
this kind of operation is finding a specific value in a sorted array using a binary search. So as
the number of items in the sorted array grows, it only takes a logarithmic time relationship to
find any given item. The next step up from there is linear time which corresponds to a Big-O
of n, and this level of time complexity corresponds to a typical example of searching for an
item in an unsorted array. So as more items are added to the array in an unsorted fashion, it
takes a corresponding linear amount of time to perform a search. After that we have order of n
times log n or what's called log-linear time complexity. And some good examples of this kind
of operation are some sorting algorithms like heap sort and merge sort. And then there's order
of n squared, which is called quadratic time complexity, and as you've probably guessed it's
not a very good level of performance, because what that means is that as the number of items
in the data set increases, the time it takes to process them increases at the square of that
number. So some examples of this type of operation are some of the simpler sorting
algorithms like the bubble sort and the selection sort. Now this is not an exhaustive list of the
various types of time complexity rankings. Believe it or not, there are actually some worse
ones than the quadratic time scale. But this is a good representative list of the ones that you
are likely to encounter when programming, and it's a good way of comparing the different
performance levels to each other. And again, we'll see more of these as we go through the
course.
The volume, velocity, and variety of big data Selecting transcript lines in this section will navigate to
timestamp in the video - [Narrator] The one thing we know absolutely for sure about big data is that
it's really big. There's a lot of data, it's big. But you know, what counts as big data changes with the
times. Once upon a time, a bunch of punch cards might've been big data or back in 1969, this was a
massive amount of programming. This is Margaret Hamilton with the code for the Apollo Guidance
Computer for which she eventually won the Presidential Medal of Freedom. And then what might be
big at one time becomes normal or even small at another. So for example, back in 1992, I got my first
computer, an Apple Macintosh Classic II with the optional larger 80 megabyte hard drive. I got all the
way through grad school. I wrote my PhD thesis on that computer and now I have a relatively modest
MacBook Pro and when I home it's connected to ten terabytes of external storage. That's 125,000
times as much storage as my first computer and truthfully, I've bumped up against the limits. So it's
massive by comparison to my 1992 self but it's middling for consumers and really puny by any
commercial standards. And so, if we want to think about really, what do we mean by big data given
this relative shifting frame of reference, well, one thing that's pretty consistent is the definition that
relies on what are called the three Vs of big data. And so, the first V, the first characteristic of big
data, is volume, simply meaning there's a lot of it. And the second is velocity, having to do with the
speed with which the data arrives. And the third one is the variety or the nature and the format of
the data. Take those together and you have what most people would consider big data and I want to
talk about each of these a little bit separately. So volume is the most obvious one. This is when you
have more data than fits in a computer's RAM, its memory, or maybe you have more data than fits
on a single hard drive and you have to use servers and distributed storage. I mean think, for example,
about the data on Facebook's 2+ billion users. You can't put that on a single computer. Or the
information on Amazon's 120+ million items that they sell online. Keeping track of all that is
obviously going to overwhelm any one computer, or even a collection of computers. Next is velocity.
You know, a gentle breeze through the trees is nice but a hurricane is a whole other situation. The
velocity refers to data that comes in rapidly and changes frequently. So think about, for instance, it's
been estimated that nearly 200 million emails are sent each minute of each day. Or that five billion
videos are watched on YouTube every day. If you're trying to keep track of this stuff as it happens, it's
going to be a completely overwhelming job. And then the third V is variety. Data comes in a lot of
different formats and you can have video, photos, audio, you can have GPS coordinates with time
and location, you can have social network connections between people, and all of these represent
distinct kinds of data from the regular rows and columns of numbers and letters that you would
expect to find in, like, a spreadsheet. And all of these require special ways of storing, managing,
manipulating, and analyzing the data. Taken together, those usually constitute big data. Another way
to think about it is big data is data that is hard to manage. It's the idea like some animals are a little
more challenging to deal with and have not been domesticated, like the zebra. Big data's a little like
the zebra of the data world. It's simply not easy to work with, not through conventional standards,
and you're going to have to be very adaptable to get the value out of that data. Now, I do want to say
something historical about the term. This is Google Trends data on the search popularity for the term
"big data" on Google, and we have data from 2011 through 2019. And what you can see is there's an
obvious peak right there in the middle at October of 2014. That is when the Google searches for "big
data" were the most common overall. Now, this doesn't mean that people don't care about big data
anymore. You see how it's gone down maybe a third, maybe even 50% since then. Well, there's a
saying that a fish, or in this case, a seahorse, would be the last to discover water. That's because it's
everywhere around them. It's literally the medium in which they live and move. The same thing is
true for big data. While the searches for big data may have declined a little over the few years, not
because nobody cares about big data anymore, it is because big data has become the air that we
now breathe, the water that we move in. Big data has become the new normal data for use in data
science and machine learning and artificial intelligence, and because of that, understanding what big
data means, the special challenges that it creates, and how to work with it is as relevant now as it
ever has been.
Social media and the Internet of Things Selecting transcript lines in this section will navigate to
timestamp in the video - [Instructor] The big data revolution has happened because data has grown,
but it hasn't happened in a gradual process like a little sprout coming up out of the ground. I mean,
that would be an example of what's called linear growth. For each step forward in time, you move
the same step upward to get this nice, straight line. That's easy to deal with. Rather, data growth has
been exponential. Here I take the exact same line but I put it way down here because here's an
exponential curve. This is actually two to the power of X, and you don't even get to see how far it
goes over to 10 because, at that point, it's got a value of over 1,000. 1,024 to be exact. That's a whole
lot more than the 10 we had with linear growth. This is huge, but you know what? The data
revolution is growing even faster than this. We can shrink that one down and we can draw a curve
that shows the X to the X. So, we take the number on the bottom, the four, the five, the 10, and raise
that to its own power. We do that, we get a curve like this. You don't even get to see how far it goes
because by the time it gets to 10, it's at 10 billion. This is the kind of growth that we're dealing with in
the data world, and my personal impression is that, at this exact moment, we're right here at this
inflection point where things are taking off. And so, when we're talking about data growth, it's less of
a carefully nurtured sprout popping up through the ground and more of a barely-controlled
explosion. Really, it's an exciting time, and there are a couple of major contributors to this explosive
growth. The first major candidate for this explosive growth is social media, and the second one is the
Internet of Things. Now, there are many others, but these really play a starring role in the
extraordinary growth of the big data world. Let's talk about how this works. First off, there is the
feedback. We have the uroboros, the snake that's feeding on itself, and this is like social media
because social media begets more social media. I mean, think about it. A person puts post online,
that post gets liked. Each like is an additional piece of data with a fair amount of metadata that goes
with it. Somebody puts an image online. That image gets tagged and it gets shared with anybody
who's in it, and it gets put into forms that use the same tags, or somebody has a follower, they put
something online and those followers share that content and multiply it, the reach of the post. Also,
of course, the fact that more and more people are online and more and more people have social
media profiles, and so there's this incredible growth. It's not just that it's doubling, it's the X to the X
kind of growth. It just goes absolutely through the roof with social media. Now, another one is the
Internet of Things, that's IoT, and the classical examples of Internet of Things include things like
smart homes. So, I've got a smart thermostat. It knows when we're home, it talks to our phones. It
can tell what's going on. And maybe you've got lights and a security system and maybe you got a
smart lock, and maybe you've got all sorts of other systems that are connected with each other.
They're communicating constantly. And outside of the home, you can have a smart grid, where the
city knows about the traffic levels, it knows about how much electricity is going between the
generators and the various buildings and houses. It knows what's going on with the water system.
There is so much information being exchanged here through sensors and networks that it, again, has
the explosive growth. Another one that falls into this category is self-driving cars, which gather an
extraordinary amount of data from the sensors they have all around them and they communicate to
each other, and the time will soon come that they communicate directly with the road and with the
traffic signals. There are so many different ways to gather all this data, and it gets even more
complicated because the quality of the data has changed over time. Now, by that I don't mean good
data versus bad data, I mean that we start with text. You know, text comes first. When you send text
messages, you started with your little flip phone and you send your little, tiny text message, which
actually was text. Text can be measured in kilobytes. But soon, for instance, you learn how to do
audio and you can send a voice message or a sensor for your home can give you the audio of what's
happening. That's usually measured in megabytes. And then, eventually, you get to video. Say, for
instance, you have a doorbell. It's now giving you HD video over your phone of who's there at the
door. Video can be measuring gigabytes. And so, we're not only increasing the number of things that
provide the data, but we're going up the ladder to kinds, or qualities, of data that are much, much
larger and also might move from the structured to the large and unstructured format, again, from
text to audio to video, and what those correspond to are the three V's of big data: volume and
velocity and variety. And so, the social media revolution, which is still going on very actively, and the
Internet of Things, which, again, is still just at the beginning, have contributed massively to the
explosive growth of data that constitutes both the challenge and really makes the fertile ground for
the promise of big data.
Data warehouses, data lakes, and the cloud Selecting transcript lines in this section will navigate to
timestamp in the video - [Instructor] I love old books. I love the feel, I love the look. I love the formal
typesetting, the page-long paragraphs, the formal language, the complete absence of footnotes and
references. But they're musty, they're fragile, and really, they're risky to lend out. You don't know if
they're going to come back intact. They're hardly how I would want to store my operational data.
And so after the era of books as data we got to go to digital data sets. But so often the data did, and
frequently still does, exist as a local file on one person's own computer where it may not be able to
do much good, except for that one person's desktop. Organizations have always been trying to find
ways to get the message out. And several developments in data storage have helped with that goal
and helped deal with the massive quantities of data that are part and parcel of big data. The first
solution, the first step, in this direction was the creation of data warehouses. A data warehouse is a
unified place to keep an organization's data sets. So think of it as a server that has all the data sets on
it. Now there're several advantages to data warehouses. Number one, the data is typically well-
structured and curated. And it's well-structured in part because data warehouses often could only
take one kind of data, and it would usually be a relational database with the rows and columns of
tables. And in addition to that, a data warehouse may contain discrete organizational units. So, for
instance, accounting might have their own part of the warehouse, and sales might have their own
part of the warehouse. And this was one way to organize things. Now these were very popular, say,
for instance, back in the '70s and '80s. But there was a problem in that you ran the risk of ending up
with a data dead zone, an abandoned data warehouse. The reliance on structured data worked well
for a lot of purposes, but the explosion of big data brought in enormous amounts of unstructured
and semi-structured data that didn't fit well in the systems. In addition, warehouses sometimes
perpetuated the existence of organizational silos, or these walls between units, that didn't
communicate well. And truthfully, that undermined some of the promise of the data revolution. And
so for many organizations the data warehouse kind of came, it solved some problems, but then it
seemed to bring along some potential problems of its own. That led to the development of our next
solution and next metaphor, the data lake. This is a data storage that holds data that is structured,
and semi-structured, and unstructured. Any kind of data can go in there, so it's enormously flexible.
And it's designed to do away with data silos that all of the organization's data from every unit can go
in there. And so it's potentially available to whoever needs it. And really, it's the existence of data
lakes through things like Hadoop that made the big data revolution possible. But just as with data
warehouses, there are some potential risks with data lakes, and that is you can end up with a data
swamp. If the data lake isn't well maintained, if mistakes in the data aren't fixed, if adequate
metadata isn't provided, if multiple, but different, versions of the same data are coexisting, if formats
aren't adapted to the software used, you run the risk of getting this swamp-like situation, which can
be a dangerous place to get lost, and really becomes a place where data go to die. And so people
came to recognize there were some limitations to this approach as well, which leads us to the
current preferred solution, which is the cloud. The idea here is that a data cloud is something that
contains data, like a data lake, that is structured, semi-structured, and unstructured, so it's very
flexible in terms of the contents. But it doesn't rely on a local server. It doesn't rely on local
maintenance of the equipment, because it's off somewhere else. I mean, this is when you use things
like Amazon Web Services, better known as AWS, or Microsoft Azure, or Google Cloud, among
others. And also, because it's fully online, it's more accessible. And truthfully, it's more adaptable.
You can get more space immediately. You can get more computing resources immediately as you go.
And so there's some huge advantages to storing and working with big data in the cloud. But, as with
every other solution, there are some risks, and I'm going to call it data dissipation. The major risks of
storing data in the cloud have to do, first, with security, where the organization may not be keeping
track of who has access to the resources, or it might be susceptible to hacking. And second, there's
cost containment. When things are infinitely expandable at any time, you may not be keeping track
of the costs that you're racking up. And third, there's the remote possibility that the services you use
might become temporarily inaccessible, or they may disappear completely. And so you do run the
risk of your data kind of dissipating, or not knowing what's happening there. Because we're in the
data cloud era, it's worth pointing out a couple of variations that have become particularly popular.
One is the multicloud, also called polynimbus. This is using several different cloud storage and
computing providers simultaneously. So you might use both Microsoft Azure, and Google Cloud, and
use some IBM cloud computing resources. And the point here is you actually can integrate them so
they're in a single heterogeneous environment, so it's as though they were in a single thing. Now,
that's one popular approach because it lets you use the strengths of each of these providers. There's
also hybrid cloud. This is where you're using both a public cloud, like AWS, and you're using a secure
private cloud. Think of a local server. And what that lets you do is it lets you keep a lot tighter control
on what happens here. And it's possible also to move back and forth. You can take things from the
private cloud, push them into the public cloud we need to expand or if you need transparency. But
times you need extra security, you bring it from the public to the private. And so there're a lot of
variations that can be used to meet the demands of your organization's data computing and the
regulatory environment. And so there're a few things to think about when looking at these different
approaches: the data warehouse, the data lake, the data cloud, along with the multicloud and the
hybrid cloud. Number one is accessibility. Some of these allow more access to more people and from
different situations. The cloud's going to be the best at that. The data warehouse is going to be the
most limited in that particular respect. There's also the flexibility. What kinds of data can you put into
it? And also, can you scale up your resources as needed? And then the third one is security. How
much control do you have over your data? How much control do you have over your costs? And how
are you able to meet the demands of the regulations that you're working with? Using these three
criteria, as well as things like the cost of purchase to maintain and the learning curve to get started
on any one of these, your organization can choose. Do you want a data warehouse? Do you want a
data lake? Or do you want to work in the cloud? Or maybe there's even a situation where you do
want to keep those files on a single computer for the tightest possible control. But any one of these is
going to allow you to make an informed choice about how you deal with you data, especially all the
variations that constitute big data and the promise that it has for your organization.
Big data for business strategy Selecting transcript lines in this section will navigate to timestamp in
the video - [Instructor] You may work in a massive multinational corporation or you may work in a
scrappy young startup, but no matter where you work, you have to make decisions about how you're
going to make your business work and reach its potential. And data should always be central to those
decisions. Now, maybe you have a simple situation. Maybe you already have consistent profits going
up and up and up each month so in that case, you know, just keep doing what you're doing, do more
of same 'cause you know, maybe you've cornered the market on some new wonder widget and
you're growing 10x every year. By all means, just keep what you're doing and enjoy your
uninterrupted prosperity. But reality is rarely that simple, customers' attention wonders, competitors
arise, the market gets saturated and economic environments changes so you have to adapt in that
case. Also, you need to identify potential profits, things that you could do to add to your business and
potential risks, things that potentially represent a threat to your operations. Now, you can use data
to answer these questions and maybe you've got some simple methods, maybe you just look at your
sales numbers in a database up, up, up, down, up, up. Or maybe you have customer data in CRM
software that's customer relationship management and CRM software can be very basic, it can be
very sophisticated, depends on what you have. Also, you have social media accounts and you have
websites and you can look at the built-in analytics for each of those. Those can give you individual
images into what's going on in your business. But there's a major problem with all of those
approaches: everybody's doing them. Remember, if you're in the business world, you're in a
competitive environment and you need a competitive advantage. So it's pretty obvious that you can't
simply do the same things that all of your competitors are doing. At best, you would keep even with
them. But it won't put you ahead, you need a competitive advantage and this is where big data
comes in. Remember, big data is data that's characterized by its unusual volume, its high velocity,
and its astounding variety of formats and sources. What these together mean is that big data has the
potential of giving you a competitive advantage in your business' operations. There's a few different
things, for instance, you can identify new patterns in your data, you can find relationships, you can
find trends that other people can't because you're able to look at it in a more detailed nuanced and
responsive way. You can use novel data sources, this is a cell I know some local companies that do
amazing work on biological experiments connecting hundreds of thousands per week, then using
microscopes to take photos of the cells and then using artificial intelligence to process those photos,
it's a novel data source, it's extremely high quality data and it's actually allowing them to get a huge
advantage in the pharmaceutical market. Also, the intelligent use of big data through data science,
machine learning and artificial intelligence allows you to adapt dynamically in real time, not just
quarterly or monthly or yearly. It allows you to identify new trends and new markets and start
engaging with them immediately. You can become an agile organization through the proper use of
big data. Now let me say just a little bit more about each of these. With the new patterns, big data
brings in with it a whole new suite of new algorithms, new ways of processing. It's true that a lot of
traditional methods likely in your regression can be extraordinarily effective with the proper data.
But sometimes your questions go beyond that and you do need the algorithms that come with the
big data revolution. You can focus on exceptional cases and especially 'cause you're going to have a
lot of data, remember, it's big data, you're going to have a lot of data and you can focus on those
edge cases and identify possibly a new niche that's been untapped. You can follow nonlinear
patterns, many other methods assume you can draw a straight line or put a flat plane through it,
many of the big data methods allow you to get much more intricate than that and there's new
methods for identifying clusters of potential clients of maybe a subgroup that has not been
completely addressed. By using the methods that come along with the big data,
Big data for applications Selecting transcript lines in this section will navigate to timestamp in the
video - One of the great paradoxes of big data is that sometimes all of the technological wizardry, all
the equipment, all the huge amount of effort that goes into it, is put into motion to help with
something as theoretically simple as choosing whether to go left or right at the fork in the road. The
idea here is that big data is frequently used to create small data. The small amount of information,
the direction, the couple of restaurant recommendations that the end user wants. It's really another
example of the general principal that all analysis, all data analysis, is an act of simplification. You are
taking something very complex, something potentially huge and overwhelming, and reducing it with
hopefully still maintaining the general form and feel to something that is more usable and actionable.
Now, in terms of applications based on big data some of the really common ones are things like
directions from point A to point B, or maybe something like finding jobs online, or getting health
feedback. By the way, I intentionally use cell phones in all three of these because mobile devices are
going to be one of the greatest ways that people interact with big data. You've got the huge server
farms on your end. They've got their phone in their hand at the other end and that's how you're
trying to meet with the end user. Now, think about directions, just for example. In terms of big data
what you're doing is you're bringing things in like GPS coordinates for where the person is, where
they want to go. You're bringing in map data, so information about the roads, the shape files that
show the map. You're bringing in traffic conditions, historical patterns and real time estimates. You're
getting user preferences. Do they like to be on the freeways? Do they want the scenic route? Do they
want to avoid tolls? And you can also bring in data from other travelers to identify potential slow
downs or detours in the road. This is a very complex set of data and that's why it is big data. It's the
varietyness, the volume, and things can be coming in quickly with the velocity. But you're boiling it
down to, "Go straight, turn left, turn right here." That is big data being put into the best possible use
for an application with consumer orientation. Or take something like a job search which can be a very
time consuming, stressful, unpleasant experience. But with big data applications you can take, for
instance, a person's physical location, where they work, where they say they want to work. You can
match their skills, you might even be able to assess those skills directly. You can look at the
endorsements that people give them and match those skills and endorsements to the requirements
of the jobs that you have available. And then, probably one of the most important things is, you can
connect that person through a professional network to other people who have similar jobs, maybe
even at the same company. It tremendously facilitates the job search process. And again, these are
many different kinds of data. That's how big data is used. To compress it down to a small number of
jobs and recommendations for the end user. And then as a final example, there are health
applications where you're doing things like connecting with medical history. Usually self reported,
where the person who owns the phone, they put the information into their account, possibly
collecting data from their medical provider's records. That's another big obstacle. But you can also
get sensor data from their watch and their phone. How much are they moving around? What's their
heart rate? How did they sleep last night? What's the temperature around them? All of that
information, again, very dramatically different kinds, can be integrated into the health app. And you
can even use that to connect with emergency help. Again, think about something that a person
needs all this information to put together so the watch can say, "maybe you fell," and, "maybe you
need emergency help." That's taking big data and distilling it down to the essence of the small data
to get the customer, or the client, the thing that they want. All three of these are fabulous examples
of how the big data revolution has made it possible for companies to provide unprecedented service,
both in terms of it's kind of service, and in it's personalization, when that fulfills the promise of big
data in the consumer realm.
- [Instructor] Big data isn't just a supersized version of small data. It's qualitatively different in a
number of ways that actually matter to what you're doing. In the book, Principles of Big Data:
Preparing, Sharing, and Analyzing Complex Information, Jules Berman outlines 10 ways, maybe you
could say they're the top 10, that big data is different from small data, and I want to walk you
through those. First, they differ in their goals. For small data, the goal is usually a specific, singular
goal. We're trying to accomplish this one task by analyzing the data. On the other hand, with big
data, the goals evolve and they can redirect over time. You may have one when you start, but things
can take unexpected directions. Second is location. Small data is usually in one place and often, in
one computer file or one floppy disk. But big data is much larger because it is big, and the data may
be spread across multiple servers in multiple locations anywhere on the internet. Third, the data
structure and content. Small data is typically structured in a single table, rows and columns in a table,
like a spreadsheet. Big data, on the other hand, can be semi-structured or unstructured across many
different sources, and it's putting those things together that constitutes one of the major challenges.
The fourth difference is data preparation. Small data is usually prepared by the end user for their
own goals so the person who's putting in, they know why it's there and they know what they're
trying to accomplish. On the other hand, big data is generally a team sport and the data can be
prepared by many people who may not be the end users or the analysts at all, and so the degree of
coordination that is required for big data is extraordinarily more advanced. The fifth difference has to
do with longevity. Small data may be kept for only a limited amount of time after the project is
finished. Maybe a few months, maybe a few years, but after that, it can go away. It doesn't really
matter. Big data, on the other hand, may be stored in perpetuity and it actually may become part of
later projects, so future data might be added to existing data, historical data, and data from other
sources can come in. It evolves over time. The six difference has to do with measurements. Small
data is typically measured in standardized units using one protocol, usually because one person's
doing it and it all happens in one point in time. On the other hand, big data can come in many
different formats that are measured in many different units and gathered with many different
protocols by different people and different times and places. And so there is no assumption of
standardization or uniformity, at least in the raw version of the data. The seventh difference has to
do with reproducibility. With small data, projects can usually be reproduced, if needed, if your data
goes bad or if your data goes missing, you can do it over again. In scientific circles, this kind of
replicability is actually considered a very good thing. With big data, however, replication, meaning
gathering all the data over again simply may not be possible or feasible. Bad data may have to be
identified through some forensic process, and there may be attempts to directly repair things or you
may have to do without it. The eighth difference has to do with the stakes involved. With small data,
because the projects are not huge, the risks are generally limited. If the project doesn't work out or if
a mistake is made, it's usually not catastrophic. But with big data projects, because there's so much
time and effort and people's work invested in it, the risks are enormous. They can cost hundreds of
millions of dollars and lost data or bad data can doom a project, maybe even the organization that's
running the project. The ninth difference goes by a peculiar title, for those of you who are not in
computer science, and that's introspection. This is a term that comes from object-oriented
programming languages like C++, Java, Perl, and so on. It has to do with where the data is and
identifying the data. In small data, you generally have well-organized and individual data points that
are easy to locate and they often have clear metadata so you know where they come from. You know
what the values mean. With big data, however, you have many different files in potentially many
different formats, and it can be difficult to actually locate the data points that you're looking for and
if they are not well documented and it's easy for things to slip through the cracks, it may be difficult
to interpret the data and know exactly what each value means. And the 10th difference has to do
with analysis. In small data, you can generally analyze all the data in one procedure on one machine,
but big data, again, because it's big, the data may need to be broken apart. It may need to be
analyzed in several different steps using different methods and then trying to combine the results at
the end. All of these differences combined to make it so big data isn't just bigger. It's not just more
data or lots of data sometimes, people call it. It's a qualitatively different approach to solving
problems that requires qualitatively different methods and that is both the challenge and the
promise of big data.
The three facets of data science Selecting transcript lines in this section will navigate to timestamp in
the video - [Instructor] Elsewhere in this course, I've introduced you to the three V's of big data.
Those are volume and velocity and variety. And when you put those together, you have big data. But
you may be familiar with another three-part Venn diagram this time serving as the tripartite
definition of data science that involves hacking skills and math and statistics and substantive
expertise and that together constitutes data science. And this model was created in 2013 by Drew
Conway. Now if you want to use shorter names for each of these, you can call them code and quant
and context. By the way, most people have an intuitive feel for why coding and quantitative skills are
important to data science, but context or domain expertise might need a little bit of explanation. And
I like this quote about why context expertise matters. Organizations already have people who know
their own data better than mystical data scientists and this is a key. The internal people already
gained experience and ability to model, research and analyze. Learning Hadoop or any other big data
platform or method is easier than learning the company's business. And that's by Svetlana Sicular of
Gartner, Inc. And so the idea here is it's easier to take somebody who understands your business and
who understands the way things work within that domain and get them the technical skills than it is
to do the other thing. Now the data science Venn diagram also labels the intersections of pairs of
circles. So for instance, here at the intersection of coding ability and quantitative ability is machine
learning according to Drew Conway. And then over here at the intersection of quant and context is
traditional research. And then finally over here in code and context, Drew Conway calls it the danger
zone and I have different things to say about that, but the important to know about this is that data
science and big data for many years were taken as practically synonymous terms. I want to show you
in some of the following videos how the two relate but also the ways in which they diverge and can
represent different practices with different goals.Data science without big data Selecting transcript
lines in this section will navigate to timestamp in the video –
[Instructor] Elsewhere, I introduced you to the three-part definition of data science that revolves on
coding and quantitative abilities and context. And while data science has been strongly associated
with big data, it's still possible to do data science projects without involving all of the elements of big
data. So, for example, you can use data science on projects that have volume without velocity or
variety, so for instance, a very large and static data set with a consistent format. Genetic data's going
to fall into that category, or maybe something in data mining or predictive analytics. You may be
trying to predict a single outcome like whether a person will click on an ad on a webpage. In this
case, the data set might have thousands of variables and billions of cases, but all of it's in a consistent
format. The size is a challenge, but otherwise, it's a relatively, at least in theory, straightforward
analysis. You can also have velocity without volume or variety. So, think about streaming data that
has a consistent structure. That might be looking at the number of hits on webpages or searches for a
particular term, like searches for big data or data science in the last decade on Google Trends. You
can also get data from a sensor like temperature readings that can come in hundreds of times per
second. So, this is data stream mining or sometimes called the real-time classification of streaming
sensor data. So, you have to do data science to be able to process information that's coming in this
fast and adding on to it, even if it doesn't ultimately constitute a large volume or if it has a very
consistent structure. And then finally, there's variety in the data without necessarily having velocity
or volume. So, think about for instance, facial recognition in a personal photograph collection. You
might say, maybe you have a few thousand photos. That doesn't exactly constitute big data, but
because it's photographs, it does have a lot of variety in the data. But an even better example is a
data visualization of a complex, but static data set. That involves the data science tool set, even
though it's distinct from big data because it's characterized only by one of the three V's. And so, data
science and big data do have a lot of history in common, and a lot of practices in common, but you
can see from this that you can do data science, even when you don't have all of the elements of big
data.Big data without data science Selecting transcript lines in this section will navigate to timestamp
in the video –
[Instructor] You're familiar by now with the 3 Vs of big data, the Volume and Velocity and Variety of
the data. And you're also familiar with data science, which involves Coding and Quantitative ability
and Context. Now, just as I have shown you that you can do data science without needing big data,
you can actually do big data without needing the full data science toolkit, and you can still get
valuable and actionable insights from those analyses. So let's take a look at, for instance, when you
would use the coding and the quantitative abilities of data science without necessarily needing the
context or the domain expertise. The best example in this case is machine learning or specifically,
deep learning neural networks. In this case, you can create a black box, which means you don't
necessarily know exactly how the algorithm is processing the data, but you can still get very useful
categorizations and identifications of cases that fit, cases that don't fit, of anomalies, and you can
find how you can best serve your clients even without having the strict domain expertise that's
normally involved with that. On the other hand, there are projects that involve context expertise and
quantitative ability without necessarily requiring the ability to code. Now, in Drew Conway's Data
Science Venn Diagram, he calls this traditional research. Now, a person with traditional research
training, as is common in my own academic field, is probably going to be very, very good at their
context, they'll be experts on it, and they're probably sufficiently skilled with quantitative abilities to
get their work done. Chances are, they're not prepared to deal with the volume, velocity or variety of
big data on their own. So in this case, those who are best skilled at traditional research may not
actually be able to engage in a big data project the way it's traditionally defined. And then, finally,
there's what Drew Conway calls the danger zone. He says that's when you have the context or the
content expertise and the ability to code, without having quantitative ability. Now, I can think of lots
of situations where that's not going to cause problems. For instance, in the big data world, word
counts, where you take large amounts of text, either from individual documents, like this, or from a
collection of social media posts, and counting how often the words happen. There's nothing
statistically sophisticated about that, but it can be a great way of processing data, getting an overall
feel for what's happening in the document and seeing how things change over time. These word
counts involve, again, no statistical information except for saying how common this happens and this
happens. But they are absolutely big data tasks. Text analysis, especially natural language processing,
is one of the foundational tasks of data science, even if it doesn't involve specific quantitative,
mathematical or statistical tasks. And so, the point of this is, you can do big data projects without
necessarily having to have all of the elements of the data science toolkit, and lets you know they are
separate disciplines, even if they traditionally frequently overlap.
Big data and privacy Selecting transcript lines in this section will navigate to timestamp in the video -
[Instructor] Most tech companies would think that their wildest dreams had all come true if they
could bring in $5 billion of revenue, but to lose $5 billion because you didn't adequately protect
people's privacy, well, that would be about your worst nightmare, and that's something that
Facebook gets to deal with at this exact moment because of a landmark fine for violating the privacy
requirements of the European Union's General Data Protection Regulation, or GDPR. Now the
European Union's GDPR, which has been enforced as of May of 2018 makes some important
restrictions. It puts limits on data, not all data that is accessible data that you can see online or that
you can get from people by asking them can be legally used. For instance, you may ask for people's
phone numbers to enable two factor authentication, but you can only use the phone number for that
purpose. You can't use it for audience targeting or segmentation. Also, there's a dramatically
changing landscape when it comes to privacy. The laws and regulations are evolving rapidly, and the
interpretations of these regulations. So we're seeing how the GDPR gets interpreted and how it plays
out, and then there's things like the California Consumer Privacy Act, which will be enforced in 2020,
it shares many of the same elements as the GDPR and many states and most countries have their
own versions of privacy regulations which, unfortunately, don't always mesh, and so it makes
working in an international environment, which is what people on the internet do, a little more
challenging. Now if you want to know specifically what kind of information is considered private by
these regulations, here's an answer. Demographic data that indicates racial or ethnic origin, political
opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data for
the purpose of uniquely identifying a natural person, data concerning health, or data concerning a
natural person's sex life or sexual orientation, as well as things like web behavior and IP addresses.
That's a huge amount of information that, once upon a time, would have been part of the free-for-all
for internet marketing, but now that information's restricted, and this list, by the way, is from Article
9 of the GDPR. Well, what can you do? There are a few major things that are going to help you get
started on proper privacy. So for instance, the GDPR requires explicit affirmative consent to gather
personal data. People have to check the box to say yes, and it can't already be checked, it has to be
blank, and they have to check it, and they have to give you permission to gather personal data. But in
the process, you have to justify the collection. You have to explain how that data is collected, for
what purposes, how it's going to be processed. And then, really a critical one here is that the
customers control the data. The customers can download all of the data that you collect on them. If
they think that there's an error, they can correct it and you have to make the revisions, and they can
ask you to completely delete all their information from your records. So these are the sorts of things
that might fall under common sense, you know, civil rights or human rights on the internet, but
they're things that had to be codified in the GDPR, and they are now part of the legal and regulatory
landscape. Now it does bring up another question. We have a lot of kinds of information that are not
allowed, but how close do you have to be to those things? We have what you might call the minimal,
or the minimalist interpretation, and the idea here is that only variables that directly record personal
characteristics are blocked. So you would have to have a variable or a field that says Race of Ethnicity
or Religion, and it would have to be in answer to that direct question. And so the minimal version
says well, only if you ask that specific question. The problem is that nearly all variables on everything
are correlated with these kinds of personal characteristics. It's not difficult to infer a person's race or
religion or education or so many other things by, for instance, the websites they visit. And so it turns
out that the minimal approach, which only looks at direct questions is ineffective protection, because
all of these protective categories can be inferred through other processes. Then there's what you
might call the maximal approach, and that is that all variables that are even correlated with personal
characteristics are blocked and can't be used in targeting audiences and marketing. Well, this
approach basically removes all of the data, everything is out, and the problem is that that maximalist
approach makes modeling, that is, creating statistical models to provide the services that your
company's designed to do, it makes that nearly impossible. And so, again, it's not a workable
solution. So obviously, you're going to have to find some kind of balance between those two
extremes. It's going to have to depend on, among other things, the field that you're working in, the
kind of data that you have available to you, and of course, your local laws and regulations. And if you
want more information about some of these challenges and some potential solutions, we have
another course available, AI Accountability Essential Training, which will be a great introduction to
these topics for anybody working in big data, data science, machine learning, and artificial
intelligence. That way you can start doing your work, and you can do it in a way that is productive,
that is useful, that provides something of value for your clients and your customers without running
afoul of the laws, the regulations, and the social practices that you're dealing with.
Data governance Selecting transcript lines in this section will navigate to timestamp in the video -
[Narrator] My friend is a choregrapher and they've got a box full of early work recorded on Super
Eight, on Betamax, and a whole bunch of other orphan video formats. One of these days they'll get
them transferred to digital, but until then, these videos constitute a sort of data dead zone. There's
data there, it's in there hands, they just can't do anything with it. It's actually an example of data
governance gone wrong. Data governance is a lot like taking care of your dance videos or maybe your
garden. In a garden there's a lot of work to do to make sure your plants are healthy and happy, and
the same holds true for taking care of your data. There are four major elements of data governance.
They concern the availability of data, the usability, the integrity of that data, and the security, and I'll
talk a little bit about each of these separately. But I just want to mention, these things are hard
enough to do when you have only one machine to take care of. I think all of us have had an example
of losing the data file, having it be gone for good, or trying to open a file and have it be corrupted.
Well it's exponentially more complicated when you have big data, for several reasons. Number one,
there's more people, there's more hands in the data. There's more people working who may not be
communicating well and that dramatically increases the risk of working at cross-purposes. There are
many more machines, instead of a single laptop, you've got a server or collections of servers, or who
knows what out there in the cloud. And so you have a lot of different hardware and potentially
different software involved. And then you have more formats for your data 'cause big data
necessarily involves many different kinds of data, think about the variability. You have a lot of
different formats, and you may be working in many different programming languages 'cause they
work well for different things. All of these multiply the complexity of data governance and make that
project much more important and challenging when working with big data. Let's talk about the
availability first. The question is, where is the data, can you find it? Is it connected to your current
systems, or is it sitting on a hard drive somewhere that you don't where it is? Is the data restricted,
can only certain people get access to it? As important as computer security is, the moment you don't
have the password to the data that you need, you know you're facing an important availability issue.
Second is usability. What format is used for storing the data? Is it on a floppy disk back in your
backpack? What language is used for processing the data? There's actually a fair amount of work still
done in Fortran, but you're going to have to work a little harder to find somebody who's very
comfortable working in that. Or is metadata available, is there something that describes where the
data comes from, how it was collected, who the sample was, what the variables mean and what the
values mean? These are all critical questions to make the data usable and interpretable within the
context of your project. Then there's integrity. Are the data files free from corruption? Anybody
who's had to deal with this for a single file knows how difficult it can be when you get muck. If it
happens to an entire server you got a extraordinarily huge problem. Is the code that's used to
process and analyze the data, is the code complete and is it functional? And you have to worry about
whether, for instance, the code has updated between now and then, which is why, for instance, it's
common practice to include the usable copies of the packages that you used when you're doing the
programming. So those update a lot, this way you always have the version that you know worked.
And then are the relations intact? And by that I mean the connections between the different tables
and your data. You have to have a way of referring between them to make a complete data set. Are
those still valid? And then finally there is the issue of security. Are the data and the code safe from
hackers? Remember, data's enormously valuable, and it's a target for anybody who's looking to get a
little bit of extra stuff for free. Or in a big data project, because there's so many people involved you
have to be sure that the data and the code are safe from accidental changes by unauthorized users. I
think about, for instance, just my genealogical data. I log in there, somebody's made a change, they
went back to the same wrong thing that got corrected and you got this big problem. Think about
what happens on Wikipedia, the little editing wars. But people sometimes log in and if they have
access, they might make a change without realizing how much impact they're having on other major
projects. And then the last one, to get back to our Fortran issue, are they future-proof? Can you store
the data, can you store the code, in such a way that you know it will be usable at least a few years
from now, maybe not forever, but at least a few years? So for instance, if possible, don't store your
data in a proprietary format which requires a license and requires the company that made it to
update things. If possible, store it in an open format. CSV files are nice, they don't keep the metadata
so you have other options for that. But something that you know will be around for at least a few
years, that way you can guarantee at least the short-term longevity of your data and the whole thing
makes it easier to manage and to govern the data that is using your data science projects over time,
so you can get both more insight and more meaning, and truthfully, more value, too, out of your
project.
Structured, semi-structured, and unstructured data Selecting transcript lines in this section will
navigate to timestamp in the video - [Instructor] I have a friend who worked as an archivist for a
major fashion designer in New York City. Their job was to go through the disorganized racks and piles
of clothes from several decades of runway shows, match them with the videos of the shows and then
label and document and organize everything. It was a monumental task that took over a year but
now the design house has a documented history of their creative work and in addition they can
actually find what they're looking for. In fashion, as in life, a little bit of structure can go a long way.
When it comes to data and structure, there are three major rubrics or categories, there is structured
data, semi-structured data, and unstructured and I'll say a couple words about each of these.
Structured data is the kind that you generally think of when you think of data. It's rows and columns
organized and labeled. The variables are predefined, the last name goes here, the first name goes
here, the zip code goes here. Spreadsheets are a perfect example of structured data, provided that
each row is a case and each column is a variable or field. Relational databases are also a perfect
example of structured data, things are well-defined, you know where things are, you know how they
connect and it makes analysis very easy. On the other hand, here in the data science world we get to
deal a lot with semi-structured data. Now, this involves variables that are defined but they're defined
on a one on one basis and marked with tags and those tags can vary freely. Good examples of this
are HTML, that's a Hypertext Markup Language that creates webpages or XML, Extensible Markup
Language and JSON, the JavaScript Object Notation, and all of these are used extensively on the web.
And what you see here on the right is some data that I've used in other examples but saved not as a
CSV file 'cause it's normally in a spreadsheet but I've saved it as a JSON file, and so you can see within
the curly brackets are each case and then in each line it defines first what the name of that variable
or field is, like state and then it gives the value for that case, California and so on. Now, these have
consistent names but they can vary from one document or one case to another. And then finally,
there's unstructured data, this is when the variables or the fields are not labeled or identified,
instead you have a lot information there that you have to figure out what it is. Free text is an
example, if you import the text of a book or you take photos or videos or audio waves, like we have
right here, those are extremely common forms of data but they're not used traditionally a lot in
analysis because they're not structured, they're not labeled, they're not processed in a way to make
analysis easy. And that gets it a little paradox here, if we take these three versions of data and look at
their ease of analysis, structured data is by far the best, it's ready to go, it's all set up. Semi-
structured data takes a little bit of work but you can import a package into your programming
language that knows how to pass it and it adds a few extra steps but it's not a big deal. Unstructured
data on the other hand, requires sometimes an amazing amount of mental and statistical gymnastics
to get the data set for analysis. So from ease, it's structured to semi-structured, to unstructured
being the hardest. On the other hand, if you want to talk about the quantity of data that is
theoretically available out there in the world, well there's more unstructured data than anything else,
especially when you consider all the social media posts, all the books ever written, all the songs ever
recorded, all the movies ever made, all of those are unstructured. And there's an enormous wealth
there but it takes some work to get at it. Semi-structured, there's a lot of data that, for instance, is on
the internet, social media data is in certain respects semi-structured and webpages are largely semi-
structured at least in the gross elements. And then structured data, the stuff that's all set up and
ready to go, well it's the easiest to work with and it's what I turn to first, it is probably the least
common in terms of overall quantity of data that's theoretically available. This means that there's a
problem, this mismatch between the ease of analysis and the quantity and that means that basically
you have to find a way to take the unstructured data and to put some kind of structure on it. You can
do that by either manually labeling the data, like grading papers and giving a score to them or taking
a document and saying what it's about. You can do it yourself, you can hire other people to do it, you
can wait for users to label their own photographs, or what's shown itself to be really useful these
days is to use machine learning to create and apply the labels, this is then one of the great benefits of
deep learning URL networks. They take messy unstructured data, like audio, video, MRI scans, and so
on, and through a sophisticated procedure they are able to label the data and do so at scale, which is
one of the great benefits. So it's one way of dealing with this mismatch between the ease of analysis
and the quantity of the data available. But what to let you know overall is that there are these
different kinds of data that the structured is easy and ready to go, semi-structured is pretty common,
unstructured is an enormous untapped resource and that's one of the major challenges of big data
and by extension, data signs and machine learning and artificial intelligence to take that unstructured
data and get the value and the meaning out of it.
Challenges with
data preparation Selecting transcript lines in this section will navigate to timestamp in the video -
[Narrator] Salmon can be farmed or they can be caught wild. But either way it takes a fair amount of
work before they are turned into this. Everybody knows that food prep is an important although time
consuming and frequently tedious part of cooking. There is a similar principle in any big data project.
The rule of thumb is about 80 percent of the time on a big data project is spent preparing the data.
And that's been my own experience. Now there are several reasons why this may be the case. It
includes things like how is the data entered? If you're using wild caught data, meaning data that you
found out there in the world and that maybe was entered with free text. You have to look at things
like place names. Here are four different ways of indicating California. You can write it out, you can
use various abbreviations and the inclusion of a period. At least by default marks it as a separate
answer than the one without a period. Or when people are putting in dates. Here are four different
ways of writing the same dates. Now humans looking at these know that they refer to the same
thing. But a computer will read them by default as different things and you have to do a little bit of
clean up work to make sure that the dates are all read the same way and formatted the same way.
Or phone numbers. Again, the same phone number in four different ways. You know for instance
that if you have these in a spreadsheet they're going to sort separately. But you can reformat them
so it knows that they are all phone numbers, it does them appropriately. And this kind of clean up
work, this becomes an important part of getting things ready for your big data project. Then they're
are other issues like units of measurement. If you ask someone's weight and they put 150 or 68 or
10.7 well, 150 is pounds, 68 kilograms is the same weight and 10.7 stone if you're dealing with
someone from the United Kingdom. That also is 150 pounds. So several different ways of measuring
these. And we all know about examples where, for instance, NASA has not adequately converted
from one unit to another and space probes have crashed as a result of that. Then there are errors. If
someone puts in 2.7 children, that's not possible. You can't have fractional children. You can have an
average that's fractional or 270 children, too many. Or negative 27, you can't have a negative number
on a counted value like this. And then of course missing data. In some data sets, an eight might
indicate a missing value. `A negative 999 might indicate a missing or a period or a simply empty
space. And there are different kinds of missingness. You can have not a number, not available,
missing completely or random. All of them mean different things. And when you're preparing your
data, you have to go through these and be thorough and getting them all in the same format with the
same meaning and interpretation all the way through. And of course that's to say nothing of the
amount of processing required for dealing with unstructured data like cell phone videos. And it gets
to an interesting question. Where is the return on investment and where's the balance point? Should
you clean wild data? Or should you gather new data that is in the structure and with the encodings
that you want? So for instance, here's a microscope image of cancer cells. And if you're dealing things
like health records the project can be a nightmare. Which is why companies like for instance
recursion pharmaceuticals who do drug development where they find applications of existing drugs
to orphan diseases. They figured out that it was cheaper, faster, and easier to develop their own
technology to gather and analyze data from half a million experiments per week with the goal is to
hitting a million per week. They did the math and realized that gathering their own data with their
own propriety methods so they could then feed it into their algorithms for analyzing the data would
be the best way to reach their goals as opposed to trying to clean the wild caught data. And so, it
becomes a trade off and depending on your exact circumstances, you might be able to use the wild
data. You may need to gather new custom bespoke data. But no matter how you decide to go about
it, you do need to give yourself adequate time for the demanding work of preparing data in your big
data project. Remember that 80/20 rule. Again about 80 percent of the time on a project is spent
getting ready for it. 20 percent for analysis. You can find ways to facilitate that, to streamline that
and get to the meaning and the insight that you're looking for faster and more efficiently.
- [Instructor] In "Alice's Adventures in Wonderland," by Lewis Carroll, Alice, here on the left, gets
some life advice from the Duchess, here on the right. This is what the Duchess tells Alice. "Never
imagine yourself not to be otherwise "than what it might appear to others "that what you were or
might have been "was not otherwise than what you had been "would have appeared to them to be
otherwise." Now, remember, Lewis Carroll was an accomplished mathematician and logician. This is a
logically consistent and meaningful statement, but it's nearly impossible to follow. Fortunately, the
Duchess provided a shorter version as well, where she simply says, "Be what you would seem to be."
Or even shorter, be authentic. Now, any time you work with data, you're attempting to simplify the
data so that it makes sense and you can get some insight. It is always a matter of simplification.
That's the purpose of data analysis and big data projects, and that's one of the major things you can
do with data visualization. Now, in terms of how you frame things, it's important to remember who is
making and implementing the final decision. Is it a machine learning algorithm? So, decisions by
machines, you don't need to explain them, you don't necessarily need to visualize them. It's going to
know based on its calculations whether to recommend a particular product to somebody or not. But
for decisions that are made by humans, that's where they're generally looking for the principles or
the abstract concepts that come out of the analysis that they can use to guide their own decisions.
And in that case, visualization is going to be critical to get the value out of your big data project.
Remember, humans are visual animals, and you get much higher bandwidth with graphics. A picture
is, after all, worth 1000 words. And so, let me make a few recommendations for your graphics in a big
data project or anywhere. Number one, make the graphic as simple as possible. Now, what I mean to
say by that, as simple as fits the data. Make it no simpler. You want to make it as clear as possible to
focus on the data and not get distracted by other things going on in the graphics. Second, it's a really
good idea to provide a method to give detail as needed. Let them click on things, drill down, get
some of the source data, get additional context. That gives richness to a graphic, especially one that
is initially rather simple. And then, to state the obvious, provide labels for your axes. What's on the X,
what's on the Y? What are the different groups? And give the sources for your data so the meaning
of it is complete. And then there are a few things to be cautious about. First, be cautious of 3D
graphics. Now, there's one kind which is the false third dimension, like you see right here where you
have bars that have depth. There's basically no reason to ever do that. The other one is a 3D graphic
like a 3D scatterplot. It's really cool as long as it's moving, but as soon as it stops, it's very difficult to
read, and I personally never use them, and I recommend that people avoid them unless there's some
really compelling reason to include it. Another one that's a little contentious is the issue of
interactivity, not just with VR goggles, but really putting radio buttons and sliders on your graphics.
The risk is this. People will get distracted by the interactivity, and they'll just start playing with it,
spinning around your graph and clicking on things to see what happens. At that point, the interface
has taken attention away from the data itself, and that's a delicate balance, something you need to
think about carefully. Closely related to that is the use of novel graphs like this one. They're cool,
they're pretty, they attract attention, they are often very hard to read and hard to read well enough
to know what to do about it. Now, doesn't mean don't use 'em. It means be very careful and
thoughtful when you think about using a novel kind of graph. Truthfully, in terms of the best ROI or
return on investment, or the best value, the most information from a graphic, it's going to be really
simple ones. Bar charts, histograms, line charts, and scatterplots will probably get you 90%, 95% of
the value you need from data visualization, even on a large big data project. It doesn't matter that
the data set is massive or that the data set gets involved. By the time you're getting it down, the
simplification to the level that is needed by the decision-makers to use it in a productive way, these
kinds of graphs will probably give you the kind of insight you need without distracting people from
the meaning of the data, and that is the main purpose of visualization in any big data project.
- [Instructor] Back several generations in my family tree, I have some ancestors who were miners.
They would go down to work in the mines early Monday morning and wouldn't come back up until
Saturday night, it was backbreaking and lung blackening work that I'm grateful I would never have to
do, but to be perfectly clear, that's not what we're talking about when we're talking about data
mining, rather data mining is the practice of working with large datasets to find potentially valuable
insight. Think of it this way, computer algorithms, especially deep learning neural networks are able
to go through millions of images and automatically identify all of the pictures with cats in them and
show you where the cats are, they draw the frame around them. Data mining's a little like that, but
instead of finding cats, maybe you're trying to find new customers, or new treatments for diseases,
or students with exceptional potential, really anything that's important to you using the same
general concept. The idea is pretty simple, you get a lot of data in your computer, millions of cases
and maybe thousands of variables, and you have the algorithm start looking for patterns of some
kind in the data. Now, there is one major risk in doing all of this, and that is what I personally call
Baseball Stats, that's not to be confused with the real data of baseball, but you get things like this,
the the batter is a career of three for three against left-handed relief pitchers on interleague play
before the eighth inning on away games when the team is leading by at least four runs after the all-
star break, and you got to wonder how the batter's even been up to bat three times. This is
capitalizing on chance, these are such small numbers that these are meaningless patterns, even if
they give people something to talk about. The major challenge in data mining is to find patterns, but
not get these kinds of false positives, and in fact, it is false positives that are one of the major risks,
that's if somebody rolls a 12 with two dice, that might mean something if it's a good outcome, and if
they do it again you might suddenly think, they know how to roll a 12. Well, that's not really the case
because what you're failing to do is to adequately account for the roll of chance, lots of things can
happen at random. A person can flip 10 heads in a row, it doesn't happen very often, but there's a
known probability for that, and it's important to always know what is the normal variation and
maybe slightly beyond that to say, this is normal, and when you start looking at many, many possible
effects, that becomes critical, and that's one reason any time you do a data mining project, you
especially need to be careful about doing verification, that's the double checking that scene if your
pattern holds up in a different dataset, it's why it's common in data science to break up your original
dataset into what's called a training dataset and a testing or validation dataset to control for these
tendencies towards false positives, but if you can do that, there's some great insights you can get out
of data mining, for instance, you can find new market segments, groups of people who's interests
have not been adequately addressed before, you got another niche there, something you can do.
You can also look for new kinds of product and market match, especially when it turns out that a
product can be used in unexpected, novel ways. There are things like market basket analysis where
you can say that anytime a person buys this and this, there's an increased probability they will also
buy this, so if you're doing e-commerce, you show them that extra item online or you give them an
incentive to buy it. And then there's a whole separate category of time series mining where you're
looking for events that happened over time, or the cluster related sequence mining where you're not
saying when they happened, just that it happened before this and after that, both of those can be
enormously useful in predicting trends with the interactions you have with businesses or customers.
Now, if you'd like more information on this, we have another course called Data Science
Foundations: Data Mining, which gives you some concrete examples of how each of these kinds of
analyses can worked, and how the results can be interpreted and put into effect, but either way, the
basic idea here is the same, when you've got data and when you've got a lot of data, you have
something that potentially has great value if you know how to go through it, and mine through it,
and find those patterns that you and your organization can act on.
For unlabeled
for labeld