Part 1 Lectures
Part 1 Lectures
Detecting outbreaks
two weeks ahead
of CDC data
5
The unreasonable effectiveness of Deep
Learning (CNNs)
2012 Imagenet challenge:
Classify 1 million images into 1000 classes.
6
Where does data come from?
7
“Big Data” Sources
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event ….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…
9
There's certainly a lot of
Data, data everywhere… it!
logarithmic scale
800 EB
5 EB
1 Exabyte
120 PB
13
2
2_ Data
Wrangling
Data wrangling is the process of converting data from its
raw form to a tidy form ready for analysis. Data
wrangling is an important step in data preprocessing and
includes several processes like data importing, data
cleaning, data structuring, string
parsing, handling dates
processing, HTMLand times, handling missing data,
and text mining.
The process of data wrangling is a critical step for any
data scientist. Very rarely is data easily accessible in a
data science project for analysis. It is more likely for the
data to be in a file, a database, or extracted from
documents such as web pages, tweets, or PDFs. Knowing
how to wrangle and clean data will enable you to derive
critical insights from your data that would otherwise be
hidden.
3
3_ Data Visualization
Data Visualization is one of the most important branches
of data science. It is one of the main tools used to
analyze and study relationships between different
variables. Data visualization (e.g., scatter plots, line
graphs, bar plots, histograms, qqplots, smooth densities,
boxplots, pair plots, heat maps, etc.) can be used for
descriptive analytics. Data visualization is also used in
machine learning for data preprocessing and analysis,
feature selection, model building, model testing, and
model evaluation. When preparing a data visualization,
keep in mind that data visualization is more of
an Art than Science. To produce a good visualization,
you need to put several pieces of code together for an
excellent end result.
4
4_ Outliers
An outlier is a data point that is very different from the
rest of the dataset. Outliers are often just bad data, e.g.,
due to a malfunctioned sensor; contaminated
experiments; or human error in recording data.
Sometimes, outliers could indicate something real such
as a malfunction in a system. Outliers are very common
and are expected in large datasets. One common way to
detect outliers in a dataset is by using a box plot.
Outliers can significantly degrade the predictive power
of a machine learning model. A common way to deal
with outliers is to simply omit the data points. However,
removing real data outliers can be too optimistic, leading
to non-realistic models. Advanced methods for dealing
with outliers include the RANSAC method.
5
5_ Data Imputation
Most datasets contain missing values. The easiest way to deal
with missing data is simply to throw away the data point.
However, the removal of samples or dropping of entire feature
columns is simply not feasible because we might lose too much
valuable data. In this case, we can use different interpolation
techniques to estimate the missing values from the other
training samples in our dataset. One of the most common
interpolation techniques is mean imputation, where we
simply replace the missing value with the mean value of the
entire feature column. Other options for imputing missing
values are median or most frequent (mode), where the latter
replaces the missing values with the most frequent values.
Whatever imputation method you employ in your model, you
have to keep in mind that imputation is only an approximation,
and hence can produce an error in the final model. If the data
supplied was already preprocessed, you would have to find out
how missing values were considered. What percentage of the
original data was discarded? What imputation method was
used to estimate missing values?
6
6_ Data Scaling
Scaling your features will help improve the quality and
predictive power of your model.
In order to bring features to the same scale, we could
decide to use either normalization or standardization of
features. Most often, we assume data is normally
distributed and default towards standardization, but that
is not always the case. It is important that before
deciding whether to use either standardization or
normalization, you first take a look at how your features
are statistically distributed. If the feature tends to be
uniformly distributed, then we may use normalization
(MinMaxScaler). If the feature is approximately
Gaussian, then we can use standardization
(StandardScaler). Again, note that whether you employ
normalization or standardization, these are also
approximative methods and are bound to contribute to
the overall error of the model.
Goal of Data Science
•
Data Science – One Definition
Textbook
• Required:
– Data Science from Scratch (DSS) by Joel
Grus
easier_to_read_list_of_lists =
[ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
Alternatively:
long_winded_computation = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \
9 + 10 + 11 + 12 + 13 + 14 + \
15 + 16 + 17 + 18 + 19 + 20
Modules
• Certain features of Python are not loaded by
default
• In order to use these features, you’ll need to
import the modules that contain them.
• E.g.
import matplotlib.pyplot as plt
import numpy as np
Variables and objects
• Variables are created the first time it is assigned a value
– No need to declare type
– Types are associated with objects not variables
• X=5
• X = [1, 3, 5]
• X = ‘python’
– Assignment creates references, not copies
X = [1, 3, 5]
Y= X
X[0] = 2
Print (Y) # Y is [2, 3, 5]
Assignment
• You can assign to multiple names at the same
time
x, y = 2, 3
• To swap values
x, y = y, x
• Assignments can be chained
x=y=z=3
• Accessing a name before it’s been created (by
assignment), raises an error
Arithmetic
• a=5+2 # a is 7
• b = 9 – 3. # b is 6.0
• c=5*2 # c is 10
• d = 5**2 # d is 25
• e=5%2 # e is 1
• Two or more string literals (i.e. the ones enclosed between quotes) next to
each other are automatically concatenated
s1 = 'Py' 'thon'
s2 = s1 + '2.7'
real_long_string = ('this is a really long string. '
‘It has multiple parts, '
‘but all in one line.‘)
List - 1
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]
list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6
• Get the i-th element of a list
x = [i for i in range(10)] # is the list [0, 1, ..., 9]
zero = x[0] # equals 0, lists are 0-indexed
one = x[1] # equals 1
nine = x[-1] # equals 9, 'Pythonic' for last element
eight = x[-2] # equals 8, 'Pythonic' for next-to-last element
• Get a slice of a list
one_to_four = x[1:5] # [1, 2, 3, 4]
first_three = x[:3] # [0, 1, 2]
last_three = x[-3:] # [7, 8, 9]
three_to_end = x[3:] # [3, 4, ..., 9]
without_first_and_last = x[1:-1] # [1, 2, ..., 8]
copy_of_x = x[:] # [0, 1, 2, ..., 9]
another_copy_of_x = x[:3] + x[3:] # [0, 1, 2, ..., 9]
List - 2
• Check for memberships
1 in [1, 2, 3] # True
0 in [1, 2, 3] # False
• Concatenate lists
x = [1, 2, 3]
y = [4, 5, 6]
x.extend(y) # x is now [1,2,3,4,5,6]
x = [1, 2, 3]
y = [4, 5, 6]
z = x + y # z is [1,2,3,4,5,6]; x is unchanged.
• List unpacking (multiple assignment)
x, y = [1, 2] # x is 1 and y is 2
[x, y] = 1, 2 # same as above
x, y = [1, 2] # same as above
x, y = 1, 2 # same as above
_, y = [1, 2] # y is 2, didn't care about the first element
List - 3
• Modify content of list
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]
x[2] = x[2] * 2 # x is [0, 1, 4, 3, 4, 5, 6, 7, 8]
x[-1] = 0 # x is [0, 1, 4, 3, 4, 5, 6, 7, 0]
x[3:5] = x[3:5] * 3 # x is [0, 1, 4, 9, 12, 5, 6, 7, 0]
x[5:6] = [] # x is [0, 1, 4, 9, 12, 7, 0]
del x[:2] # x is [4, 9, 12, 7, 0]
del x[:] # x is []
del x # referencing to x hereafter is a NameError
• Strings can also be sliced. But they cannot modified (they are immutable)
s = 'abcdefg'
a = s[0] # 'a'
x = s[:2] # 'ab'
y = s[-3:] # 'efg'
s[:2] = 'AB' # this will cause an error
s = 'AB' + s[2:] # str is now ABcdefg
The range() function
for i in range(5):
print (i) # will print 0, 1, 2, 3, 4 (in separate lines)
for i in range(2, 5):
print (i) # will print 2, 3, 4
for i in range(0, 10, 2):
print (i) # will print 0, 2, 4, 6, 8
for i in range(10, 2, -2):
print (i) # will print 10, 8, 6, 4
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
... print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
Range() in python 2 and 3
• In python 2, range(5) is equivalent to [0, 1, 2, 3, 4]
• In python 3, range(5) is an object which can be iterated,
but not identical to [0, 1, 2, 3, 4] (lazy iterator)
print (range(3)) # in python 3, will see "range(0, 3)"
print (range(3)) # in python 2, will see "[0, 1, 2]"
print (list(range(3))) # will print [0, 1, 2] in python 3
x = range(5)
print (x[2]) # in python 2, will print "2"
print (x[2]) # in python 3, will also print “2”
a = list(range(10))
b = a
b[0] = 100
print(a) [100, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = list(range(10))
b = a[:]
b[0] = 100
print(a) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
tuples
• Similar to lists, but are immutable
Note: tuple is defined by comma, not parens,
• a_tuple = (0, 1, 2, 3, 4) which is only used for convenience. So a = (1)
• Other_tuple = 3, 4 is not a tuple, but a = (1,) is.
try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False
• Get all items In python3, The following will not return lists but
iterable objects
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Difference between python 2 and python 3:
Iterable objects vs lists
• In Python 3, range() returns a lazy iterable object.
– Value created when needed x = range(10000000) #fast
– Can be accessed by index x[10000] #allowed. fast
• all all(a)
Out[136]: False
Comparison
Operation Meaning a = [0, 1, 2, 3, 4]
b = a
< strictly less than
c = a[:]
<= less than or equal
a == b
> strictly greater than Out[129]: True
>= greater than or equal
a is b
== equal Out[130]: True
!= not equal a == c
Out[132]: True
is object identity
a is c
is not negated object identity
Out[133]: False
Bitwise operators: & (AND), | (OR), ^ (XOR), ~(NOT), << (Left Shift), >> (Right Shift)
Control flow - 2
• loops
x = 0
while x < 10:
print (x, "is less than 10“)
x += 1
for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print (x)
Exceptions
try:
print 0 / 0
except ZeroDivisionError:
print ("cannot divide by zero")
https://docs.python.org/3/tutorial/errors.html
Functions - 1
• Functions are defined using def
def double(x):
"""this is where you put an optional docstring
that explains what the function does.
for example, this function multiplies its
input by 2"""
return x * 2
• You can call a function after it is defined
z = double(10) # z is 20
• You can give default values to parameters
def my_print(message="my default message"):
print (message)
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b = 5) # same as above
subtract(b = 5, a = 0) # same as above
Functions - 3
• Functions are objects too
In [12]: def double(x): return x * 2
...: DD = double;
...: DD(2)
...:
Out[12]: 4
In [16]: def apply_to_one(f):
...: return f(1)
...: x=apply_to_one(DD)
...: x
...:
Out[16]: 2
Functions – lambda expression
• Small anonymous functions can be created
with the lambda keyword.
In [18]: y=apply_to_one(lambda x: x+4)
In [19]: y
Out[19]: 5
In [68]: even_numbers = []
In [69]: for x in range(5):
...: if x % 2 == 0:
...: even_numbers.append(x)
...: even_numbers
Out[69]: [0, 2, 4]
List comprehension - 3
• More complex examples:
# create 100 pairs (0,0) (0,1) ... (9,8), (9,9)
pairs = [(x, y)
for x in range(10)
for y in range(10)]
In [204]: double(b)
Traceback (most recent call last):
…
TypeError: unsupported operand type(s) for *: 'int' and 'range'
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [208]: def is_even(x): return x%2==0
...: a=[0, 1, 2, 3]
...: list(filter(is_even, a))
...:
Out[208]: [0, 2]
https://docs.python.org/3/tutorial/inputoutput.html
Files - output
https://docs.python.org/3/tutorial/inputoutput.html
Module math
Command name Description Constant Description
abs(value) absolute value e 2.7182818...
ceil(value) rounds up pi 3.1415926...
cos(value) cosine, in radians
floor(value) rounds down
log(value) logarithm, base e
log10(value) logarithm, base 10
max(value1, value2) larger of two values
min(value1, value2) smaller of two values
round(value) nearest whole number # preferred.
sin(value) sine, in radians import math
sqrt(value) square root math.abs(-0.5)
• Numpy
– Key module for scientific computing
– Convenient and efficient ways to handle multi dimensional
arrays
• pandas
– DataFrame
– Flexible data structure of labeled tabular data
• Matplotlib: for plotting
• Scipy: solutions to common scientific computing problem
such as linear algebra, optimization, statistics, sparse
matrix
Module paths
• In order to be able to find a module called myscripts.py,
the interpreter scans the list sys.path of directory names.
• The module must be in one of those directories.
>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> “Hello” * 3
‘HelloHelloHello’
Mutability:
Tuples vs. Lists
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
• We can change lists in place.
• Name li still points to the same memory
reference when we’re done.
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
>>> li.sort(some_function)
# sort in place using user-defined comparison
Tuple details
• The comma is the tuple creation operator, not
parens
>>> 1,
(1,)