Lec1 - For Upload Complete
Lec1 - For Upload Complete
Image credits to
Source: indeed.com
Data Science: Why all the Excitement?
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of CDC data
8
Why the all the Excitement?
9
Data and Election 2012 (cont.)
• …that was just one of several ways that Mr. Obama’s campaign
operations, some unnoticed by Mr. Romney’s aides in Boston,
helped save the president’s candidacy. In Chicago, the campaign
recruited a team of behavioral scientists to build an extraordinarily
sophisticated database
…that allowed the Obama campaign not only to alter the very
nature of the electorate, making it younger and less white, but
also to create a portrait of shifting voter allegiances. The power of
this operation stunned Mr. Romney’s aides on election night, as
they saw voters they never even knew existed turn out in places
like Osceola County, Fla.
New York Times, Wed Nov 7, 2012
• The White House Names Dr. DJ Patil as the First U.S. Chief Data
Scientist, Feb. 18th 2015
10
The unreasonable effectiveness of Deep
Learning (CNNs)
2012 Imagenet challenge:
Classify 1 million images into 1000 classes.
11
The unreasonable effectiveness of Deep
Learning (CNNs)
Performance of deep learning systems over time:
Human performance
5.1% error
2015
12
Difference between BI and Data Science
BI stands for business intelligence, which is also used for data analysis of business information:
Below are some differences between BI and Data sciences:
17
“Big Data” Sources
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event
…
Fast Forward, pause,… ..
Server request
Transaction
Network message
Fault
…
19
There's certainly a lot of it!
Data, data
everywhere…
1 Zettabyte 1.8 ZB 8.0 ZB
logarithmic scale
800 EB
5 EB
1 Exabyte
120 PB
23
Goal of Data Science
Clean,
prep
Evaluate
Interpret
Example data science applications
• Marketing: predict the characteristics of high life time
value (LTV) customers, which can be used to support
customer segmentation, identify upsell opportunities,
and support other marking initiatives
• Logistics: forecast how many of which things you need
and where will we need them, which enables learn
inventory and prevents out of stock situations
• Healthcare: analyze survival statistics for different
patient attributes (age, blood type, gender, etc.) and
treatments; predict risk of re-admittance based on
patient attributes, medical history, etc.
More Examples
• Transaction Databases 🡪 Recommender systems (NetFlix), Fraud
Detection (Security and Privacy)
Searches for
“MySpace”
Searches for
“Facebook”
Data Makes Everything Clearer?
http://techcrunch.com/2014/01/23/facebook-l
osing-users-princeton-losing-credibility/
Machine learning in Data Science
To become a data scientist, one should also be aware
of machine learning and its algorithms, as in data
science, there are various machine learning
algorithms which are broadly being used. Following
are the name of some machine learning algorithms
used in data science:
• Regression
• Decision tree
• Clustering
• Principal component analysis
• Support vector machines
• Naive Bayes
• Artificial neural network
Solve a problem in Data Science using
Machine learning algorithms
Data Science Lifecycle
The life-cycle of data science is explained
as below diagram.
1. Discovery: The first phase is discovery, which involves asking the right
questions. When you start any data science project, you need to determine
what are the basic requirements, priorities, and project budget. In this phase,
we need to determine all the requirements of the project such as the number
of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this
phase, we need to perform the following tasks:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation,
After performing all the above tasks, we can easily use this data for our further
processes.
3. Model Planning: In this phase, we need to determine the various methods
and techniques to establish the relation between input variables. We will apply
Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see
what data can inform us. Common tools used for model planning are:
• SQL Analysis Services
• R
• SAS
• Python
4. Model-building: In this phase, the process of model building
starts. We will create datasets for training and testing purpose.
We will apply different techniques such as association,
classification, and clustering, to build the model.
• Following are some common Model building tools:
• SAS Enterprise Miner
• WEKA
• SPCS Modeler
• MATLAB
5. Operationalize: In this phase, we will deliver the final reports
of the project, along with briefings, code, and technical
documents. This phase provides you a clear overview of
complete project performance and other components on a
small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach
the goal, which we have set on the initial phase. We will
communicate the findings and final result with the business
team.
Brief introduction of Python
• Invented in the Netherlands, early 90s by Guido
van Rossum
• Open sourced from the beginning
• Considered a scripting language, but is much more
– No compilation needed
– Scripts are evaluated by the interpreter, line by line
– Functions need to be defined before they are called
Different ways to run python
• Call python program via python interpreter from a Unix/windows
command line
– $ python testScript.py
– Or make the script directly executable, with additional header lines in the script
• Using python console
– Typing in python statements. Limited functionality
>>> 3 +3
6
>>> exit()
• Using ipython console
– Typing in python statements. Very interactive.
In [167]: 3+3
Out [167]: 6
– Typing in %run testScript.py
– Many convenient “magic functions”
Anaconda for python3
• We’ll be using anaconda which includes python
environment and an IDE (spyder) as well as many
additional features
– Can also use Enthought
• Most python modules needed in data science are
already installed with the anaconda distribution
• Install with python 3.6 (and install python 2.7 as
secondary from anaconda prompt)
• Key diff between Python 2 and python 3
Formatting
• Many languages use curly braces to delimit blocks of code. Python
uses indentation. Incorrect indentation causes error.
• Comments start with #
• Colons start a new block in many constructs, e.g. function
definitions, if-then clause, for, while
easier_to_read_list_of_lists =
[ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
Alternatively:
long_winded_computation = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \
9 + 10 + 11 + 12 + 13 + 14 + \
15 + 16 + 17 + 18 + 19 + 20
Modules
• Certain features of Python are not loaded by
default
• In order to use these features, you’ll need to
import the modules that contain them.
• E.g.
import matplotlib.pyplot as plt
import numpy as np
Variables and objects
• Variables are created the first time it is assigned a
value
– No need to declare type
– Types are associated with objects not variables
• X=5
• X = [1, 3, 5]
• X = ‘python’
– Assignment creates references, not copies
X = [1, 3, 5]
Y= X
X[0] = 2
Print (Y) # Y is [2, 3, 5]
Assignment
• You can assign to multiple names at the same time
x, y = 2, 3
• To swap values
x, y = y, x
• Assignments can be chained
x=y=z=3
• Accessing a name before it’s been created (by
assignment), raises an error
Arithmetic
• a=5+2 # a is 7
• b = 9 – 3. # b is 6.0
• c=5*2 # c is 10
• d = 5**2 # d is 25
• e=5%2 # e is 1
x = [1, 2, 3]
y = [4, 5, 6]
z = x + y # z is [1,2,3,4,5,6]; x is unchanged.
• List unpacking (multiple assignment)
x, y = [1, 2] # x is 1 and y is 2
[x, y] = 1, 2 # same as above
x, y = [1, 2] # same as above
x, y = 1, 2 # same as above
_, y = [1, 2] # y is 2, didn't care about the first element
List - 3
• Modify content of list
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]
x[2] = x[2] * 2 # x is [0, 1, 4, 3, 4, 5, 6, 7, 8]
x[-1] = 0 # x is [0, 1, 4, 3, 4, 5, 6, 7, 0]
x[3:5] = x[3:5] * 3 # x is [0, 1, 4, 9, 12, 5, 6, 7, 0]
x[5:6] = [] # x is [0, 1, 4, 9, 12, 7, 0]
del x[:2] # x is [4, 9, 12, 7, 0]
del x[:] # x is []
del x # referencing to x hereafter is a NameError
• Strings can also be sliced. But they cannot modified (they are immutable)
s = 'abcdefg'
a = s[0] # 'a'
x = s[:2] # 'ab'
y = s[-3:] # 'efg'
s[:2] = 'AB' # this will cause an error
s = 'AB' + s[2:] # str is now ABcdefg
The range() function
for i in range(5):
print (i) # will print 0, 1, 2, 3, 4 (in separate lines)
for i in range(2, 5):
print (i) # will print 2, 3, 4
for i in range(0, 10, 2):
print (i) # will print 0, 2, 4, 6, 8
for i in range(10, 2, -2):
print (i) # will print 10, 8, 6, 4
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
... print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
Range() in python 2 and 3
• In python 2, range(5) is equivalent to [0, 1, 2, 3, 4]
• In python 3, range(5) is an object which can be iterated,
but not identical to [0, 1, 2, 3, 4] (lazy iterator)
x = range(5)
print (x[2]) # in python 2, will print "2"
print (x[2]) # in python 3, will also print “2”
a = list(range(10))
b = a
b[0] = 100
print(a) [100, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = list(range(10))
b = a[:]
b[0] = 100
print(a) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
tuples
• Similar to lists, but are immutable
• a_tuple = (0, 1, 2, 3, 4) Note: tuple is defined by comma, not parens,
which is only used for convenience. So a = (1)
• Other_tuple = 3, 4 is not a tuple, but a = (1,) is.
• Another_tuple = tuple([0, 1, 2, 3, 4])
• Hetergeneous_tuple = (‘john’, 1.1, [1, 2])
try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False
• Get all items In python3, The following will not return lists but
iterable objects
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Difference between python 2 and
python 3: Iterable objects vs lists
• In Python 3, range() returns a lazy iterable object.
– Value created when needed x = range(10000000) #fast
– Can be accessed by index x[10000] #allowed. fast
• any any(a)
Out[135]: True
• all
all(a)
Out[136]: False
Comparison
Operatio a = [0, 1, 2, 3, 4]
Meaning
n b = a
< strictly less than c = a[:]
Bitwise operators: & (AND), | (OR), ^ (XOR), ~(NOT), << (Left Shift), >> (Right Shift)
Control flow - 2
• loops
x = 0
while x < 10:
print (x, "is less than 10“)
x += 1
for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print (x)
Exceptions
try:
print 0 / 0
except ZeroDivisionError:
print ("cannot divide by zero")
https://docs.python.org/3/tutorial/errors.html
Functions - 1
• Functions are defined using def
def double(x):
"""this is where you put an optional docstring
that explains what the function does.
for example, this function multiplies its
input by 2"""
return x * 2
• You can call a function after it is defined
z = double(10) # z is 20
• You can give default values to parameters
def my_print(message="my default message"):
print (message)
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b = 5) # same as above
subtract(b = 5, a = 0) # same as above
Functions - 3
• Functions are objects too
In [12]: def double(x): return x * 2
...: DD = double;
...: DD(2)
...:
Out[12]: 4
In [16]: def apply_to_one(f):
...: return f(1)
...: x=apply_to_one(DD)
...: x
...:
Out[16]: 2
Functions – lambda expression
• Small anonymous functions can be created
with the lambda keyword.
In [18]: y=apply_to_one(lambda x: x+4)
In [19]: y
Out[19]: 5
In [68]: even_numbers = []
In [69]: for x in range(5):
...: if x % 2 == 0:
...: even_numbers.append(x)
...: even_numbers
Out[69]: [0, 2, 4]
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [203]: def double(x): return 2*x In [205]: [double(i) for i in range(5)]
...: b=range(5) Out[205]: [0, 2, 4, 6, 8]
...: list(map(double, b))
Out[203]: [0, 2, 4, 6, 8]
In [204]: double(b)
Traceback (most recent call last):
…
TypeError: unsupported operand type(s) for *: 'int' and 'range'
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [208]: def is_even(x): return x%2==0
...: a=[0, 1, 2, 3]
...: list(filter(is_even, a))
...:
Out[208]: [0, 2]
https://docs.python.org/3/tutorial/inputoutput.html
Files - output
https://docs.python.org/3/tutorial/inputoutput.html
Module math
Command name Description Constant Description
abs(value) absolute value e 2.7182818...
ceil(value) rounds up pi 3.1415926...
cos(value) cosine, in radians
floor(value) rounds down
log(value) logarithm, base e
log10(value) logarithm, base 10
max(value1, value2) larger of two values
min(value1, value2) smaller of two values
round(value) nearest whole number # preferred.
sin(value) sine, in radians import math
sqrt(value) square root math.abs(-0.5)
>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> “Hello” * 3
‘HelloHelloHello’
Mutability:
Tuples vs. Lists
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
• We can change lists in place.
• Name li still points to the same memory
reference when we’re done.
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
>>> li.sort(some_function)
# sort in place using user-defined comparison
Tuple details
• The comma is the tuple creation operator, not
parens
>>> 1,
(1,)