0% found this document useful (0 votes)

13 views100 pages

Part 1 Lectures

The document serves as an introduction to data science, covering its definition, the importance of data wrangling, visualization, and handling outliers and missing data. It emphasizes the significance of Python programming in data science and outlines the course structure, including topics such as statistics, machine learning, and practical applications in various fields. Additionally, it discusses the growing demand for data scientists and the vast sources of data available today.

Uploaded by

Rabia Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views100 pages

Part 1 Lectures

Uploaded by

Rabia Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 100

Introduction to Data Science

Part I: Couse intro & Python tutorial

Plan for this lecture
• Data Science - why all the excitement
• What is data science
• Course information – syllabus, grading, etc.
• Basic Python programming
Data Scientists are in high demand
Also in academia
Data Science: Why all the Excitement?
e.g.,
Google Flu Trends:

Detecting outbreaks
two weeks ahead
of CDC data

New models are estimating

which cities are most at risk
for spread of the Ebola virus.

5
The unreasonable effectiveness of Deep
Learning (CNNs)
2012 Imagenet challenge:
Classify 1 million images into 1000 classes.

6
Where does data come from?

7
“Big Data” Sources
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event ….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…

Internet of Things / M2M Health/Scientific Computing

Graph Data
Lots of interesting data
has a graph structure:
• Social networks
• Communication networks
• Computer Networks
• Road networks
• Citations
• Collaborations/Relationships
• …

Some of these graphs can get

quite large (e.g., Facebook*
user graph)

9
There's certainly a lot of
Data, data everywhere… it!

1 Zettabyte 1.8 ZB 8.0 ZB

logarithmic scale
800 EB

Data produced each year

161 EB

5 EB
1 Exabyte

120 PB

100-years of HD video + audio

60 PB
Human brain's capacity
1 Petabyte 14 PB

1 Petabyte == 1000 TB 2002 2006 2009 2011 2015

1 TB = 1000 GB
References

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

“Data is the New Oil”
– World Economic Forum 2011
“Data Science” an Emerging Field

O’Reilly Radar report, 2011 12

Data Science – A Definition

Data Science is the science which uses computer

science, statistics and machine learning,
visualization and human-computer interactions
to collect, clean, integrate, analyze, visualize,
interact with data to create data products.

13
2

2_ Data
Wrangling
Data wrangling is the process of converting data from its
raw form to a tidy form ready for analysis. Data
wrangling is an important step in data preprocessing and
includes several processes like data importing, data
cleaning, data structuring, string
parsing, handling dates
processing, HTMLand times, handling missing data,
and text mining.
The process of data wrangling is a critical step for any
data scientist. Very rarely is data easily accessible in a
data science project for analysis. It is more likely for the
data to be in a file, a database, or extracted from
documents such as web pages, tweets, or PDFs. Knowing
how to wrangle and clean data will enable you to derive
critical insights from your data that would otherwise be
hidden.
3

3_ Data Visualization
Data Visualization is one of the most important branches
of data science. It is one of the main tools used to
analyze and study relationships between different
variables. Data visualization (e.g., scatter plots, line
graphs, bar plots, histograms, qqplots, smooth densities,
boxplots, pair plots, heat maps, etc.) can be used for
descriptive analytics. Data visualization is also used in
machine learning for data preprocessing and analysis,
feature selection, model building, model testing, and
model evaluation. When preparing a data visualization,
keep in mind that data visualization is more of
an Art than Science. To produce a good visualization,
you need to put several pieces of code together for an
excellent end result.
4

4_ Outliers
An outlier is a data point that is very different from the
rest of the dataset. Outliers are often just bad data, e.g.,
due to a malfunctioned sensor; contaminated
experiments; or human error in recording data.
Sometimes, outliers could indicate something real such
as a malfunction in a system. Outliers are very common
and are expected in large datasets. One common way to
detect outliers in a dataset is by using a box plot.
Outliers can significantly degrade the predictive power
of a machine learning model. A common way to deal
with outliers is to simply omit the data points. However,
removing real data outliers can be too optimistic, leading
to non-realistic models. Advanced methods for dealing
with outliers include the RANSAC method.
5

5_ Data Imputation
Most datasets contain missing values. The easiest way to deal
with missing data is simply to throw away the data point.
However, the removal of samples or dropping of entire feature
columns is simply not feasible because we might lose too much
valuable data. In this case, we can use different interpolation
techniques to estimate the missing values from the other
training samples in our dataset. One of the most common
interpolation techniques is mean imputation, where we
simply replace the missing value with the mean value of the
entire feature column. Other options for imputing missing
values are median or most frequent (mode), where the latter
replaces the missing values with the most frequent values.
Whatever imputation method you employ in your model, you
have to keep in mind that imputation is only an approximation,
and hence can produce an error in the final model. If the data
supplied was already preprocessed, you would have to find out
how missing values were considered. What percentage of the
original data was discarded? What imputation method was
used to estimate missing values?
6

6_ Data Scaling
Scaling your features will help improve the quality and
predictive power of your model.
In order to bring features to the same scale, we could
decide to use either normalization or standardization of
features. Most often, we assume data is normally
distributed and default towards standardization, but that
is not always the case. It is important that before
deciding whether to use either standardization or
normalization, you first take a look at how your features
are statistically distributed. If the feature tends to be
uniformly distributed, then we may use normalization
(MinMaxScaler). If the feature is approximately
Gaussian, then we can use standardization
(StandardScaler). Again, note that whether you employ
normalization or standardization, these are also
approximative methods and are bound to contribute to
the overall error of the model.
Goal of Data Science

Turn data into data products.

Iterative steps of machine learning
Example data science applications
• Marketing: predict the characteristics of high life
time value (LTV) customers, which can be used to
support customer segmentation, identify upsell
opportunities, and support other marking
initiatives
• Healthcare: analyze survival statistics for different
patient attributes (age, blood type, gender, etc.)
and treatments; predict risk of re-admittance
based on patient attributes, medical history, etc.
More Examples
• Transaction Databases  Recommender systems
(NetFlix), Fraud Detection (Security and Privacy)

• Wireless Sensor Data  Smart Home, Real-time

Monitoring, Internet of Things

• Text Data, Social Media Data  Product Review and

Consumer Satisfaction (Facebook, Twitter, LinkedIn)

•
Data Science – One Definition
Textbook
• Required:
– Data Science from Scratch (DSS) by Joel
Grus

– Python for Data Analysis (PDA) by Wes

McKinney

– Free e-book: Think Stats (TS) by Allen B.

Downey. PDF | website

• Optional: Python Data Science

Handbook (PDSH) by Jake VanderPlas
Tentative course content (subject to
change)
• Week 1-2:
– Python basics
– Basic plotting: line graph, bar chart, scatter plot
– Basic statistics: mean, median, standard deviation
– Matplotlib & Numpy
• Week 3-5:
– More statistics:
• Continuous distribution, correlation, hypothesis testing
– Probability
– Linear algebra
• Week 6: midterm
• Week 7-8: data in/out, transformation, pandas. Project description out.
• Week 9-10: linear algebra, regression
• Week 11-12: classification
• Week 13-14: clustering
• Week 15: networks
• Week 13-15: Final project presentations
Brief introduction of Python
• Invented in the Netherlands, early 90s by Guido
van Rossum
• Open sourced from the beginning
• Considered a scripting language, but is much
more
– No compilation needed
– Scripts are evaluated by the interpreter, line by line
– Functions need to be defined before they are called
Different ways to run python
• Call python program via python interpreter from a Unix/windows command
line
– $ python testScript.py
– Or make the script directly executable, with additional header lines in the script
• Using python console
– Typing in python statements. Limited functionality
>>> 3 +3
6
>>> exit()
• Using ipython console
– Typing in python statements. Very interactive.
In [167]: 3+3
Out [167]: 6
– Typing in %run testScript.py
– Many convenient “magic functions”
Anaconda for python3
• We’ll be using anaconda which includes python
environment and an IDE (spyder) as well as many
additional features
– Can also use Enthought
• Most python modules needed in data science are
already installed with the anaconda distribution
• Install with python 3.6 (and
install python 2.7 as secondary from anaconda prom
pt
)
• Key diff between Python 2 and python 3
Python programming in <2 hours
• This is not a comprehensive python language class
• Will focus on parts of the language that is worth
attention and useful in data science
• Two parts:
– Basics - today
– More advanced – next week and/or as we go
• Comprehensive Python language reference and
tutorial available in Anacondo Navigator under
“Learning” and on python.org
Formatting
• Many languages use curly braces to delimit blocks of code.
Python uses indentation. Incorrect indentation causes error.
• Comments start with #
• Colons start a new block in many constructs, e.g. function
definitions, if-then clause, for, while
for i in [1, 2, 3, 4, 5]:
# first line in "for i" block
print (i)
for j in [1, 2, 3, 4, 5]:
# first line in "for j" block
print (j)
# last line in "for j" block
print (i + j)
# last line in "for i" block print "done looping
print (i)
print ("done looping”)
• Whitespace is ignored inside parentheses and
brackets.
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 +
9 + 10 + 11 + 12 + 13 + 14 +
15 + 16 + 17 + 18 + 19 + 20)

list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

easier_to_read_list_of_lists =
[ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]

Alternatively:
long_winded_computation = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \
9 + 10 + 11 + 12 + 13 + 14 + \
15 + 16 + 17 + 18 + 19 + 20
Modules
• Certain features of Python are not loaded by
default
• In order to use these features, you’ll need to
import the modules that contain them.
• E.g.
import matplotlib.pyplot as plt
import numpy as np
Variables and objects
• Variables are created the first time it is assigned a value
– No need to declare type
– Types are associated with objects not variables
• X=5
• X = [1, 3, 5]
• X = ‘python’
– Assignment creates references, not copies
X = [1, 3, 5]
Y= X
X[0] = 2
Print (Y) # Y is [2, 3, 5]
Assignment
• You can assign to multiple names at the same
time
x, y = 2, 3
• To swap values
x, y = y, x
• Assignments can be chained
x=y=z=3
• Accessing a name before it’s been created (by
assignment), raises an error
Arithmetic
• a=5+2 # a is 7
• b = 9 – 3. # b is 6.0
• c=5*2 # c is 10
• d = 5**2 # d is 25
• e=5%2 # e is 1

Built in numerical types: int, float, complex

• f=7/2
# in python 2, f will be 3, unless “from __future__
import division”
• f = 7 / 2 # in python 3 f = 3.5
• f = 7 // 2 # f = 3 in both python 2 and 3
• f = 7 / 2. # f = 3.5 in both python 2 and 3

• f = 7 / float(2) # f is 3.5 in both python 2 and 3

• f = int(7 / 2) # f is 3 in both python 2 and 3
String - 1
• Strings can be delimited by matching single or double quotation
marks
single_quoted_string = 'data science'
double_quoted_string = "data science"
escaped_string = 'Isn\'t this fun'
another_string = "Isn't this fun"

real_long_string = 'this is a really long string. \

It has multiple parts, \
but all in one line.'

• Use triple quotes for multi line strings

multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
String - 2
• Use raw strings to output backslashes
tab_string = "\t" # represents the tab character
len(tab_string) # is 1

not_tab_string = r"\t" # represents the characters '\' and 't'

len(not_tab_string) # is 2

• Strings can be concatenated (glued together) with the + operator, and

repeated with *
s = 3 * 'un' + 'ium' # s is 'unununium'

• Two or more string literals (i.e. the ones enclosed between quotes) next to
each other are automatically concatenated
s1 = 'Py' 'thon'
s2 = s1 + '2.7'
real_long_string = ('this is a really long string. '
‘It has multiple parts, '
‘but all in one line.‘)
List - 1
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]
list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6
• Get the i-th element of a list
x = [i for i in range(10)] # is the list [0, 1, ..., 9]
zero = x[0] # equals 0, lists are 0-indexed
one = x[1] # equals 1
nine = x[-1] # equals 9, 'Pythonic' for last element
eight = x[-2] # equals 8, 'Pythonic' for next-to-last element
• Get a slice of a list
one_to_four = x[1:5] # [1, 2, 3, 4]
first_three = x[:3] # [0, 1, 2]
last_three = x[-3:] # [7, 8, 9]
three_to_end = x[3:] # [3, 4, ..., 9]
without_first_and_last = x[1:-1] # [1, 2, ..., 8]
copy_of_x = x[:] # [0, 1, 2, ..., 9]
another_copy_of_x = x[:3] + x[3:] # [0, 1, 2, ..., 9]
List - 2
• Check for memberships
1 in [1, 2, 3] # True
0 in [1, 2, 3] # False
• Concatenate lists
x = [1, 2, 3]
y = [4, 5, 6]
x.extend(y) # x is now [1,2,3,4,5,6]

x = [1, 2, 3]
y = [4, 5, 6]
z = x + y # z is [1,2,3,4,5,6]; x is unchanged.
• List unpacking (multiple assignment)
x, y = [1, 2] # x is 1 and y is 2
[x, y] = 1, 2 # same as above
x, y = [1, 2] # same as above
x, y = 1, 2 # same as above
_, y = [1, 2] # y is 2, didn't care about the first element
List - 3
• Modify content of list
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]
x[2] = x[2] * 2 # x is [0, 1, 4, 3, 4, 5, 6, 7, 8]
x[-1] = 0 # x is [0, 1, 4, 3, 4, 5, 6, 7, 0]
x[3:5] = x[3:5] * 3 # x is [0, 1, 4, 9, 12, 5, 6, 7, 0]
x[5:6] = [] # x is [0, 1, 4, 9, 12, 7, 0]
del x[:2] # x is [4, 9, 12, 7, 0]
del x[:] # x is []
del x # referencing to x hereafter is a NameError

• Strings can also be sliced. But they cannot modified (they are immutable)
s = 'abcdefg'
a = s[0] # 'a'
x = s[:2] # 'ab'
y = s[-3:] # 'efg'
s[:2] = 'AB' # this will cause an error
s = 'AB' + s[2:] # str is now ABcdefg
The range() function
for i in range(5):
print (i) # will print 0, 1, 2, 3, 4 (in separate lines)
for i in range(2, 5):
print (i) # will print 2, 3, 4
for i in range(0, 10, 2):
print (i) # will print 0, 2, 4, 6, 8
for i in range(10, 2, -2):
print (i) # will print 10, 8, 6, 4
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
... print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
Range() in python 2 and 3
• In python 2, range(5) is equivalent to [0, 1, 2, 3, 4]
• In python 3, range(5) is an object which can be iterated,
but not identical to [0, 1, 2, 3, 4] (lazy iterator)
print (range(3)) # in python 3, will see "range(0, 3)"
print (range(3)) # in python 2, will see "[0, 1, 2]"
print (list(range(3))) # will print [0, 1, 2] in python 3

x = range(5)
print (x[2]) # in python 2, will print "2"
print (x[2]) # in python 3, will also print “2”

x[2] = 5 # in python 2, will result in [0, 1, 5, 3, 4, 5]

x[2] = 5 # in python 3, will cause an error.
Ref to lists
• What are the expected output for the following code?

a = list(range(10))
b = a
b[0] = 100
print(a) [100, 1, 2, 3, 4, 5, 6, 7, 8, 9]

a = list(range(10))
b = a[:]
b[0] = 100
print(a) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
tuples
• Similar to lists, but are immutable
Note: tuple is defined by comma, not parens,
• a_tuple = (0, 1, 2, 3, 4) which is only used for convenience. So a = (1)
• Other_tuple = 3, 4 is not a tuple, but a = (1,) is.

• Another_tuple = tuple([0, 1, 2, 3, 4])

• Hetergeneous_tuple = (‘john’, 1.1, [1, 2])

• Can be sliced, concatenated, or repeated

a_tuple[2:4] # will print (2, 3)
• Cannot be modified
a_tuple[2] = 5
TypeError: 'tuple' object does not support item assignment
Tuples - 2
• Useful for returning multiple values from
functions
def sum_and_product(x, y):
return (x + y),(x * y)
sp = sum_and_product(2, 3) # equals (5, 6)
s, p = sum_and_product(5, 10) # s is 15, p is 50
• Tuples and lists can also be used for multiple
assignments
x, y = 1, 2
[x, y] = [1, 2]
(x, y) = (1, 2)
x, y = y, x
Dictionaries
• A dictionary associates values with unique keys
empty_dict = {} # Pythonic
empty_dict2 = dict() # less Pythonic
grades = { "Joel" : 80, "Tim" : 95 } # dictionary literal

• Access/modify value with key

joels_grade = grades["Joel"] # equals 80

grades["Tim"] = 99 # replaces the old value

grades["Kate"] = 100 # adds a third entry
num_students = len(grades) # equals 3

try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False

• Use “get” to avoid keyError and add default value

joels_grade = grades.get("Joel") # equals 80
kates_grade = grades.get("Kate") # equals 0
no_ones_grade = grades.get("No One") # default
default is None
#Which of the following is faster?
'Joel' in grades # faster.
• Get all items 'Joel' in all_keys
Hashtable
'Joel' in all_keys # slower. List.
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False

• Use “get” to avoid keyError and add default value

joels_grade = grades.get("Joel", 0) # equals 80
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # default
default is None

• Get all items In python3, The following will not return lists but
iterable objects
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Difference between python 2 and python 3:
Iterable objects vs lists
• In Python 3, range() returns a lazy iterable object.
– Value created when needed x = range(10000000) #fast
– Can be accessed by index x[10000] #allowed. fast

• Similarly, dict.keys(), dict.values(), and dict.items()

(also map, filter, zip, see next)
– Value can NOT be accessed by index
– Can convert to list if really needed
– Can use for loop to iterate
keys = grades.keys()
keys[0] # error
for key in keys: print (key) #ok
Control flow - 1
• if-else
if 1 > 2:
message = "if only 1 were greater than two..."
elif 1 > 3:
message = "elif stands for 'else if'"
else:
message = "when all else fails use else (if you want
to)"
print (message)
parity = "even" if x % 2 == 0 else "odd"

• Difference between python 2 and python3 print

• In python 2, print is a statement
• Print(message) and print message are both valid
• In python 3, print is a function
• Only print(message) is valid
Truthiness
• True All keywords are case sensitive.
• False 0, 0.0, [], (), ‘’, None are considered
False. Most other values are True.
• None
In [137]: print ("True") if '' else print ('False')
• and False
• or
a = [0, 0, 0, 1]
• not any(a)
• any Out[135]: True

• all all(a)
Out[136]: False
Comparison
Operation Meaning a = [0, 1, 2, 3, 4]
b = a
< strictly less than
c = a[:]
<= less than or equal
a == b
> strictly greater than Out[129]: True
>= greater than or equal
a is b
== equal Out[130]: True

!= not equal a == c
Out[132]: True
is object identity
a is c
is not negated object identity
Out[133]: False

Bitwise operators: & (AND), | (OR), ^ (XOR), ~(NOT), << (Left Shift), >> (Right Shift)
Control flow - 2
• loops
x = 0
while x < 10:
print (x, "is less than 10“)
x += 1

What happens if we forgot to indent?

for x in range(10): Keyword pass in loops:

pass Does nothing, empty statement placeholder

for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print (x)
Exceptions
try:
print 0 / 0
except ZeroDivisionError:
print ("cannot divide by zero")

https://docs.python.org/3/tutorial/errors.html
Functions - 1
• Functions are defined using def
def double(x):
"""this is where you put an optional docstring
that explains what the function does.
for example, this function multiplies its
input by 2"""
return x * 2
• You can call a function after it is defined
z = double(10) # z is 20
• You can give default values to parameters
def my_print(message="my default message"):
print (message)

my_print("hello") # prints 'hello'

my_print() # prints 'my default message‘
Functions - 2
• Sometimes it is useful to specify arguments by name

def subtract(a=0, b=0):

return a – b

subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b = 5) # same as above
subtract(b = 5, a = 0) # same as above
Functions - 3
• Functions are objects too
In [12]: def double(x): return x * 2
...: DD = double;
...: DD(2)
...:
Out[12]: 4
In [16]: def apply_to_one(f):
...: return f(1)
...: x=apply_to_one(DD)
...: x
...:
Out[16]: 2
Functions – lambda expression
• Small anonymous functions can be created
with the lambda keyword.
In [18]: y=apply_to_one(lambda x: x+4)

In [19]: y
Out[19]: 5

In [104]: def small_func(x): return x+4

...: apply_to_one(small_func)
Out[104]: 5
lambda expression - 2
• Small anonymous functions can be created
with the lambda keyword.
In [22]: pairs = [(2, 'two'), (3, 'three'), (1, 'one'), (4, 'four')]
...: pairs.sort(key=lambda pair: pair[0])
...: pairs
Out[22]: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]

In [107]: def getKey(pair): return pair[0]

...: pairs.sort(key=getKey)
...: pairs
Out[107]: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')
Sorting list
• Sorted(list): keeps the original list intact and returns
a new sorted list
• list.sort: sort the original list
x = [4,1,2,3]
y = sorted(x) # is [1,2,3,4], x is unchanged
x.sort() # now x is [1,2,3,4]

• Change the default behavior of sorted

# sort the list by absolute value from largest to smallest
x = [-4,1,-2,3]
y = sorted(x, key=abs, reverse=True) # is [-4,3,-2,1]
# sort the grades from highest count to lowest
# using an anonymous function
newgrades = sorted(grades.items(),
key=lambda (name, grade): grade,
reverse=True)
List comprehension
• A very convenient way to create a new list

In [51]: squares = [x * x for x in range(5)]

In [52]: squares
Out[52]: [0, 1, 4, 9, 16]

In [64]: for x in range(5): squares[x] = x

*x
...: squares
Out[64]: [0, 1, 4, 9, 16]
List comprehension - 2
• Can also be used to filter list
In [65]: even_numbers = [x for x in range(5) if x % 2 == 0]
In [66]: even_numbers
Out[66]: [0, 2, 4]

In [68]: even_numbers = []
In [69]: for x in range(5):
...: if x % 2 == 0:
...: even_numbers.append(x)
...: even_numbers
Out[69]: [0, 2, 4]
List comprehension - 3
• More complex examples:
# create 100 pairs (0,0) (0,1) ... (9,8), (9,9)
pairs = [(x, y)
for x in range(10)
for y in range(10)]

# only pairs with x < y,

# range(lo, hi) equals
# [lo, lo + 1, ..., hi - 1]
increasing_pairs = [(x, y)
for x in range(10)
for y in range(x + 1, 10)]
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [203]: def double(x): return 2*x In [205]: [double(i) for i in range(5)]
...: b=range(5) Out[205]: [0, 2, 4, 6, 8]
...: list(map(double, b))
Out[203]: [0, 2, 4, 6, 8]

In [204]: double(b)
Traceback (most recent call last):
…
TypeError: unsupported operand type(s) for *: 'int' and 'range'
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [208]: def is_even(x): return x%2==0
...: a=[0, 1, 2, 3]
...: list(filter(is_even, a))
...:
Out[208]: [0, 2]

In [209]: [a[i] for i in a if is_even(i)]

Out[209]: [0, 2]
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [216]: from functools import reduce
In [217]: reduce(lambda x, y: x+y, range(10))
Out[217]: 45

In [220]: reduce(lambda x, y: x*y, [1, 2, 3, 4])

Out[220]: 24
zip
• Useful to combined multiple lists into a list of
tuples
In [238]: list(zip(['a', 'b', 'c'], [1, 2, 3], ['A', 'B', 'C']))
Out[238]: [('a', 1, 'A'), ('b', 2, 'B'), ('c', 3, 'C')]
In [245]: names = ['James', 'Tom', 'Mary']
...: grades = [100, 90, 95]
...: list(zip(names, grades))
...:
Out[245]: [('James', 100), ('Tom', 90), ('Mary', 95)]
Argument unpacking
• zip(*[a, b,c]) same as zip(a, b, c)
In [252]: gradeBook = [['James', 100],
['Tom', 90],
['Mary', 95]]
...: [names, grades]=zip(*gradeBook)
In [253]: names
Out[253]: ('James', 'Tom', 'Mary')
In [254]: grades
Out[254]: (100, 90, 95)

In [259]: list(zip(['James', 100], ['Tom', 90], ['Mary', 95]))

Out[259]: [('James', 'Tom', 'Mary'), (100, 90, 95)]
args and kargs
• Convenient for taking variable number of
unnamed and named parameters
In [260]: def magic(*args, **kwargs):
...: print ("unnamed args:", args)
...: print ("keyword args:", kwargs)
...: magic(1, 2, key="word", key2="word2")
...:
unnamed args: (1, 2)
keyword args: {'key': 'word', 'key2': 'word2'}
Useful methods and modules
• The Python Tutorial
– Input and Output
• The Python Standard Library Reference
– Common string methods
– Regular expression operations
– Numeric and Mathematical Modules
– CSV File Reading and Writing
Files - input
inflobj = open(‘data’, ‘r’) Open the file ‘data’ for
input
S = inflobj.read() Read whole file into one
String
S = inflobj.read(N) Reads N bytes (N >= 1)

L = inflobj.readline () Read one line

L = inflobj.readlines() Returns a list of line strings

https://docs.python.org/3/tutorial/inputoutput.html
Files - output

outflobj = open(‘data’, ‘w’) Open the file ‘data’

for writing
outflobj.write(S) Writes the string S to
file
outflobj.writelines(L) Writes each of the
strings in list L to file
outflobj.close() Closes the file

https://docs.python.org/3/tutorial/inputoutput.html
Module math
Command name Description Constant Description
abs(value) absolute value e 2.7182818...
ceil(value) rounds up pi 3.1415926...
cos(value) cosine, in radians
floor(value) rounds down
log(value) logarithm, base e
log10(value) logarithm, base 10
max(value1, value2) larger of two values
min(value1, value2) smaller of two values
round(value) nearest whole number # preferred.
sin(value) sine, in radians import math
sqrt(value) square root math.abs(-0.5)

#bad style. Many unknown #This is fine

#names in name space. from math import abs
from math import * abs(-0.5)
abs(-0.5)
Module random
• Generating random numbers are important in
statistics
In [75]: import random
...: four_uniform_randoms = [random.random() for _ in range(4)]
...: four_uniform_randoms
...:
Out[75]:
[0.5687302894847388,
0.6562738117250464,
0.3396960191199996,
0.016968446644451407]
• Other useful functions: seed(), randint, randrange, shuffle, etc.
• Type in “random” and then use tab completion to see available
functions and use “?” to see docstring of function.
Important python modules for data science

• Numpy
– Key module for scientific computing
– Convenient and efficient ways to handle multi dimensional
arrays
• pandas
– DataFrame
– Flexible data structure of labeled tabular data
• Matplotlib: for plotting
• Scipy: solutions to common scientific computing problem
such as linear algebra, optimization, statistics, sparse
matrix
Module paths
• In order to be able to find a module called myscripts.py,
the interpreter scans the list sys.path of directory names.
• The module must be in one of those directories.

>>> import sys

>>> sys.path
['C:\\Python26\\Lib\\idlelib', 'C:\\WINDOWS\\system32\\python26.zip',
'C:\\Python26\\DLLs', 'C:\\Python26\\lib', 'C:\\Python26\\lib\\plat-win',
'C:\\Python26\\lib\\lib-tk', 'C:\\Python26', 'C:\\Python26\\lib\\site-
packages']
>>> import myscripts
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
import myscripts.py
ImportError: No module named myscripts.py
Appendix
Sequence types: Tuples,
Lists, and Strings
Sequence Types
1. Tuple: (‘john’, 32, [CMSC])
· A simple immutable ordered sequence of
items
· Items can be of mixed types, including
collection types
2. Strings: “John Smith”
– Immutable
– Conceptually very much like a tuple
3. List: [1, 2, ‘john’, (‘up’, ‘down’)]
· Mutable ordered sequence of items of mixed
types
Similar Syntax
• All three sequence types (tuples, strings, and
lists) share much of the same syntax and
functionality.
• Key difference:
– Tuples and strings are immutable
– Lists are mutable
• The operations shown in this section can be
applied to all sequence types
– most examples will just show the operation
performed on one
Defining Sequence
• Define tuples using parentheses and commas
>>> tu = (23, ‘abc’, 4.56, (2,3),
‘def’)
• Define lists are using square brackets and commas
>>> li = [“abc”, 34, 4.34, 23]
• Define strings using quotes (“, ‘, or “““).
>>> st = “Hello World”
>>> st = ‘Hello World’
>>> st = “““This is a multi-line
string that uses triple quotes.”””
Accessing one element
• Access individual members of a tuple, list, or string
using square bracket “array” notation
• Note that all are 0 based…
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1] # Second item in the tuple.
‘abc’
>>> li = [“abc”, 34, 4.34, 23]
>>> li[1] # Second item in the list.
34
>>> st = “Hello World”
>>> st[1] # 2nd character in string. Still str type
‘e’
Positive and negative indices

>>> t = (23, ‘abc’, 4.56, (2,3),

‘def’)
Positive index: count from the left, starting with 0
>>> t[1]
‘abc’
Negative index: count from right, starting with –1
>>> t[-3]
4.56
Slicing: return copy of a subset

>>> t = (23, ‘abc’, 4.56,

(2,3), ‘def’)
Return a copy of the container with a subset of the
original members. Start copying at the first index,
and stop copying before second.
>>> t[1:4]
(‘abc’, 4.56, (2,3))
Negative indices count from end
>>> t[1:-1]
(‘abc’, 4.56, (2,3))
Slicing: return copy of a subset

>>> t = (23, ‘abc’, 4.56,

(2,3), ‘def’)
Omit first index to make copy starting from
beginning of the container
>>> t[:2]
(23, ‘abc’)
Omit second index to make copy starting at first
index and going to end
>>> t[2:]
(4.56, (2,3), ‘def’)
Copying the Whole Sequence
• [ : ] makes a copy of an entire sequence
>>> t[:]
(23, ‘abc’, 4.56, (2,3), ‘def’)
• Note the difference between these two lines for mutable
sequences
>>> l2 = l1 # Both refer to 1 ref,
# changing one affects
both
>>> l2 = l1[:] # Independent copies,
two refs
The ‘in’ Operator
• Boolean test whether a value is inside a
container:
>>> t = [1, 2, 4, 5]
>>> 3 in t
False
>>> 4 in t
True
>>> 4 not in t
False

• For strings, tests for substrings

>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
The + Operator
The + operator produces a new tuple, list, or string whose value is
the concatenation of its arguments.

>>> (1, 2, 3) + (4, 5, 6)

(1, 2, 3, 4, 5, 6)

>>> [1, 2, 3] + [4, 5, 6]

[1, 2, 3, 4, 5, 6]

>>> “Hello” + “ ” + “World”

‘Hello World’
The * Operator
• The * operator produces a new tuple, list, or string
that “repeats” the original content.
>>> (1, 2, 3) * 3
(1, 2, 3, 1, 2, 3, 1, 2, 3)

>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]

>>> “Hello” * 3
‘HelloHelloHello’
Mutability:
Tuples vs. Lists
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
• We can change lists in place.
• Name li still points to the same memory
reference when we’re done.
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment

• You can’t change a tuple.

• You can make a fresh tuple and assign its reference
to a previously used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
• The immutability of tuples means they’re faster
than lists.
Operations on Lists Only

>>> li = [1, 11, 3, 4, 5]

>>> li.append(‘a’) # Note the method

syntax
>>> li
[1, 11, 3, 4, 5, ‘a’]

>>> li.insert(2, ‘i’)

>>>li
[1, 11, ‘i’, 3, 4, 5, ‘a’]
The extend method vs +
• + creates a fresh list with a new memory ref
• extend operates on list li in place.
>>> li.extend([9, 8, 7])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7]
• Potentially confusing:
– extend takes a list as an argument.
– append takes a singleton as an argument .
>>> li.append([10, 11, 12])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7, [10, 11,
12]]
Operations on Lists Only
Lists have many methods, including index, count, remove, reverse,
sort
>>> li = [‘a’, ‘b’, ‘c’, ‘b’]
>>> li.index(‘b’) # index of 1st
occurrence
1
>>> li.count(‘b’) # number of
occurrences
2
>>> li.remove(‘b’) # remove 1st occurrence
>>> li
[‘a’, ‘c’, ‘b’]
Operations on Lists Only
>>> li = [5, 2, 6, 8]

>>> li.reverse() # reverse the list in place

>>> li
[8, 6, 2, 5]

>>> li.sort() # sort the list in place

>>> li
[2, 5, 6, 8]

>>> li.sort(some_function)
# sort in place using user-defined comparison
Tuple details
• The comma is the tuple creation operator, not
parens
>>> 1,
(1,)

• Python shows parens for clarity (best practice)

>>> (1,)
(1,)

• Don't forget the comma!

>>> (1)
1

• Trailing comma only required for singletons others

• Empty tuples have a special syntactic form
>>> ()
()
>>> tuple()
()
Summary: Tuples vs. Lists
• Lists slower but more powerful than tuples
– Lists can be modified, and they have lots of handy
operations and mehtods
– Tuples are immutable and have fewer features
• To convert between tuples and lists use the list() and
tuple() functions:
li = list(tu)
tu = tuple(li)

Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Module1 DS
No ratings yet
Module1 DS
61 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Python
100% (1)
Python
635 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Science 1
100% (4)
Data Science 1
133 pages
Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio
No ratings yet
Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio
151 pages
Getting Started With Data Science Using Python
100% (1)
Getting Started With Data Science Using Python
25 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Sem 6
No ratings yet
Sem 6
12 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
ML Da
No ratings yet
ML Da
55 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Unit 1
No ratings yet
Unit 1
84 pages
Nac PDF
No ratings yet
Nac PDF
23 pages
Birla Institute of Technology & Science, Pilani Work Integrated Learning Programmes Digital
No ratings yet
Birla Institute of Technology & Science, Pilani Work Integrated Learning Programmes Digital
9 pages
HSB3119 Theory Summary p1 Stud
No ratings yet
HSB3119 Theory Summary p1 Stud
22 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Unit 1
No ratings yet
Unit 1
11 pages
Lec 1
No ratings yet
Lec 1
9 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Data Science With Python-Sasmita PDF
67% (3)
Data Science With Python-Sasmita PDF
9 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Unit I
No ratings yet
Unit I
52 pages
Aspen IQ Model
100% (2)
Aspen IQ Model
71 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Approaches in Data Science (Slides)
No ratings yet
Approaches in Data Science (Slides)
13 pages
Python For Data Science Department of Indian Institute of Technology, Madras Lecture - 01 Why Python For Data Science?
No ratings yet
Python For Data Science Department of Indian Institute of Technology, Madras Lecture - 01 Why Python For Data Science?
9 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
100 Plus Statistics Interview Questions
0% (1)
100 Plus Statistics Interview Questions
44 pages
Data Scince Report
No ratings yet
Data Scince Report
11 pages
Data Science and Analytics
No ratings yet
Data Science and Analytics
3 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (10)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
Data Science Master Class 2023
No ratings yet
Data Science Master Class 2023
8 pages
Kaggle Competition PDF
No ratings yet
Kaggle Competition PDF
19 pages
Empirical Software Engineering (Swe504) : Practical File
No ratings yet
Empirical Software Engineering (Swe504) : Practical File
27 pages
PV Inverter With Energy Storage Nitrox Hybrid 10 KW Sp-5g User Manual
No ratings yet
PV Inverter With Energy Storage Nitrox Hybrid 10 KW Sp-5g User Manual
51 pages
Unit 4 Big Data Complete Notes
No ratings yet
Unit 4 Big Data Complete Notes
32 pages
Statistical Analysis Outliers
No ratings yet
Statistical Analysis Outliers
33 pages
ISO 3384 2005-Stress Relaxation
No ratings yet
ISO 3384 2005-Stress Relaxation
18 pages
Lab Manual Ds&Bdal
No ratings yet
Lab Manual Ds&Bdal
100 pages
E-Library With Augmented Reality
No ratings yet
E-Library With Augmented Reality
59 pages
Final Test: Mehran Coaching Academy of Science & English Language S.I.T.E Kotri
No ratings yet
Final Test: Mehran Coaching Academy of Science & English Language S.I.T.E Kotri
6 pages
B-Anomaly Detection Based On Multidimensional Data Processing For Protecting Vital Devices in 6G-Enabled Massive IIoT
No ratings yet
B-Anomaly Detection Based On Multidimensional Data Processing For Protecting Vital Devices in 6G-Enabled Massive IIoT
12 pages
Business Analytics Data Analysis Decision Making 5th Edition S. Christian Albright - Read The Ebook Online or Download It For The Best Experience
100% (1)
Business Analytics Data Analysis Decision Making 5th Edition S. Christian Albright - Read The Ebook Online or Download It For The Best Experience
78 pages
Assignment 1 - Data Screening (16 March)
100% (1)
Assignment 1 - Data Screening (16 March)
5 pages
Outlier Detection and Capping
No ratings yet
Outlier Detection and Capping
7 pages
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
34 pages
Pakistan Affairs CSS Syllabus
No ratings yet
Pakistan Affairs CSS Syllabus
5 pages
Solution PDF
No ratings yet
Solution PDF
243 pages
Persons' Personality Traits Recognition Using Machine Learning
No ratings yet
Persons' Personality Traits Recognition Using Machine Learning
6 pages
Voice Gender Recognition Using Deep Learning: December 2016
No ratings yet
Voice Gender Recognition Using Deep Learning: December 2016
4 pages
In Addition To Extracting Phase and Amplitude From The CSI Data
No ratings yet
In Addition To Extracting Phase and Amplitude From The CSI Data
4 pages
Cyber Security
No ratings yet
Cyber Security
14 pages
Appropriate Preposition
No ratings yet
Appropriate Preposition
16 pages
Second Schedule
No ratings yet
Second Schedule
4 pages
MCO 022 E 2024 25 MCOM New GSPH@9891268050 Fy7wlx
No ratings yet
MCO 022 E 2024 25 MCOM New GSPH@9891268050 Fy7wlx
20 pages
Job - Application - 01 2024 PAA 01 011660 - 030924232345
No ratings yet
Job - Application - 01 2024 PAA 01 011660 - 030924232345
1 page
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
14 pages
QCAR LAb Exam Key
No ratings yet
QCAR LAb Exam Key
14 pages
Decision Tree Learning Through A Predictive Model F - 2021 - Computers and Educa
No ratings yet
Decision Tree Learning Through A Predictive Model F - 2021 - Computers and Educa
12 pages
Company's Ability To Produce Chips That Meet Specifications? The Elements That The
No ratings yet
Company's Ability To Produce Chips That Meet Specifications? The Elements That The
4 pages
Mulugeta's Article
No ratings yet
Mulugeta's Article
5 pages
Validation of Protein-Structure Coordinates: Procheck
No ratings yet
Validation of Protein-Structure Coordinates: Procheck
4 pages
Islamic Studies
No ratings yet
Islamic Studies
2 pages
Poisson-Based Regression Analysis of Aggregate Crime Rates
No ratings yet
Poisson-Based Regression Analysis of Aggregate Crime Rates
23 pages
Estevic Davis Gilliland Fall20 Final
No ratings yet
Estevic Davis Gilliland Fall20 Final
9 pages
GE - FEL DP 101 - R Programming Activity
No ratings yet
GE - FEL DP 101 - R Programming Activity
20 pages
Application of Statistical Concepts in The Determination of Weight Variation in Coin Samples
No ratings yet
Application of Statistical Concepts in The Determination of Weight Variation in Coin Samples
2 pages
Robust Decision Trees
No ratings yet
Robust Decision Trees
6 pages
7th Final Exam Study Guide 2nd Semester
No ratings yet
7th Final Exam Study Guide 2nd Semester
6 pages
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Big Data for Executives and Market Professionals - Third Edition: Big Data
From Everand
Big Data for Executives and Market Professionals - Third Edition: Big Data
Jose Antonio Ribeiro Neto
No ratings yet
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet

Part 1 Lectures

Uploaded by

Part 1 Lectures

Uploaded by

Introduction to Data Science

Part I: Couse intro & Python tutorial

New models are estimating

Internet of Things / M2M Health/Scientific Computing

Some of these graphs can get

1 Zettabyte 1.8 ZB 8.0 ZB

Data produced each year

100-years of HD video + audio

1 Petabyte == 1000 TB 2002 2006 2009 2011 2015

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

O’Reilly Radar report, 2011 12

Data Science is the science which uses computer

Turn data into data products.

• Wireless Sensor Data  Smart Home, Real-time

• Text Data, Social Media Data  Product Review and

– Python for Data Analysis (PDA) by Wes

– Free e-book: Think Stats (TS) by Allen B.

• Optional: Python Data Science

list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Built in numerical types: int, float, complex

• f = 7 / float(2) # f is 3.5 in both python 2 and 3

real_long_string = 'this is a really long string. \

• Use triple quotes for multi line strings

not_tab_string = r"\t" # represents the characters '\' and 't'

• Strings can be concatenated (glued together) with the + operator, and

x[2] = 5 # in python 2, will result in [0, 1, 5, 3, 4, 5]

• Another_tuple = tuple([0, 1, 2, 3, 4])

• Can be sliced, concatenated, or repeated

• Access/modify value with key

grades["Tim"] = 99 # replaces the old value

• Use “get” to avoid keyError and add default value

• Use “get” to avoid keyError and add default value

• Similarly, dict.keys(), dict.values(), and dict.items()

• Difference between python 2 and python3 print

What happens if we forgot to indent?

for x in range(10): Keyword pass in loops:

my_print("hello") # prints 'hello'

def subtract(a=0, b=0):

In [104]: def small_func(x): return x+4

In [107]: def getKey(pair): return pair[0]

• Change the default behavior of sorted

In [51]: squares = [x * x for x in range(5)]

In [64]: for x in range(5): squares[x] = x

# only pairs with x < y,

In [209]: [a[i] for i in a if is_even(i)]

In [220]: reduce(lambda x, y: x*y, [1, 2, 3, 4])

In [259]: list(zip(['James', 100], ['Tom', 90], ['Mary', 95]))

L = inflobj.readline () Read one line

L = inflobj.readlines() Returns a list of line strings

outflobj = open(‘data’, ‘w’) Open the file ‘data’

#bad style. Many unknown #This is fine

>>> import sys

>>> t = (23, ‘abc’, 4.56, (2,3),

>>> t = (23, ‘abc’, 4.56,

>>> t = (23, ‘abc’, 4.56,

• For strings, tests for substrings

>>> (1, 2, 3) + (4, 5, 6)

>>> [1, 2, 3] + [4, 5, 6]

>>> “Hello” + “ ” + “World”

• You can’t change a tuple.

>>> li = [1, 11, 3, 4, 5]

>>> li.append(‘a’) # Note the method

>>> li.insert(2, ‘i’)

>>> li.reverse() # reverse the list *in place*

>>> li.sort() # sort the list *in place*

• Python shows parens for clarity (best practice)

• Don't forget the comma!

• Trailing comma only required for singletons others

You might also like

>>> li.reverse() # reverse the list in place

>>> li.sort() # sort the list in place