UNIT I Material
UNIT I Material
LECTURE NOTES
Data science: Definition, Datafication, Exploratory Data Analysis, The Data
science process, A data scientist role in this process.
NumPy Basics: The NumPy ndarray: A Multidimensional Array Object, Creating
ndarrays, Data Types for ndarrays, Operations between Arrays and Scalars, Basic
UNIT I
Indexing and Slicing, Boolean Indexing, Fancy Indexing, Data Processing Using
Arrays, Expressing Conditional Logic as Array Operations, Methods for Boolean
Arrays , Sorting , Unique.
Data Science:
Definition
⮚ So, what is data science? Is it new, or is it just statistics or analyticsrebranded? Is it real, or is
it pure hype? And if it’s new and if it’s real,what does that mean?
⮚ ―What is Data Science?‖ and here’s Metamarket CEO MikeDriscoll’s answer:
⮚ Data science is the civil engineering of data. Its acolytes possess apractical knowledge of tools
and materials, coupled with a theoreticalunderstanding of what’s possible.
⮚ Drew Conway’s Venn diagram of data sciencefrom 2010:
⮚
⮚ Data science is an emerging field in industry, and as yet, it is not well defined as an academic
subject.
⮚ Over the past few years, there’s been a lot of hype in the media about ―data science‖ and ―Big
Data.‖
⮚ Data Science is a blend of various tools, algorithms, and machine learning principles with the
goal to discover hidden patterns from the raw data.
⮚ Data Science is primarily used to make decisions and predictions making use of predictive
causal analytics, prescriptive analytics (predictive plus decision science) and machine
learning.
We have massive amounts of data about many aspects of our lives, and, simultaneously, an
abundance of inexpensive computing power.
Shopping, communicating, reading news, listening to music, searching for information,
expressing our opinions—all this is being tracked online
It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals,
bioinformatics, social welfare, government, education, retail, and the list goes on.
It’s not only the massiveness that makes all this new data interesting (or poses challenges).
It’s that the data itself, often in real time, becomes the building blocks of data products.
On the Internet, this means Amazon recommendation systems, friend recommendations on
Face‐
book, film and music recommendations, and so on.
We’re witnessing the beginning of a massive, culturally saturated feedback loop where our
behavior changes the product and the product changes our behavior.
Datafication as a process of ―taking all aspects of life andturning them into data.‖ As
examples, Twitter datafies stray thoughts. LinkedIn datafies professional networks.
Datafication is an interesting concept and led us to consider its importance with respect to
people’s intentions about sharing their own data.
We are being datafied, or rather our actions are, and when we ―like‖ someone or something
online, we are intending to be datafied or at least we should expect to be.
But when we merely browse the Web, we are unintentionally, or at least passively, being
datafied through cookies.
When we walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.
Once we datafy things, we can transform their purpose and turn the information into
new forms of value.
Exploratory data analysis (EDA) as the first step toward building a model.
It’s traditionally presented as a bunch of histograms and stem-and-leaf plots.
But EDA is a critical part of the data science process.
In EDA, there is no hypothesis and there is no model. The ―exploratory‖ aspect means that your
understanding of the problem you are solving, or might solve, is changing as you go.
The basic tools of EDA are plots, graphs and summary statistics. Generally speaking, it’s a method of
systematically going through the data, plotting distributions of all variables (using box plots), plotting
time series of data, transforming variables, looking at all pairwise relationships between variables
using scatterplot matrices, and generating summary statistics for all of them.
But as much as EDA is a set of tools, it’s also a mindset. And that mindset is about your relationship
with the data. You want to understand the data—gain intuition, understand the shape of it, and try to
connect your understanding of the process that generated the data to the data itself.
EDA happens between you and the data and isn’t about proving anything to anyone else yet.
There are important reasons anyone working with data should doEDA. Namely,
o to gain intuition about the data;
o to make comparisonsbetween distributions;
o for sanity checking to findout where data is missing or if there are outliers; and
o to summarizethe data.
In the context of data generated from logs, EDA also helps with debugging the logging process.
In the end, EDA helps you make sure the product is performing as intended.
1) First we have the Real World. Inside the Real World are lots of peoplebusy at various activities.
2) Specifically, we’ll start with raw data—logs, Olympics records, Enronemployee emails, or recorded
genetic material.
3) We want to process this to make it clean for analysis. So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you want to call it. To do this we use tools such as Python,
shell scripts, R, or SQL, or all of the above.
4) Eventually we get the data down to a nice format, like something withcolumns:
name | event | year | gender | event time
5) Once we have this clean dataset, we should be doing some kind ofEDA. In the course of doing EDA,
we may realize that it isn’t actuallyclean because of duplicates, missing values, absurd outliers, and
datathat wasn’t actually logged or incorrectly logged.
6) Next, we design our model to use some algorithm like k-nearestneighbor (k-NN), linear regression,
Naive Bayes, or something else.
The model we choose depends on the type of problem we’re trying tosolve, of course, which could be
a classification problem, a predictionproblem, or a basic description problem.
7) We then can interpret, visualize, report, or communicate our results.
8) Alternatively, our goal may be to build or prototype a ―data product‖;e.g., a spam classifier, or a search
ranking algorithm, or a recommendation system.
9) NOTE: Now the key here that makes data science special anddistinct from statistics is that this data
product then gets incorporatedbackinto the real world, and users interact with that product, and that
generates more data, which creates a feedback loop.
10) This is very different from predicting the weather, say, where yourmodel doesn’t influence the
outcome at all.
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python
counterparts and use significantly less memory.
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast,
flexible container for large datasets in Python.
Arrays enable you to perform mathematical operations on whole blocks of data.
Anndarray is a generic multidimensional container for homogeneous data; that is, all of the elements
must be the same type.
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object
describing the data type of the array:
In [17]: data.shape
Out[17]: (2, 3)
In [18]: data.dtype
Out[18]: dtype('float64')
Creating ndarrays:
The easiest way to create an array is to use the array function. This accepts any sequence-like object
(including other arrays) and produces a new NumPy array containing the passed data.
In [19]: data1 = [6, 7.5, 8, 0, 1]
In [20]: arr1 = np.array(data1)
In [21]: arr1
Out[21]: array([ 6. , 7.5, 8. , 0. , 1. ])
Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:
In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
In [23]: arr2 = np.array(data2)
In [24]: arr2
Out[24]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
np.array tries to infer a good data type for the array that it creates. The data type is stored in a special
dtype metadata
object;
In [28]: arr2.dtype
Out[28]: dtype('int64')
In addition to np.array, there are a number of other functions for creating new arrays. As examples,
zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an
array without initializing its values to any particular value. To create a higher dimensional array with
these methods, pass a tuple for the shape:
In [29]: np.zeros(10)
Out[29]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [30]: np.zeros((3, 6))
Out[30]:
array([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.]])
In [31]: np.empty((2, 3, 2))
Out[31]:
Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 8
array([[[ 0., 0.],
[ 0., 0.],
[ 0., 0.]],
[[ 0., 0.],
[ 0., 0.],
[ 0., 0.]]])
NOTE: It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may
return uninitialized ―garbage‖ values.
arangeis an array-valued version of the built-in Python range function:
In [32]: np.arange(15)
Out[32]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
If you have an array of strings representing numbers, you can use astype to convert
them to numeric form:
In [44]: numeric_strings= np.array(['1.25', '-9.6', '42'], dtype=np.string_)
In [45]: numeric_strings.astype(float)
Out[45]: array([ 1.25, -9.6 , 42. ])
There are shorthand type code strings you can also use to refer to a dtype:
In [49]: empty_uint32 = np.empty(8, dtype='u4')
In [50]: empty_uint32
Out[50]:
array([ 0, 1075314688, 0, 1075707904, 0,
1075838976, 0, 1072693248], dtype=uint32)
Arithmetic operations with scalars propagate the scalar argument to each element in the array:
In [55]: 1 / arr
Out[55]:
array([[ 1. , 0.5 , 0.3333],
[ 0.25 ,0.2 ,0.1667]])
In [56]: arr** 0.5
Out[56]:
array([[ 1. , 1.4142, 1.7321],
[ 2. ,2.2361, 2.4495]])
As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or
broadcasted henceforth) to the entire selection.
An important first distinction from Python’s built-in lists is that array slices are views on the
original array.
This means that the data is not copied, and any modifications to the view will be reflected in the
source array.
Now, when I change values in arr_slice, the mutations are reflected in the original
array arr:
In [68]: arr_slice[1] = 12345
In [69]: arr
Out[69]: array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8,
9])
The “bare” slice [:] will assign to all values in an array:
In [70]: arr_slice[:] = 64
In [71]: arr
NOTE: If you want a copy of a slice of anndarray instead of a view, you will need to explicitly copy
the array—for example,
arr[5:8].copy().
With higher dimensional arrays, you have many more options. In a two-dimensionalarray, the
elements at each index are no longer scalars but rather one-dimensionalarrays:
In [72]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
In [73]: arr2d[2]
Out[73]: array([7, 8, 9])
Thus, individual elements can be accessed recursively. But that is a bit too muchwork, so you can
pass a comma-separated list of indices to select individual elements.
So these are equivalent:
In [74]: arr2d[0][2]
Out[74]: 3
In [75]: arr2d[0, 2]
Out[75]: 3
In multidimensional arrays, if you omit later indices, the returned object will be alower dimensional
ndarray consisting of all the data along the higher dimensions.
In [76]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
In [77]: arr3d
Out[77]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0] is a 2 × 3 array:
In [78]: arr3d[0]
Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0),forming a 1-
dimensional array:
In [84]: arr3d[1, 0]
Out[84]: array([7, 8, 9])
This expression is the same as though we had indexed in two steps:
In [85]: x = arr3d[1]
In [86]: x
Out[86]:
array([[ 7, 8, 9],
[10, 11, 12]])
In [87]: x[0]
Out[87]: array([7, 8, 9])
Note that in all of these cases where subsections of the array have been selected, thereturned arrays
are views.
Consider the two-dimensional array from before, arr2d. Slicing this array is a bitdifferent:
In [90]: arr2d
Out[90]:
array([[1, 2, 3],
[4, 5, 6],
Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 14
[7, 8, 9]])
In [91]: arr2d[:2]
Out[91]:
array([[1, 2, 3],
[4, 5, 6]])
As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects arange of elements
along an axis. It can be helpful to read the expression arr2d[:2] as―select the first two rows of arr2d.‖
You can pass multiple slices just like you can pass multiple indexes:
In [92]: arr2d[:2, 1:]
Out[92]:
array([[2, 3],
[5, 6]])
When slicing like this, you always obtain array views of the same number of dimensions.
By mixing integer indexes and slices, you get lower dimensional slices.For example, I can select the
second row but only the first two columns like so:
In [93]: arr2d[1, :2]
Out[93]: array([4, 5])
Similarly, I can select the third column but only the first two rows like so:
In [94]: arr2d[:2, 2]
Out[94]: array([3, 6])
See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire
axis, so you can slice only higher dimensional axes by doing:
In [95]: arr2d[:, :1]
Boolean Indexing
Let’s consider an example where we have some data in an array and an array of names with
duplicates. I’m going to use here the randnfunction in numpy.randomto generate some random
normally distributed data:
In [98]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [99]: data = np.random.randn(7, 4)
In [100]: names
Out[100]:
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],dtype='<U4')
In [101]: data
Out[101]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
Suppose each name corresponds to a row in the data array and we wanted to select all the rows with
corresponding name 'Bob'. Like arithmetic operations, comparisons (such as ==) with arrays are also
vectorized. Thus, comparing names with the string 'Bob' yields a boolean array:
In [102]: names == 'Bob'
Out[102]: array([ True, False, False, True, False, False, False], dtype=bool)
You can even mix and match boolean arrays with slices or integers.
To select everything but 'Bob', you can either use != or negate the condition using ~:
In [106]: names != 'Bob'
Out[106]: array([False, True, True, False, True, True, True], dtype=bool)
In [107]: data[~(names == 'Bob')]
Out[107]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
The ~ operator can be useful when you want to invert a general condition:
In [108]: cond = names == 'Bob'
In [109]: data[~cond]
Out[109]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
Selecting two of the three names to combine multiple boolean conditions, use
boolean arithmetic operators like & (and) and | (or):
In [110]: mask = (names == 'Bob') | (names == 'Will')
In [111]: mask
Out[111]: array([ True, False, True, True, True, False, False], dtype=bool)
In [112]: data[mask]
Out[112]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241]])
Selecting data from an array by boolean indexing always creates a copy of the data, even if the
returned array is unchanged.
NOTE: The Python keywords and andor do not work with boolean arrays. Use &(and) and | (or)
instead.
Setting whole rows or columns using a one-dimensional boolean array is also easy:
In [115]: data[names != 'Joe'] = 7
In [116]: data
Out[116]:
array([[ 7. , 7. , 7. , 7. ],
[ 1.0072, 0. , 0.275 , 0.2289],
[ 7. , 7. , 7. , 7. ],
[ 7. , 7. , 7. , 7. ],
[ 7. , 7. , 7. , 7. ],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[ 0. , 0. , 0. , 0. ]])
Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
Suppose we had an 8 × 4 array:
In [117]: arr= np.empty((8, 4))
In [118]: for iin range(8):
.....: arr[i] = i
In [119]: arr
Out[119]:
array([[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.]])
To select out a subset of the rows in a particular order, you can simply pass a list orndarray of
integers specifying the desired order:
In [120]: arr[[4, 3, 0, 6]]
Out[120]:
Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 18
array([[ 4., 4., 4., 4.],
[ 3., 3., 3., 3.],
[ 0., 0., 0., 0.],
[ 6., 6., 6., 6.]])
Here the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected. Regardless of how many dimensions
the array has (here, only 2), the result of fancy indexing is always one-dimensional.
The behavior of fancy indexing in this case is a bit different from what some users might have
expected (myself included), which is the rectangular region formed by selecting a subset of the
matrix’s rows and columns. Here is one way to get that:
In [125]: arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
Out[125]:
array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.
For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute
the axes (for extra mind bending):
In [132]: arr= np.arange(16).reshape((2, 2, 4))
In [133]: arr
Out[133]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [134]: arr.transpose((1, 0, 2))
Out[134]:
array([[[ 0, 1, 2, 3],
Here, the axes have been reordered with the second axis first, the first axis second,and the last axis
unchanged.
Simple transposing with .T is a special case of swapping axes. ndarray has the methodswapaxes,
which takes a pair of axis numbers and switches the indicated axes to rearrangethe data:
In [135]: arr
Out[135]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [136]: arr.swapaxes(1, 2)
Out[136]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
swapaxes similarly returns a view on the data without making a copy.
Now, evaluating the function is a matter of writing the same expression you wouldwrite with two points:
Sorting
Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sortmethod:
In [195]: arr= np.random.randn(6)
In [196]: arr
Out[196]: array([ 0.6095, -0.4938, 1.24 , -0.1357, 1.43 , -0.8469])
In [197]: arr.sort()
In [198]: arr
Out[198]: array([-0.8469, -0.4938, -0.1357, 0.6095, 1.24 , 1.43 ])
You can sort each one-dimensional section of values in a multidimensional array inplacealong an axis
by passing the axis number to sort:
In [199]: arr= np.random.randn(5, 3)
In [200]: arr
Out[200]:
array([[ 0.6033, 1.2636, -0.2555],
[-0.4457, 0.4684, -0.9616],
[-1.8245, 0.6254, 1.0229],
[ 1.1074, 0.0909, -0.3501],
[ 0.218 ,-0.8948, -1.7415]])
In [201]: arr.sort(1)
In [202]: arr
Out[202]:
array([[-0.2555, 0.6033, 1.2636],
[-0.9616, -0.4457, 0.4684],
[-1.8245, 0.6254, 1.0229],
[-0.3501, 0.0909, 1.1074],
[-1.7415, -0.8948, 0.218 ]])
The top-level method np.sort returns a sorted copy of an array instead of modifying
the array in-place. A quick-and-dirty way to compute the quantiles of an array is to
sort it and select the value at a particular rank:
In [203]: large_arr= np.random.randn(1000)
In [204]: large_arr.sort()
In [205]: large_arr[int(0.05 * len(large_arr))] # 5% quantile
Out[205]: -1.5311513550102103