0% found this document useful (0 votes)
16 views25 pages

UNIT I Material

The document provides lecture notes on Python for Data Science, covering key topics such as the definition of data science, datafication, exploratory data analysis (EDA), and the data science process. It emphasizes the role of a data scientist in managing data and developing models, as well as introduces NumPy for numerical computing, detailing its features and functionalities. The notes highlight the importance of EDA and the iterative nature of data science, where insights lead to further data collection and analysis.

Uploaded by

Mamatha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views25 pages

UNIT I Material

The document provides lecture notes on Python for Data Science, covering key topics such as the definition of data science, datafication, exploratory data analysis (EDA), and the data science process. It emphasizes the role of a data scientist in managing data and developing models, as well as introduces NumPy for numerical computing, detailing its features and functionalities. The notes highlight the importance of EDA and the iterative nature of data science, where insights lead to further data collection and analysis.

Uploaded by

Mamatha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

PYTHON FOR DATA SCIENCE

LECTURE NOTES
Data science: Definition, Datafication, Exploratory Data Analysis, The Data
science process, A data scientist role in this process.
NumPy Basics: The NumPy ndarray: A Multidimensional Array Object, Creating
ndarrays, Data Types for ndarrays, Operations between Arrays and Scalars, Basic
UNIT I
Indexing and Slicing, Boolean Indexing, Fancy Indexing, Data Processing Using
Arrays, Expressing Conditional Logic as Array Operations, Methods for Boolean
Arrays , Sorting , Unique.

Data Science:
Definition
⮚ So, what is data science? Is it new, or is it just statistics or analyticsrebranded? Is it real, or is
it pure hype? And if it’s new and if it’s real,what does that mean?
⮚ ―What is Data Science?‖ and here’s Metamarket CEO MikeDriscoll’s answer:
⮚ Data science is the civil engineering of data. Its acolytes possess apractical knowledge of tools
and materials, coupled with a theoreticalunderstanding of what’s possible.
⮚ Drew Conway’s Venn diagram of data sciencefrom 2010:


⮚ Data science is an emerging field in industry, and as yet, it is not well defined as an academic
subject.
⮚ Over the past few years, there’s been a lot of hype in the media about ―data science‖ and ―Big
Data.‖
⮚ Data Science is a blend of various tools, algorithms, and machine learning principles with the
goal to discover hidden patterns from the raw data.
⮚ Data Science is primarily used to make decisions and predictions making use of predictive
causal analytics, prescriptive analytics (predictive plus decision science) and machine
learning.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 1



⮚ Data science is the study of data. It involves developing methods of recording, storing, and
analyzing data to effectively extract useful information. The goal of data science is to gain
insights and knowledge from any type of data — both structured and unstructured.
⮚ Data science is related to computer science, but is a separate field. Computer science involves
creating programs and algorithms to record and process data, while data science covers any
type of data analysis, which may or may not use computers.
⮚ Data science is more closely related to the mathematics field of Statistics, which includes the
collection, organization, analysis, and presentation of data.
⮚ Because of the large amounts of data modern companies and organizations maintain, data
science has become an integral part of IT.
⮚ For example, a company that has petabytes of user data may use data science to develop
effective ways to store, manage, and analyze the data.
⮚ Data science combines multiple fields, including statistics, scientific methods, artificial
intelligence (AI), and data analysis, to extract value from data.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 2


Datafication

 We have massive amounts of data about many aspects of our lives, and, simultaneously, an
abundance of inexpensive computing power.
 Shopping, communicating, reading news, listening to music, searching for information,
expressing our opinions—all this is being tracked online
 It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals,
bioinformatics, social welfare, government, education, retail, and the list goes on.
 It’s not only the massiveness that makes all this new data interesting (or poses challenges).
It’s that the data itself, often in real time, becomes the building blocks of data products.
 On the Internet, this means Amazon recommendation systems, friend recommendations on
Face‐
book, film and music recommendations, and so on.
 We’re witnessing the beginning of a massive, culturally saturated feedback loop where our
behavior changes the product and the product changes our behavior.
 Datafication as a process of ―taking all aspects of life andturning them into data.‖ As
examples, Twitter datafies stray thoughts. LinkedIn datafies professional networks.
 Datafication is an interesting concept and led us to consider its importance with respect to
people’s intentions about sharing their own data.
 We are being datafied, or rather our actions are, and when we ―like‖ someone or something
online, we are intending to be datafied or at least we should expect to be.
 But when we merely browse the Web, we are unintentionally, or at least passively, being
datafied through cookies.
 When we walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.
 Once we datafy things, we can transform their purpose and turn the information into
new forms of value.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 3


Exploratory Data Analysis

 Exploratory data analysis (EDA) as the first step toward building a model.
 It’s traditionally presented as a bunch of histograms and stem-and-leaf plots.
 But EDA is a critical part of the data science process.
 In EDA, there is no hypothesis and there is no model. The ―exploratory‖ aspect means that your
understanding of the problem you are solving, or might solve, is changing as you go.
 The basic tools of EDA are plots, graphs and summary statistics. Generally speaking, it’s a method of
systematically going through the data, plotting distributions of all variables (using box plots), plotting
time series of data, transforming variables, looking at all pairwise relationships between variables
using scatterplot matrices, and generating summary statistics for all of them.
 But as much as EDA is a set of tools, it’s also a mindset. And that mindset is about your relationship
with the data. You want to understand the data—gain intuition, understand the shape of it, and try to
connect your understanding of the process that generated the data to the data itself.
 EDA happens between you and the data and isn’t about proving anything to anyone else yet.


 There are important reasons anyone working with data should doEDA. Namely,
o to gain intuition about the data;
o to make comparisonsbetween distributions;
o for sanity checking to findout where data is missing or if there are outliers; and
o to summarizethe data.
 In the context of data generated from logs, EDA also helps with debugging the logging process.
 In the end, EDA helps you make sure the product is performing as intended.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 4


The Data Science Process

1) First we have the Real World. Inside the Real World are lots of peoplebusy at various activities.
2) Specifically, we’ll start with raw data—logs, Olympics records, Enronemployee emails, or recorded
genetic material.
3) We want to process this to make it clean for analysis. So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you want to call it. To do this we use tools such as Python,
shell scripts, R, or SQL, or all of the above.
4) Eventually we get the data down to a nice format, like something withcolumns:
name | event | year | gender | event time
5) Once we have this clean dataset, we should be doing some kind ofEDA. In the course of doing EDA,
we may realize that it isn’t actuallyclean because of duplicates, missing values, absurd outliers, and
datathat wasn’t actually logged or incorrectly logged.
6) Next, we design our model to use some algorithm like k-nearestneighbor (k-NN), linear regression,
Naive Bayes, or something else.
The model we choose depends on the type of problem we’re trying tosolve, of course, which could be
a classification problem, a predictionproblem, or a basic description problem.
7) We then can interpret, visualize, report, or communicate our results.
8) Alternatively, our goal may be to build or prototype a ―data product‖;e.g., a spam classifier, or a search
ranking algorithm, or a recommendation system.
9) NOTE: Now the key here that makes data science special anddistinct from statistics is that this data
product then gets incorporatedbackinto the real world, and users interact with that product, and that
generates more data, which creates a feedback loop.
10) This is very different from predicting the weather, say, where yourmodel doesn’t influence the
outcome at all.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 5


11) Take this loop into account in any analysis you do by adjusting for anybiases your model caused. Your
models are not just predicting thefuture, but causing it!

A Data Scientist’s Role in This Process


 This model so far seems to suggest this will all magically happen without human intervention.
 But, someone has to make the decisions about what data to collect, and why. That person needs to
be formulating questions and hypotheses and making a plan for how the problem will be attacked.
 And that someone is the data scientist or our beloved data science team.
 It is clear that the data scientist needs to be involved in this process throughout, meaning they are
involved in the actual coding as well as in the higher-level process, as shown below.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 6


NumPy Basics:
 NumPy, short for Numerical Python, is one of the most important foundational packages for
numerical computing in Python.
 NumPy includes:
 ndarray, an efficient multidimensional array
 Mathematical functions for fast operations on entire arrays
 Tools for reading/writing array data to disk
 Linear algebra, random number generation, and Fourier transform capabilities
 For most data analysis applications, the main areas of functionality which can be focused on
are:
• Fast vectorized array operations for data munging and cleaning, subsetting and filtering,
transformation, and any other kinds of computations
• Common array algorithms like sorting, unique, and set operations
• Efficient descriptive statistics and aggregating/summarizing data
• Data alignment and relational data manipulations for merging and joining together
heterogeneous datasets
• Expressing conditional logic as array expressions instead of loops with if-elif-else branches
• Group-wise data manipulations (aggregation, transformation, function application)
 One of the reasons NumPy is so important for numerical computations in Python is
because it is designed for efficiency on large arrays of data. There are a number of
reasons for this:
o NumPy internally stores data in a contiguous block of memory, independent of
other built-in Python objects.
o NumPy arrays also use much less memory than built-in Python sequences.
o NumPy operations perform complex computations on entire arrays without the
need for Python for loops.
 To give you an idea of the performance difference, consider a NumPy array of one million
integers, and the equivalent Python list:

import numpy as np

my_arr = np.arange(1000000)

my_list = list(range(1000000))

%time my_arr2 = my_arr * 2


Wall time: 2 ms

%time my_list2 = [x * 2 for x in my_list]


Wall time: 82.6 ms

 NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python
counterparts and use significantly less memory.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 7


The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast,
flexible container for large datasets in Python.
Arrays enable you to perform mathematical operations on whole blocks of data.
Anndarray is a generic multidimensional container for homogeneous data; that is, all of the elements
must be the same type.
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object
describing the data type of the array:
In [17]: data.shape
Out[17]: (2, 3)
In [18]: data.dtype
Out[18]: dtype('float64')

Creating ndarrays:
The easiest way to create an array is to use the array function. This accepts any sequence-like object
(including other arrays) and produces a new NumPy array containing the passed data.
In [19]: data1 = [6, 7.5, 8, 0, 1]
In [20]: arr1 = np.array(data1)
In [21]: arr1
Out[21]: array([ 6. , 7.5, 8. , 0. , 1. ])
Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:
In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
In [23]: arr2 = np.array(data2)
In [24]: arr2
Out[24]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
np.array tries to infer a good data type for the array that it creates. The data type is stored in a special
dtype metadata
object;
In [28]: arr2.dtype
Out[28]: dtype('int64')
In addition to np.array, there are a number of other functions for creating new arrays. As examples,
zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an
array without initializing its values to any particular value. To create a higher dimensional array with
these methods, pass a tuple for the shape:
In [29]: np.zeros(10)
Out[29]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [30]: np.zeros((3, 6))
Out[30]:
array([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.]])
In [31]: np.empty((2, 3, 2))
Out[31]:
Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 8
array([[[ 0., 0.],
[ 0., 0.],
[ 0., 0.]],
[[ 0., 0.],
[ 0., 0.],
[ 0., 0.]]])
NOTE: It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may
return uninitialized ―garbage‖ values.
arangeis an array-valued version of the built-in Python range function:
In [32]: np.arange(15)
Out[32]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

Data Types for ndarrays


The data type or dtype is a special object containing the information (or metadata, data about data)
the ndarray needs to interpret a chunk of memory as a particular type of data:
In [33]: arr1 = np.array([1, 2, 3], dtype=np.float64)
In [34]: arr2 = np.array([1, 2, 3], dtype=np.int32)
In [35]: arr1.dtype
Out[35]: dtype('float64')
In [36]: arr2.dtype
Out[36]: dtype('int32')
dtypes are a source of NumPy’s flexibility for interacting with data coming from other systems.
The numerical dtypes are named the same way:
a type name, like float or int, followed by a number indicating the number of bits per element.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 9


You can explicitly convert or cast an array from one dtype to another using ndarray’sastype method:
In [37]: arr= np.array([1, 2, 3, 4, 5])
In [38]: arr.dtype
Out[38]: dtype('int64')
In [39]: float_arr= arr.astype(np.float64)
In [40]: float_arr.dtype
Out[40]: dtype('float64')
In this example, integers were cast to floating point.

If I cast some floating-point


numbers to be of integer dtype, the decimal part will be truncated:
In [41]: arr= np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In [42]: arr
Out[42]: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In [43]: arr.astype(np.int32)
Out[43]: array([ 3, -1, -2, 0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert
them to numeric form:
In [44]: numeric_strings= np.array(['1.25', '-9.6', '42'], dtype=np.string_)
In [45]: numeric_strings.astype(float)
Out[45]: array([ 1.25, -9.6 , 42. ])

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 10


If casting were to fail for some reason (like a string that cannot be converted to float64), a ValueError
will be raised.

You can also use another array’s dtype attribute:


In [46]: int_array= np.arange(10)
In [47]: calibers= np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
In [48]: int_array.astype(calibers.dtype)
Out[48]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a dtype:
In [49]: empty_uint32 = np.empty(8, dtype='u4')
In [50]: empty_uint32
Out[50]:
array([ 0, 1075314688, 0, 1075707904, 0,
1075838976, 0, 1072693248], dtype=uint32)

Operations between Arrays and Scalars


Arrays are important because they enable you to express batch operations on data without writing any
for loops. NumPy users call this vectorization.
Any arithmetic operations between equal-size arrays applies the operation element-wise:
In [51]: arr= np.array([[1., 2., 3.], [4., 5., 6.]])
In [52]: arr
Out[52]:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In [53]: arr* arr
Out[53]:
array([[ 1., 4., 9.],
[ 16., 25., 36.]])
In [54]: arr- arr
Out[54]:
array([[ 0., 0., 0.],
[ 0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:
In [55]: 1 / arr
Out[55]:
array([[ 1. , 0.5 , 0.3333],
[ 0.25 ,0.2 ,0.1667]])
In [56]: arr** 0.5
Out[56]:
array([[ 1. , 1.4142, 1.7321],
[ 2. ,2.2361, 2.4495]])

Comparisons between arrays of the same size yield boolean arrays:


In [57]: arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 11


In [58]: arr2
Out[58]:
array([[ 0., 4., 1.],
[ 7., 2., 12.]])
In [59]: arr2 >arr
Out[59]:
array([[False, True, False],
[ True, False, True]], dtype=bool)

Basic Indexing and Slicing


NumPy array indexing is a rich topic, as there are many ways you may want to select a subset of your
data or individual elements.
One-dimensional arrays are simple; on the surface they act similarly to Python lists:
In [60]: arr = np.arange(10)
In [61]: arr
Out[61]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [62]: arr[5]
Out[62]: 5
In [63]: arr[5:8]
Out[63]: array([5, 6, 7])
In [64]: arr[5:8] = 12
In [65]: arr
Out[65]: array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or
broadcasted henceforth) to the entire selection.

An important first distinction from Python’s built-in lists is that array slices are views on the
original array.
This means that the data is not copied, and any modifications to the view will be reflected in the
source array.

To give an example of this, I first create a slice of arr:


In [66]: arr_slice = arr[5:8]
In [67]: arr_slice
Out[67]: array([12, 12, 12])

Now, when I change values in arr_slice, the mutations are reflected in the original
array arr:
In [68]: arr_slice[1] = 12345
In [69]: arr
Out[69]: array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8,
9])
The “bare” slice [:] will assign to all values in an array:
In [70]: arr_slice[:] = 64
In [71]: arr

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 12


Out[71]: array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])

NOTE: If you want a copy of a slice of anndarray instead of a view, you will need to explicitly copy
the array—for example,
arr[5:8].copy().

With higher dimensional arrays, you have many more options. In a two-dimensionalarray, the
elements at each index are no longer scalars but rather one-dimensionalarrays:
In [72]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
In [73]: arr2d[2]
Out[73]: array([7, 8, 9])
Thus, individual elements can be accessed recursively. But that is a bit too muchwork, so you can
pass a comma-separated list of indices to select individual elements.
So these are equivalent:
In [74]: arr2d[0][2]
Out[74]: 3
In [75]: arr2d[0, 2]
Out[75]: 3

In multidimensional arrays, if you omit later indices, the returned object will be alower dimensional
ndarray consisting of all the data along the higher dimensions.
In [76]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
In [77]: arr3d
Out[77]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0] is a 2 × 3 array:
In [78]: arr3d[0]

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 13


Out[78]:
array([[1, 2, 3],
[4, 5, 6]])
Both scalar values and arrays can be assigned to arr3d[0]:
In [79]: old_values= arr3d[0].copy()
In [80]: arr3d[0] = 42
In [81]: arr3d
Out[81]:
array([[[42, 42, 42],
[42, 42, 42]],
[[ 7, 8, 9],
[10, 11, 12]]])
In [82]: arr3d[0] = old_values
In [83]: arr3d
Out[83]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0),forming a 1-
dimensional array:
In [84]: arr3d[1, 0]
Out[84]: array([7, 8, 9])
This expression is the same as though we had indexed in two steps:
In [85]: x = arr3d[1]
In [86]: x
Out[86]:
array([[ 7, 8, 9],
[10, 11, 12]])
In [87]: x[0]
Out[87]: array([7, 8, 9])
Note that in all of these cases where subsections of the array have been selected, thereturned arrays
are views.

Indexing with slices


Like one-dimensional objects such as Python lists, ndarrays can be sliced with thefamiliar syntax:
In [88]: arr
Out[88]: array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
In [89]: arr[1:6]
Out[89]: array([ 1, 2, 3, 4, 64])

Consider the two-dimensional array from before, arr2d. Slicing this array is a bitdifferent:
In [90]: arr2d
Out[90]:
array([[1, 2, 3],
[4, 5, 6],
Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 14
[7, 8, 9]])
In [91]: arr2d[:2]
Out[91]:
array([[1, 2, 3],
[4, 5, 6]])
As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects arange of elements
along an axis. It can be helpful to read the expression arr2d[:2] as―select the first two rows of arr2d.‖
You can pass multiple slices just like you can pass multiple indexes:
In [92]: arr2d[:2, 1:]
Out[92]:
array([[2, 3],
[5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions.
By mixing integer indexes and slices, you get lower dimensional slices.For example, I can select the
second row but only the first two columns like so:
In [93]: arr2d[1, :2]
Out[93]: array([4, 5])
Similarly, I can select the third column but only the first two rows like so:
In [94]: arr2d[:2, 2]
Out[94]: array([3, 6])

See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire
axis, so you can slice only higher dimensional axes by doing:
In [95]: arr2d[:, :1]

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 15


Out[95]:
array([[1],
[4],
[7]])
Of course, assigning to a slice expression assigns to the whole selection:
In [96]: arr2d[:2, 1:] = 0
In [97]: arr2d
Out[97]:
array([[1, 0, 0],
[4, 0, 0],
[7, 8, 9]])

Boolean Indexing

Let’s consider an example where we have some data in an array and an array of names with
duplicates. I’m going to use here the randnfunction in numpy.randomto generate some random
normally distributed data:
In [98]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [99]: data = np.random.randn(7, 4)
In [100]: names
Out[100]:
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],dtype='<U4')
In [101]: data
Out[101]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with
corresponding name 'Bob'. Like arithmetic operations, comparisons (such as ==) with arrays are also
vectorized. Thus, comparing names with the string 'Bob' yields a boolean array:
In [102]: names == 'Bob'
Out[102]: array([ True, False, False, True, False, False, False], dtype=bool)

This boolean array can be passed when indexing the array:


In [103]: data[names == 'Bob']
Out[103]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.669 , -0.4386, -0.5397, 0.477 ]])
The boolean array must be of the same length as the array axis it’s indexing.

You can even mix and match boolean arrays with slices or integers.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 16


In these examples, I select from the rows where names == 'Bob' and index the columns, too:
In [104]: data[names == 'Bob', 2:]
Out[104]:
array([[ 0.769 , 1.2464],
[-0.5397, 0.477 ]])
In [105]: data[names == 'Bob', 3]
Out[105]: array([ 1.2464, 0.477 ])

To select everything but 'Bob', you can either use != or negate the condition using ~:
In [106]: names != 'Bob'
Out[106]: array([False, True, True, False, True, True, True], dtype=bool)
In [107]: data[~(names == 'Bob')]
Out[107]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])

The ~ operator can be useful when you want to invert a general condition:
In [108]: cond = names == 'Bob'
In [109]: data[~cond]
Out[109]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])

Selecting two of the three names to combine multiple boolean conditions, use
boolean arithmetic operators like & (and) and | (or):
In [110]: mask = (names == 'Bob') | (names == 'Will')
In [111]: mask
Out[111]: array([ True, False, True, True, True, False, False], dtype=bool)
In [112]: data[mask]
Out[112]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241]])

Selecting data from an array by boolean indexing always creates a copy of the data, even if the
returned array is unchanged.

NOTE: The Python keywords and andor do not work with boolean arrays. Use &(and) and | (or)
instead.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 17


Setting values with boolean arrays works in a common-sense way. To set all of the negative values in
data to 0 we need only do:
In [113]: data[data < 0] = 0
In [114]: data
Out[114]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.0072, 0. , 0.275 , 0.2289],
[ 1.3529, 0.8864, 0. , 0. ],
[ 1.669 , 0. , 0. , 0.477 ],
[ 3.2489, 0. , 0. , 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[ 0. , 0. , 0. , 0. ]])

Setting whole rows or columns using a one-dimensional boolean array is also easy:
In [115]: data[names != 'Joe'] = 7
In [116]: data
Out[116]:
array([[ 7. , 7. , 7. , 7. ],
[ 1.0072, 0. , 0.275 , 0.2289],
[ 7. , 7. , 7. , 7. ],
[ 7. , 7. , 7. , 7. ],
[ 7. , 7. , 7. , 7. ],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[ 0. , 0. , 0. , 0. ]])

Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
Suppose we had an 8 × 4 array:
In [117]: arr= np.empty((8, 4))
In [118]: for iin range(8):
.....: arr[i] = i
In [119]: arr
Out[119]:
array([[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.]])

To select out a subset of the rows in a particular order, you can simply pass a list orndarray of
integers specifying the desired order:
In [120]: arr[[4, 3, 0, 6]]
Out[120]:
Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 18
array([[ 4., 4., 4., 4.],
[ 3., 3., 3., 3.],
[ 0., 0., 0., 0.],
[ 6., 6., 6., 6.]])

Using negative indices selects rows from the end:


In [121]: arr[[-3, -5, -7]]
Out[121]:

array([[ 5., 5., 5., 5.],


[ 3., 3., 3., 3.],
[ 1., 1., 1., 1.]])
Passing multiple index arrays does something slightly different; it selects a onedimensional
array of elements corresponding to each tuple of indices:
In [122]: arr= np.arange(32).reshape((8, 4))
In [123]: arr
Out[123]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])
In [124]: arr[[1, 5, 7, 2], [0, 3, 1, 2]]
Out[124]: array([ 4, 23, 29, 10])

Here the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected. Regardless of how many dimensions
the array has (here, only 2), the result of fancy indexing is always one-dimensional.

The behavior of fancy indexing in this case is a bit different from what some users might have
expected (myself included), which is the rectangular region formed by selecting a subset of the
matrix’s rows and columns. Here is one way to get that:
In [125]: arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
Out[125]:
array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 19


Transposing Arrays and Swapping Axes
Transposing is a special form of reshaping that similarly returns a view on the underlying
data without copying anything. Arrays have the transpose method and also the
special T attribute:
In [126]: arr= np.arange(15).reshape((3, 5))
In [127]: arr
Out[127]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
In [128]: arr.T
Out[128]:
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])

In [129]: arr= np.random.randn(6, 3)


In [130]: arr
Out[130]:
array([[-0.8608, 0.5601, -1.2659],
[ 0.1198, -1.0635, 0.3329],
[-2.3594, -0.1995, -1.542 ],
[-0.9707, -1.307 ,0.2863],
[ 0.378 ,-0.7539, 0.3313],
[ 1.3497, 0.0699, 0.2467]])

In [131]: np.dot(arr.T, arr)


Out[131]:
array([[ 9.2291, 0.9394, 4.948 ],
[ 0.9394, 3.7662, -1.3622],
[ 4.948 ,-1.3622, 4.3437]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute
the axes (for extra mind bending):
In [132]: arr= np.arange(16).reshape((2, 2, 4))
In [133]: arr
Out[133]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [134]: arr.transpose((1, 0, 2))
Out[134]:
array([[[ 0, 1, 2, 3],

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 20


[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])

Here, the axes have been reordered with the second axis first, the first axis second,and the last axis
unchanged.
Simple transposing with .T is a special case of swapping axes. ndarray has the methodswapaxes,
which takes a pair of axis numbers and switches the indicated axes to rearrangethe data:
In [135]: arr
Out[135]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [136]: arr.swapaxes(1, 2)
Out[136]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
swapaxes similarly returns a view on the data without making a copy.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 21


Data Processing Using Arrays:
Using NumPy arrays enables you to express many kinds of data processing tasks asconcise array
expressions that might otherwise require writing loops. This practice ofreplacing explicit loops with
array expressions is commonly referred to as vectorization.
As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y^2)across a regular grid of values. The
np.meshgridfunction takes two 1D arrays andproduces two 2D matrices corresponding to all pairs of (x, y) in the two
arrays:

Now, evaluating the function is a matter of writing the same expression you wouldwrite with two points:

Expressing Conditional Logic as Array Operations


The numpy.where function is a vectorized version of the ternary expression x if condition else y.
Suppose we had a boolean array and two arrays of values:
In [165]: xarr= np.array([1.1, 1.2, 1.3, 1.4, 1.5])
In [166]: yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
In [167]: cond= np.array([True, False, True, True, False])
Suppose we wanted to take a value from xarr whenever the corresponding value incond is True, and
otherwise take the value from yarr. A list comprehension doingthis might look like:
In [168]: result = [(x if c else y)
.....: for x, y, c in zip(xarr, yarr, cond)]
In [169]: result
Out[169]: [1.1000000000000001, 2.2000000000000002, 1.3, 1.3999999999999999, 2.5]

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 22


This has multiple problems. First, it will not be very fast for large arrays (because allthe work is being
done in interpreted Python code). Second, it will not work withultidimensional arrays. With np.where
you can write this very concisely:
In [170]: result = np.where(cond, xarr, yarr)
In [171]: result
Out[171]: array([ 1.1, 2.2, 1.3, 1.4, 2.5])
The second and third arguments to np.where don’t need to be arrays; one or both ofthem can be
scalars. A typical use of where in data analysis is to produce a new arrayof values based on another
array. Suppose you had a matrix of randomly generateddata and you wanted to replace all positive
values with 2 and all negative values with
–2. This is very easy to do with np.where:

In [172]: arr= np.random.randn(4, 4)


In [173]: arr
Out[173]:
array([[-0.5031, -0.6223, -0.9212, -0.7262],
[ 0.2229, 0.0513, -1.1577, 0.8167],
[ 0.4336, 1.0107, 1.8249, -0.9975],
[ 0.8506, -0.1316, 0.9124, 0.1882]])
In [174]: arr>0
Out[174]:
array([[False, False, False, False],
[ True, True, False, True],
[ True, True, True, False],
[ True, False, True, True]], dtype=bool)
In [175]: np.where(arr>0, 2, -2)
Out[175]:
array([[-2, -2, -2, -2],
[ 2, 2, -2, 2],
[ 2, 2, 2, -2],
[ 2, -2, 2, 2]])
You can combine scalars and arrays when using np.where. For example, I can replaceall positive
values in arr with the constant 2 like so:
In [176]: np.where(arr>0, 2, arr) # set only positive values to 2
Out[176]:
array([[-0.5031, -0.6223, -0.9212, -0.7262],
[ 2. ,2. , -1.1577, 2. ],
[ 2. ,2. , 2. , -0.9975],
[ 2. ,-0.1316, 2. , 2. ]])
The arrays passed to np.where can be more than just equal-sized arrays or scalars.

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 23


Methods for Boolean Arrays
There are two additional methods, any and all, useful especially for boolean arrays.
any tests whether one or more values in an array is True, while all checks if everyvalue is True:
In [192]: bools = np.array([False, False, True, False])
In [193]: bools.any()
Out[193]: True
In [194]: bools.all()
Out[194]: False
These methods also work with non-boolean arrays, where non-zero elements evaluateto True.

Sorting
Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sortmethod:
In [195]: arr= np.random.randn(6)
In [196]: arr
Out[196]: array([ 0.6095, -0.4938, 1.24 , -0.1357, 1.43 , -0.8469])
In [197]: arr.sort()
In [198]: arr
Out[198]: array([-0.8469, -0.4938, -0.1357, 0.6095, 1.24 , 1.43 ])

You can sort each one-dimensional section of values in a multidimensional array inplacealong an axis
by passing the axis number to sort:
In [199]: arr= np.random.randn(5, 3)
In [200]: arr
Out[200]:
array([[ 0.6033, 1.2636, -0.2555],
[-0.4457, 0.4684, -0.9616],
[-1.8245, 0.6254, 1.0229],
[ 1.1074, 0.0909, -0.3501],
[ 0.218 ,-0.8948, -1.7415]])
In [201]: arr.sort(1)
In [202]: arr
Out[202]:
array([[-0.2555, 0.6033, 1.2636],
[-0.9616, -0.4457, 0.4684],
[-1.8245, 0.6254, 1.0229],
[-0.3501, 0.0909, 1.1074],
[-1.7415, -0.8948, 0.218 ]])

The top-level method np.sort returns a sorted copy of an array instead of modifying
the array in-place. A quick-and-dirty way to compute the quantiles of an array is to
sort it and select the value at a particular rank:
In [203]: large_arr= np.random.randn(1000)
In [204]: large_arr.sort()
In [205]: large_arr[int(0.05 * len(large_arr))] # 5% quantile
Out[205]: -1.5311513550102103

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 24


Unique
NumPy has some basic set operations for one-dimensional ndarrays. A commonly
used one is np.unique, which returns the sorted unique values in an array:
In [206]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [207]: np.unique(names)
Out[207]:
array(['Bob', 'Joe', 'Will'],
dtype='<U4')
In [208]: ints= np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
In [209]: np.unique(ints)
Out[209]: array([1, 2, 3, 4])
Contrast np.unique with the pure Python alternative:
In [210]: sorted(set(names))
Out[210]: ['Bob', 'Joe', 'Will']

Fundamentals of Data Science – UNIT 1 – Lecture Notes Page 25

You might also like