Unit 1 To 5
Unit 1 To 5
UNIT I
1.1 Benefits and uses of data science and big data
Data science and big data are used almost everywhere in both commercial and
noncommercial settings. Commercial companies in almost every industry use data science
and big data to gain insights into their customers, processes, staff, completion, and products.
Many companies use data science to offer customers a better user experience, as well as to
cross-sell, up-sell, and personalize their offerings. A good example of this is Google
AdSense, which collects data from internet users so relevant commercial messages can be
matched to the person browsing the internet. Human resource professionals use people
analytics and text mining to screen candidates, monitor the mood of employees, and study
informal networks among coworkers. People analytics is the central theme in the book
Moneyball: The Art of Winning an Unfair Game. In the book (and movie) we saw that the
traditional scouting process for American baseball was random, and replacing it with
correlated signals changed everything. Relying on statistics allowed them to hire the right
players and pit them against the opponents where they would have the biggest advantage.
Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services. Governmental organizations
are also aware of data’s value. Many governmental organizations not only rely on internal
data scientists to discover valuable information, but also share their data with the public. You
can use this data to gain insights or build data-driven applications. Data.gov is but one
example; it’s the home of the US Government’s open data. A data scientist in a governmental
organization gets to work on diverse projects such as detecting fraud and other criminal
activity or optimizing project funding. well-known example was provided by Edward
Snowden, who leaked internal documents of the American National Security Agency and the
British Government Communications Headquarters that show clearly how they used data
science and big data to monitor millions of individuals. Those organizations collected 5
billion data records from widespread applications such as Google Maps, Angry Birds, email,
and text messages, among many other data sources. Nongovernmental organizations (NGOs)
are also no strangers to using data. They use it to raise money and defend their causes. The
World Wildlife Fund (WWF), for instance, employs data scientists to increase the
effectiveness of their fundraising efforts. Universities use data science in their research but
also to enhance the study experience of their students. The rise of massive open online
courses (MOOC) produces a lot of data, which allows universities to study how this type of
learning can complement traditional classes.
1.2 Facets of data
In data science and big data you’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories of data
are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data
2
Structured data is data that depends on a data model and resides in a fixed field within a
record. As such, it’s often easy to store structured data in tables within databases or Excel
files (figure 1.1). SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases. You may also come across structured data that might
give you a hard time storing it in a traditional relational
database. Hierarchical data such as a family tree is one such example.
Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.
1.2.6 Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers. MLBAM (Major League Baseball Advanced Media) announced in 2014 that
they’ll increase video capture to approximately 7 TB per game for the purpose of live, in-
game analytics. High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time, for example, the path taken by a defender relative to two baselines. a
company called DeepMind succeeded at creating an algorithm that’s capable of learning how
5
to play video games. This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning. It’s a remarkable feat that prompted
Google to buy the company for their own Artificial Intelligence (AI) development plans.
1.2.7 Streaming data
While streaming data can take almost any of the previous forms, it has an extra property. The
data flows into the system when an event happens instead of being loaded into a data store in
a batch. Although this isn’t really a different type of data, Examples are the “What’s
trending” on Twitter, live sporting or music events, and the stock market.
In this phase you use models, domain knowledge, and insights about the data you found in
the previous steps to answer the research question. You select a technique from the fields of
statistics, machine learning, operations research, and so on. Building a model is an iterative
process that involves selecting the variables for the model, executing the model, and model
diagnostics.
1.3.6 Presentation and automation
Finally, you present the results to your business. These results can take many forms, ranging
from presentations to research reports. Sometimes you’ll need to automate the execution of
the process because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
7
1 The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project. In every serious
project this will result in a project charter.
2 The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner. The
result is data in its raw form, which probably needs polishing and transformation before it
becomes usable.
3 Now that you have the raw data, it’s time to prepare it. This includes transforming the data
from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect
and correct different kinds of errors in the data, combine data from different data sources, and
transform it. If you have successfully completed this step, you can progress to data
visualization and modeling.
8
4 The fourth step is data exploration. The goal of this step is to gain a deep understanding of
the data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5 Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. Now is the time to bring out the heavy guns, but remember
research has taught us that often (but not always) a combination of simple models tends to
outperform one complicated model. If you’ve done this phase right, you’re almost done.
6 The last step of the data science model is presenting your results and automating the
analysis, if needed. One goal of a project is to change a process and/or make better decisions.
You may still need to convince the business that your findings will indeed change the
business process as expected. This is where you can
shine in your influencer role. The importance of this step is more apparent in projects on a
strategic and tactical level. Certain projects require you to perform the business process over
and over again, so automating the project will save time.
In reality you won’t progress in a linear way from step 1 to step 6. Often you’ll regress and
iterate between the different phases. This process ensures you have a well-defined research
plan, a good understanding of the business question, and clear deliverables before you even
start looking at data. The first steps of your process focus on getting high-quality data as
input for your models. This way your models will perform better later on. In data science
there’s a well-known saying: Garbage in equals garbage out.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
9
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. This may be difficult, and even if you
succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.
Start with data stored within the company
Your first act should be to assess the relevance and quality of the data that’s readily available
within your company. Most companies have a program for maintaining key data, so much of
the cleaning work may already be done. This data can be stored in official data repositories
such as databases, data marts, data warehouses, and data lakes maintained by a team of IT
professionals. The primary goal of a database is data storage, while a data warehouse is
designed for reading and analyzing that data. A data mart is a subset of the data warehouse
and geared toward serving a specific business unit. While data warehouses and data marts are
home to preprocessed data, data lakes contains data in its natural or raw format. But the
possibility exists that your data still resides in Excel files on the desktop of a domain expert.
Don’t be afraid to shop around
If data isn’t available inside your organization, look outside your organization’s walls. Many
companies specialize in collecting valuable information. For instance, Nielsen and GFK are
well known for this in the retail industry. Other companies provide data so that you, in turn,
can enrich their services and ecosystem. Such is the case with Twitter, LinkedIn, and
Facebook.
Expect to spend a good portion of your project time doing data correction and cleansing,
sometimes up to 80%. Most of the errors you’ll encounter during the data gathering phase are
easy to spot, but being too careless will make you spend many hours solving data issues that
could have been prevented during data import. You’ll investigate the data during the import,
data preparation, and exploratory phases. During data retrieval, you check to see if the data is
10
equal to the data in the source document and look to see if you have the right data types. With
data preparation, The focus is on the content of the variables: you want to get rid of typos
and other data entry errors and bring the data to a common standard among the data sets.
For example, you might correct USQ to USA and United Kingdom to UK. During the
exploratory phase your focus shifts to what you can learn from the data.
Step 3: Cleansing, integrating, and transforming data
The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your
task now is to prepare it for use in the modelling and reporting phase. Doing so is
tremendously important because your models will perform better and you’ll lose less time
trying to fix strange output. Your model needs the data in a specific format, so data
transformation will always come into play.
Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify
data errors; We do a regression to get acquainted with the data and detect the influence of
individual observations on the regression line. When a single observation has too much
influence, this can point to an error in the data, but it can also be a valid point.
11
Most errors of this type are easy to fix with simple assignment statements and if-then else
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
REDUNDANT WHITESPACE
Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
The cleaning during the ETL phase wasn’t well executed, and keys in one table contained a
whitespace at the end of a string. This caused a mismatch of keys such as “FR ” – “FR”,
12
dropping the observations that couldn’t be matched. strip() function to remove leading and
trailing spaces.
OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values. An example is shown in figure 2.6. The plot on the top shows no outliers,
whereas the plot on the bottom shows possible outliers on the upper side when a normal
distribution is expected. The high values in the bottom graph can point to outliers when
assuming a normal distribution.
13
Missing values aren’t necessarily wrong, but you still need to handle them separately; certain
modeling techniques can’t handle missing values.
When integrating two data sets, you have to pay attention to their respective units of
measurement. sets can contain prices per gallon and others can contain prices per liter. A
simple conversion will do the trick in this case.
DIFFERENT LEVELS OF AGGREGATION
Having different levels of aggregation is similar to having different types of
measurement.
An example of this would be a data set containing data per week versus one containing data
per work week.
APPENDING TABLES
TRANSFORMING DATA
Relationships between an input variable and an output variable aren’t always linear. Take, for
instance, a relationship of the form y = ae bx. Taking the log of the independent variables
simplifies the estimation problem dramatically.
16
The visualization techniques you use in this phase range from simple line graphs or
histograms, as shown in figure 2.15, to more complex diagrams such as Sankey and network
graphs.
These plots can be combined to provide even more insight, as shown in figure 2.16.
Overlaying several plots is common practice. In figure 2.17 we combine simple graphs into a
Pareto diagram, or 80-20 diagram. Figure 2.18 shows another technique: brushing and
linking. With brushing and linking you combine and link different graphs and tables (or
views) so changes in one graph are automatically transferred to the other graphs.
19
Two other important graphs are the histogram shown in figure 2.19 and the boxplot
shown in figure 2.20.
In a histogram a variable is cut into discrete categories and the number of occurrences in each
category are summed up and shown in the graph. The boxplot show how many observations
are present but does offer an impression of the distribution within categories. It can show the
maximum, minimum, median, and other characterizing measures at the same time.
Histogram:
Box plot: It can show the maximum, minimum, median, and other characterizing
measures at the same time.
21
The techniques you’ll use now are borrowed from the field of machine learning, data mining,
and/or statistics. most models consist of the following main steps:
1 Selection of a modelling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison
2.6.1 Model and variable selection
You’ll need to select the variables you want to include in your model and a modelling
technique. Your findings from the exploratory analysis should already give a fair idea of what
variables will help you construct a good model.
You’ll need to consider model performance and whether your project meets all the
requirements to use your model, as well as other factors:
22
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain?
2.6.2 Model execution
Once you’ve chosen a model you’ll need to implement it in code. Luckily, most
programming languages, such as Python, already have libraries such as StatsModels or
Scikit-learn. These packages use several of the most popular techniques.
We created predictor values that are meant to predict how the target variables behave. For a
linear regression, a “linear relation” between each x (predictor) and the y (target) variable is
assumed, as shown in figure 2.22.
23
■ Model fit—For this the R-squared or adjusted R-squared is used. This measure is an
indication of the amount of variation in the data that gets captured by the model. The
difference between the adjusted R-squared and the R-squared is minimal here because the
adjusted one is the normal one + a penalty for model complexity.
A model gets complex when many variables (or features) are introduced. You don’t need a
complex model if a simple model is available, so the adjusted R-squared punishes you for
overcomplicating. At any rate, 0.893 is high, and it should be because we cheated.
■ Predictor variables have a coefficient—For a linear model this is easy to interpret.
Detecting influences is more important in scientific studies than perfectly fitting models (not
to mention more realistic).
■ Predictor significance—Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there. This is what the p-value is about. the p-value is lower than
0.05, the variable is considered significant for most people. It means there’s a 5% chance the
predictor doesn’t have any influence.
Linear regression works if you want to predict a value, but what if you want to classify
something? Then you go to classification models, the best known among them being k-
nearest neighbors.
24
Don’t let knn.score() fool you; it returns the model accuracy, but by “scoring a model” we
often mean applying it on data to make a prediction.
prediction = knn.predict(predictors)
Now we can use the prediction and compare it to the real thing using a confusion
matrix.
metrics.confusion_matrix(target,prediction)
The confusion matrix shows we have correctly predicted 17+405+5 cases, so that’s good.
2.6.3 Model diagnostics and model comparison
You’ll be building multiple models from which you then choose the best one based on
multiple criteria. Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used to
evaluate the model afterward.
The principle here is simple: the model should work on unseen data. The model is then
unleashed on the unseen data and error measures are calculated to evaluate it. Multiple error
measures are available, and in figure 2.26 we show the general idea on comparing models.
The error measure used in the example is the mean square error.
Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction.
To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%),
without showing the other 20% of data to the model.
26
Once the model is trained, we predict the values for the other 20% of the variables based on
those for which we already know the true value, and calculate the model error with an error
measure. Then we choose the model with the lowest error. In this example we chose model 1
because it has the lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to
verify that these assumptions are indeed met. This is called model diagnostics.
After you’ve successfully analyzed the data and built a well-performing model, you’re ready
to present your findings to the world. Sometimes people get so excited about your work that
you’ll need to repeat it over and over again because they value the predictions of your models
or the insights that you produced.
27
For this reason, you need to automate your models. This doesn’t always mean that you have
to redo all of your analysis all the time. Sometimes it’s sufficient that you implement only the
model scoring; other times you might build an application that automatically updates reports,
Excel spreadsheets, or PowerPoint presentations. The last stage of the data science process is
where your soft skills will be most useful, and yes, they’re extremely important
Data Mining
Data mining should have been more appropriately named “knowledge mining from
data,” which is unfortunately somewhat long. However, the shorter term, knowledge mining
may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a
vivid term characterizing the process that finds a small set of precious nuggets from a great
deal of raw material (Figure 1.3).
In addition, many other terms have a similar meaning to data mining—for example,
knowledge mining from data, knowledge extraction, data/pattern analysis, data
archaeology, and data dredging
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery.
The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
28
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for
mining. The data mining step may interact with the user or a knowledge base. The interesting
patterns are presented to the user and may be stored as new knowledge in the knowledge
base.
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, theWeb, other
information repositories, or data that are streamed into the system dynamically.
29
1. The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (e.g., customer profile information
provided by external consultants). These tools and utilities performdata extraction,
cleaning, and transformation (e.g., to merge similar data from different sources into a
unified format), as well as load and refresh functions to update the data warehouse (see
Section 4.1.6). The data are extracted using application program interfaces known as
gateways. A gateway is supported by the underlying DBMS and allows client programs
to generate SQL code to be executed at a server. Examples of gateways include ODBC
(Open Database Connection) and OLEDB (ObjectLinking and Embedding Database) by
Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.
2. The middle tier is an OLAP (Online analytical processing) server that is typically
implemented using either (1) a relationalOLAP(ROLAP) model (i.e., an extended relational
DBMS that maps operations on multidimensional data to standard relational operations); or
30
(2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
● Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in
between a relational back-end server and client front-end tools. They use a relational
or extended-relational DBMS to store and manage warehouse data, and OLAP
middleware to support missing pieces. ROLAP servers include optimization for each
DBMS back end, implementation of aggregation navigation logic, and additional tools
and services. ROLAP technology tends to have greater scalability than MOLAP
technology. The DSS server of Microstrategy, for example, adopts the ROLAP
approach.
● Multidimensional OLAP (MOLAP) servers: These servers support
multidimensional data views through array-based multidimensional storage engines.
They map multidimensional views directly to data cube array structures. The
advantage of using a data cube is that it allows fast indexing to precomputed
summarized data. Notice that with multidimensional data stores, the storage
utilization may be low if the dataset is sparse. Many MOLAP servers adopt a two-
level storage representation to handle dense and sparse data sets: Denser subcubes are
identified and stored as array structures, whereas sparse subcubes employ
compression technology for efficient storage utilization.
● Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP
and MOLAP technology, benefiting from the greater scalability of ROLAP and the
faster computation of MOLAP. For example, a HOLAP server may allow large
volumes of detailed data to be stored in a relational database, while aggregations are
kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid
OLAP server.
● Specialized SQL servers: To meet the growing demand of OLAP processing in
relational databases, some database system vendors implement specialized SQL
servers that provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment.
Example 2.7 Median. Let’s find the median of the data from Example 2.6. The data are
already sorted in increasing order. There is an even number of observations (i.e., 12);
therefore, the median is not unique. It can be any value within the two middlemost values of
52 and 56 (that is, within the sixth and seventh values in the list). By convention, we assign
the average of the two middlemost values as the median; that is,
The quartiles give an indication of a distribution’s center, spread, and shape. The first
quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The
third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest
25%) of the data. The second quartile is the 50th percentile. As the median, it gives the
center of the data distribution.
The distance between the first and third quartiles is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the interquartile range
(IQR) and is defined as
IQR = Q3-Q1.
Interquartile range. The quartiles are the three values that split the sorted data set into four
equal parts. 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 Thus, the quartiles for this data are
the third, sixth, and ninth values, respectively, in the sorted list. Therefore, Q1 = $47,000 and
Q3 is $63,000. Thus, the interquartile range is IQR = 63-47 = $16,000.
where x is the mean value of the observations, as defined in Eq. (2.1). The standard
deviation, 𝜎, of the observations is the square root of the variance, 𝜎 2.
represents mean
standard deviation
Variance
34
UNIT II
Types of data
THREE TYPES OF DATA
Any statistical analysis is performed on data, a collection of actual observations or
scores in a survey or an experiment. The precise form of a statistical analysis often depends
on whether data are qualitative, ranked, or quantitative.
Qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes
(0 or 1) that represent a class or category. Ranked data consist of numbers (1st, 2nd, . . .
40th place) that represent relative standing within a group. Quantitative data consist of
numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a count.
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
Discrete and Continuous Variables
Quantitative variables can be further distinguished in terms of whether they are
discrete or continuous. A discrete variable consists of isolated numbers separated by gaps.
Examples include most counts, such as the number of children in a family (1, 2,3, etc., but
never 1 1/2.
A continuous variable consists of numbers whose values, at least in theory, have no
restrictions. Examples include amounts, such as weights of male statistics students; durations,
such as the reaction times of grade school children to a fire alarm; and standardized test
scores, such as those on the Scholastic Aptitude Test (SAT).
training. Such studies are referred to as experiments. An experiment is a study in which the
investigator decides who receives the special treatment.
Data are grouped into class intervals with 10 possible values each. The bottom class
includes the smallest observation (133), and the top class includes the largest observation
(245). The distance between bottom and top is occupied by an orderly series of classes. The
frequency ( f ) column shows the frequency of observations in each class and, at the bottom,
the total number of observations in all classes.
37
2.2 GUIDELINES
The “Guidelines for Frequency Distributions” box lists seven rules for producing a well-
constructed frequency distribution. The first three rules are essential and should not be
violated. The last four rules are optional and can be modified or ignored as circumstances
warrant.
38
39
For class 130-139 the cumulative frequency is 3 since, there are no lower classes.
For class 140-149 the cumulaive frequency is 1+3 = 4
For class 150-159 the cumulative frequency is 1+3+17= 21
The cumulative percent for class 130-139 is given by (cumulative frequency / Total no.of
freq)*100.
Example (3/53)*100 = 5.66 = 6
Percentile Ranks
When used to describe the relative position of any score within its parent distribution,
cumulative percentages are referred to as percentile ranks. The percentile rank of a score
indicates the percentage of scores in the entire distribution with similar or smaller values
than that score.
The weight distribution described in Table 2.2 appears as a histogram in Figure 2.1.
A casual glance at this histogram confirms previous conclusions: a dense concentration of
weights among the 150s, 160s, and 170s, with a spread in the direction of the heavier
weights. Let’s pinpoint some of the more important features of histograms.
1. Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution.
2. Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
3. The intersection of the two axes defines the origin at which both numerical scales
equal 0.
4. Numerical scales always increase from left to right along the horizontal axis and from
bottom to top along the vertical axis. It is considered good practice to use wiggly lines
to highlight breaks in scale, such as those along the horizontal axis in Figure 2.1,
between the origin of 0 and the smallest class of 130–139.
5. The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency
polygons may be constructed directly from frequency distributions.
44
Draw a vertical line to separate the stems, which represent multiples of 10, from the space to
be occupied by the leaves, which represent multiples of 1.
45
Selection of Stems
Stem values are not limited to units of 10. Depending on the data value of 10, such as
1, 100, 1000, or even .1, .01, .001, and so on can be selected.
For instance, an annual income of $23,784 could be displayed as a stem of 23 (thousands)
and a leaf of 784. (Leaves consisting of two or more digits, such as 784, are separated by
commas.)
Normal
Any distribution that approximates the normal shape. The familiar bell-shaped
silhouette of the normal curve can be superimposed on many frequency distributions.
Bimodal
Any distribution that approximates the bimodal shape in panel B, might, as suggested
previously, reflect the coexistence of two different types of observations in the same
distribution. For instance, the distribution of the ages of residents in a neighborhood
consisting largely of either new parents or their infants has a bimodal shape.
Positively Skewed
The two remaining shapes in Figure 2.3 are lopsided. A lopsided distribution caused by a few
extreme observations in the positive direction
Negatively Skewed
A lopsided distribution caused by a few extreme observations in the negative direction (to the
left of the majority of observations)
2.10 A GRAPH FOR QUALITATIVE (NOMINAL) DATA
“Do you have a Facebook profile?” appears as a bar graph in Figure 2.4. As with
histograms, equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data.
Four years is the modal term, since the greatest number of presidents, 7, served this term.
Note that the mode equals 4 years
More Than One Mode
Distributions can have more than one mode (or no mode at all).
Distributions with two obvious peaks, even though they are not exactly the same height, are
referred to as bimodal. Distributions with more than two peaks are referred to as
multimodal. The presence of more than one mode might reflect important differences among
subsets of data.
49
3.2 MEDIAN
The median reflects the middle value when observations are ordered from least to most.
3.3 MEAN
The mean is found by adding all scores and then dividing by the number of
scores.
Sample or Population?
Statisticians distinguish between two types of means—the population mean and the sample
mean—depending on whether the data are viewed as a population (a complete set of scores)
or as a sample (a subset of scores).
Formula for Sample Mean
designates the sample mean, and the formula becomes
The balance point for a sample, found by dividing the sum for the values of all scores in the
sample by the number of scores in the sample.
50
The modal infant death rate of 4 describes the most typical rate (since it occurs most
frequently, five times, in Table 3.4).
51
The median infant death rate of 7 describes the middle-ranked rate (since the United States,
with a death rate of 7, occupies the middle-ranked, or 10th, position among the 19 ranked
countries).
The mean infant death rate of 30.00 describes the balance point for all rates (since the sum of
all rates, 570, divided by the number of countries, 19, equals 30.00).
Unlike the mode and median, the mean is very sensitive to extreme scores, or outliers.
Interpreting Differences between Mean and Median
● Ideally, when a distribution is skewed, report both the mean and the median.
● Appreciable differences between the values of the mean and median signal the
presence of a skewed distribution.
● If the mean exceeds the median the underlying distribution is positively skewed.
● If the median exceeds the mean, the underlying distribution is negatively skewed.
Describing Variability
Variability is the measures of amount by which scores are dispersed or scattered in a
distribution.
In Figure 4.1, each of the three frequency distributions consists of seven scores with
the same mean (10) but with different variabilities. rank the three distributions from least to
most variable. Your intuition was correct if you concluded that distribution A has the least
variability, distribution B has intermediate variability, and distribution C has the most
variability. For distribution A with the least (zero) variability, all seven scores have the same
value (10). For distribution B with intermediate variability, the values of scores vary slightly
(one 9 and one 11), and for distribution C with most variability, they vary even more (one 7,
two 9s, two 11s, and one 13).
FIGURE 4.1
Three distributions with the same mean (10) but different amounts of variability. Numbers
in the boxes indicate distances from the mean.
4.2 RANGE
The range is the difference between the largest and smallest scores.
In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to
10); distribution B, the moderately variable, has an intermediate range of 2 (from 11 to 9);
and distribution C, the most variable, has the largest range of 6 (from 13 to 7).
Shortcomings of Range
1. The range has several shortcomings. First, since its value depends on only two scores
—the largest and the smallest—it fails to use the information provided by the
remaining scores.
2. The value of the range tends to increase with increases in the total number of scores.
4.3 VARIANCE
The mean of all squared deviation scores.
53
the variance also qualifies as a type of mean, that is, as the balance point for some
distribution. In the case of the variance, each original score is re-expressed as a distance or
deviation from the mean by subtracting the mean.
Reconstructing the Variance
FIGURE 4.1
In distribution C, one score coincides with the mean of 10, four scores (two 9s and two 11s)
deviate 1 unit from the mean, and two scores (one 7 and one 13) deviate 3 units from the
mean, yielding a set of seven deviation scores: one 0, two –1s, two 1s, one –3, and one 3.
(Deviation scores above the mean are assigned positive signs; those below the mean are
assigned negative signs.)
Mean of the Squared Deviations
Multiplying each deviation by itself—generates a set of squared deviation scores, all
of which are positive. add the consistently positive values of all squared deviation scores and
then dividing by the total number of scores to produce the mean of all squared deviation
scores, also known as the variance.
Here’s a hypothetical example to demonstrate how variance works. Let’s say returns
for stock in
Company ABC are 10% in Year 1, 20% in Year 2, and 15% in Year 3
. The average of these three returns is 5%. The differences between
each return and the average are 5%, 15%, and 20%
for each consecutive year.
Squaring these deviations yields 0.25%, 2.25%, and 4.00%, respectively. If we add
these squared deviations, we get a total of 6.5%. When you divide the sum of 6.5% by one
less the number of returns in the data set, as this is a sample (2 = 3-1), it gives us a variance
of 3.25% (0.0325).
Taking the square root of the variance yields a standard deviation of 18% (0.0325 = 0.18
The square root of the variance. This produces a new measure, known as the standard
deviation, that describes variability in the original units of measurement the standard
deviation, the square root of the mean of all squared deviations from the mean, that is,
same pattern describes a wide variety of frequency distributions including the two shown in
Figure 4.3, where the lowercase letter ‘s’ represents the standard deviation. As suggested in
the top panel of Figure 4.3,
55
if the distribution of IQ scores for a class of fourth graders has a mean (X) of 105 and
a standard deviation (s) of 15, a majority of their IQ scores should be within one standard
deviation on either side of the mean, that is, between 90 and 120.
FIGURE 4.3
Some generalizations that apply to most frequency distributions
SS represents the sum of squares, Σ directs us to sum over the expression to its right, and (X
μ) 2 denotes each of the squared deviation scores.
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation
score, X μ.
2. Square each deviation score, (X μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X μ)2.
56
where , the sum of the squared X scores, is obtained by first squaring each X score and
then summing all squared X scores; , the square of sum of all X scores, is obtained by
first adding all X scores and then squaring the sum of all X scores; and N is the population
size.
57
where s2 and s represent the sample variance and sample standard deviation, SS is the
sample sum of squares
59
when n deviations about the sample mean are used to estimate variability in the population,
only n 1 are free to vary. As a result, there are only n
1 degrees of freedom, that is,
df = n 1. One df is lost because of the zero-sum restriction.
where s2 and s represent the sample variance and standard deviation, SS is the sum of
60
FIGURE 5.1
Relative frequency distribution for heights of 3091 men. Source: National Center for Health
Statistics, 1960–62, Series 11, No.14. Mean updated by authors.
.10 of these men, that is, one-tenth of 3091, (3091/10) or about 309 men, are 70
inches tall. .10 of these men, that is, one-tenth of 3091, or about 309 men, are 70 inches tall.
Only half of the bar at 66 inches is shaded to adjust for the fact that any height between 65.5
and 66.5 inches is reported as 66 inches, whereas eligible applicants must be shorter than
exactly 66 inches, that is, 66.0 inches.
FIGURE 5.2
Normal curve superimposed on the distribution of heights.
Different Normal Curves
For example, changing the mean height from 69 to 79 inches produces a new normal
curve that, as shown in panel A of Figure 5.3, is displaced 10 inches to the right of the
original curve. Dramatically new normal curves are produced by changing the value of the
standard deviation. As shown in panel B of Figure 5.3, changing the standard deviation from
3 to 1.5 inches produces a more peaked normal curve with smaller variability, whereas
changing the standard deviation from 3 to 6 inches produces a shallower normal curve with
greater variability.
Because of their common mathematical origin, every normal curve can be interpreted
in exactly the same way once any distance from the mean is expressed in standard deviation
units.
5.2 z SCORES
A z score is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the
mean of its distribution.
where X is the original score and μ and σ are the mean and the standard deviation,
respectively,
Converting to z Scores
To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation of
heights) and solve for z as follows:
This informs us that the cutoff height is exactly one standard deviation below the
mean. Knowing the value of z, we can use the table for the standard normal curve to find the
proportion of eligible FBI applicants. First, however, we’ll make a few comments about the
standard normal curve.
5.3 STANDARD NORMAL CURVE
If the original distribution approximates a normal curve, then the shift to standard or z scores
will always produce a new distribution that approximates the standard normal curve. The
standard normal curve always has a mean of 0 and a standard deviation of 1.
However, to verify (rather than prove) that the mean of a standard normal distribution equals
0, replace X in the z score formula with μ, the mean of any (nonstandard) normal distribution,
and then solve for z:
to verify that the standard deviation of the standard normal distribution equals 1, replace X in
the z score formula with μ + 1σ, the value corresponding to one standard deviation above the
mean for any (nonstandard) normal distribution, and then solve for z:
Although there is an infinite number of different normal curves, each with its own mean
and standard deviation, there is only one standard normal curve, with a mean of 0 and
a standard deviation of 1.
64
Z score
Page 458
65
4. Find the target area. Refer to the standard normal table, using the bottom legend, as
the z score is negative. The arrows in Table 5.1 show how to read the table. Look up
column A’ to 1.00 (representing a z score of –1.00), and note the corresponding
proportion of .1587 in column C’: This is the answer, as suggested in the right part of
66
Figure 5.6. It can be concluded that only .1587 (or .16) of all of the FBI applicants
will be shorter than 66 inches.
Example: Finding Proportions between Two Scores
Look up column A′ to a negative z score of –1.00 (remember, you must imagine the negative
sign), and note the corresponding proportion of .1587 in column C′. Likewise, look up
67
column A′ to a z score of –1.67, and note the corresponding proportion of .0475 in column C
′. Subtract
68
UNIT III
6.2 SCATTERPLOTS
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
A dot cluster that has a slope from the lower left to the upper right, as in panel A of Figure
6.2, reflects a positive relationship. Small values of one variable are paired with small values
of the other variable, and large values are paired with large values.
69
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect
relationship between two variables. In practice, perfect relationships are most unlikely.
Curvilinear Relationship
Sometimes a dot cluster approximates a bent or curved line, as in Figure 6.4, and therefore
reflects a curvilinear relationship. Descriptions of these relationships are more complex
than those of linear relationships.
70
Key Properties of r
1. The sign of r indicates the type of linear relationship, whether positive or negative.
2. The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.
Sign of r
A number with a plus sign (or no sign) indicates a positive relationship, and a number
with a minus sign indicates a negative relationship. For example, an r with a plus sign
describes the positive relationship between height and weight shown in panel A of Figure 6.2,
and an r with a minus sign describes the negative relationship between heavy smoking and
life expectancy shown in panel B.
Numerical Value of r
The more closely a value of r approaches either –1.00 or +1.00, the stronger (more
regular) the relationship. Conversely, the more closely the value of r approaches 0, the
weaker (less regular) the relationship. Figure 6.3, notice that the values of r shift from .75
to .27 as the analysis for pairs of IQ scores shifts from a relatively strong relationship for
identical twins to a relatively weak relationship for foster parents and foster children.
71
Interpretation of r
Located along a scale from –1.00 to +1.00, the value of r supplies information about
the direction of a linear relationship—whether positive or negative—and, generally,
information about the relative strength of a linear relationship—whether relatively weak (and
a poor describer of the data) because r is in the vicinity of 0, or relatively strong (and a good
describer of the data) because r deviates from 0 in the direction of either +1.00 or –1.00.
Range Restrictions
The value of the correlation coefficient declines whenever the range of possible X or Y scores
is restricted.
For example, Figure 6.5 shows a dot cluster with an obvious slope, represented by an
r of .70 for the positive relationship between height and weight for all college students. If,
however, the range of heights along Y is restricted to students who stand over 6 feet 2 inches
(or 74 inches) tall, the abbreviated dot cluster loses its obvious slope because of the more
homogeneous weights among tall students. Therefore, as depicted in Figure 6.5, the value of
r drops to .10.
Verbal Descriptions
An r of .70 for the height and weight of college students could be translated into “Taller
students tend to weigh more”
72
where the two sum of squares terms in the denominator are defined as
FIGURE 7.2
Prediction of 15.20 for Emma (using the regression line).
Predictive Errors
Figure 7.3 illustrates the predictive errors that would have occurred if the regression
line had been used to predict the number of cards received by the five friends.
FIGURE 7.3
Predictive errors.
75
The placement of the regression line minimizes not the total predictive error but
the total squared predictive error. that is, the total for all squared predictive errors.
When located in this fashion, the regression line is often referred to as the least squares
regression line.
=.80*13+6.40
=16.8
7.4 STANDARD ERROR OF ESTIMATE,s y | x (s sub y given x.”) [Error caused during
prediction]
where SSy is the sum of the squares for Y scores (cards received by the five friends), that is,
7.5 ASSUMPTIONS
Linearity
Use of the regression equation requires that the underlying relationship be linear. You need
to worry about violating this assumption only when the scatterplot for the original correlation
analysis reveals an obviously bent or curvilinear dot cluster, such as illustrated in Figure 6.4.
In the unlikely event that a dot cluster describes a pronounced curvilinear trend consult
advanced statistics technique.
Homoscedasticity
Use of the standard error of estimate, sy|x, assumes that except for chance, the dots in the
original scatterplot will be dispersed equally about all segments of the regression line. You
need to worry about violating this assumption homoscedasticity only when the scatterplot
reveals a dramatically different
type of dot cluster such as that shown in Figure 7.4
78
Figure 7.4
7.6 INTERPRETATION OF r2
Squared correlation coefficient, r2 A measure of predictive accuracy that supplements the
standard error of estimate, sy|x. even though our ultimate goal is to show the relationship
between r2 and predictive accuracy, we will initially concentrate on two kinds of predictive
errors—those due to the repetitive prediction of the mean and those due to the regression
equation.
Predictive Errors
Panel A of Figure 7.5 shows the predictive errors for all five friends when the mean for all
five friends, Y, of 12 (shown as the mean line) is always used to predict each of their five Y
scores. Panel B shows the corresponding predictive errors for all five friends when a series of
different Y′ values, obtained from the least squares equation (shown as the least squares line),
is used to predict each of their five Y scores.
Positive and negative errors indicate that Y scores are either above or below their
corresponding predicted scores.
Overall, as expected, errors are smaller when customized predictions of Y′ from the least
squares equation can be used than when only the repetitive prediction of Y can be used.
80
The error variability for the repetitive prediction of the mean can be designated as SSy.
since each Y score is expressed as a squared deviation from and then summed.
Using the errors for the five friends shown in Panel A of Figure 7.5, this becomes
The error variability for the customized predictions from the least squares equation
can be designated as SSy|x
Using the errors for the five friends shown in Panel B of Figure 7.5, we obtain:
To obtain an SS measure of the actual gain in accuracy due to the least squares predictions,
subtract the residual variability from the total variability, that is, subtract
SSy|x from SSy, to obtain
To express this difference, 51.2, as a gain in accuracy relative to the original error variability
for the repetitive prediction of Y,
This result, .64 or 64 percent, represents the proportion or percent gain in predictive accuracy.
81
the square of the correlation coefficient, r2, always indicates the proportion of total
variability in one variable that is predictable from its relationship with the other
variable.
Small Values of r2
a value of r2 in the vicinity of .01, .09, or .25 reflects a weak, moderate, or strong
relationship, respectively
r2 provides us with a straightforward measure of the worth of our least squares predictive
effort
Table 7.4 lists the top 10 hitters in the major leagues during 2014 and shows how they fared
during 2015. Notice that 7 of the top 10 batting averages regressed downward, toward 260s,
the approximate mean for all hitters during 2015. Incidentally, it is not true that, viewed as a
group, all major league hitters are headed toward mediocrity. Hitters among the top 10 in
2014, who were not among the top 10 in 2015, were replaced by other mostly above-average
hitters, who also were very lucky during 2015. Observed regression toward the mean occurs
for individuals or subsets of individuals, not for entire groups.
Some trainees were praised after very good landings, while others were reprimanded
after very bad landings. On their next landings, praised trainees did more poorly and
reprimanded trainees did better. It was concluded, therefore, that praise hinders but a
reprimand helps performance!
A valid conclusion considers regression toward the mean. It’s reasonable to assume
that, in addition to skill, chance plays a role in landings. Some trainees who made very
good landings were lucky. while some who made very bad landings were unlucky.
83
UNIT IV
The Basics of NumPy Arrays
Data manipulation in Python is nearly synonymous with
NumPy array manipulation. NumPy array manipulation to
access data and subarrays, and to split, reshape, and join the
arrays
NumPy Array Attributes
dtype: int64
itemsize, which lists the size (in bytes) of each array element,
and nbytes, which lists the total size (in bytes) of the array
itemsize: 8 bytes
nbytes: 480 bytes # 8 bytes * 60[m*n*k]
Array Indexing: Accessing Single Elements
In a one-dimensional array, you can access the ith value
(counting from zero) by specifying the desired index in square
brackets.
In[5]: x1
Out[5]: array([5, 0, 3, 3, 7, 9])
# produced since randomly generated
In[6]: x1[0]
Out[6]: 5
In[7]: x1[4]
Out[7]: 7
To index from the end of the array, you can use negative
indices:
Out[5]: array([5, 0, 3, 3, 7, 9]) # produced since randomly generated
In[8]: x1[-1]
Out[8]: 9
In[9]: x1[-2]
Out[9]: 7
[1, 6, 7, 7]])
Row, column
In[11]: x2[0, 0]
Out[11]: 3
In[12]: x2[2, 0]
Out[12]: 1
In[13]: x2[2, -1]
Out[13]: 7
array([5, 0, 3, 3, 7, 9])
In[15]: x1[0] = 3.14159 # this will be truncated!
x1
Out[15]: array([3, 0, 3, 3, 7, 9])
In[16]: x = np.arange(10)
x
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Out[17]: array([0, 1, 2, 3, 4])
In[23]: x[5::-2] # reversed every other from index 5 from array [x]
Out[23]: array([5, 3, 1])
x[6::-2]
array([6, 4, 2, 0])
Multidimensional subarrays
Multidimensional slices work in the same way, with multiple
slices separated by commas.
For example
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3] # two rows, three columns
88
In[26]: x2[:3, ::2] # all rows, every other column, prints every other
element
Out[26]: array([[12, 2],
[ 7, 8],
[ 1, 7]])
In X2[::-1]
array([[1, 6, 7, 7],
[7, 6, 8, 8],
[3, 5, 2, 4]])
=============================================
In x2[::-1, ::-1]
array([[7, 7, 6, 1],
[8, 8, 6, 7],
[4, 2, 5, 3]])
X2
array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
x2[::-2]
array([[1, 6, 7, 7],
[3, 5, 2, 4]])
x2[::-2,::-2]
array([[7, 6],
[4, 5]])
x2[::-3]
array([[1, 6, 7, 7]])
89
x2[::-3,::-3]
array([[7, 1]])
Or
In[30]: print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]
if we modify this subarray we’ll see that the original array is changed
90
In[33]: x2_sub[0, 0] = 99
print(x2_sub)
[[99 5]
[ 7 6]]
In[34]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Out: [[99 5]
[ 7 6]]
If we now modify this subarray, the original array is not
touched:
In[36]: x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
[[42 5]
[ 7 6]]
In[37]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Reshaping of Arrays
The most flexible way of doing this is with the reshape()
method
91
out:
[[1 2 3]
[4 5 6]
[7 8 9]]
the size of the initial array must match the size of the reshaped
array
In[41]: x.reshape((3,1))
Out[41]: ([[1]
[2]
[3]])
Array Concatenation and Splitting
All of the preceding routines worked on single arrays. It’s also
possible to combine multiple arrays into one, and to conversely split a
single array into multiple arrays.
92
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily
accomplished through the routines np.concatenate, np.vstack, and
np.hstack.
In[43]: x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y]) # Merges two arrays
Out[43]: array([1, 2, 3, 3, 2, 1])
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
In[53]: left, right = np.hsplit(grid, [2])
print(left)
print(right)
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
the sum function and the np.sum function are not identical.
Minimum and Maximum
Similarly, Python has built-in min and max functions, used to
find the minimum value and maximum value of any given array:
In[5]: min(big_array), max(big_array)
Out[5]: (1.1717128136634614e-06, 0.9999976784968716)
For min, max, sum, and several other NumPy aggregates, a shorter
syntax is to use methods of the array object itself:
In[8]: print(big_array.min(), big_array.max(), big_array.sum())
Multidimensional aggregates
One common type of aggregation operation is an aggregate along a
row or column
In[9]: M = np.random.random((3, 4))#3 rows and 4columns
print(M)
Figure 2.4
In[3]: a + 5
Out[3]: array([5, 6, 7])
print(a)
print(b)
[0 1 2] #print a
[[0] # print b
[1]
[2]]
In[7]: a + b
Out[7]: array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
Rules of Broadcasting
Broadcasting example 1
Let’s look at adding a two-dimensional array to a one-dimensional
array:
In[8]: M = np.ones((2, 3))
a = np.arange(3)
Broadcasting example 2
example where both arrays need to be broadcast
In[10]: a = np.arange(3).reshape((3, 1))
b = np.arange(3)
a.shape = (3, 1)
b.shape = (3,)
out: [[0]
[1]
[2]]
[0 1 2]
Rule 1 says we must pad the shape of b with ones:
a.shape -> (3, 1)
b.shape -> (1, 3)
[1, 2, 3],
[2, 3, 4]])
Broadcasting example 3
an example in which the two arrays are not compatible
In[12]: M = np.ones((3, 2))
a = np.arange(3)
[0 1 2] # a output
M.shape = (3, 2)
a.shape = (3,)
rule 1 tells us that we must pad the shape of a with ones:
M.shape -> (3, 2)
a.shape -> (1, 3)
rule 2, the first dimension of a is stretched to match that of M
M.shape -> (3, 2) # since its 2 here we cannot strech
a.shape -> (3, 3)
rule 3—the final shapes do not match, so these two arrays are
incompatible
In[13]: M + a
Broadcasting in Practice
Centring an array ufuncs allow a NumPy user to remove the need to
explicitly write slow Python loops. Broadcasting extends this ability.
example is centering an array of data
Out
([[0.6231582 , 0.62830284, 0.48405648],
[0.4893788 , 0.96598238, 0.99261057],
[0.18596872, 0.26149718, 0.41570724],
[0.74732252, 0.96122555, 0.03700708],
[0.71465724, 0.92325637, 0.62472884],
[0.53135009, 0.20956952, 0.78746706],
[0.67569877, 0.45174937, 0.53474695],
[0.91180302, 0.61523213, 0.18012776],
[0.75023639, 0.46940932, 0.11044872],
[0.86844985, 0.07136273, 0.00521037]])
we can center the X array by subtracting the mean value from each
element in array.(Ex: )
we can check that the centered array has near zero mean by
In[20]: X_centered.mean(0)
Out[20]: array([ 0.00000000e+00, -1.11022302e-16, -6.66133815e-17])
# z -array,
origin - [0,0] index of z should be at the lower-left corner of the plot,
extent = left, right, bottom, and top boundaries of the image,
cmap - color map.
104
plt.hist(inches, 2);
In[13]: x < 6
Out[13]: array([[ True, True, True, True],
[False, False, True, True],
[True, True, False, False]], dtype=bool)
Counting entries
In[15]: # how many values less than 6?
np.count_nonzero(x < 6) # prints values less than 6
Out[15]: 8
107
(Or)
Here all the elements in the first and third rows are less than 8, while
this is not the case for the second row.
BOOLEAN OPERATORS
We have already seen
● All days with rain less than four inches,
● All days with rain less than four inches and greater than one
inch?
108
Or
In[27]: x < 5
Out[27]: array([[False, True, True, True],
[False, False, True, False],
[True, True, False, False]], dtype=bool)
Now to select these values from the array, we can simply index on
this Boolean array;
this is known as a masking operation:
109
In[33]: bin(42)
Out[33]: '0b101010' #binary representation
In[34]: bin(59)
Out[34]: '0b111011' #binary representation
In[38]: A or B
110
ValueError: The truth value of an array with more than one element
is...
5. Fancy Indexing
We’ll look at another style of array indexing, known as fancy
indexing instead of (e.g., arr[0]), slices (e.g., arr[:5]),.
[51 92 14 71 60 20 82 86 74 74]
The first index refers to the row, and the second to the column:
In[6]: row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]
Out[6]: array([ 2, 5, 11])
The first value in the result is X[0, 2], the second is X[1, 1], and
the third is X[2, 3]. The pairing of indices in fancy indexing follows
all the broadcasting rules.
1 [ 4 5 6 7]
2 [ 8 9 10 11]]
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
x[i] -= 10 # x[i]=x[i]-10
print(x) #minuses value 10 and prints it
[ 0 89 89 3 89 5 6 7 89 9]
In[21]: i = [2, 3, 3, 4, 4, 4]
x[i] += 1 # x[2]=x[2]+1
x
Out[21]: array([ 6., 0., 1., 1., 1., 0., 0., 0., 0., 0.]) # the value is 1
since the values are overwritten.
[ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
114
In[15]: data['age']
Out[15]: array([25, 45, 37, 19], dtype=int32)
If we view our data as a record array instead, we can access this with
slightly fewer keystrokes:
The downside is that for record arrays, there is some extra overhead
involved in accessing the fields.
Series as dictionary
Like a dictionary, the Series object provides a mapping from a
collection of keys to a collection of values:
In[1]: import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], #values
index=['a', 'b', 'c', 'd']) # keys
data
Out[1]: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In[2]: data['b'] # returns the value of b in OUT[1]
Out[2]: 0.5
118
Out[6]: a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
Out[6]: a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
c 0.75
dtype: float64
In[9]: # masking
data[(data > 0.3) & (data < 0.8)]
Out[9]: b 0.50
c 0.75
dtype: float64
● when you are slicing with an explicit index (i.e., data['a':'c']), the
final index is included in the slice, while when you’re slicing
with an implicit index (i.e., data[0:2]), the final index is
excluded from the slice.
data['a':'c']
Out[7]: a 0.25
b 0.50
c 0.75
data[0:2]
a 0.25
b 0.50
0 1 a
1 3 b
2 5 c
dtype: object
In[15]: data.loc[1:3]
Out[15]: 1 a
3b
dtype: object
The iloc attribute allows indexing and slicing that always references
the implicit Python-style index:
121
In[16]: data.iloc[1]
Out[16]: 'b'
In[17]: data.iloc[1:3]
Out[17]: 3 b
5c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series
objects is equivalent to standard []-based indexing.
DataFrame as a dictionary
In[18]: area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})#area variable
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})#population variable
data = pd.DataFrame({'area':area, 'pop':pop})
data
122
In[19]: data['area']
Out[19]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
Though this is a useful shorthand, keep in mind that it does not work
for all cases!
Since, there is already a function called as POP(), used to remove
a element in an array.
123
Matrix transpose
In[25]: data.T
Out[25]:
California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01
124
OUT [28]
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
Example dataframe
Out[23]:
area pop density
California 423967 38332521 90.413926
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740
126
In[34]: data[1:3]
Out[34]: area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
df
Out[3]: A B C D
06926
17437
27254
Any item for which one or the other does not have an entry is marked
with NaN, or “Not a Number,” which is how Pandas marks missing
data
Example:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2]) #values with index are added
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A+B
Out[9]: 0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
If using NaN values is not the desired behaviour, we can modify by
calling A.add(B) is equivalent to calling A + B,
Out[12]: B A C
0 4 0 9
1 5 8 0
2 9 2 6
In[13]: A + B
Out[13]: A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
Here we’ll fill with the mean of all values in A (which we compute by
first stacking the rows of A):
In[14]: fill = A.stack().mean() # all values in A are stacked and
added to find mean = 4.5 obtained from (1+5+11+1)/4
A.add(B, fill_value=fill)
A B C B A C
0 1 11 4.5 + 0 4 0 9
1 5 1 4.5 1 5 8 0
2 4.5 4.5 4.5 2 9 2 6
dtype = object
10 loops, best of 3: 78.2 ms per loop
dtype = int
100 loops, best of 3: 3.06 ms per loop
In[6]: 1 + np.nan
Out[6]: nan
In[7]: 0 * np.nan
Out[7]: nan
In[8]: vals2.sum(), vals2.min(), vals2.max()
Out[8]: (nan, nan, nan)
NumPy does provide some special aggregations that will ignore these
missing values
In[9]: np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Out[9]: (8.0, 1.0, 4.0)
133
Out[17]: 0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
135
2 NaN 4.0 6
By default, dropna() will drop all rows in which any null value is
present:
In[18]: df.dropna()
Out[18]: 0 1 2
1 2.0 3.0 5 #dispalys row with no missing values
you can drop NA values along a different axis; axis=1 drops all
columns containing a null value:
In[19]: df.dropna(axis='columns')
Out[19]: 2 # Displays only column with no missing values.
02
15
26
In[20]: df[3] = np.nan # add column 3 to df.
df
Out[20]: 0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[21]: df.dropna(axis='columns', how='all')
Out[21]: 0 1 2 # drops all NaN column since (axis=col)
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
the thresh parameter lets you specify a minimum number of
non-null values for the row/column to be kept:
In[22]: df.dropna(axis='rows', thresh=3)
Out[22]: 0 1 2 3
1 2.0 3.0 5 NaN
Out[23]:
136
a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64
Out[24]:
a 1.0 # filled with 0 values
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
Out[25]:
a 1.0 # fills previous value 1 in NaN value
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
Out[26]:
137
For DataFrames, the options are similar, but we can also specify an
axis along which the fills take place:
In[27]: df
Out[27]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[28]: df.fillna(method='ffill', axis=1) #column wise fill from prev
Out[28]:
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
Notice that if a previous value is not available during a forward fill,
the NA value remains.
Hierarchical Indexing
While Pandas does provide Panel and Panel4D objects that natively
handle three-dimensional and four-dimensional data, a far more
common pattern in practice is to make use of hierarchical indexing
(also known as multi-indexing) to incorporate multiple index levels
within a single index.
creation of MultiIndex objects
In[1]: import pandas as pd
import numpy as np
This produces the desired result, but is not as clean (or as efficient for
large datasets), So we go for multiindex
● In this case, the state names and the years, as well as multiple
labels for each data point which encode these levels.
● reindex of series with MultiIndex shows the hierarchical
representation of data
In[6]: pop = pop.reindex(index)
pop
Out[6]: 0 California 0 2000 33871648
0 1 2010 37253956
1 New York 0 2000 18976457
1 1 2010 19378102
2 Texas 0 2000 20851820
2 1 2010 25145561
dtype: int64
Blank entry indicates the same value as the line above it.
In[7]: pop[:, 2010] #access data of 2010
Out[7]: California 37253956
New York 19378102
Texas 25145561
dtype: int64
pop_df
141
Out[8]:
2000 2010 # Difference
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
stack() method provides the opposite operation:
In[9]: pop_df.stack()
Out[9]:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
so why do we need multiple indexing?
we were able to use multi-indexing to represent two-
dimensional data within a one-dimensional Series, we can also use it
to represent data of three or more dimensions in a Series or
DataFrame.
Now we add another column with population under 18.
In[10]: pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
In[6]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
Out[6]: 1 A
2B
3C
4D
5E
6F
dtype: object
146
Duplicate indices
Difference between np.concatenate and pd.concat is that Pandas
concatenation preserves indices, even if the result will have duplicate
indices
In[9]: x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
print(x); print(y); print(pd.concat([x, y]))
print("ValueError:", e)
ValueError: Indexes have overlapping values: [0, 1]
Ignoring the index. Sometimes the index itself does not matter, and
you would prefer it to simply be ignored. You can specify this option
using the ignore_index flag. With this set to True, the concatenation
will create a new integer index for the resulting Series:
By default, the entries for which no data is available are filled with
NA values. To change this, we can specify one of several options for
the join and join_axes parameters of the concatenate function. By
default, the join is a union of the input columns (join='outer'), but we
can change this to an intersection of the columns using join='inner':
Many-to-many joins
If the key column in both the left and right array contains duplicates,
then the result is a many-to-many merge
151
Pivot Tables
Titanic example
UNIT V
Visualization with Matplotlib
Color version available online at
https://jakevdp.github.io/PythonDataScienceHandbook/
https://matplotlib.org/
General Matplotlib Tips
In[1]: import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style directive to choose appropriate aesthetic styles for our figures
In[2]: plt.style.use('classic')
Plotting from a script
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100) #numpy.linspace(start, stop, num=50)
Return evenly spaced numbers over a specified interval.
Returns num evenly spaced samples, calculated over the interval [start, stop].
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()
plt.show() command should be used only once per
Python session
Plotting from an IPython shell
IPython is built to work well with Matplotlib if you specify Matplotlib mode. To enable this
mode, you can use the %matplotlib magic command after starting ipython:
In [1]: %matplotlib # enables the drawing of matplotlib figures in the IPython
environment
Using matplotlib backend: TkAgg
In [2]: import matplotlib.pyplot as plt
Plotting from an IPython notebook
The IPython notebook is a browser-based interactive data analysis tool that can combine
narrative, code, graphics, HTML elements, and much more into a single executable
document
• %matplotlib notebook will lead to interactive plots embedded within the
notebook
• %matplotlib inline will lead to static images of your plot embedded in the
Notebook
In[3]: %matplotlib inline
In[4]: import numpy as np
158
In[6]:
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
162
plt.axis('equal');
Labeling Plots
In[14]: plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
In[15]:
plt.plot(x, np.sin(x), '-g', label='sin(x)') # green label sin
plt.plot(x, np.cos(x), ':b', label='cos(x)') # blue label cos
plt.axis('equal')
plt.legend();
166
In[3]: rng = np.random.RandomState(0) # seed value, produces same random numbers again
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker, label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
167
In[4]: plt.plot(x, y, '-ok'); # line (-), circle marker (o), black (k)
Visualizing Errors
Basic Errorbars
In[1]: %matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
In[2]: x = np.linspace(0, 10, 50)
# start, stop, no. of points.
# Return evenly spaced numbers over a specified interval.
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
170
plt.errorbar(x, y, yerr=dy, fmt='.k'); #xerr, yerr: These parameter contains an array.And the
error array should have positive values.
fmt: This parameter is an optional parameter and it contains the string value.
K means data points will be black in color.
Continuous Errors
In[4]: from sklearn.gaussian_process import GaussianProcess
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)
# Compute the Gaussian process fit
# cubic correlation function
gp = GaussianProcess(corr='cubic', theta0=1e-2, - thetaL=1e 4, thetaU=1E-1,
random_start=100)
171
for area in [100, 300, 500]: #For area in values of 100, 300, 500
plt.scatter([], [], c='k', alpha=0.3, s=area, label=str(area) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, title='City Area')
plt.title('California Cities: Area and Population');
Multiple Legends
In[10]: fig, ax = plt.subplots()
lines = []
styles = ['-', '--', '-.', ':']
x = np.linspace(0, 10, 1000)
for i in range(4):
lines += ax.plot(x, np.sin(x - i * np.pi / 2), styles[i], color='black')
ax.axis('equal')
# specify the lines and labels of the first legend
ax.legend(lines[:2], ['line A', 'line B'],
loc='upper right', frameon=False)
# Create the second legend and add the artist manually.
from matplotlib.legend import Legend
leg = Legend(ax, lines[2:], ['line C', 'line D'],
loc='lower right', frameon=False)
ax.add_artist(leg);
181
Customizing Colorbars
In[1]: import matplotlib.pyplot as plt
plt.style.use('classic')
In[2]: %matplotlib inline
import numpy as np
In[3]: x = np.linspace(0, 10, 1000)
I = np.sin(x) * np.cos(x[:, np.newaxis])
plt.imshow(I)
plt.colorbar();
Customizing Colorbars
In[4]: plt.imshow(I, cmap='gray');
def grayscale_cmap(cmap):
"""Return a grayscale version of the given colormap"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
def view_colormap(cmap):
"""Plot a colormap with its grayscale equivalent"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
cmap = grayscale_cmap(cmap)
grayscale = cmap(np.arange(cmap.N))
In[6]: view_colormap('jet')
In[7]: view_colormap('viridis')
In[8]: view_colormap('cubehelix')
183
In[9]: view_colormap('RdBu')
Multiple Subplots
In[8]:
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
# Add labels to the plot
ax.annotate("New Year's Day", xy=('2012-1-1', 4100), xycoords='data',
xytext=(50, -30), textcoords='offset points', #xytext position of text
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=-0.2"))
Customizing Ticks
Major and Minor Ticks
In[1]: %matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
In[2]: ax = plt.axes(xscale='log', yscale='log')
Since, this is hard to do all the modifications each time its best to change the defaults
Changing the Defaults: rcParams
Each time matplotlib loads it defines a runtime configuration (rc) containing default style for
each plot. plt.rc.
In[4]: IPython_default = plt.rcParams.copy()
In[5]: from matplotlib import cycler
colors = cycler('color',
['#EE6666', '#3388BB', '#9988DD',
'#EECC55', '#88BB44', '#FFBBBB'])
plt.rc('axes', facecolor='#E6E6E6', edgecolor='none',
axisbelow=True, grid=True, prop_cycle=colors)
plt.rc('grid', color='w', linestyle='solid')
plt.rc('xtick', direction='out', color='gray')
plt.rc('ytick', direction='out', color='gray')
plt.rc('patch', edgecolor='#E6E6E6')
plt.rc('lines', linewidth=2)
In[6]: plt.hist(x);
Stylesheets
In[8]: plt.style.available[:5] # names of the first five available Matplotlib styles
Out[8]: ['fivethirtyeight',
'seaborn-pastel',
'seaborn-whitegrid',
'ggplot',
'grayscale']
The basic way to switch to a stylesheet is to call:
plt.style.use('stylename')
this will change the style for the rest of the session
with plt.style.context('stylename'):
make_a_plot()
Let’s create a function that will make two basic types of plot:
In[9]: def hist_and_lines():
np.random.seed(0)
fig, ax = plt.subplots(1, 2, figsize=(11, 4))
ax[0].hist(np.random.randn(1000))
for i in range(3):
ax[1].plot(np.random.rand(10))
ax[1].legend(['a', 'b', 'c'], loc='lower left')
Default style
In[10]: # reset rcParams
plt.rcParams.update(IPython_default);
Now let’s see how it looks (Figure 4-85):
In[11]: hist_and_lines()
196
FiveThirtyEight style
In[12]: with plt.style.context('fivethirtyeight'):
hist_and_lines()
Similarly we have ggplot, Bayesian Methods for Hackers style, Dark background,
Grayscale, Seaborn style
Three-Dimensional Plotting in Matplotlib
In[1]: from mpl_toolkits import mplot3d
In[2]: %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In[3]: fig = plt.figure()
ax = plt.axes(projection='3d')
197
drawparallels()
Draw lines of constant latitude
drawmeridians()
Draw lines of constant longitude
drawmapscale()
Draw a linear scale on the map
• Whole-globe images
bluemarble()
Project NASA’s blue marble image onto the map
shadedrelief()
Project a shaded relief image onto the map
etopo()
Draw an etopo relief image onto the map
warpimage()
Project a user-provided image onto the map
Plotting Data on Maps
contour()/contourf()
Draw contour lines or filled contours
imshow()
Draw an image
pcolor()/pcolormesh()
Draw a pseudocolor plot for irregular/regular meshes
plot()
Draw lines and/or markers
scatter()
Draw points with markers
quiver()
Draw vectors
barbs()
Draw wind barbs
drawgreatcircle()
Draw a great circle
Example: California Cities
In[10]: import pandas as pd
cities = pd.read_csv('data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
In[11]: # 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', # map projection, resolution high
lat_0=37.5, lon_0=-119,
width=1E6, height=1.2E6)
m.shadedrelief() #draw shaded satellite image
201
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area,cmap='Reds',
alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7) # Set the color limits of the current image.
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a,
label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,
labelspacing=1, loc='lower left');
import numpy as np
import pandas as pd
In[2]: # Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0) #cumulative sum of elements (partial sum of
sequence)
In[3]: # Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');