08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
SEMESTER-I
All rights reserved. No Part of this book may be reproduced in any form without permission
in writing from Team Lease Edtech Pvt. Ltd.
CONTENT
STRUCTURE
1.2 Introduction
1.3.5 Privacy
1.9.4 Regression
1.9.5 Heteroskedasticity
1.10 Summary
● Formulate and use appropriate models of data analysis to solve hidden solutions to
business-related challenges
● Interpret data findings effectively to any audience, orally, visually, and in written
formats
1.2 INTRODUCTION
George E.P states that “All models are wrong, but some of them are useful.” So what exactly
is “data science?” You must have heard the term many times from various people. Though
there seems to be no exact definition of data science, a lot of people have coined definitions.
Before we begin, let us understand a few basic concepts.
The word data is of Latin origin and the literal meaning of this word is anything and
everything that is given. Data can be from a single source or it might have a lot of sources.
Over time, the use of data was seen in a lot of places, and each use defined and put forth a
different meaning of this word. One of the most used and proven definitions of data is
“something given or admitted.” Anything that is a conclusion of a study, observation,
analysis, or occurrence.
UNESCO has also defined data and according to them, data can be defined as concepts,
facts, and illustrations in a formal manner. This formal illustration is suitable for
interpretation and communication, or it can even be processed by humans or through
automatic methods.
But that's not the end of it. Overtime requirements have changed and data has taken a special
place in society. From the smallest of spaces to huge rooms, data covers a lot of details,
materials, specifics, proofs, things, and places. Data has led to the evolution of data science
(the study of data), and data science has led to technology such as Big Data. In today's world,
big data is used to run businesses, and hence it is important to understand and learn this
concept. The increasing use of the internet and social media has made data an integral part of
our lives. The internet and social media platforms have created many channels of
information. It has not just increased the scope of business and work but has also played an
important role in generating more data. The use of such data has been increasing ever since.
Be it YouTube, Twitter, Instagram, or any other platform, data creation, collection, and
analysis are everywhere.
But you can’t simply say that everything around us is data. It is just a piece of information
until you add and integrate it into a complete study. Before the integration of analytics, data
is nothing more than just noise. And this chapter is all about helping you identify the
difference between data and information. It also plays an important role in helping you
understand how important it is to have a complete model and theory of things. It is already
proven that we live in a world where data is everything. If you want to convert information
into data accurately, remember that it is all about application and analytics. The data you use
for your business, studies, or any other purpose is based on well-founded theories and expert
judgments that lead to conclusions, and convert noise into data that can be used for business
growth, learning, and much more.
With extensive network and usage, data has become very important to mankind. Therefore,
data science, the study, and collection of data, has also become very important. For quite
some time now, data science has been transforming businesses. Various companies have
been using not just normal data but medical data to gather insights and important information
about individuals. The CODATA Task Group on Accessibility and Dissemination of Data
(CODATA/ADD) founded in 1975 states that the need for data categorization arose due to
the variety of data available. This group has also developed several methods to classify data
into different categories.
Figure 1.1
According to the officials of the CODATA Task Group on Accessibility and Dissemination
of Data (CODATA/ADD), the various types of data according to science are:
Figure 1.2
There is a lot of data that can be categorized with time as its main factor. Time being the
main factor, there are broadly two types of data that can be defined. One is the Time
Independent Data and the second one is Time-Dependent Data. Now let's brief both these
parts.
Figure 1.3
Time independent data can be understood by its name. It can be defined as the data that’s
independent of time. Now to understand this we can take examples of geoscience and
astronomy. In both these sciences, time is not very important as both these departments deal
with rocks, stars, and geological structures.
● Time-dependent data
The next category is time-dependent data where things happen once in a long time and their
recurring value is very rare. Since a recurrence is infrequent, time is a very important factor.
Location-independent data:
You might have studied different concepts in Physics and Chemistry. All of those things/
concepts are based on analytical data and are independent of location. This is how science
defines data independent of location.
On the other hand, there is a lot of data under this category. This data is dependent on
location. Everything you study in astronomy, earth science belongs to this category. Also,
people who study rocks and their composition, or how old a particular rock is, are gathering
data that is location-specific. Hence, such data is location-dependent.
Now there is another category of data defined by science, and for data scientists, it is very
important to not just understand it, but also make use of this category in their work and
studies. Data is also defined and categorized based on the mode of generation. There are
three modes of data generation. These are listed below:
1. Primary Data - As the name suggests, primary data is obtained from observations and
experiments. These are taken from the values determined by the experiment based on
samples. For example, time, length, or velocity; all of these are primary data derived
after conducting experiments.
2. Derived Data - Then comes the derived data. In many areas, derived data is also
known as reformatted data. It helps combine theoretical models to conclude. As the
name suggests, derived data is concluded from a set of observations.
Figure 1.4
3. Predicted Data - Theoretical data is also known as predicted data and is derived from
various theoretical calculations. In studies of this kind, data is derived from several,
rigorous calculations. The basic or fundamental constants of predicted data are used
in calculations e.g., all the data that is calculated, or data that involves celestial
mechanics.
Figure 1.5
There are many "Vs'' in big data: three of them are volume, velocity, variety. Big data
approaches the computing ability of typical databases. This is the element of volume. The
size of the production of data is mind-boggling. Google's Eric Schmidt found out that until
2003, just five Exabyte’s of data had been produced by all human beings (an Exabyte is
10006 bytes or a billion-billion bytes). Today, every two days, we produce 5 Exabyte’s of
data.
The key explanation behind this is the proliferation of "interaction" data, a modern
development in comparison to "transaction" data. Interaction data is generated from tracking
events in our increasingly interactive everyday lives, such as browsing operations,
geolocation data, RFID data, sensors, personal digital recorders such as fit bits and tablets,
satellites, etc. We're living in the "Internet of Things" (or IOT) today, and it's generating vast
volumes of info, all of which we seem to have an endless desire to examine. It's easier to
learn about 4 Vs of large data in other quarters.
Figure 1.6
A good data scientist will be adept at managing volume not just technically in a database
sense, but by building algorithms to make intelligent use of the size of the data as efficiently
as possible. Things change when you have gargantuan data because almost all correlations
become significant, and one might be tempted to draw spurious conclusions about causality.
For many modern business applications today, extraction of correlation is sufficient, but good
data science involves techniques that extract causality from these correlations as well.
Figure 1.7
In many cases, detecting correlations is useful as is. For example, consider the classic case of
Google Flu Trends, see Figure. The figure shows the high correlation between flu incidence
and searches about "flu" on Google, see Ginsberg et. al. (2009). Searches on the keyword
“flu” do not result in the flu itself! Of course, the incidence of searches on this keyword is
influenced by flu outbreaks. The interesting point here is that even though searches about flu
do not cause flu, they correlate with it, and may at times even be predictive of it, simply
because searches lead to actual reported levels of flu, as those may occur concurrently but
take time to be reported. And searches may be predictive, the cause of searches is the flu
itself, one variable feeding on the other, in a repeat cycle. Hence, a prediction is a major
outcome of correlation, and has led to the recent buzz around the subfield of "predictive
analytics." There are entire conventions devoted to this facet of correlation, such as the
wildly popular PAW (Predictive Analytics World). Pattern recognition is in, passed causality
is out.
Data velocity is accelerating. Streams of tweets, Facebook entries, financial information, etc.,
are being generated by more users at an ever-increasing pace. Whereas velocity increases
data volume, often exponentially, it might shorten the window of data retention or
application. For example, high-frequency trading relies on micro-second information and
Figure 1.8
Finally, the variety in data is much greater than ever before. Models that relied on just a
handful of variables can now avail of hundreds of variables, as computing power has
increased. The scale of change in volume, velocity, and variety of the data now available
calls for new econometrics, and a range of tools even for single questions. This book aims to
introduce the reader to a variety of modeling concepts and econometric techniques essential
for a well-rounded data scientist.
Data science is more than the mere analysis of large data sets. It is also about the creation of
data. The field of "text-mining" expands available data enormously since there is so much
more text being generated than numbers. The creation of data from varied sources, and its
quantification into information is known as “datafication.”
Data science is also more than “machine learning,” which is about how systems learn from
data. Systems may be trained to use data to make decisions, and training is a continuous
process, where the system updates its learning and (hopefully) improves its decision-making
ability with more data. A spam filter is a good example of machine learning. As we feed it
more data it keeps changing its decision rules, using a Bayesian filter, thereby remaining
ahead of the spammers. It is this ability to adaptively learn that prevents spammers from
gaming the filter, as highlighted in Paul Graham’s interesting essay titled “A Plan for Spam”.
Credit card approvals are also based on neural networks, another popular machine learning
technique.
However, machine-learning techniques favour data over judgment, and good data science
requires a healthy mix of both. Judgment is needed to accurately contextualize the setting for
analysis and to construct effective models. A case in point is Vinny Bruzzese, known as the
“mad scientist of Hollywood” who uses machine learning to predict movie revenues. He
asserts that mere machine learning would be insufficient to generate accurate predictions. He
complements machine learning with judgment generated from interviews with screenwriters,
surveys, etc., "to hear and understand the creative vision, so our analysis can be
contextualized.”
Machine intelligence is re-emerging as the new incarnation of AI (a field that many feels has
not lived up to its promise). Machine learning promises and has delivered on many questions
of interest, and is also proving to be quite a game-changer, as we will see later on in this
chapter, and also as discussed in many preceding examples. What makes it so appealing?
Hilary Mason suggests four characteristics of machine intelligence that make it interesting, it
is usually based on a theoretical breakthrough and is therefore well-grounded in science. It
changes the existing economic paradigm. The result is commoditization (e.g. Hadoop), and it
makes available new data that leads to further data science.
1.3.1 Supervised and Unsupervised Learning
Figure 1.9
Systems may learn in two broad ways, through “supervised” and “unsupervised” learning. In
supervised learning, a system produces decisions (outputs) based on input data. Both spam
filters and automated credit card approval systems are examples of this type of learning. So is
linear discriminant analysis (LDA). The system is given a historical data sample of inputs
and known outputs, and it "learns" the relationship between the two using machine learning
techniques, of which there are several. Judgment is needed to decide which technique is most
appropriate for the task at hand.
Unsupervised learning is a process of reorganizing and enhancing the inputs to place the
structure on unlabelled data. A good example is cluster analysis, which takes a collection of
entities, each with several attributes, and partitions the entity space into sets or groups based
on the closeness of the attributes of all entities. It reorganizes the data, but it also enhances
the data by labelling the data with additional tags (in this case a cluster number/name). Factor
analysis is also an unsupervised learning technique. The origin of this terminology is unclear,
but it presumably arises from the fact that there is no clear objective function that is
maximized or minimized in unsupervised learning, so no “supervision” is required to reach
an optimal. However, this is not necessarily true in general, and we will see examples of
unsupervised learning (such as community detection in the social web), where the outcome
depends on measurable objective criteria.
Figure 1.10
Data science is about making predictions and forecasts. There is a difference between the
two. The statistician-economist Paul Saffo has suggested that predictions aim to identify one
outcome, whereas forecasts encompass a range of outcomes. To say that “it will rain
tomorrow” is to make a prediction, but to say that “the chance of rain is 40%” (implying that
the chance of no rain is 60%) is to make a forecast, as it lays out the range of possible
outcomes with probabilities. We make weather forecasts, not predictions. Predictions are
statements of great certainty, whereas forecasts exemplify the range of uncertainty. In the
context of these definitions, the term predictive analytics is a misnomer for its goal is to
make forecasts, not mere predictions.
1.3.3 Innovation and Experimentation
Data science is about new ideas and approaches. It merges new concepts with fresh
algorithms. Take for example the A/B test, which is nothing but the online implementation of
a real-time focus group. Different subsets of users are exposed to A and B stimuli
respectively, and responses are measured and analyzed. It is widely used for website design.
This approach has been in place for more than a decade, and in 2011 Google ran more than
7,000 A/B tests. Facebook, Amazon, Netflix, and several other firms use A/B testing widely.
The social web has become a teeming ecosystem for social science experiments. The
potential to learn about human behaviour using innovative methods is much greater now than
ever before.
Figure 1.11
A good data scientist will take care to not overreach in concluding big data. Because there
are so many variables available, and plentiful observations, correlations are often statistically
significant but devoid of basis. In the immortal words of the bard, empirical results from big
data maybe - "A tale told by an idiot, full of sound and fury, signifying nothing." One must
be careful not to read too much into the data. More data does not guarantee less noise, and
signal extraction may be no easier than with fewer data.
Adding more columns (variables in the cross-section) to the data set, but not more rows (time
dimension) is also fraught with danger. As the number of variables increases, more
characteristics are likely to be related statistically. Overfitting models in-sample is much
more likely with big data, leading to poor performance out-of-sample.
Researchers have also to be careful to explore the data fully, and not terminate their research
the moment a viable result, especially one that the researcher is looking for, is attained. With
big data, the chances of stopping at a suboptimal, or worse, intuitively appealing albeit the
wrong result become very high. It is like asking a question to a class of students. In a very
large college class, the chance that someone will provide a plausible yet off-base answer
quickly is very high, which often short circuits the opportunity for others in the class to think
more deeply about the question and provide a much better answer.
Figure 1.12
Nassim Taleb describes these issues elegantly - “I am not saying there is no information in
big data. There is plenty of information. The problem – the central issue – is that the needle
comes in an increasingly larger haystack.” The fact is, one is not always looking for needles
or Taleb’s black swans, and there are plenty of normal phenomena about which robust
forecasts are made possible by the presence of big data.
1.3.5 Privacy
Figure 1.13
The emergence of big data coincides with a gigantic erosion of privacy. Humankind has
always been torn between the need for social interaction and the urge for solitude and
privacy. One trades off against the other. Technology has simply sharpened the divide and
made the slope of this trade-off steeper. It has provided tools of social interaction that steal
privacy much faster than in the days before the social web.
Rumors and gossip are now the old worlds. They required bilateral transmission. The social
web provides multilateral revelation, where privacy no longer capitulates a battle at a time,
but the entire war is lost at one go. And data science is the tool that enables firms,
governments, individuals, benefactors and predators, et al, en masse, to feed on privacy’s
carcass.
The loss of privacy is manifested in the practice of human profiling through data science. Our
web presence increases entropically as we move more of our life’s interactions to the web, be
they financial, emotional, organizational, or merely social. And as we live more and more of
our lives in this new social media, data mining and analytics enable companies to construct
very accurate profiles of who we are, often better than what we might do ourselves. We are
moving from "know thyself" to knowing everything about almost everyone.
If you have a Facebook or Twitter presence, rest assured you have been profiled. For
instance, let’s say you tweeted that you were taking your dog for a walk. Profiling software
now increments your profile with an additional tag - pet owner. An hour later you tweet that
you are returning home to cook dinner for your kids. Your profile is now further tagged as a
parent. As you can imagine, even a small Twitter presence ends up being dramatically
revealing about who you are. Information that you provide on Facebook and Twitter, your
credit card spending pattern, and your blog, allows the creation of a profile that is accurate
and comprehensive, and probably more objective than the subjective and biased opinion you
have of yourself. A machine knows better. And you are the product!
Humankind leaves an incredible trail of “digital exhaust” comprising phone calls, emails,
tweets, GPS information, etc., that companies use for profiling. It is said that 1/3 of people
have a digital identity before being born, initiated with the first sonogram from a routine
hospital visit by an expectant mother. The half-life of non-digital identity or the average age
of digital birth is six months, and within two years 92% of the US population has a digital
identity.
Those of us who claim to be safe from revealing our privacy by avoiding all forms of social
media are simply profiled as agents with a "low digital presence." It might be interesting to
ask such people whether they would like to reside in a profile bucket that is more likely to
attract government interest than a profile bucket with a more average digital presence. In this
age of profiling, the best way to remain inconspicuous is not to hide, but to remain as average
as possible, to be mostly lost within a large herd.
Privacy is intricately and intrinsically connected to security and efficiency. The increase in
transacting on the web, and the confluence of profiling, has led to massive identity theft. Just
as in the old days, when a thief picked your lock and entered your home, most of your
possessions were at risk. It is the same with electronic break-ins, except that there are many
more doors to break in from and so many more windows through which an intruder can
unearth revealing information. And unlike a thief who breaks into your home, a hacker can
reside in your electronic abode for quite some time without being detected, an invisible
parasite slowly causing damage. While you are blind, you are being robbed blind. And unlike
stealing your worldly possessions, stealing your very persona and identity is the cruellest cut
of them all.
Based on buyers’ profiles, the seller will offer each buyer the price he is willing to pay on the
demand curve. Profiling helps sellers capture consumer’s surplus and eat into the region of
missed sales. Targeting brings benefits to sellers and they actively pursue it. The benefits
outweigh the costs of profiling, and the practice is widespread as a result. Profiling also fine-
tunes price segmentation, and rather than break buyers into a few segments, usually two,
each profile becomes a separate segment, and the granularity of price segmentation is
modulated by the number of profiling groups the seller chooses to model.
Of course, there is an insidious aspect to profiling, which has existed for quite some time,
such as targeting conducted by tax authorities. Also, we will not take kindly to insurance
companies profiling us any more than they already do. Profiling is also undertaken to snare
terrorists. However, there is a danger in excessive profiling. A very specific profile for a
terrorist makes it easier for their ilk to game detection as follows. Send several possible
suicide bombers through airport security and see who has repeatedly pulled aside for
screening and who is not. Repeating this exercise enables a terrorist cell to learn which
candidates do not fall into the profile. They may then use them for the execution of a terror
act, as they are unlikely to be picked up for the special screening. The antidote?
Randomization of people picked for a special screening in searches at airports, which makes
it hard for a terrorist to always assume no likelihood of detection through screening.
Automated invasions of privacy naturally lead to human responses, not always rational or
predictable. This is articulated in Campbell’s Law: “The more any quantitative social
indicator (or even some qualitative indicator) is used for social decision-making, the more
subject it will be to corruption pressures and the more apt it will be to distort and corrupt the
social processes it is intended to monitor." We are in for an interesting period of interaction
between man and machine, where the battle for privacy will take centre stage.
My view of data science is one where theories are implemented using data, some of its big
data. This is embodied in an inference stack comprising (in sequence): theories, models,
intuition, causality, prediction, and correlation. The first three constructs in this chain are
from Emanuel Derman’s wonderful book on the pitfalls of models.
Theories are statements of how the world should be or is and are derived from axioms that
are assumptions about the world, or precedent theories. Models are implementations of
theory, and in data science are often algorithms based on theories that run on data. The
results of running a model lead to intuition, i.e., a deeper understanding of the world based on
theory, model, and data. Whereas there are schools of thought that suggest data is all we
need, and theory is obsolete, this author disagrees. Still, the unreasonable proven
effectiveness of big data cannot be denied. Chris Anderson argues in his Wired magazine
article thus:
In contrast, the academic Thomas Davenport writes in his foreword to Seigel (2013) that
models are key, and should not be increasingly eschewed with increasing data:
But the point of predictive analytics is not the relative size or unruliness of your data, but
what you do with it. I have found that “big data often means small math,” and many big data
practitioners are content just to use their data to create some appealing visual analytics.
That’s not nearly as valuable as creating a predictive model.
Once we have established intuition for the results of a model, it remains to be seen whether
the relationships we observe are causal, predictive, or merely correlational. Theory may be
causal and tested as such. Granger (1969) causality is often stated in mathematical form
for two stationary time series of data as follows. X is said to Granger-cause Y, if in the
following equation system,
Causality is a hard property to establish, even with a theoretical foundation, as the causal
effect has to be well-entrenched in the data.
And then there is a correlation, at the end of the data science inference chain.
Contemporaneous movement between two variables is quantified using correlation. In many
cases, we uncover correlation, but no prediction or causality. A correlation has great value to
firms attempting to tease out beneficial information from big data. And even though it is a
linear relationship between variables, it lays the groundwork for uncovering nonlinear
relationships, which are becoming easier to detect with more data. The surprising parable
about Walmart finding that purchases of beer and diapers seem to be highly correlated
resulted in these two somewhat oddly-paired items being displayed on the same aisle in
supermarkets. Unearthing correlations of sales items across the population quickly lead to
different business models aimed at exploiting these correlations, such as my book buying
inducement from Barnes & Noble, where my “fly and buy” predilection is easily exploited.
Correlation is often all we need, eschewing human cravings for causality. As Mayer-
Schönberger and Cukier (2013) so aptly put it, we are satisfied “... not knowing why but only
what.”
In the data scientist mode of thought, relationships are multifaceted correlations amongst
people. Facebook, Twitter, and many other platforms are satisfying human relationships
using graph theory, exploiting the social web in an attempt to understand better how people
relate to each other, intending to profit from it. We use correlations on networks to mine the
social graph, understanding better how different social structures may be exploited. We
answer questions such as where to seed a new marketing campaign, which members of a
network are more important than the others, how quickly will information spread on the
network, i.e., how strong is the “network effect”?
Data science is about the quantization and understanding of human behaviour, the holy grail
of social science. In the following chapters, we will explore a wide range of theories,
techniques, data, and applications of a multi-faceted paradigm. We will also review the new
technologies developed for big data and data science, such as distributed computing using the
Dean and Ghemawat (2004) MapReduce paradigm developed at Google, and implemented as
the open-source project Hadoop at Yahoo!. When data gets super-sized, it is better to move
algorithms to the data than the other way around. Just as big data has inverted database
paradigms, so is big data changing the nature of inference in the study of human behaviour.
Ultimately, data science is a way of thinking, for social scientists, using computer science.
Business analysis requires the use of numerous mathematical methods, from arithmetic and
calculus to mathematics and econometrics, with implementations in diverse programming
languages and applications. It calls for strategic skills as well as strong reasoning and the
capacity to pose insightful questions and to deploy data to address questions.
The presence of the web as the main forum for business and marketing has spawned massive
volumes of data, pushing companies to strive to leverage vast knowledge stores to improve
their competitive advantage. As a result, corporations in Silicon Valley (and elsewhere)
recruit a new form of employee known as "data scientists" whose job is to evaluate "big data"
using methods such as those you can learn in this course.
This chapter would discuss some of the geometry, statistics, linear algebra, and equations that
you might not have seen for several years. It's more enjoyable than it seems. We'll even learn
how to use certain mathematical packages along the way. We're going to review some of the
typical equations and analyses that you're going to find in previous classes that you may have
taken. You'll refresh some old ideas, discover new ones, and grow technically proficient at
trading software.
It is necessary to start with the basic mathematical constant, "e = 2.718281828..." which is
also the function "exp (·)." This function is also written as ex, where x may be a real or
complex variable. It occurs in many fields, particularly in finance, where it is used for
continuous compounding and discounting of capital at a certain interest rate r over a certain
period t.
Percent shift in y. This is since ln(y) = x, where ln (·) is the normal logarithm equation, is the
reciprocal function of the exponential function. Recall that dy dx is the first derivative of this
function.
n→∞ n
This is the forward value of one dollar. Present value is just the
reverse. Therefore, the price today of a dollar received t years
from today is P = e−rt. The yield of a bond is:
r = −1t ln(P)
In bond mathematics, the negative of the percentage price sensitivity of a bond to changes in
interest rates is known as
“Duration”: dP 1 �−rt 1 �1
The derivative dP is the price sensitivity of the bond to changes in interest rates and is
negative.
Further dividing this by P gives the percentage price sensitivity. The minus sign in front of
the definition of duration is applied to convert the negative number to a positive one.
The “Convexity” of a bond is its percentage price sensitivity relative to the second derivative,
i.e.,
Because the second derivative is positive, we know that the bond pricing function is convex.
This distribution is the workhorse of many models in the social sciences and is assumed to
generate much of the data that comprises the big data universe. Interestingly, most
phenomena (variables) in the real world are not normally distributed. They tend to be “power
law” distributed, i.e., many observations of low value, and very few of high value. The
probability distribution declines from left to right and does not have the characteristic hump
shape of the normal distribution. An example of data that is distributed thus is income
distribution (many people with low income, very few with high income). Other examples are
word frequencies in languages, population sizes of cities, number of connections of people in
a social network, etc.
Still, we do need to learn about the normal distribution because it is important in statistics,
and the central limit theorem does govern much of the data we look at. Examples of
approximately normally distributed data are stock returns and human heights.
If x ∼ N(μ, σ2), that is, x is normally distributed with mean μ and variance σ2, then the
probability “density” function for x is:
and
We will be using linear algebra in many of the models that we explore in this book. Linear
algebra requires the manipulation of vectors and matrices. We will also use vector calculus.
Vector algebra and calculus are very powerful methods for tackling problems that involve
solutions in spaces of several variables, i.e., in high dimensions. The parsimony of using
vector notation will become apparent as we proceed. This introduction is very light and
meant for a reader who is mostly uninitiated in linear algebra.
This is a random vector, because each return Ri, i = 1, 2, . . ., N comes from its distribution,
and the returns of all these stocks are correlated. This random vector's probability is
represented as a joint or multivariate probability distribution. Note that we use a bold font to
denote the vector R.
In this chapter, we develop some expertise in using the R statistical package. There are many
tutorials available now on the web. See the manuals on the R website www.r-project.org.
There is also a great book titled “The Art of R Programming” by Norman Matloff. Another
useful book is “Machine Learning for Hackers” by Drew Conway and John Myles White.
If you want to directly access the system you can issue system commands as follows:
system (“<command>" )
For example
will list all directory entries that contain my last name in reverse chronological order. Here
the Unix command is being used, so this will not work on a Windows machine, but it will
certainly work on a Mac or Linux box.
However, you are hardly going to be issuing commands at the system level, so you are
unlikely to use the system command very much.
To get started, we need to grab some data. Go to Yahoo! Finance and download some
historical data in an Excel spreadsheet, re-sort it into chronological order, then save it as a
CSV file. Read the file into R as follows:
The last command reverses the sequence of the data if required. We can download stock data
using the quantmod package. Note: to install a package you can use the drop-down menus on
Windows and Mac operating systems, and use a package installer on Linux. Or issue the
following command:
Figure 1.14
>yhoo = as.matrix(YHOO[,6])
We now go ahead and concatenate columns of data into one stock data set.
Now, compute daily returns. This time, we do log returns in continuous- time. The mean
returns are:
> n = length(stkdata[,1])
>n
> rets = log(stkdata[2:n,]/stkdata[1:(n−1),])
Notice the print command that allows you to choose the number of significant digits.
For more flexibility and better handling of data files in various formats, you may also refer to
the reader package. It has many useful functions.
Finding roots of nonlinear equations is often required, and R has several packages for this
purpose. Here we examine a few examples.
Figure 1.15
1.9.4 Regression
mine′e = β
(Y−X·β)′(Y−X·β)
= Y′(Y−X·β)−(Xβ)′ ·(Y−X·β)
Note that this expression is a scalar. Differentiating w.r.t. β′ gives the following f.o.c:
−2X′Y+2X′Xβ = 0
=⇒ ′ −1 ′
β = (XX) (XY)
Figure 1.15
Var(Xi ) should compute this and check that each coefficient in the regression is
Example: Let’s do a regression and see whether AAPL, CSCO, and IBM can explain the
returns of YHOO. This uses the data we had downloaded earlier.
> Y = as.matrix(rets[,1])
> X = cbind(matrix(1,n,1),X)
Here is a simple regression run on some data from the 2005-06 NCAA
basketball season for the March madness stats. The data is stored in a space-delimited file
called ncaa.txt. We use the metric of performance to be the number of games played, with
more successful teams playing more playoff games, and then try to see what variables
explain it best. We apply a simple linear regression that uses the R command lm, which
stands for “linear model.”
We know that the command lm returns an "object" with the name res. This object contains
various details about the regression result, and can then be called by other functions that will
format and present various versions of the result. For example, using the following command
gives a nicely formatted version of the regression output, and you should try to use it when
presenting regression results.
(The output is not shown here to not repeat what we saw in the previous regression.) Data
frames are also objects. Here, objects are used in the same way as the term is used in object-
oriented programming (OOP), and similarly, R supports OOP as well.
Direct regression implementing the matrix form is as follows (we had derived this earlier):
= cbind(wuns,x)
Note that this is the same result as we had before, but it gave us a chance to look at some of
the commands needed to work with matrices in R.
1.9.5 Heteroscedasticity
Simple linear regression assumes that the standard error of the residuals is the same for all
observations. Many regressions suffer from the failure of this condition. The word for this is
“heteroskedastic” errors. “Hetero” means different, and “skedastic” means dependent on
type.
We can first test for the presence of heteroscedasticity using a standard Breusch-Pagan test
available in R. This resides in the latest package which is loaded in before running the test.
We can see that there is very little evidence of heteroscedasticity in the standard errors as the
p-value is not small. However, let's -+go ahead and correct the t-statistics for
heteroscedasticity as follows, using the hccm function. The “hccm” stands for
heteroscedasticity corrected covariance matrix.
> z = cbind(wuns,x)
> tstats
Resolving heteroscedasticity
● Regression with robust standard errors – incorrect standard issues can be resolved
using regression analysis to perform ordinary least squares that no longer produce the
best linear unbiased estimators.
● Generalized least squares with an unknown form of variance – the generalized least
squares estimators determine the best-unbiased estimator.
When data is autocorrelated, i.e., has a dependence on time, not accounting for it results in
unnecessarily high statistical significance. Intuitively, this is because observations are treated
as independent when they are correlated in time, and therefore, the true number of
observations is effectively less.
Inefficient markets, the correlation of returns from one period to the next should be close to
zero. We use the returns stored in the variable rets (based on Google stock) from much
earlier in this chapter.
In the data there only seems to be statistical significance at the eighth lag. We may regress
leading values on lags to see if the coefficient is significant.
The autocorrelation can be resolved in various ways and they are listed as follows:
● By including dummy variable in the data
Also known as VAR (not the same thing as Value-at-Risk, denoted VaR). VAR is useful for
estimating systems where there are simultaneous regression equations, and the variables
influence each other. So in a VAR, each variable in a system is assumed to depend on lagged
values of itself and the other variables. The number of lags may be chosen by the
econometrician based on what is the expected decay in time-dependence of the variables in
the VAR.
In the following example, we examine the inter-relatedness of returns of the following three
tickets: SUNW, MSFT, IBM. For vector auto-regressions (VARs), we run the following R
commands:
The “order” of the VAR is how many lags are significant. In this example, the order is 1.
Hence, when the “ar" command is given, it shows the coefficients on the lagged values of the
three values to just one lag. For example, for SUNW, the lagged coefficients are -0.0098,
0.0222, and 0.0021, respectively for SUNW, MSFT, IBM. The Akaike Information Criterion
(AIC) tells us which lag is significant, and we see below that this is lag 1.
Interestingly, we see that each of the tickers has a negative relation to its lagged value, but a
positive correlation with the lagged values of the other two stocks. Hence, there is positive
cross autocorrelation amongst these tech stocks.
Prediction trees are the inevitable consequence of recursive data partitioning. They are also a
basic type of clustering at various stages. The normal cluster analysis results in a "flat"
partition, but the node prediction creates a multi-level tree cluster. The concept used here is
CART, which stands for the study of classification and regression trees. Yet prediction trees
are distinct from vanilla clustering in a significant aspect – there is a dependent variable, i.e.
a group or set of values (e.g. score) that one is trying to forecast.
Suppose we want to estimate the credit score of a person using age, salary, and education as
explanatory variables. Suppose the wage is the better predictive predictor of the three. Then,
at the peak of the tree, there would be income as a branching feature, i.e. if income is less
than any threshold, then we'll move down the left branch of the tree, otherwise, we'll go
down the right. It could be at the second stage where we use education to create the next
bifurcation, and then at the third step where we use the age. A variable can also be used
continuously at more than one stage.
1.10 SUMMARY
Data science uses complex machine learning algorithms to build predictive models. In
practice, data science is already helping the airline industry predict travel disruptions. Data
science is an essential part of any industry given the massive amounts of data that are
produced. Data science is one of the most discussed issues in the industry these days. Its
prevalence has increased over the years, and businesses have begun to adopt data science
strategies to improve their market and enhance customer loyalty. The idea analysis is the first
phase of the data science program. The purpose of this phase is to clarify the issue by doing a
study of the business model. Because raw data may not be available, data processing is the
most critical part of the data science lifecycle. The data scientist must first review the data
and find any holes or data that may not add meaning. Using various analytical tools and
techniques, data scientists can manipulate the data with the goal of 'discovering' useful
information. The data used for analysis can be from multiple sources and present in various
formats. Machine learning is where computers can use an algorithm to develop it and "learn"
over time as they communicate with more data. With machine learning, you can feed a
computer with terabytes and petabytes of data, so that they learn the difference and write
their algorithms based on the underlying human-driven programming to produce the desired
outcome.
1. What is linear regression? What do the terms p-value, coefficient, and r-squared value
mean? Write the significance of each of these components.
1. Study the image given below. Which graphs are being referred to here?
a. Exploratory
b. Inferential
c. Causal
2. Choose the model that sums the importance over each boosting iteration.
a. Boosted trees
b. Bagged trees
b. Set
c. Value
d. Subset
a. Probability
b. Hypothesis
c. Causal
a. Visual techniques
b. Assumptions
c. Fixed models
a. Frequency
b. Summarized
c. Raw
a. Raw
b. Processed
c. Synchronized
d. Filtered
a. Data cleansing
b. Data integration
c. Data replication
d. Data duplication
a. MCV
b. MCB
c. MARS
d. MCRS
Answers:
References book
● Tony Hey; Stewart Tansley; Kristin Michele Tolle. The Fourth Paradigm: Data-
intensive Scientific Discovery. Microsoft Research.
● Bell, G.; Hey, T.; Szalay, A. "COMPUTER SCIENCE: Beyond the Data Deluge".
Science.
Website
● towardsdatascience.com