0% found this document useful (0 votes)

353 views208 pages

Data Science Tips and Tricks To Learn Data Science Theories Effectively

Uploaded by

enock-readers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

353 views208 pages

Data Science Tips and Tricks To Learn Data Science Theories Effectively

Uploaded by

enock-readers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 208

Data Science

Tips and Tricks to learn Data Science

Theories Effectively
Ó Copyright 2020 by William Vance - All rights reserved.
This document is geared towards providing exact and reliable information in regards to the topic
and issue covered. The publication is sold with the idea that the publisher is not required to
render accounting, officially permitted or otherwise qualified services. If advice is necessary,
legal or professional, a practiced individual in the profession should be ordered.
- From a Declaration of Principles which was accepted and approved equally by a Committee of
the American Bar Association and a Committee of Publishers and Associations.
In no way is it legal to reproduce, duplicate, or transmit any part of this document in either
electronic means or in printed format. Recording of this publication is strictly prohibited, and any
storage of this document is not allowed unless with written permission from the publisher. All
rights reserved.
The information provided herein is stated to be truthful and consistent, in that any liability, in
terms of inattention or otherwise, by any usage or abuse of any policies, processes, or directions
contained within is the solitary and utter responsibility of the recipient reader. Under no
circumstances will any legal responsibility or blame be held against the publisher for any
reparation, damages, or monetary loss due to the information herein, either directly or indirectly.
Respective authors own all copyrights not held by the publisher.
The information herein is offered for informational purposes solely and is universal as so. The
presentation of the information is without a contract or any type of guarantee assurance.
The trademarks that are used are without any consent, and the publication of the trademark is
without permission or backing by the trademark owner. All trademarks and brands within this
book are for clarifying purposes only and are owned by the owners themselves, not affiliated
with this document.
Contents
Introduction
Chapter One: What Is Data Science?
Two broad aspects of data science
The Four V's Of Data Science
Machine learning
Supervised and Unsupervised Learning
Privacy
Theories, Models, Intuition, Causality, Prediction, Correlation
Conclusion
Chapter Two: Getting Started with Data Science
Exponentials, Logarithms, and Compounding
Normal Distribution
Poisson distribution.
Moment of continuous random variables
How to combine random variables
Vector Algebra
Diversification
Matrix calculus
Conclusion
Chapter Three: R - Statistic Packages
System command
Matrices
Descriptive Statistics
Higher-Ordered Moments
Brownian Motion in R
Maximum Likelihood estimation
GARCH/ARCH Models
How Bivariate random variables works
Multivariate random variables
Portfolio computation in R
Regression
Heteroskedasticity
Auto-regressive model
Vector Auto-Regression (VAR)
Conclusion.
Chapter Four: Data Handling and Other Useful Things
Data Extraction Of Stocks Using Quantmod
How to use the merge function
Using The Data.Table Package
Conclusion
Chapter Five: Markowitz Mean-Variance Problem
Markowitz Mean-variance Problem
Solution in R
How To Solve The Problem Using The Quadprog Package
Risk Budgeting
Conclusion
Chapter Six: Bayes Theorem
Correlated Default (Conditional Default)
Continuous and More Formal Exposition
Bayes Rule in Marketing
Bayes Models in Credit Rating Transitions
Accounting Fraud
Conclusion
Chapter Seven: More Than Words - Extracting Information From News
What is News Analysis
Algorithms
Crawlers and Scrapers
Pre-processing Text
Term Frequency - Inverse Document Frequency (TF - IDF)
Wordclouds
Word Count Multiplier
Vector Distance Classifier (VDC)
Metrics
Confusion Matrix
Precision and Recall
Accuracy
False Positives
Sentiment Error
Correlation
Phase-Lag Metrics
Economic Significant
Text Summarization
Conclusion
Chapter Eight: Bass Model
The Bass Model
The Basic idea
Software
Calibration
Sales Peak
Conclusion
Chapter Nine: Extracting Dimensions: Discriminant and Factor Analysis
Discriminant Analysis
Notation and assumption
Implementation with R
Splitting into multiple groups
Factor Analysis
Principal Components Analysis (PCA)
The Difference between FA and PCA
Factor Rotation
Conclusion
Chapter 10: Auction
Auctions
Types of Auction.
How To Determine The Value Of An Auction
Bidder Types
Benchmark Model (BM)
Properties and results of Benchmark Model
Auction Math
Optimization By Bidders
Treasury Auctions
UPA or DPA
Mechanism Design
Clicks (Advertising Auctions)
Next-price Auction
Laddered Auction
Conclusion
Chapter 11: Limited Dependent Variables
Limited Dependent Variables
Logit
Prohit
Slopes
Maximum-Likelihood Estimation (MLE)
Multinomial Logit
Limited Dependent Variables in VC Syndication
Endogeneity.
Conclusion
Chapter Twelve: Fourier Analysis And Network Theory
Fourier Analysis
Fourier series
Angular Velocity
Fourier Series
Complex Algebra
From Trigs to complex
Collapsing and Simplifying
Fourier Transform
Application To Probability Functions
Graph Theory.
Features of Graphs
Chapter 13: Searching Graph
Breadth-First-Search
Strongly Connected Components.
Dijkstra's Shortest Path Algorithm
Degree Distributions
Diameter
Fragility
Centrality
Communities
Modularity
Conclusion
Chapter 14: Neural Networks
Overview of Neural Networks
Nonlinear Regression
Perceptrons
Squashing Functions
How does NN works?
Logit/Probit Model
Connection To Hyperplanes
Feedback/Backpropagation
Chapter 15: One Or Zero: Optimal Digital Portfolio
Optimal Digital Portfolio
Modeling Digital Portfolio
Conclusion
Conclusion
Introduction
There is a popular joke that a data scientist is someone who knows more computer science than a
statistician and knows more statistics than a computer scientist. While it is true that to become a
good data scientist, you don't need just the knowledge of computers and statistics, to become a
professional data scientist, one must be exceedingly brilliant in both computer science and
statistics. This is because the knowledge of both fields is what enables data scientists to be able
to create insight from scattered data on the computer screen. In the same vein, the knowledge of
fields outside these two helps the data scientist asks intelligent questions, both formal and
informal, on data. He or she uses this to generate a clear insight.
This is why, in most cases, a data scientist is well trained in not just computer-related fields but
also other fields like economics, business, law and many more. This helps the scientist to become
vast in questions required in any domain he or she is working with. It also helps him or her
summarize this question as required. A good example of a site that uses questions from
additional domain aside computer and statistics is the dating site OkCupid. Before an individual
can become a member of the group, he or she would be required to answer a lot of questions.
The site uses the answers given to these questions to summarize the personal characteristics of
the candidate and then merge him or her with the right person available. As a result, to be able to
construct these questions intelligently, the scientist is required to be vast in the field of
psychology and humanity.
Similarly, Facebook, one of the most popular social media, like the dating site mentioned above,
also uses some personal questions. The reason for this is to make it easy for families and friends
to find and connect with the person.
In relation to this, in 2012, during Obama campaign in the US, he employed a lot of data scientist
to use their skill in finding out voters the parts of the country that need extra . This birthed the
different fundraising appeals and programs that were put in place before the election. This
singular effort played an important role in the president's reelection and shows that at each
passing year, data scientists are becoming more needed and valuable.
The growing rate of both social media and the internet, as the basic channels, to get any
information on any field imaginable, has generated more data than we can comprehend. At every
passing, users upload videos on YouTube. Not only this, the rate at which people are using
Facebook and Twitter grows at each passing day. Business is drawn to the digital world. Digital
marketing is fast becoming entrenched — all this is more data than we can imagine.
These brief explanations show that the domain of data science is vast and required continuous
training not just in the field of the computer but also in other fields where data scientists would
be needed. Only a few scientists are actually armed with the required capacity. In a business
organization, for instance, a data scientist is required to help generate applicable business
intelligence.
However, data is not information, and until an analysis is added to it, it is just noise. This is
where this book comes in. It explains the different theories and models, applications and
calculations required in data science. The books explain the different aspects of data analysis and
science.
Added to this is the indisputable fact that the world is fast drowning in data consumption. Every
click on a website is being tracked and monitored by data science; every smartphone is built with
a capacity to deliver your location at every time it is needed accurately. Quantified sellers are
always on pedometers-on-steroid use to records their movement, heart rate, diet, habit, and even
their sleeping method. The internet represents the happening place. It contains information
ranging from the database, encyclopedia, domain-specific details on music, sport and any event
of your choice. Not only this, on the internet, is information such as government statistics,
academic write-ups, textbooks, products of different ranges, games and many more.
These data contain information that seems elusive to everyone. This information is what would
be explained in this book. Data science has indeed risen more than we often thought. It is fast
becoming the most sorted after the field in computer studies. The world is beginning to recognize
the impact of the data scientist. This book focuses on explaining the rudimentary aspects of data
science. It explains in detail the important theories and models used in data science. If the focus
on thoroughly explaining how theories are applied in data science and how these theories can be
used to explain some of the basic aspects of data science.
The book is divided into fifteen chapters, and each chapter focuses on explaining thoroughly and
indelibly all there is to know about data science. The first few chapters explain in detail what
data science is and the different aspects of data science. The other chapters concentrate on the
various theories and models used in data science.
At the end of the book, the learner is expected to be knowledgeable in the use of theories and
models in data science. He or she is also expected to be able to construct intelligent questions
and forecast or predict the result of the questions and its importance to the domain at large.
Chapter One: What Is Data Science?
With the rising rate of technology today, data science is transforming businesses and careers.
Both medical cares and law firms use data in their domain. Business organizations are not cut out
of the use of data. According to Josh Wills of Cloudera, “A data scientist is a person who is
better at statistics than any software engineer and better at software engineering than any
statistician.” Therefore, a data scientist is someone who is excellent at software engineering and
statistics. A data scientist knows all there is about business models and paradigms.
Data science covers generating new ideas and approaches and merging new theories with new
algorithms. It entails the mastery of both statistics and computer studies. Data science focuses
more on data and algorithms and the various theories and models that can be used to interpret the
data and algorithms. Data science encompasses the creation of an excellent data analysis plan
and the implementation of that data using theories and models.
To create a good data analysis plan, McKinsey argues that it must contain these three elements:
analytic model, interlinked data inputs, and decision-support tools. Similarly, Halevy, Norvig,
and Pereira argued in a seminal paper published in 2009 that even simple theories and models,
with big data, have the potential to do better than complex models with fewer data. This is to
show that the effectiveness of data analysis is not dependent on how large the data is. The size of
the data has little to play in the effectiveness of the data. One of the popular data scientist Hilary
Mason, explained that the making of “data products” entails three important aspects. The first is
the data itself, the second is technical excellence which encompasses machine learning and the
third part is people and process; this entails talent. A good example of a data product that
comprises all three elements is Google Maps. Hilary went further to list out the skill required to
be an excellent data scientist. These include math and stats, communication and coding.
However, before the mastery and implementation of these skills, every individual who aims to
become a good data scientist must possess one valuable ability. This is the ability to ask relevant
and excellent questions. The answer to these questions is what unlock values for consumers,
companies, and even society at large. Therefore, becoming a good data scientist begins with the
ability to ask relevant questions and to solve a problem. Becoming a good data science requires
the mastery of not just computer or statistics, a data scientist is knowledgeable in other fields of
study.
Two Broad Aspects of Data Science
Predictions and Forecasts
These two aspects play a very important role in the major purpose of data science. However,
there is a difference between how these two works. Forecasts cover a range of outcomes, while
prediction focuses on identifying just one outcome. Take, for instance, the statement "it will
snow this week" is a prediction. However, when we say there is a 50% probability that it would
snow this week. This means that there is also a 50% probability that it would not snow.
Prediction portrays certainty while the forecast gives a range of possibilities.
The Four V's Of Data Science
The big data is made up of several V's however, for the purpose of this book, these four will be
our concentration.
Volume
Velocity
Variety
Veracity
Volume
The volume of big data is fast exceeding the capacity of most databases. Data generation scale
has risen to the extent that what used to be the record generated in years is now being generated
in two days. This is confirmed by a very popular data analyst, Google's Eric Schmid. He pointed
out that as of 2003, what was generated in the volume of big data is 5 exabytes of data - an
exabyte is 1000x6 bytes or a billion billions of bytes. Today, what was generated in 2003 is
generated in just two days.
The reason for this massive rise in data volume is simply the introduction of interaction data.
This is the new kind of data that produces more results than the transaction data that was used
before its development. Interaction data entails the recording of daily activities in apps, which
may include activities in a browser, RFID data, geo-location data, and so on. All these activities
gather more than millions of thousands of data in just a day. Truth is we are in the age of
“internet of things” (or IoT); every little thing concerning us is made available on the internet or
social media and this is generating constant significant growth in the quantity of data.
As such, a good data scientist is expected to be very skilled at managing data volume. This is not
limited to the technical database alone but also the ability to build algorithms that help manage
the size of the data intelligently. This is partly because, when dealing with big data, none of the
correlations is left out, all of them are very important. Hence, to be able to manage this
effectively, a good data scientist should be vast in techniques that would help extract causality
from correlations.
Velocity
As the volume of data increases, the velocity also increases. Facebook entries increase every
single second; tweets are growing on Twitter; at every second, information is seemed and
generated by users. The rising increase of velocity also affects the volume of data. However, this
might affect the window of data application.
Variety
The variety of data has grown higher than before. Before now, models that work with variables
are very minimum. However, these have increased as computing power increases. The increase
and change in velocity, volume, and variety of data required new econometrics and a couple of
new tools for setting questions in data science.
As we progress in the book, we explained the different econometric techniques and modeling
concepts that would be given.
However, it is important to note that data science is not limited to the analysis of large data; it
also requires the creation of data.
Machine Learning
Machine learning is just an aspect of data science. This implies that data science entails more
than just machine learning. Systems are usually trained on data to make decisions. However, this
is a continuous process. As the system is continuously trained on decision making, the capacity
of the system improves with more data. A good example of a system trained in decision making
is the spam filter. As the data grows, the spam filter uses a Bayesian filter to change its decision
making. This happens constantly and helps keep the filter ahead of spammers. It is this ability to
stay ahead of spammers that them from gaming the filter.
However, machine-learning techniques favor data over judgment. Good data science is required
to combine both aspects excellently. Machine learning is fast progressing; Hilary Mason
highlighted four characteristics of machine learning. The characteristic is listed below and would
be explained in detail in subsequent chapters.
1. Machine learning is usually based on a theoretical breakthrough and is therefore well-
grounded in science.
1. Machine learning changes the existing economic paradigm.
2. The result is commoditization (e.g., Hadoop)
3. it makes available new data that leads to further data science.

Supervised and Unsupervised Learning

In explaining machine learning, it was mentioned that systems are trained in decision making,
and this training is a continuous process. There are two majors ways a system can be trained.
These include supervised and unsupervised learning.
In supervised learning, the output is produced based on the input data. To do this, the system is
provided with historical data of input and output. The system then uses one of the several
machine techniques to learn the relationship between the two. Judgment would also be used to
decide which of the machine learning techniques would be appropriate. Examples of supervised
learning include automated credit card approval and spam filter.
In unsupervised learning, inputs are arranged and reorganized in order to structure unlabeled
data. This is done by reorganizing data and labeling them with tags. A good example is factor
analysis

Privacy
As the volume, velocity, and variety of data increase, individual privacy is flooded. As humans,
we are often torn between engaging in social interaction with others and maintaining our privacy,
and technology has made it happen in such a way that the war against maintaining our privacy
grows every blessed day. Daily we are in a struggle to keep our private information from leaking
out to the public. The reason for this is simply as a result of the involvement of data science.
Stakeholders, governments, business owners, and more useful data service to gain access to
people's private information. We are very conversant with sites asking personal questions
ranging from our age, marital status, place of origin, occupation and even address to our resident
areas. These kinds of questions help leak out private information.
Additionally, the loss of privacy is also caused by what is known as "human profiling." This is a
situation whereby the more we move our daily activities to the web, the more companies and
organizations use data mining and analysis to construct profiles of who we are even more than
we often realize. For instance, when we tweet "taking my dog for a walk," this information is
incremented through data analysis as "owner of a pet" when we tweet " going home to cook for
my kids," this is incremented as "a mother." This shows that a little information that appears like
an ordinary tweet to us on Facebook or Twitter reveals more than we often know. In succinct, a
machine knows you better than you think.
Aside from tweets on social media, phone calls, GPS location tracker, and emails are also part of
what companies and organizations can use to create human profiling. Those who don't often
tweet information about themselves are being recorded as people of low digital profile. However,
to create a balance, it is advisable not to hide so as not to be known but to maintain an average
profile as possible.
Human profiling means the separation of a targeted space for a separate audience or group of
people. This allows attention to be paid to that group of persons through what is known as price
discrimination. For instance, if my profile shows that I am an influential person and very rich, I
am likely to start receiving internet sales pitches from popular companies who have gathered that
I often buy the product they are into. Profiling allows companies or business organization to
reach their audience faster and more accurately.
Profiling is also used to snare terrorists, however, care should be taken not to engage in excessive
profiling. We are in an age where there is an interesting battle between man and machine for
their privacy. Let's be careful what information we disclose on social media.
Theories, Models, Intuition, Causality, Prediction, Correlation
Data science entails the implementation of theories and models. Data science also makes use of
intuition, causality, prediction, and correlation. Theories are a statement about how the world
should be or should not be. These statements are often derived from axioms that are assumed on
the nature of the world or from existing theories. Models, however, are the implementation of
theories. This is often achieved through the use of algorithms and variables. Intuition is the result
of a running model. This means intuition is a profound understanding of the world with the aid of
data, theories, and models.
Once the intuition for the result of a model is established, what is left is to determine if the
relationship observed between model and intuition is that of prediction, causality, or correlation.
Causality is usually stated in a mathematical form or structure. Theories might be causal. To
arrive at a causal effect, it must be deeply entrenched in data. This is why causality is very
difficult to establish, even with the use of theoretical foundations.
At the end of the inference chain in data science, the movement between two variables is often
determined by correlation. Correlation is of utmost importance to firms hoping to tease out
information from big data. Although correlation deals with a linear relationship between
variables, it could also lay the background for finding out nonlinear relationship, an aspect which
is becoming more and more flexible with the use of data.
In data science, a relationship is a multifaceted correlation among people. Social media, like
Twitter, Facebook, Instagram, and so on, use graph theory to datafy human relationships. The
aim of this is to understand how people relate to each other to make some profit from it. Data
science encompasses the understanding of how humans relate with one another and to
understand the behavior of a human generally. This aspect is the focus of social science.
Conclusion
This chapter explained in detail who a data scientist is, what is data science and the features a
good data science should have. The chapter also looked at the characteristics of good data,
machine learning, and the two major types of machine learning. In the subsequent chapters of
this book, we will consider theories, models, data application and techniques. We will also
explore some of the recent technologies created for big data and data science.
Chapter Two: Getting Started with Data Science
This chapter explores some of the mathematical models, statistics, and algebra of data science.
We would be looking at some equations prevalent in data analysis and how business
organizations use this in carrying out their duties.
Data analysis calls for technical expertise and excellence. It calls for the ability to use various
quantitative tools. These tools range from statistics to calculus and algebra, and of course,
econometrics. There are various tools used in data analysis; in this chapter, some of the tools
would be explained in detail. The outline that would be covered in this chapter include:
Exponentials, Logarithms, and Compounding
Normal distribution
Poisson distribution
Vector Algebra
Matrix calculus
Diversification
Exponentials, Logarithms, and Compounding
In this section, we would start our explanation from the most basic mathematical constant we are
familiar with. This is “e = 2.718281828...”, which is also the function “exp(·).” This function is
usually written as "ex." Here, x can either be a real or complex variable. This type of
mathematical constant is very popular in finance and other related areas. In finance, we use the
constant in the continuous compounding and discounting of money at a stipulated interest rate
which is (r) and a time frame (t).
Let's assume y = ex, any change in the value of x would also result in a change in the percentage
of y. The reason for this is simple. In (y) =x, In (.) is the inverse function of the exponential
function and also a natural logistic function.
Remember that the first derivative of this function is the equation dy/dx =ex. e is a constant, and
it is defined as the limit of a particular function
The limit of a successively shorter interval over discrete compounding is what is known as
exponential compounding. Let's assume that time frame (t) is split into intervals per year. The
equation for the compounding of a dollar from zero time frame to a time (t) years at a given
interval (n) at per annum rate can be written as:

When the limit of n rises to infinity, it leads to continuous compounding. The equation is written
like this:

The above equation is just the forward value for one dollar. To calculate the present value, we do
a reverse equation. Hence the price today of a dollar collected t years from today is P=e-rt. What
we got now is a bond. The yield of this bond is:

Duration is the negative of the percentage price sensitivity of a bond to changes in interest rate.
The equation represents this:
The percentage price sensitivity of a bond in relation to its second derivative is its convexity. The
equation for this is:
Normal Distribution
This aspect is the benchmark of many models in social science. This is because it is widely
believed to produce virtually all the data needed in the big data. It is quite interesting that most
phenomenon in the real world is "power law" distributed. This implies a very few observation of
high value as against many observation of low value. In this type of distribution, the probability
distribution does not have the features of normal distribution. Rather what we have here is left to
right decline in a probability distribution.
A good example of data distributed in this format is income distribution ( very few observations
of high income and many observations of low income). Other examples include the population of
cities, frequencies in language, and so on.
The normal distribution is very important in statistics. A good example of this type of
distribution is human heights and stock returns. In a normal distribution, if x is normally
distributed with the mean and variance, the probability density for x equals to

The distribution function gives the cumulative probability

The notation N(·) or Φ(·) instead of F(·) is often used because the normal distribution is
symmetric. The “standard normal” distribution is: x ∼ N(0,1).
Poisson Distribution
This is also known as a rare-event distribution. The density function for this type of distribution
is

In this type of distribution, there is only one parameter, the mean λ. The function of density is
above the discrete values of n. Both the variance and the mean in Poisson distribution is
represented by λ. The Poisson is a discrete-support distribution, its values range from n =
(0,1,2,3,4,5....)
Moment of Continuous Random Variables
The formulas that would be reviewed in this sector are very necessary because any analysis of
data entails the use of these formulas. In our review, we would use the random variable x and the
probability density function f(x) to arrive at the first four moments.
Mean (first moment or average) =

The power of the variable results in a higher nth order moment. These types of moments are
non-central. The formula for this is
The next central moment is the variance. Moments of the demeaned variable is also known as
central moments.
Variance = Var(x) = E[x−E(x)]2 = E(x2)−[E(x)]2
The square root of the variance is the standard deviation, i.e., σ = √Var(x). The next moment is
skewness
The value of skewness is in relation to the degree of asymmetric in probability density. If there is
more occurrence of values in the left-hand side than the right, the distribution is left-skewed.
When the values fall more on the right-hand side, what we have is right-skewed.
The last normalized central distribution is the kurtosis.
The standard distribution value for Kurtosis is 3. Excess kurtosis occurs when the standard
distribution value is minus 3. A distribution with excess kurtosis is called leptokurtic.
How to Combine Random Variables
Here are the simple formats to combine random variables
1. Means are scalable and addictive, E(ax +by) = aE(x)+bE(y)
2. When a, b are scaler values and x, y are addictive, the variance of random plus scaled
variables is
Var(ax +by) = a2Var(x)+b2Var(y)+2abCov(x,y)
3. The equation for the covariance and correlation between two random variables is
Vector Algebra
In most of the models we will explore in this book, what we will be using are linear algebra and
vector calculus. Linear algebra encompasses the use of both vector and matrices, while vector
algebra and calculus are very effective in handling issues, including solutions in spaces of
several variables.
A good example is a high dimension. In this book, the use of vector calculus would be examined
in the context of a stock portfolio. The return of each stock is defined as:

What we have in the above equation is a random vector. This is because each return comes from
its own distribution. Also, there is a correlation in the return of all these stocks.
We can also define a Unit vector as :

The unit vector would be used further in subsequent chapters, especially for analysis. A set of
portfolio weights represents a portfolio vector. This implies the fraction of the portfolio invested
in each stock.

The sum of all portfolios must be 1. The equation for this is:

A good observation of the line above would show that there are two ways to calculate the sum of
a portfolio. The first way is by summation notation, while the second used a simple verbal
algebraic statement. Vectors are represented by the two elements at the left-hand side of the
equation while the elements at the right-hand side is a scalar.
Vector notation can also be used to compute statistics and the quantities of portfolios. The
formula for the portfolio return is
In the above equation, the quantities at the left-hand side represent the scalar, while the right-
hand side is the vector.
Diversification
Here we would examine the power of using vector algebra with an application. To explore how
diversification works, we would be using vector and summation math. When the number of non-
perfected correlations in a stock portfolio increases, diversification happens. This creates a
reduction in portfolio variance. Now, to compute the variance, we would use the portfolio weight
w and the covariance vector of stock return R. Σ represents this. In our calculation, the formula
for a portfolio return variance would first be written as:

However, if the return is independent, the formula collapses to

But if an equal amount is invested in each asset and return are independent, the formula we have
is

In the above equation, the first term is the average variance, while the second term is the average
covariance. What we have at the end of our equation is an outstanding result of a diversified
portfolio. In this type of portfolio, the variance of stock does not play any role in portfolio risk.
The variance of the stock is the average of off-diagonal terms in the covariance matrix or vector.
Matrix Calculus
Matrix calculation is merely the function of countless variables. Just as a function can be
amended in multivariable calculus, functions are also amendable in matrix calculus. However,
the simplest among this is using vector and matrix. Here we can take the derivative of in just a
single step. For instance, let's assume

and

The fraction for f(w) will be wB. What we have here is a function of two variables w1,w2. When
we write out f(w) in long-form, what we will arrive at is 3w1 +4w2. The derivative of f(w) in
relation to w1 is ∂f/∂w1 = 3, while the derivative of f(w) in relation to w2 is ∂f/∂w2 = 4. When
we compare this with vector B, the value for df/dw is B.
The insight in this form of calculation is that when vectors are treated as regular scalars and
calculus are conducted accordingly, the result we will arrive at is a vector derivative.
Conclusion
In this chapter, we have successfully covered some models in data calculation. We have also
explored some of the basic statistics in data science. We considered mathematical calculations
like vector, matrix, calculus, variables, and so on. The next chapter will examine more about
theories and models in data science.
Chapter Three: R - Statistic Packages
This chapter would examine some of the useful steps for using R - statistic packages. For a great
user interface that comes with using the R package, it is advisable to download and install
RStudio; this can be done by visiting www.rstudio.com. However, it is necessary first to install R
from the R project page, www.r-project.org. Now let's get started with some R statistics basic
programming skills. The outlines that would be covered in this chapter include:
System command
Matrix
Descriptive statistics
High-ordered moments
Brownian motion in R
GARCH/ARCH Model
Heteroskedasticity
Regression model
System Command
To access the system directly, you can issue system command using the following technique:
system( "<command>" )
For example
system( " ls_lt_ |_ grep_Das" )
will list out all the directory entries that have my name in chronological order. However, this
kind of command would not work on a Windows machine because I am using a UNIX
command. This can only work with a Linux box or Mac.
Loading data
To get started with this, there is a need to get some data. Here are the steps to do this:
1. Go to Yahoo
2. Download and save some historical data into Excel spreadsheet
3. Restructure the order of the data chronologically
4. Save the work as a CSV file
5. Use the following method to read the file into R

If required, the last command in the above would reverse the structure of the data sequence.
Stock can be download using the quantmod package.
Note: the drop-down menu on Window and Mac can be used to install a package. In Linux, use
the package installer. The following command can also be used to achieve this.
install. packages( "quantmod" )
Now we can start using the package

We can also create a direct column of stock using the following formula:
Next, concatenate the data columns into a single stock data set

Now we will log in return in continuous-time to compute daily returns. The mean returns
include:

We will also compute the correlation matrix and the covariance matrix.

In our program, we will notice that the print layout made it easy to select several significant
digits.
To make the data files easy to work within all formats, you can use the reader package. It has
many tangible functions.
Matrices
In this section, we would examine the basic command needed to manipulate and create a matrix
in the R project. We will be creating a 4x3 matrix with some random numbers shown below:

When transposing the matrix, we would notice the reversion in the dimensions of the number

For easy multiplication of a matrix, the matrix to be multiplied must conform with each other.
This implies that the number of rows of the matrix at the right must be equal to the number of the
columns of the matrix on the left. The resultant matrix that has the sum of the computational
would contain the number of columns of the matrix at the left and the rows of the matrix at the
right.
Descriptive Statistics
Here, we would be using the same data to compute different descriptive statistics. The first step
to do this is to
Read a CSV data file into R file

Now we have our stock data intact, we can compute daily returns and then convert the
returns into an annualized return. The result of this action is shown below:

When we compute the daily and annualized return, the result is as follows:
Higher-Ordered Moments
Two major moments arise in return distribution; they are skewness and kurtosis. To show how
this works, we would be using a moment library in R.
Skewness = E[(X−µ)3] ÷ σ3
The meaning of skewness is that one tail is fatter than the other. A fatter right(left) tail means
that the skewness is positive (negative).
Kurtosis = E[(X −µ)4] ÷ σ4
In Kurtosis, the two tails are fatter than the normal distribution. In a normal distribution,
skewness is zero, and kurtosis is 3. Excess kurtosis occurs when the value of kurtosis is minus
three.
Brownian Motion in R
Stock motion law often major in Brownian Motion, especially its geometry.
dS(t) = µdS(t) = µS(t) dt+σS(t) dB(t), S(0) = S0
This kind of equation is a stochastic differential equation (SDE). The equation is an SDE because
it explains the random movement of the stock (t), the coefficient of the stock (µ) and (σ). The
drift of the process of the stock is determined by µ while σ determines the volatility. Brownian
motion determines the randomness B(t). Unlike the deterministic differential equation, which is
only a function of time, this aspect is more general. In SDE, the solution is always a random
function and not a determinant function. The time interval (h) solution is as follows:

The presence of B(h) ∼ N(0,h) in the solution is what gives the function its random function
quality. B(h) can also be written as the random variable (h) ∼ N(0,h), where ∼ N(0,1). The
presence of the exponential return makes the price of stock lognormal.
Maximum Likelihood Estimation
In MLE Estimation, our concern is to find the parameters {µ,σ} that causes the probability of
seeing the empirical sequence of returns R(t). To carry out this estimation, we would be using a
probability function. Here are the steps to carry out this estimation:
A quick overview of the normal distribution x ∼ N(µ,σ2)
Next, we do a density function

The formula for a standard normal distribution is x ∼ N(0,1).For the accepted normal
distribution: F(0)=12.
In the following equation, the probability density is normal.

Where α=(µ−1/2σ2)h. For periods t=1,2,... T the entire series likelihood is

It is very simple (computationally) now to maximize.

This kind of action is known as log-likelihood; it is very easy to use in the R project. The
first step to do this is to generate the log-likelihood function.

After this, we can now go ahead and do the MLE using the NLM (non-linear
minimalization) package in R. This uses a Newton-type algorithm.
GARCH/ARCH Models
GARCH represents “Generalized Auto-Regressive Conditional Heteroskedasticity." Rob Engle
was the first who invented ARCH, which later earned him a Nobel Prize. This was later extended
to GARCH by Tim Bollerslev. The emphasis of ARCH models is that volatility tends to cluster,
i.e., volatility for period t, is auto-correlated with volatility from the period (t −1), or other
preceding periods. When a time series follows a random walk, it can be a model like this:

Under ARCH, the variance is always auto-correlated. So we will have:

In GARCH, the stock is conditionally normal and independent. However, because of the changes
in variance, it is not identically distributed.
How Bivariate Random Variables Work
Two independent random (e1,e2) ∼ N(0,1) can be converted into two correlated random variables
(x1, x2) with correlation ρ using the following transformation method:

This implies that we can generate 10,000 pairs of variables using the R code explained below
Multivariate Random Variables
This is generated by using the Cholesky decomposition. Cholesky stands for a covariance matrix,
which is a product of two matrices. Covariance can be written in decomposed form Σ = L L.
Here, L represents a lower triangular matrix. There can also be an alternative decomposition for
upper triangular, here U = L. Each component that makes up the decomposition becomes a
square root of the covariance matrix.
Cholesky decomposition is very good at generating a correlated random number from a
distribution with mean vector µ and covariance matrix Σ. Assuming we have a scalar random
variable e ∼ (0,1) we can use this to change the variable into x ∼ (µ,σ2), we generate e and then
set x = µ+σe. However, if instead of a scalar, we have a vector e = [e1,e2,...,en] T ∼ (0,I) . This
can be transformed into a vector of correlated random variables x = [x1,x2,...,xn] ∼ (µ,Σ), by
computing: x =µ+Le
Portfolio Computation in R
Its variance usually calculates a portfolio's risk. When there is an increase in n (the number of
securities in the portfolio), this initiates a reduction in the variance. This continues to the point
that it becomes the same as the average covariance of the total assets. The following result shows
what happens when the variance demo through the R function.
Regression
A multivariate linear regression has:
Y =X·β+e
This is the value for Y ∈ Rt×1, X ∈ Rt×n, and β ∈ Rn×1. The overall regression solution is
β = (XX)−1(XY) ∈ Rn×1.
To arrive at the above result, we minimize the sum of squared error.
It is noteworthy that this expression is a scalar.
Heteroskedasticity
In simple linear regression, it is assumed that the standard error of the residual is the same for all
observations. A lot of regression suffers from this type of situation. This type of error is what is
known as "heteroskedastic error." "Hetero" means "different" while "skedastic" means
"dependent on type."
A heteroskedastic error can be tested by using the standard Breusch-Pagan test available in R.
This is found in the lmtest package and should be loaded before running the test.

In the above test, there is a little heteroskedastic error in the standard. This is seen in the
appearance of the p-value. We would correct this using the hccm function. This stands for
heteroskedasticity corrected covariance matrix would be as follows:\
In the above program, we use the hccm to generate a new covariance matrix vb of the
coefficients. Next, we generated the standard error as the square root of the diagonal of the
covariance matrix. With the help of these revised standard errors, we divided the coefficients by
the new standard error. This helps us to recompute the t-statistics.
Auto-Regressive Model
Whenever a data is autocorrelated, that is, generates a dependence on time, not giving an account
of this is tantamount to unnecessary high statistical significance. This is because when an
observation is correlated with time, they are often seen as independent, thereby limiting the true
number of observations.
In an efficient market, the correlation of time from one period to the other should be close to
zero.

The program above is for immediate consecutive periods, referred to as the first-ordered
autocorrelation. This can be examined across many staggered periods by using the R functions in
the package car.

When the DW is close to 2, there is usually no traces of autocorrelation. When the DW statistic
is less than two, it is positive autocorrelation; when it is greater than 2, it is negative
autocorrelation.
Vector Auto-Regression (VAR)
This is very useful for estimating systems where the variables influence each other, and there are
simultaneous regression equations. Therefore in VAR, each variable in a system depends on the
lagged value of itself and on other variables. To choose the number of lag values, we use the
econometrician to choose the expected decay in the time-dependence of the variable in VAR.
Conclusion
In this chapter, we explored in detail the various models and features of the R package. We also
examined the types of regression models in R-statistic. In the next chapter, we will examine data
handling using the R package.
Chapter Four: Data Handling and Other Useful Things
This chapter would focus on some of the alternative programs in R different from what we have
examined in the previous chapter. We will also constantly draw reference from the topics under
the R package treated in the former chapter. Here, we would explore some of the very strong
packages of R. Especially those that use sql-like operations on both the handling of small data
and the handling of big data. The topics we would be considered in this chapter include:
Data extraction of stocks using quantmod.
How to use the merge function
How to apply a class of functions
Getting interest data rate from FRED
How to handle data using lubricate
Using the data.table package
Data Extraction Of Stocks Using Quantmod
Here we will be using the stock package treated in the previous chapter to get a few initial data.

When the length of each stock series is printed, you will find out that they are not the same. Our
next action is to covert closing adjusted prices of every stock into separate data.frames. Here are
the steps to do this:
Construct a list of data.frames. This is important because data.frames are stored in lists

Each data.frames should have a column. This would be used later to join the separate
stock data.frames that were previously created to a single composite data.frames.

Next, we will use a join to integrate all the stocks, adjusted closing prices into one
data.frame. The aim of this is to merge; this can be done through a union join or through
an intersect join. Intersect join is the default.
We will observe that the stock table contains the number of rolls in the stock index.
This has limited observations than individual stocks. Because what we are dealing
with is an intersect join, part of the rows will be dropped.
Plot all stocks into a single data frame with the use of ggplots 2. This is more
functional than the basic plot function. However, to use ggplots 2, we would first use
the basic plot function.
Next, the data would be converted into returns. These could be either log returns or
continuously compounded returns.
The data.frame returns can be used to present the descriptive statistics of returns
Next, the correlation matrix of returns should be computed.
After this, the correlogram for the six return series should be displayed. This would
help us see the relationship between all variables in the data set.
How to Use the Merge Function
Data frames are similar to tables or spreadsheets. However, they are very much like a database.
When we want to merge two data frames, it is the same as joining two databases. R program has
the merge function for this.
Now, let's assume we already have a list of ticker symbols that we want to produce a well-
detailed data frame from these tickers. The first thing we would do is to go through the input
name of the tickers. Let's assume the tickers are in a file named tickers.csv; the diameter of the
file is the sign of a colon. This would be read like this:
tickers = read. table(" tickers . csv" ,header=FALSE, sep=" : " )
We arrive at two columns of data from the line of code read in the file. The upper part of the file
contains the six rows listed below:
> head( tickers )
V1 V2
1. NasdaqGS ACOR
2. NasdaqGS AKAM
3. NYSE ARE
4. NasdaqGS AMZN
5. NasdaqGS AAPL
6. NasdaqGS AREX
The next line identified below list out the numbers of input tickers while the third line renames
the data frames columns. The tickers' column is renamed "symbols" because the data frame that
would be merged with it shares a similar name. This column is the index that the two data frames
are joined.

The next action is to read in the list of every stock on NYSE, Nasdaq, and AMEX. This is shown
as follows:
The upper part of the Nasdaq would contain the following:

Our next action is to join all three data frames into a single data frame. Then we check the
number of rows in the merged file by checking the dimensions. These two actions are shown as
follows:
co _ names = rbind (nyse _ names, nasdaq _ names, amex _ names)
>dim (co _ names)
[1] 6801 8
Lastly, we would join the ticker symbol file and the exchange data into one using the merge
function. This would extend the ticker file to contain information in the exchange file.

Now, let's assume we wish to search for the names of the CEO of all the 98 companies listed in
our program. Since we don't have any available document containing the information we seek,
we can easily download. However, a site such as Google Finance Page has the information we
seek. Our next auction is to write R code and use this to scrape the data on the Google Finance
page one after the other. Once we extract the CEO's name, we augment the tickers' data frame
using the R code.

We would notice that the R code that augments the tickers' data frame did this with the stringr
package. This helps simplify the string. When we are through with the extraction of the names,
we then search the line that has the name "Chief Executive." Here is the final data frame with the
name of the CEOs
How To Use The Apply Class Of Functions
Most times, functions are expected to be applied to many cases. The parameter for the cases may
be provided in a matrix, vector, or lists. This is similar to using different sets of a parameter to
repeat evaluations of a function by running a loop through a set of values. In the illustration
below, we use the apply function to compute the means return of all stock. The function of the
data it is merged with is the first argument, the second is either 1 (by rows) or 2 (by columns)
while the function being evaluated is the third.

We will notice that the function returns the means column of the data. Not only this, the function
that applies to a list is the lappy, while sappy works with vectors and matrices. The Mapply uses
multiple arguments. To verify our work, we can easily use the colMeans function.
How To Get Interest Rate Data From FRED
FRED stands for Federal Reserve Economic Data. This is an authenticated data interest rate
source. It is managed by the St. Louis Reserve Bank and warehoused at this warehouse
https://research.stlouisfed.org/fred2/. Now, let's assume we want to download the data directly
using the R in FRED. For us to be able to achieve this, we would write out some codes. Although
before the website was changed, there was a site for this since it is so, we would easily roll in our
own code in R.

We will use the function above to download our data and to produce a list of economic time
series. The data would be used as an index to join the individual series as a single series. Also,
we download maturity interest rates (yields) beginning from the maturity of one month
(DGS1MO) to thirty years (DGS30).
Now we have a data frame that contains all the series we are interested in. Next, we sort the
data.frame by date, but before this we first convert the date into number strings as shown below
NA represents missing values. Note that there are values represented by "-99." Although both
NA and -99 can be wiped out, we leave them because they represent times when there was no
yield for that maturity.
How To Handle Dates Using Lubricate
Assuming we want to sort out the data.frames of failed banks. We would need to do this month
by month, day by day, and week by week. This definitely requires the use of dates package. A
very unique and useful tool developed by Hadley Wickham is the lubricate package.
We would do the same sorting we did here with a month to see if we will record any form of
seasonality.

There is no seasonality with the monthly sorting, let's try with daily sorting
From our counts, we would observe that counts are indeed lower at the start and end of each
month.
Using The Data.Table Package
This is a very brilliant package written by Matt Dowle. The function of the package is to allow
data.frame works as a database. Not only this, but it also allows the proper and effective handling
of massive quantities of data. The effectiveness IP address of a company known as
h2o:http://h2o.ai/ has now embedded this technology. To see how this works, we will be using
some downloadable crime data statistics for California. Next, we will create a csv file and place
our data inside so that it can be easily read into R.
data = read . csv ("CA_Crimes_Data_2004−2013.csv", header=TRUE)
Now it is easy to convert the data into a database
library ( data . table )
D_T = as . data. table( data )
Now let's see how this works, we will notice that the syntax of this looks very much like the
syntax of data.frame. As a result, we would only print a section of the name and not all.
print ( dim(D_T)

One of the unique characteristics of the database is that it can be index by making any column in
the index key. Once this is done, it is easier to compute subtotals and even generate plots from
them.
We would notice that the type of output generated looks like that of the data.table. It also
includes classes from DataFrames too. Our next action is to plot the result of the data.table the
same way we plot that of the data.frame.
Using the p l y r Table
Hadley Wickham writes this package. It is very useful to apply functions to tables of data
(data.frames). In our program, we would want also to write a custom function, it is in writing this
function that this package comes in. In R function, we can either use the p l y r class of package
or the data.table to handle data.frame as database.

Next, we would use the filter function to subset the rows of the dataset we want to select for
further analysis.
Also, Data.table provides a unique way to carry out statistics. Below are the steps to do this:
1. Group data by standpoint.
2. Use the groups to produce statistics
3. Choose the option that allows you to count the number of trips beginning from the first
station and also allows you to calculate the average time of each trips.
Conclusion
This chapter explains in detail how data is handled in R packages. Explanations on how to merge
data to functions, apply functions to data and use the various options available for handling big
and small data were provided. The next chapter examines Data statistics.
Chapter Five: Markowitz Mean-Variance Problem
This chapter examines the Markowitz mean-variance problem. This type of problem is not only
popular in data science, but its solution is also widely used. In this chapter, we will cover the
following outlines:
Markowitz mean-variance problem
How to solve the problem using the quadprog package
Risk Budgeting
Markowitz Mean-Variance Problem
This is a very popular portfolio optimization. The solution to this type of problem is still widely
used today. However, our major aim in this chapter is on the portfolio of n asset. This implies
that the return of E (rp), and a variance denoted as Var(rp). Our portfolio weight is represented
by w ∈ Rn. The meaning of this is that in allocating values to the assets, take, for instance, we
want to allocate $1 to the asset. It means that each $1 is allocated into various assets. The total
value of the sum of our weight is 1.
Quadratic (Markowitz) Problem
This optimization problem can be defined as this. We want our result to achieve the pre-specific
level of expected mean return, and its variance(risk) avoided as much as we can.

The ½ we have in front of the above variance is for mathematical neatness. The function of this
would be explained as we progress in this chapter. The scaling of the objective function by a
constant does not affect the minimized solution. There are two types of constraints working with
our variance above. The expected mean return is forced into a specific mean return E(rp) by the
first constraint. The second constraint, also known as the fully invested constraint ensures that
the weight of the portfolio is up to 1. These two constraints are equality constraints.
The type of problem explained above is a Lagrangian problem; it requires that we use the
Lagrangian multipliers {λ1,λ2} to embed the constraints into the objective function. What we
will have after this action is a civilization problem.
We will take derivative with respect to w, λ1, and λ2, to minimize this function and then arrive at
the first-order conditions started as follows:

The first equation represented by (*) is an n equation system. This is because the derivative is
taken with respect to all the elements of the vector w. This is why we arrive at a total of (n+2) as
our first-order condition. From(*)
Let's take note of these observations:
Since Σ is positive, this means that Σ-¹ would also be positive and: B>0, C>0.
Taking the solutions for λ1,λ2, we would find out the solution for w using this formula

The above equation is the expression for the optimization equation weight when the variance is
minimized for a given amount of expected return E(rp). Once the inputs to the problems µ and Σ
is given, the vectors g and h become fixed.
E(rp) can be varied to get a set of frontier (optimal or efficient) portfolios w
Therefore, these two portfolios g, g, and h produce the entire frontier.
Solution in R
We can use R to create a function to return the optimal portfolio weight. To do this, we will use
the following formula
We can call the function of an expected return and then enter the example of a mean return
vector and the covariance matrix of returns.

The output is the vector of the optimal portfolio weight.

However, we will get a different output when the expected return is changed to 0.10

To get the expected return of 0.18 in the first example, we would notice that we shorten some
low-risk assets and lengthen some medium and high-risk assets. However, when the expected
return was reduced to 0.10, all the weight are positive.
How To Solve The Problem Using The Quadprog Package
This is an optimizer that uses linear constraint to take a quadratic objective function. As a result,
this is exactly what we need to solve the mean-variance portfolio problem we just treated.
Another significant use of this package is that we can use additional inequality constraints. For
instance, whenever we don't feel like granting short sales of any asset, we can easily bound the
weight to lie between zero and one. The manual below shows the specification of the quadprog
package.

In the setup of the problem, we are dealing with, with no short cuts and three securities. We will
have the following bvec and Amat\

The constraints will be modulated by meq = 2. This states that the first two constraints will be
equality constraints, while the last two will be greater than equal to a constraint.
The package code would be run in this format:

After running the code, our expected result would be 0.18, with a short-selling that allows:
[ 1] −0.3575931 0.8436676 0.5139255
This is exactly the same result we got in Markowitz's solution. When we restrict short-selling, we
will arrive at the same 0.10 we got in Markowitz.
Risk Budgeting
A single problem can have a different view of the Markowitz optimization problem. To control
this, we use one of the recent approaches to risk portfolio construction. We construct a portfolio
where the risk construction of all assets is equal. This approach is known as "Risk Parity."
Another portfolio where all the risk contributes the same quota of the total return of the portfolio
would also be created. This type of approach is known as the "Performance Parity Approach."
Assuming its weights portray the portfolio, the risk becomes the function of its weight and is
denoted by R(w). The standard deviation of the portfolio is as follows:

The risk function of this kind of risk is homogenous. This implies that if the size of the portfolio
is doubled, then the risk measures also doubles. This is also known as the homogeneity property
of risk measurement. Homogeneity is one of the coherence in risk measurement explained by
Eber, Artzner, Health, and Delbaen (1999). Once a risk measurement meets the requirements of
homogeneity, the next step is to apply Euler's theory to decompose the risk into the amount given
by each asset.
Let's assume that we define the risk measurement to be the standard deviation of portfolio return;
the risk decomposition would require the measurement of the risk along with all its weight. This
is shown as follows:

We can verify the sum of total risk using this procedure:

Conclusion
In this chapter, we observe the Markowitz problem in Data science and the various packages that
can solve this problem. The next chapter examines Bayes’ theorems and the types of models
under it.
Chapter Six: Bayes Theorem
This theorem deals with coincidence and reality. A very good explanation of the theory is
explicated on Wikipedia http://en.wikipedia.org/wiki/Bayes theorem and a video by Professor
Persi Diaconis's talk on Bayes on Yahoo video. In business, we often encounter questions
bothering on reality and coincidence. A good example of the question is, is Warren Buffet's
investment success a coincidence? How do we answer the question? Do we use our prior
knowledge of the probability of Buffet being able to beat the market, or do we check the
performance of the business over time? It is in answering this question that Buffet rule comes in.
The rule follows from the decomposition of joint probability. Here is the formula
Pr[A ∩ B] = Pr(A|B) Pr(B) = Pr(B|A) Pr(A)
The last two terms in the equation can be restated like this:

The example we would be using is the Aid Test

This is a very intriguing test. Applying the Bayes Theorem implies that if you are diagnosed with
Aid, there is a chance that you don't have the disease; however, if you are diagnosed with not
having the disease, there is a good chance that this is true.
We would use the equation { Pos, Neg} as a positive or negative diagnosis of having AIDS.
While {Dis, NoDis} would represent having or not having the diseases. Taking the US report on
AIDS as a case study, there are over 1.5 million AID cases in a population of over 300 million
people in the USA. This implies that the probability of people with AID in the country is 0.5%.
Since the probability percentage is half a percent, in doing a random test to discover someone
with AID, we would use half a percent probability. The percentage of accuracy is 99%. Our
equation would be as follows:
Pr(Pos|Dis) = 0.99
For those without the disease, the accuracy test is
Pr(Neg|NoDis) = 0.95
In finding out the probability of having the disease when the test says so, we would compute our
confirmation accuracy of the AIDS test using Bayle's Rule
From our calculation above, the chance of having the diseases when the test is positive is 9%.
Now we would calculate the chance of not having it when the test says positive
Pr(NoDis|Pos) = 1− Pr(Dis|Pos) = 1−0.09 = 0.91
The question now is, what is the chance of having the disease when the test says negative? This
is often a worry to some. Using Bayle's theorem, our calculation would be as follows:

From our test, when the test is negative, there is a very slim chance that you might have it, so
there is nothing to worry about.
Correlated Default (Conditional Default)
Bayes’ theorem is very effective for verifying conditional default information. Bond fault
managers are not as concerned with the correlation of defaults in the bond of their portfolio as
much as they are concerned with the conditional default of bond. This means that they are
concerned with the conditional probability of bond. To calculate this, some of the modern
financial institutions already develop tools to obtain the conditional default of firms.
Let's assume that we already know that firm 1 has a default probability P1 = 1%, and firm 2 has a
default probability P2=3%. Assuming the default of both firms is 40% in a year, however, if
either bond default, what is the probability of default of the other conditional on the first default?
Despite the limited information on the firm's probability of default, we can still use Bayes
theorem to define the conditional probability of interest. Here are the steps to calculate this:
define di, i = 1,2. This is the default indicator for the two firms
define di = 1 if the firms default.
define di =0 if the firms did not.
We would note the following in our Bayes application
E(d1) = 1.p1 +0.(1− p1) = p1 = 0.01.
Likewise
E(d2) = 1.p2 +0.(1− p2) = p2 = 0.03.
With Bernoulli distribution, we would be able to determine the standard deviation of d1 and d2.

In the above calculation, p12 is the default probability for the two firms. Our conditional
probabilities would be:
p(d1|d2) = p12/p2 = 0.0070894/0.03 = 0.23631
p(d2|d1) = p12/p1 = 0.0070894/0.01 = 0.70894
From the result of this conditional probability, it can be summarized that once the firm begins to
defect, the default contagion would start getting severe.
Continuous and More Formal Exposition
There is some very significant expression in Bayesian approaches. These expressions are
posterior, prior, and likelihood. These expressions would be explained in detail in this section.
Usually, in standard notation, we are concerned with the parameter of a θ, the mean of a
distribution of some data x. However, in Bayesian theory, we won't only be concentrating on the
value of θ, we would also be exploring the distribution value of θ beginning with some prior
assumption about this distribution. Therefore, we would begin with p(θ); this is referred to as
prior distribution. We then move to the data x and combine our prior distribution value to it to
get the posterior distribution p(θ|x). However, to do this, we are required to compute the
probability of seeing the data x given our prior p(θ). This probability is given by the likelihood
function L(x|θ). Assuming that we already know the variance of our data x as o². When we apply
our Bayesian theory, we would have:

If we assume that both the prior distribution for the mean and the likelihood are normal, then we
would have:
If this be the case, our posterior value would be

When the prior distribution and posterior distribution are of the same form, they are a
"conjugate" with respect to the specific likelihood function. However, if we observe n new value
of x, the new posterior would be:
Bayes Net
Bayes Net is a network diagram that can be used to visualize joint distributions over several
outcomes/events and higher-dimension Bayes problem. The net is a directed acyclic graph (
referred to as DAG). This means that circles are not permitted in the graph.
To understand how Bayes Networks, we would be using an example of economic distress.
Distress can be noticed at these three levels: economy level (E = 1), industry level (I = 1), and
the firm level (F = 1). Economy distress can result in industry distress, but this may or may not
lead to firm distress. The diagram below shows the flow of causality. It is noteworthy that the
probability in our first table is unconditional, but all others are conditional.

In our conditional probabilities, each pair adds up to 1. The channels in the table the arrows in
the Bayes net diagram.
In the first diagram, we would notice that there are three channels in the Bayes net. Channel a
stands for the inducement of industry distress from economic distress; channel b stands for the
inducement of firm distress directly from industry distress. The last channel c stands for the
inducement of firm distress directly from industry distress.
The question that arises from this net is, what is the probability that the industry is distressed if
the firm is in distressed? The calculation for this problem is stipulated below:
Bayes Rule in Marketing
In one of the widest market research campaign: pilot marketing, Bayes showed up in a very easy
manner. Let's assume we have a project with a value x. Now, if the product fails (F), the payoff
is -70; however, if it is successful (S), the payoff is +100. The probability of these two
happenings is
Pr(S) = 0.7, Pr(F) = 0.3
We can easily check that our expected end is E(x) = 49. Assuming we were able to get protection
for a failed product, the protection would be a put option of the real option; it’s worth rate would
be 0.3 ×70 = 21. Since the put option is what saves all the loss recorded by the failed product,
value is the expected loss, condition on loss. This is usually seen as the value of "perfect
information" by market researchers.
However, suppose there is an intermediate choice rather than proceeding with the product launch
after the odds, we would have done a pilot test. Although this is not always accurate, it is
reasonably sophisticated.
The test signal of the pilot test is (T+) or failure (T-). Our probabilities in pilot test would be as
follows:

The pilot test above gives only a valid reading of success 80% of the time. The probability that
the pilot signal gives a positive result can be computed as follows:
Pr(T+) = Pr(T+|S)Pr(S)+Pr(T+|F)Pr(F)
= (0.8)(0.7) +(0.3)(0.3) = 0.65
Negative result can be computed as follows:
Pr(T−) = Pr(T−|S)Pr(S)+Pr(T−|F)Pr(F)
= (0.2)(0.7) +(0.7)(0.3) = 0.35
This would allow us to compute the following:
Now that we have these conditional probabilities, let us re-evaluate our product launch. If the
result of the pilot test is positive, what do we expect of the value of our product launch. This
would be as follows:
E(x|T+) = 100Pr(S|T+)+(−70)Pr(F|T+)
= 100(0.86154)−70(0.13846)
= 76.462
But if the test is negative, the value of our launch is
E(x|T−) = 100Pr(S|T−)+(−70)Pr(F|T−)
= 100(0.4)−70(0.6)
= −2
Now that we know the value of both the negative pilot test and positive pilot test, our overall
value of pilot test would be:
E(x) = E(x|T+)Pr(T+)+E(x|T−)Pr(T−)
= 76.462(0.65) +(0)(0.35)
= 49.70
Without the pilot test, the incremental value over the case is 0.70.
Bayes Models in Credit Rating Transitions
Most times, companies or business organizations are allocated to credit rating classes. Unlike
default probability, credit rating is a more coarse bucket of credit. Also, updating the credit rating
class in the section tends to be very slow. As a result, the DFG models use a Bayesian approach
to develop a model of rating changes that uses contemporaneous data on default probabilities.
Accounting Fraud
Bayesian inference can also be used to detect accounting fraud and audits. When fraudulence is
suspected, an auditor can use a Bayesian hypothesis of fraud to verify past data and assess the
chance that the current fraud situation has been ongoing for a while.
Conclusion
In this chapter, we have examined the main focus and use of the Bayes Model. We examined
Bayesian Net and how we use Bayesian to explain conditional default information. In the next
chapter we examine News Analysis in Data science, algorithms, word count, and more.
Chapter Seven: More Than Words - Extracting Information From
News
This chapter explains in detail the concept of news extracting. Wikipedia defines news analysis
as the measurement of the various qualitative and quantitative attributes of textual news stories.
Some of these attributes are sentiment, relevance, and novelty. Expressing news stories as
numbers the manipulation of everyday information mathematically and statistically.” The chapter
examines the various analytical techniques in news extraction, the various news analytic
software, method, and the sets of metrics that can be used for the assessments of analytic
performance. The outlines that would be covered in this chapter include:
What is News Analysis?
Algorithms
Scrapers and Crawlers
Pre-possessing Test
Term Frequency - Inverse Document Frequency (TF - IDF)
Text Classification
Word Count Multiplier
Metrics
Text Summarization
What is News Analysis
This is an umbrella term that covers a set of formulas, techniques, and statistics used to classify
and summarized public sources of information. It also includes metrics that are used to assess
analytics. The field of News analysis is very broad; it covers aspects such as machine learning,
information retrieval, network theory, statistic learning theory, and collaborative filtering.
However, all these can be broken into three broad categories of news analysis: text, content, and
context.
Text in news analytics entails the visceral aspect of news, i.e., words, phrases, sentences,
document headings and so on. The main purpose of analytics here is to convert text into
information. This action is carried out by these three means:
Signing the text
Classifying the text
Summarizing it into its main component.
During the summarization process, analytic discard text that is not relevant while separating
information that is of higher signal content.
The next layer of news analytic is content. Content works on the domain of text by expanding its
images, text forms ( blogs, emails, pages, etc.), time, formats (XML, HTML), etc. Content
enriches text such that it asserts quality and veracity that can be explored in analytics. For
instance, a blog can be streamed to have a higher-quality than a stock message-board post;
however, when financial information is streamed with Dow Jones, it can have more value than a
blog.
The last layer of news analytic is the context. This is simply the relationship between information
items. This can also refer to the network relationship of news. In exploring the relationship
between context and news analytic, Das, Martinez-Jerez, and Tufano (2005), a clinical study of
four companies examines the relationship of news analytic to message-board postings. Similarly,
Das and Sisk (2005) explore the social networks of message-board postings to find out if the
rules of a portfolio can be created with the network connections between stocks. A good example
of an analytic that functions at all these three levels are Google's PageRank algorithms.
Algorithms have a lot of features; the kernel of these features are context while others are text
and content. Context is the kernel of algorithms because search is the most popularly used news
analytics. However, this depends on the number of highly ranked pages pointing to it.
From our explanation so far, it can be deduced that news analytics is where algorithms and data
meet. This is where tension is generated between the two. This is why there has been a heated
debate on which of the two should be more than the other. This debate was brought up in a talk
at the 17th ACM Conference on Information Knowledge and Management (CIKM '08), Peter
Norvig, Google's director of research, made his preference by stating that it is better to have
more data than algorithms. According to him, “data is more agile than code.” On the one hand,
this might sound reasonable okay, on the other, too much data can make algorithms become
useless thereby leading to overfitting.
When we talk about algorithms and data and which among the two should be more than the
other, this debate made it seems as if there is no correlation or relationship between the two.
However, this is not the case. To start with, news data shares the same three broad classifications
that news analytic has, i.e., text, content, and context. The level of complexity of any of these
three depends on which one is dominant. Generally, in news data, the simplest among the three is
text analysis. The context that applies to network relationships can be quite difficult. For
example, a community-detection algorithm can be very difficult to compare to word-count
algorithms which are very simple, almost naive. The community-detection algorithm has more
complicated memory requirements and logic.
The tension between the two aspects, News data, and news algorithms, is managed and
controlled by domain specificity. This implies the quantity of customization needed to
implement news analytic. It is quite interesting that low-complexity algorithms more domain
specificity than high-complexity ones. For example, the previous illustration we use, community-
detection would need little domain knowledge because it is applicable to a wide range of the
graph. However, this is not the case with word-count algorithms. A word-count algorithm
requires domain knowledge of grammar, lexicon and even syntax. Not only this, political
messages would be read differently and separated from medical messages.
Algorithms
Crawlers and Scrapers
Crawlers are set.of algorithms that are used to generate a series of web pages that may be used to
search for news content. The software derives it's name "crawler" from the way it works. It starts
from some web pages and crawls to others. By this, the algorithms make it choose from the
series of web pages it gathered. The commonest approach to choosing a page out of the
numerous ones gathered is to move from the current page to a page that is linked to hyper-
referenced. Significantly, a crawler uses heuristics to explore the tree from any given node and
used this to determine useful paths among the numerous ones before choosing which ones to
focus on.
Web scrapers download the details of any web page chosen; it may or may not format the web
page for analysis. Virtually all programming language has its own modules used for web
scrapping. The modules contain some inbuilt function that is directed connected to the web.
Once the functions are opened, it makes it easy to download user-specific or crawler-specific
URLs. The popularity of web analysis has made most statistic packages come with its own
inbuilt web scraping functions. For instance, R functions come with its own web scraping
function in its base distribution. Whenever we want to read a page into a vector line, we can
easily download and use a single-line command.
Excel, which is the most widely used spreadsheet, has its own inbuilt web scraping function.
This can be downloaded from the Data ----- GetExternal command tree. Once we download the
web scraping function, it can be transferred into a worksheet and then operated as desired. We
can also set up excel such that it refreshes the content constantly.
Gone are the days when to use web-scraping code; we will need to write it in Java, C, Python or
Perl. Today, we can use tools like R to handle statistical analysis, algorithms, and data. With R,
these three can be written within the same software. This is to say, data science progress daily.

Pre-Processing Text
Often times we think that no text can be dirtier than text from external sources; this is not the
case. Text from web pages is dirtier than text from external sources. Before applying news
analysis on algorithms, they must be first cleaned. The process of cleaning up algorithms before
applying news analysis on them is what is known as pre-processing. The first process to cleaning
algorithms is by using HTML cleanup; this process removes all HTML tags from the body of
messages. Example of these tags include <p>, <BR>&quot,etc. The next cleanup is with
abbreviations. Here we expand abbreviations to their full forms. All abbreviated phrases and
contractions are written out in full. For instance, it's is written out as it is, ain't is written out as
are not, etc. The third cleanup is a negative expression. An expression containing negative words
would mean the opposite of the negative expression. To handle this, we first detect the negative
words such as not, no and never. Then we tag the remaining words in the sentence where the
negative words are used. This would help reverse the meaning of the sentence.
Another significant aspect of pre-processing is the stem. This aspect deals with the root words. In
stem, words are replaced and represented by their roots words. This would make it possible for
the tenses of the words not to be treated differently. There are various types of stemming
algorithms available in a programming language. Popular among these stemming is Porter
stemmer discovered in 1980. Stemming varies from language to language. Hence, it is language-
dependent.
Term Frequency - Inverse Document Frequency (TF - IDF)
This is a scheme used to weigh the usefulness of rare words in a document. The TF-IDF uses a
very easy calculation and does not have any strong theoretical basis. It is simply the importance
of a word (w), in a document (d) in a corpus (C). Since this is a function of the three aspects, we
will write it as TF-IDF(w, d, C), It is a product of term frequency (TF) and inverse document
frequency (IDF).

Frequency is calculated like this

Where d is the number of words in a document, the frequency equation would be rewritten as:
TF(w,d) = ln[f(w,d)]
The above equation is known as Log Normalization. There is another form of normalization
known as Double Normalization. The formula for this is

The formula for inverse document frequency is:

The formula for the score of weight for a given word w in document d and corpus c is
TF-IDF(w,d,C) = TF(w,d)× IDF(w,C)
We will illustrate this using the application below:
When we run this code, here is the result we will arrive at:

The code can be written into a function, after which we then examine the TF-IDF for all words.
These can be used to weigh other words in further analysis.
Word Clouds
You can make a word cloud from this document. It would come out like this:

Text Classification
Bayes Classifier
This is the most widely used classifier today. Bayes Classifier simply takes some part of the text
and then assign it to one of the pre-determined set of category. The classifier is first trained on a
pre-classified initial corpus before it is applied to the text. It is this trained data that produces the
prior probabilities needed for the Bayesian analysis of the text. Next, we applied the classifier to
an out of sample text to obtain the posterior probability of textual categories. The text is then
applied to the category that has the highest posterior probability.
To see how this works, we would use an e1071 R package that contains the function of naive
Bayes. Next, we would use iris data that contain detail of the flower. Then we will take a
classifier to go through the flower data and identify which one among the numerous flower is it.
To list out the set of data loaded on our R package, we would use the following
Next, we call a prediction test to predict a single data or to generate a confusion matrix in this
format:
In the above table, the mean and standard deviation of the table is given. Basic Bayes calculation
would take the following pattern:

F stands for the type of flower, while a, b, c, and d stand for the four attributes of the flower.
Note that we didn't compute the denominator because it is still the same for the calculation of
Pr[F=1|a,b,c,d],Pr[F=2|a,b,c,d],or Pr[F=3|a, b, c, d]
Support Vector Machines (SVM)
This is a kind of classifier technique. It is very similar to cluster analysis but also applicable to
very high+ dimensional spaces. SVM can be best described by taking every text message as a
vector in high-dimension space. The number of data can be taken as similar to the number of
words in a dictionary. As a very simple example, we would use the same flower data set we use
in the naive Bayes.

SMV is very fast and can be used in news analytics.

Word Count Multiplier
Word count is the simplest form of classifier. Every language inference works with words. This
implies that words are the main factor of every language inference. Hence it varies from domain
to domain. FC Bartlett states that Words can indicate the qualitative and relational features of a
situation in their general aspect just as directly as, and perhaps even more satisfactorily than, they
can describe its particular individuality, This is, in fact, what gives to language its intimate
relation to thought processes.”
To get started with a word count classifier, the user would first determine the lexicon of the
words relating to the classification problem being treated. For example, if the text is to be
classified into optimistic versus pessimistic economic news. The user would first want to
separate the lexicon of the bad news from that of the good news. To do this, he or she would
need the use of domain knowledge in designing the lexicon of the words. Hence, unlike the
Bayesian classifier, a word count classifier is language-specific.
If, while counting out the numbers of words in each category, the number of words in each
category exceeds the other, the text message is assigned to the aspect with the highest lexical
counts.
Vector Distance Classifier (VDC)
In VDC, messages are seen as a word vector. As a result, every hand-tagged, pre-classified text
message in the corpus of training becomes a comparison vector. This is called set the Rule Set.
To assign a classification to a text message, it is first compared to the ruleset. Classification is
assigned based on how the ruleset is to the vector space. The measure of proximity is provided
by the angle between the message vector (M) and the vectors in the ruleset (S)

A search engine specifically index page as a word vector. When a search query is presented A
search engine essentially indexes pages by representing the text as a word vector. When a search
query is presented, the vector distance cos(θ) ∈ (0,1) is computed for the search query with all
indexed pages to find the pages with which the angle is the least, i.e., where cos(θ) is the
greatest. Presenting the best-match ordered list is made by sorting all indexed pages by their
angle with the search query.
In news analytics, when using the vector distance classier for news analytics, the classification
algorithm takes the new text sample and then finds the best match by computing the angle of the
message with all the text pages in the indexes training corpus. After this, pages with the same
tags are classified as the best matches. To implement the classifier, all that is required is only
linear algebra functions and sorting out routines readily available in virtually all the
programming environments.
Discriminant-Based Classier
All the classifiers we have examined so far do not weigh words differently. It is either they do
not weigh the words at all, as evidenced with SVM or Bayes classifier, or they weigh some parts
of the words while ignoring the other, as is the case with word count classifiers. Discriminant-
Based Classifier weighs words base on their discriminant value. Among the popularly used tool
for this purpose is Fisher's Discriminant.
In our example, we will take the mean value of each term for each category as = µi. The mean
stands for an average number of times word w appears in a text message of category i. The text
message itself would be index ad j. To evaluate the number of times word w occur in a text
message j of category i, our for this would be mij. The discriminant function can be written as:

We would consider the case we observe previously in this study, the economic evaluation we
grouped into an optimistic and pessimistic group. Let assume the word "dismal" appears once, in
the entire text, the word would be grouped as pessimistic and would not appear in the optimistic
class. The across-class variation of the word is positive, while the within-class variation is zero.
In this kind of situation, the denominator of the equation would be zero. We would conclude by
saying that the word "dismal" is an infinitely-powerful discriminant and should be evaluated with
a large weight in any word-count algorithms.
Metrics
Analytics developed without metrics is incomplete. In every developing analytics, it is important
to create measures that would examine whether or not the analytics are generating classifications
that are economically useful, statistically useful, and stable. However, there are some criteria
every analytic must meet for it to be statistically useful. These criteria would ensure the
classification power and accuracy. When an analytic is both economically useful and statistically
valid, it increases the quality of the analytics. Stability helps an analytic to perform effectively
in-sample and out-of-sample.
Confusion Matrix
This is a classic tool used in assessing classification accuracy. For n categories, the confusion
matrix would be of dimension n × n. The column stands for the correct category of the text,
while the rows represent the category given by the analytic algorithm. For each cell (i, j), the
number of text messages in type j and classified as type i are contained in the cell matrix. The
number of times the algorithm got the correct classification is stated in the cells on the diagonal
of the confusion. When this is sorted out, every other cell is a classification error. The rows and
columns of the classification can only be dependent on each other if an algorithm has no
classification ability. The statistics examined for rejection under the null statistics are as follows:

A(i,j) represents the numbers observed in the confusion matrix, while E(i, j) stands for the
expected numbers when there is no classification under the null. If T(j) stands for the total
column and (Ti) stands for the total across row i of the confusion matrix, then

(n −1)2 can be used to calculate the degree of freedom of the x2 statistics. This statistic is very
easy to calculate and can be used for any n model.
Precision and Recall
Two results can emerge from the creation of the confusion matrix. They are Precision or Recall.
Precision is also known as positive predictive value. This is simply the fraction of identified
positives that are really positives. It is the measurement of the validity of precision. Take, for
instance, we want to find out the number of people on LinkedIn who are looking for a job if our
algorithms find n of these kinds of people while only m are looking for jobs. Our precision value
would be m/n.
Recall, on the other hand, is also known as sensitivity. This is the number of positives that are
truly identified. A recall is the measure of the completeness of prediction. Using our LinkedIn
example, since the value of the actual people looking for a job is m, our recall formula would be
m/n. For instance, let's assume our recall confusion matrix is

For the above confusion matrix, our value for precision is 10/12, while recall is 10/11. This
implies that precision is related to the probability of false positives (Type 1 error). This is one
minus precision. However, recall is related to the probability of false-positive (Type 2 error).
This is simply one minus recall.
Accuracy
The measure of algorithm accuracy over a classification scheme is simply the percentage of text
that is accurately classified. This measurement can be done both out-of-sample and in-sample.
Here is the formula to compute this off our confusion matrix
False Positives
It is better to have a failure to classify than to have an improper classification. For instance, in a 2
×2 scheme, i.e., a two-category n=2, every off-dimension matrix in the confusion matrix is a
false positive. This implies that, when n >2, it means some classification errors are worse than
the other.
Calculating the percentage of false positives is a very important metric to work with. This can be
calculated by dividing the total classification undertaken by the weighted count or simple count
of classification.
Sentiment Error
An aggregate measure of sentiment may be computed once many texts or articles are computed.
This means that aggregation is very useful to cancel classification error.
Sentiment error is simply the percentage of the value we would get when there is no
classification error and the percentage difference between the computed aggregate sentiments.
Correlation
Having examined some of the vital aspects of news analysis, the question that comes to mind is,
how would the sentiment from news correlate with financial time series? Leinweber and Sisk
provide the explanation to this question in their paper published in 2010.
In the paper, they explained crucial differences in cumulative excess returns between strong
positive sentiment and strong negative sentiment days over prediction horizons of a week or a
quarter. Therefore, it can be inferred that the event studied are focused on point-in-time
correlation triggers. The visual correlation metric is the simplest correlation. Here we can see
how the sentiments and the returns track each other.
Phase-Lag Metrics
A unique case of lead-lag analysis is the correlation across sentiments and return time series.
This may be summed up as looking for correlations in the matrix. In simple term, a graphical
lead-lag analysis finds graph pattern across two series and examine if there are any ways pattern
in one time series can be predicted with the other. In other words, is there a way we can use the
sentiment data generated in algorithms to in-stock series. This type of graphical examination is
called the phase-lag analysis.
Economic Significant
We can evaluate news analysis using economic significance as a yardstick. In using economic
significance as a yardstick, we would be asking the following question, do the algorithms help
reduce risk my delivering profitable opportunities? Or does it not? This kind of evaluation would
help us identify a set of stocks that would perform significantly better than the other.
Economic metrics contain a lot of research and performances for news analysis. In fact,
Leinweber and Sisk, in the paper published in 2010, explained that there is exploitable alpha in
news streams. Economic analysis can make use of risk management and credit analysis areas to
validate news analysis.
Text Summarization
Text can be easily summarized using statistics. The simplest form of text summarizer works
more on the sentence-based model used in sorting the sentences in a document in descending
order. When this occurs, the most overlap words are arranged first, then others followed it. For
instance, let's assume an article D has a sentence si,i = 1,2,...,m. In this m sentence, each si
represents a set of words. To summarize the text, we would use the 3 similarity index to compute
each pairwise overlap between sentences.
To get the sentence overlap, we would find the ratio of the size of the intersection of the two
sentences, so and sj, divided by the size of the union of the two sets. Next, the similarity score of
every sentence is computed as the row sum of the Jaccard similarity matrix.

After obtaining the row sum, we sort them out; the summary is the first n sentence based on the
value.

Conclusion
We have explained in detail what news analysis is and how it is carried out. We examined the
vital features of news analysis and the different models that can be used to carry out this analysis.
Also, we examined how errors can be avoided or contained to the barest minimum when carrying
out news analysis. The vital aspect of word-count was explained in detail. In the next chapter, we
look at one of the important models in data science.
Chapter Eight: Bass Model
This chapter explains in detail all there is to know about Base Model. The chapter would cover
the following outlines:
The Bass Model
Calibration
Sales Peak
The Bass Model
The Bass Model is one of the classic models in Marketing literature. This was discovered in
1969 and has become one of the best models for predicting the market share of products that are
newly introduced and even matured products. The main focus of the model is the adoption rate
of a product must follow these two basic conditions:
the propensity of customers to adopt the product without the influence of social
influences
the additional propensity that the product would be adopted because other customers
have.
This is why, at some point in a very good product, the influence of the early adopters get so
strong that it affects or stir others to adopt the product. Usually, this is seen as a product of the
network. However, Frank Bass had already completed all there is to know about the influence of
early adopters on a very good product before the advent of the network effect. That is to say,
product adoption resulting from the influence of early adopters is not necessarily a product of the
network.
The bass model explains in detail how the information of the first few sales of a product can be
used to predict or forecast the product's future sale. Although this model seems to be more of a
marketing model, it can be used to determine the value of a start-up business by analyzing the
cash flow of the business.

The Basic Idea

Here, we would follow the exposition of the Bass Model. Let's take, for instance, that cumulative
probability of a product by a single individual from a time zone of zero to time t is F(t). The
probability of product at time t is the density function f(t)=F(t). Given that there is no purchase
so far, the rate of purchase would be

Modeling this is very similar to how we model the adoption rate of a product for a given time t.
Using the Bass model, this adoption rate can be defined as :

P in the equation can be assumed to be the independent rate of a consumer adopting the product,
while q is the rate of imitation. This is because it modulates the impact of the consumer adopting
the product from the cumulative intensity of adoption F(t).
With our analysis of the p and q of the product, we can use our findings to forecast the adoption
of the product.
Software
Free software can be used to solve an ordinary differential equation. Among the most popularly
used open-source package is the Maxima. This is available for download in many places. Here is
what the basic solution for differential equation in Maxima looks like:
Maxima 5.9.0 http://maxima.sourceforge.net
This was distributed under the GNU Public License. Bug reporting information is provided by
the function bug_report()

Note that the function ¹/ 1−F was processed from the left and not from the right as the software
seems to be working. This is why Maxima would be used to solve the partial fraction results in
simple integral. The result of this would be

The above result is the correct one. Another very simple tool that is effective in calculating
small-scale symbolic calculation is WolframAlpha. This can be downloaded at
www.wolframalpha.com.
Calibration
How do we find out the coefficient of p and q in our previous Bass model? Since we already
have the current sales history of the product, this can be easily fit into the adoption curve. Below
is the formula.to calculate this:
Sales in any period are: s(t) = m f(t).
Cumulative sales up to a given time t are: S(t) = m F(t)
Since we already have the formula, we will go ahead and substitute f(t) and F(t) in the Bass
equation. This would give us:

This can be rewritten as

s(t) = [p+q S(t)/m][m−S(t)]
Therefore:

We will be
using this equation in another example to understand it perfectly. Now let us examine the
ongoing sales for iPhone product as an example. First, we would read our quarterly sales already
stored in a file; after this, we will carry out a Bass model analysis. Next, we will R code to
compute it:
Now we will fit in the model and then plot our actual sales overlaid on the forecast.
Sales Peak
From our calculation so far, calculating the sales peak is very easy. All we need to differentiate
f(t) with respect to t, and then set the result equal to zero. This is shown as follows:
t ∗ = argmaxt f(t)
This is the same as the solution to f¹ (t)=0.
The calculation is very simple, the formula is

Therefore, for the values p = 0.01 and q = 0.2, we will have

Now for our iPhone sales, the computation of the sales peak would give us:
In our calculation, we would observe that the peak happens in half a year. The number of quarter
that passed before the sales peak is 31.
Conclusion
In this chapter, we carried out an extensive explanation of the Bass Model. Also, an explanation
of how to use the Bass Model to determine the future of sales and calculate sales peak in
business. In the next chapter, we examine how dimensions are extracted in Data science.
Chapter Nine: Extracting Dimensions: Discriminant and Factor
Analysis
This chapter covers the analysis of large data sets. It explains in detail all there is to know about
the analysis of large data. We would be using the two common approaches of large data analysis:
Discriminant analysis and Factor analysis. The two data would help us understand the most
important structural components of any big data. In discriminant analysis, for example, we would
be developing models that would help us group population size into two broad components:
males vs. female, immigrants versus indigene and so on. With factor analysis, we would be able
to beat down large data on population into explanatory variables. Here are the outlines that
would be covered in this chapter:
Discriminant Analysis
Notation and Assumption
Discriminant Function
Eigensystem
Factor Analysis
Difference between discriminant analysis and factor analysis
Factor Rotation
Discriminant Analysis
Discriminant analysis is an attempt to explain categorical data by creating a dichotomous split of
observations. For instance, let's assume that we want to split our large business data into two
categories. One category is for the bad creditors, and the other is for the good creditors. In DA,
the bad and good creditors are referred to as dependent variables or criterion variables. The
variable we use in explaining the split in the criterion variable is referred to as explanatory or
predictor variable. We can assume the criterion variable to be the left-hand side variables while
the explanatory variable is the right-hand variables.
The significant property of DA is that left-hand variables are qualitative. This implies that aside
from their numerical value, they are naturally of good qualities. A good example of how DA
works is the admission process of universities and other tertiary institutions. Every university has
a specific cut-off for each department a student might want to apply for. The cut-off mark is what
separates the student that would be admitted from those that won't be admitted. Now, this cut-off
mark is determined with the aid of DA.
In a very simple term, DA is the tool that quantitative explanatory variables are used to explain
qualitative criterion variables. This does not mean that DA only works with two categories. The
number of categories that we use DA is not restricted to just two; it starts from two or more.
Notation and Assumption
Let's assume that there are N groups or categories indexed by i = 2...N.
In each of the N groups, there are observations yj, indexed by j = 1...Mi. Note that the group does
not necessarily need to have the same size.
We have a set of predictor or explanatory x = [x1,x2,...,xK]. There must be a valid reason for
choosing this so that y can have a group where it belongs. Therefore, the value of the kth variable
for group i, observation j, is denoted as xijk.
Observations must be mutually exclusive. This implies that each member of a group cannot
belong to the other group.
Cov(xi) = Cov(xj). that is, the explanatory variable of all group have the same K×K covariance
matrix.
Discriminant Function
The main focus of DA is to find a discriminant function that best defines and separates one group
from the other. The most common approach is to use a linear DA. However, the function might
be nonlinear. The function of the DA takes this formula:

The discriminant weight is the ak coefficients.

In carrying out our analysis, we would need a score for the cut-off C. Take, for instance, N = 2.
This implies that there are 2 groups, the observation would fall into group one if D >C while it
will fall into group two if D ≤ C.
Therefore, the objective function is to select {{ak}, C} in such a way that classification error is
reduced.
The formula C = D({xk};{ak} is the formula of a hyperplane that splits the observation space
into different parts depending on the number of groups we are dealing with. If we are dealing
with two groups, the space of observation would be split into two parts.
Implementation with R
In this section, we would be using data for the top 64 teams in the 2005-06 NCAA tournament
and then implement a discriminant function model on the data. The data for the program is as
follows:
Now we will run some of the command stored in the lda.R on the program

Therefore the first 32 members of the team would form our category 1 (y=1), the last 32 would
form the category 2 (y=0). The result of our discriminant analysis is:

We can extract some useful result as follows:

The singular value decomposition value is contained in the last line. This is also the Fisher's
discriminant level that provides the ratio of the between-and-within group standard deviation on
the linear discriminant variables. The squares of these linear discriminant values are the
canonical F-statistics.
Confusion Matrix
The confusion matrix has been explained previously in this study. It is a tabular presentation of
both actual and predicted values. The following R command would be used to generate the
confusion matrix for the basketball team in our previous example.

In the above command, we would observe that both 5 and 64 have been wrongly classified. To
assess this, we would be computing the x² statistics for the confusion matrix. To do this, we
would first define the confusion matrix as

The above matrix shows some classification ability. However, what happens when our model
does not have any confusion ability? It means that our matrix would have no relationship
between the rows and columns; hence the average number would be drawn based on the total of
rows and columns. Since the total of rows and columns for our program is 32, our matrix with no
confusion ability would look like this:

The total number of squared normalized differences in the cell of an individual matrix is Text
Statistics. The formula for this is:
Splitting into Multiple Groups
If we want to split our NCAA team into groups, for instance, we want to split the group into
four, we simply used the following commands:
Eigen Systems
Here, we will be exploring some components of matrices that would help us in data
classification. To get started, we would first download the Treasury Interest rate date from the
FRED website: http://research.stlouisfed.org/fred2/. This can be assessed in a file named
tryrates.txt. After this, we simply read the file

An M×M matrix A has attendant M eigenvectors V and eigenvalue λ if we can write

λV=AV
Beginning with the A matrix, the decomposition for the eigenvalue would be both V and λ. For
the Matrix M, we would be finding both the eigenvalue and eigenvectors since there is no
explanation or equation for this. Hence we would require that λ=0. Then we would set Matrix A
as the covariance matrix for the rates of different maturities.
Next, we calculate the eigenvectors and the eigenvalues of the covariance matrix. Let's assume
the covariance matrix is the total of the rate of connection between the rates of maturities in our
data set. However, we don't know the number of dimensions present in this data. For each
dimension of commonality, our focus is on the importance of the dimension (eigenvalue) and
the influence of the dimension on each rate ( value of eigenvector). The highest eigenvalue is the
most important dimension. This is also known as "principal eigenvalue" with its corresponding
principal eigenvector. The eigenvalue and the eigenvector are what make up the Eigen pair. This
is why it is called the "eigenvalue decomposition matrix."
Factor Analysis
This is simply the use of eigenvalue decomposition in finding the basic structure of data. When
we have a data set of both observation and explanatory variables, we use factor analysis to
achieve decomposition of these two properties:
First, generate a reduced dimension set of explanatory variables. This is also known as extracted,
derived, or discovered factors. The generated factors must be uncorrelated with each other.
Generate data reduction. This implies that you suggest a limited set of variables. For our set of
variables, each subset is the manifestation of an abstract underlying dimension.
Notation
Original explanatory variables: xik,k = 1...K.
Observations: yi,i = 1...N.
Factors: Fj, j = 1...M.
M<K.
The Idea
We will notice from our observation of the rate data that there are eight different kinds of rates.
Now to model the data of each of the eight data, we would need a separate driver leading to K =
8 underlying factor. However, this would go against the essence of factor analysis. Factor
analysis aims at reducing the number of existing drivers. As a result, we would go with a smaller
value of M < K factors.
The most important focus here is to project the variables x ∈ RK onto the reduced factor set F
∈ RM. This would help us explain most of the variables by the factors. Therefore, what we are
looking for is a relation:
x = BF
B in the equation stands for B = {bkj} ∈ RK×M is a matrix of factor “loadings” for the
variables. With B matrix, we can represent x in smaller dimension M. The entries in matrix B
can either be negative or positive. When the entries are negative, it means that the entries are
negatively correlated with the factors. When it is positive, it is positively correlated. We aim to
use the relation of y to a reduced F to replace the correlation of y to x.
Once the set of factors have been defined, the N observations y can be expressed in terms of
factor through a factor score matrix A ={aij} ∈ RN×M in this form
y = AF
Also, the factor score can either be positive or negative. We have different ways in which
transformation from factors to variables can be carried out. We would observe some of these
ways as we progress.
Principal Components Analysis (PCA)
Principal Components Analysis take each component and view it as a weighted combination of
the other variables. Although this is not how Factor Analysis implementation is carried out, it is
one of the most popular ways in Factor Analysis.
The covariance matrix is the starting point of PCA. The main focus of the PCA is to extract the
principal eigenvector by using the eigenvalue analysis of the matrix. The analysis can be done
using the R command. Below is an example of this action:
The Difference Between FA and PCA
The major difference between PCA and FA is that, for computational purposes, PCA assumes
that all variables are common with all unique factors equal set to zero, while FA assumes that
there is some unique variance. Also, PCA can be taken as a subset of FA. The FA model that is
chosen, determines the level of unique variance. Hence we can summarize by saying the FA
model is an open system while PCA is a closed system. FA factors focus on decomposing the
correlation matrix into both unique portions and common ones.
Factor Rotation
When the factors are rotated, it sometimes makes the variables load better on the factors. Most
times, the factors carry out this function automatically. This kind of action is called factor
rotation.
Here are the steps to factor rotation
Recall that x variables were decomposed as follows:
x =BF+e
Since x is dimension K, B ∈ RK×M, F ∈ RM, and e is a K-dimension vector. It means that
Cov(x) = BB+ψ
Recall that the matrix of factor loading is B. If this is the case, the system remains unchanged if
BG replaces b. Here G ∈ RM×M and G are orthogonal. Next, we call G a “rotation” of B.
Conclusion
Discriminant Analysis and Factor Analysis are two important models that have made the
extraction of big data an easy feat. This chapter has provided an extensive explanation of these
two factors. In the next chapter, we examine an interesting topic in Data science: Auctions.
Chapter 10: Auction
This chapter examines the various types of auction formats we have and the different principles
of revenue maximization and bidding theories associated with the auction. The outlines that
would be covered in this chapter include:
Auction
Types of Auction
How to determine the values of auction
Types of Bidders
Benchmark model
Auction Math
UPA and DPA
Clicks
Auctions
Auction involves some of the oldest market forms today, but are still widely used for marketing
mechanism and selling off assets. Hal Varian, Chief Economist at Google (NYT, Aug 1, 2002),
gave a very interesting definition of an auction, he stated that “Auctions, one of the oldest ways
to buy and sell, have been reborn and revitalized on the Internet.
One of the most popular online computer-managed auction is eBay. Because of its great
advantage and its economic value, it has become relatively popular. An online computer-
managed auction can be used for marketing almost any product.
Features in Auctions
There are various features in an auction, but the most important is the information asymmetric
between seller and buyer. Although the basic assumption in the auction is that the seller knows
more about the product than the buyer, it is not uncommon for buyers to have different
information about the product. This is often because buyers always take note of negative
information from other bidders. In this chapter, we would examine how information asymmetric
plays a great role in bidding products.
Equally significant is the fact that the market mechanism for auction is relatively explicit. This
means prices and revenue are direct consequences of the auction design. In contrast to this, other
market mechanisms are usually more implicit than explicit. A very common example is the case
of commodities. Here a market mechanism is based on demand and supply.
Examples of Auctions
We have various examples of an active auction, some of these include auction of arts and
valuables, Google ad auctions, Treasury Securities, eBay, and even the New York Stock
Exchange. All these are good examples of a continuous call auction. Depending on the product
being auctioned off, an auction can be either single units or multiple. A good example of a single
unit auction is arts, while Treasury Securities is an example of multiple auctions.

Types of Auctions
The main types of the auction include:
English (E), the highest bidder wins. This is an open kind of auction. It is called an open auction
because the progression of bids is revealed to the participants. Generally, the price of products is
in ascending order.
Dutch (D). This is also an open kind of auction. However, product prices in this type of auction
are in descending order. The auctioneer starts from the highest prices to the lowest. The winner
of the bid is usually the first bidder.
Ist price sealed auction (iP). Here, the bid is sealed and not revealed. The winner of the bidder is
the highest bidder.
2nd price sealed bid (2P): this is very similar to the (iP). However, the only difference between
the two is that unlike (iP) where the first highest bidder wins, here the second-highest bidder
wins.
Anglo-Dutch (AD): this type of auction starts off as an open auction but gets sealed when it is
left with only two bidders.
How To Determine The Value Of An Auction
The two most important aspects of an auction are the value and the price. However, the value of
a product to be auctioned can only be determined by the nature of the product. Here are two of
the model to determine the value of a product being auctioned:
Independent private values model: this model states that the individual bidder determines the
valuation of the product. This is very common with an art auction
Common-values model: Here, the bidders aim to discover the common price of the product
being auction. This is because there is usually an after a market where common value is traded.
A good example of this auction model is Treasury Securities.
Bidder Types
The types of the bidder and the assumption made by the bidders about the product determines the
revenue that would be generated from the auction. There are two major types of bidder:
Symmetric: In this type of bidder, the bidders share the same probability distribution of products
and stop-out (SP) prices. Stop-out price means the price of the lowest winning bid for the last
unit sold. This assumption is very good when the competition is high.
Non-symmetric or Asymmetric. This bidder type has different values distribution. This usually
occurs where the market is segmented. A good example is the bidding of firms in the M&A deal.
Benchmark Model (BM)
Benchmark Model is the simplest model that can be used to analyze auction. This model is based
on four major assumptions. The assumptions are explained below:
Risk-neutrality of bidders: this implies that utility function is not needed in the analysis of
auction
Private-values model: here, all bidders are welcome to their own reserved value for the products.
This implies that there is a distribution of bidder's private value.
Symmetric bidders: all bidders the same distribution of product value. This was already
explained in the types of bidders.
Winners' payment is based on bids alone.
Properties and Results of Benchmark Model
D = iP, that is, the Ist price and Dutch auction type are equivalent to bidders. This is because, in
each auction type, the bidder has to choose how high or low he or she would bid without the
knowledge of other bidders
In Benchmark Model, the most important thing is to bid according to how valuable the product is
to you. This is obvious in D and iP because both mechanisms do not entail bidders seeing any
other lower bids. Hence the bidder bids according to how valuable the product is to him or her
and watch if the bid wins. In other mechanisms like the 2P, when you bid too high, you overplay,
and when you bid too low, you lose. The best way to bid is according to how valuable the
product is to you. For the E-auction mechanism, it is advisable to keep bidding until the price
cross your level of valuation.
Equilibrium types:
Dominant: This is a situation whereby bidders bid with respect to their true valuation of the
product, not minding what other bidders are bidding. Satisfied by E and 2P.
Nash: here, bids are chosen according to the best guess of other bidders' bid and hence satisfied
by D and iP.
Auction Math
Now we will move away from the theoretical explanation of auction and apply some auction
equilibrium formula. To start with, F would be the probability distribution of the bids while vi is
the true value of the ith bidder on a 0 and 1 continuum. Let's say that we rank bidders in order of
their true valuation vi. How then do we define F(vi)? Let's take for instance, that the bid is drawn
from a beta distribution F on v ∈ (0,1) so that the probability of a very low bid and a very high
bid is lower than a bid around the mean of the distribution. Our for the expected difference
between the first and second highest bidder v1 and v2 is:
D =[1−F(v2)](v1 −v2)
This implies that the difference between the first and second bids would be multiplied by the
probability that v2 is the second-highest bidder. Or we assume the probability to be that there is a
higher bidder than v2. Now, from the first-order condition, i.e., the sellers' point of view, our
formula is:

Given that bidders are symmetric in BM, v1 ≡ d v2. ≡ d means equivalent in distribution. This
means that:

Since the expected revenue is equivalent to the expected second price, we would rearrange the
equation to get our equation for the second price:
Optimization By Bidders
The main aim of any bidder i is to find out the function/bidding rule B that is a function of the
private value vi such that
bi = B(vi)
In the above equation, bi stands for the actual bidder; when there is any n bidder, we will have

The goal of each bidder is to maximize his or her expected profit in relation to the true valuation
of the product. This is:

Now we are going to invoke the notion of bidder symmetry. The first step to this is to optimize
by taking ∂πi/∂bi = 0. We can only arrive at this optimization formula by first taking the sum of
all derivative of profit relative to the bidder's value like this:

Since ∂πi/∂bi = 0, the partial derivative of profit with respect to personal valuation is reduced.
The partial derivation is taken from this equation:

Next, we will take vi as the lowest bid, then integrate the two former equations to get:

When we equate the formula for the value of expected profit, we would have:

Assuming F is a uniform, we would have:

We will observe that our bid is shaded slightly from our personal valuation. This implies that we
bid less than the true value of the product; this would give room for profit. Bids are increased as
the level. to personal value increase and bidders increase, that is :
Treasury Auctions
Our previous explanation is based on a single unit auction. In this section, we would be moving
from a single unit to one of the most popular multiple units, Treasury auctions. Treasury auctions
are the mechanisms employed by the government and other similar bodies to issue its bills,
bonds, and notes. Usually, an auction is performed on Wednesday. This implies that bids are
received up until the afternoon of the day it is to be auctioned. After which the quantities
requested are supplied to the top bidders until there is no remaining supply of securities. Before
the auction or trade, Treasury auction is referred to as pre-market or when-issued. It is in this
market that bidders get an indication of prices that might result in a tighter clustering of the bid
in the auction.
Treasury auction is made up of two broad types of dealers: the small independent dealers and the
primary dealers. The primary dealers entail investment houses, big banks, and so on. Most times,
the auction is played among the primary dealers. The primary dealers place competitive prices on
the item to be auction while the small independent dealers play with non-competitive prices.
Usually, the value placed on the item being auctioned is based on information about the
secondary market of the item. The secondary market occurs immediately after the primary
market. This implies that the assumed profit the item is likely to attract at the secondary market
is what influenced the bidder's price at the auction. The likely price of the item at the secondary
market is usually gotten from the when-issued market.
The winner at the Treasury Securities often leaves with more regret than pleasure because he or
she is aware that he has bid with more money that is overplayed. In Treasury securities, this
phenomenon is referred to as the "winners curse." Before the auction takes place, the fed
government and other participants in the Treasury Securities try to mitigate the winner's curse.
This is because someone with less propensity of regret would bid at a higher price.
UPA or DPA
UPA stands for "uniform price auction," while DPA stands for "discriminating price auction."
DPA is mostly used and more preferred in Treasury Securities while UPA is only introduced
recently. For DPA, the highest bidder gets his bid quantity at the price he or she bids for. The
next highest bidder gets the same and this continues until the last bidder and the last item. This
implies that in Treasury Securities, each winning is filled at a price, hence bidding price varies.
This is what is known and refer to as discriminating in price.
However, for UPA, the highest bidder gets his or her bid quantity at the stop-out price, i.e., the
price of the last winning. The next highest bidder also gets the same until the Treasury Securities
supply is exhausted. This means that UPA uses single-price auction.
Although DPA tends to yield more revenue, UPA has shown to be more successful. This is
because the winners' curse is mitigated in UPA. All bidders bid the same, unlike DPA were to
win, you would have to pay more than other bidders.
Mechanism Design
To achieve a good auction mechanism, consider the following:
The starting price of the item to be auctioned off.
Is collusion contained to the barest minimum?
Is there a truthful value revelation? This is also referred to as "truthful bidding."
Is the product efficient? that is, the maximization of utility of auctioneer and bidders
Is it too expensive to play?
Fairness to both sides, whether big or small, high or low.
Clicks (Advertising Auctions)
One of the popular program that allows easy creation of advertisements that would appear on
important sites like the Google search result page and other related sites is the Google AdWords
program. Google AdWords is different from the Google AdSense program. Google AdSense is
the one that delivers Google AdWords to other sites. Depending on the type of ad displayed on
the site, Google pays web publishers based on the number of clicks and the number of
impressions gathered by the ad.
In this section, we would be explaining some of the basic features of a search engine
advertisement model using the research paper written by Aggarwal, Goel, and Motwani (2006).
There are three stages of search page experience used by the search engine advertisement
program. These stages include cost per click (CPC), cost by thousand views (CPM) and cost by
acquisition(CPA). Among these three, CPC is the most widely used. Under CPC, we have two
models:
a). Revenue ranking (the Google Model)
b). Direct ranking (the overture model).
The merchant pays the next click price. This price is different from that of the second auction.
However, this statement is not so in revenue ranking, as would be seen in our example.
Asymmetric: there is no incentive to overbid, only to underbid
Iterative: this is a situation whereby a bidder places many bids and watch for the response of
these bids. The reason for this is to uncover the ordering of bids by other bidders. But this is not
as simple as it sounds. In fact, Google often provides the Google Bid Stimulator or GBS so that
sellers can easily use AdWords to figure out optimal bids.
The utility of auctioneers and merchant will be maximized if revenue ranking is true. This is
known as auction efficiency
Innovation: the laddered auction. Randomized weight is attached to bids. This implies that if the
sum of weight is 1, then the type of ranking used is direct ranking. However, if it is CTR, i.e.,
Click Through Rate, then the type of ranking used is revenue-based, revenue ranking.
The following steps highlighted below can be used by merchants to figure out the maximum cost
per click(CPA) of each.
Maximum Profitable CPA. This is simply the margin of profit on the product. For instance, if the
cost price of the product is #200 and the selling price is #300, the profit margin is simply #100.
This is also the maximum amount a seller would pay for the CPA.
Conversion Rate (CR). CR is simply the rate gathered based on the number of times a click
results in a sale. To calculate this, we simply divide the number of sales by the number of clicks.
For instance, if for every 100 clicks, 5 sales are recorded. The CR is 5%.
Value Per Click (VPR). This is simply the CR multiplied by CPA. Using our example, our VPR
is simply 005 × 100 = #5
Determine the profit-maximizing CPC bid. The more the bid reduces, the more the number of
clicks reduce and the more the CPC and revenue reduce. However, this might not affect the
profit because it is possible for the profit after acquisition to rise. We can easily use the Google
Bid Simulator to find the number of clicks expected at every turn. Also, it is important to note
that the price you bid is not the same as the price for a click. This is because it is based on
revenue ranking, i.e., a next-price auction. Hence, the Google model determines the actual price
that would be paid. The equation for profits is :
(VPC − CPC) × #Clicks = ( CPA× CR − CPC) × #Clicks
Therefore, for a #4 bid, the profit would be:
(5 −407.02/154) ×154 = $362.98
Next-Price Auction
The CPC of the next-price auction is based on the price of the click right after a bid is made. This
implies that, if, for instance, the winning bid is for position j on the search screen, the price paid
is that of the winning bid at position j +1.
Laddered Auction
The main idea of a laddered auction is to set the position of CPC as:

The expected revenue to Google by ad impression is the lhs. This model aims to maximize
revenue for Google and at the same time, make the auction system very easy and effective for
merchants. If the result of the laddered auction is a truthful equilibrium, this is a good one for
Google. Note that the weights wi are arbitrary. Hence it is not disclosed to the merchants.
Conclusion
From our explanation so far, it is obvious that auction is still in vogue and not yet sidelined.
Additionally, it takes some mastery and skills to be able to perform effectively at any auction.
Data science is indeed an all-encompassing domain. The next chapter will examine limited
dependent variables.
Chapter 11: Limited Dependent Variables
This chapter examines the different approaches to creating and working with dependent
variables. The chapter covers the following outlines:
Limited Dependent Variables
Logit
Prohit
Slopes
Limited Dependent Variables
Dependent variables are limited when the variables are discrete, binomial, or multinomial.
However, most times, we use the continuous variables for the dependent (y) variable to run a
regression. A good example is when we run regressions on an income of education. Hence we
will need a different approach to run regression on these types of variables.
A unique type of limited dependent variables is discrete dependent variables. Some of the
examples of models that use this dependent variable are Logit and Prohit model. They are often
referred to as qualitative response (QR) models.
A discrete dependent variable often occurs as binary by taking values of {0,1}. When we regress
this, we get a probability model. However, if we just regress from the left-hand side of one and
zero on a suite of right-hand side variables, this could be fit as linear regression. If we should
take another observation with the right-hand side value, for instance, x = {x1,x2,..., xk}we could
use the fitted coefficient to compute the value of the y variables. Except by unusual coincidence,
the value would not be 0 or 1.
In a limited dependent variable, we would also explain the reason for the results in the allocation
of categories. There is also a relationship between limited dependent variables and classifier
models. This is because classifier models focus on allocating observation to categories, in the
same vein some examples of limited dependent variables focus on explaining whether a firm is
syndicated or not, whether a person is employed or not, and whether or not a firm is solvent and
so on.
It is important to note that most times, these fitted values might not even be between 0 and 1 in
linear regression. It means that we could choose a nonlinear regression to ensure that the fitted
value of y is restricted to 0 and 1. After this, we could get a model and fit in a probability. To
achieve this, we use any of the two models mentioned in our explanation, i.e., Logit or Prohit.
Logit
This is also known as logistic regression. This type of model takes the form highlighted below:

Our focus here is to fit in the coefficient {β0, β1,..., βk}. Note that this would be done
irrespective of coefficients (x) ∈ (−∞,+∞), but y ∈ (0,1). When f (x) → −∞, y → 0,and
when f(x) → +∞, y → 1. This model can be rewritten as

Here Λ (lambda) stands for logit.

The model generates an S-shaped curve which can be plotted in this form

The probability that y=1 is the fitted value of y.

Prohit
This is very similar to the Logit except that the normal distribution replaces the probability
function. The equation for the nonlinear equation is as follows:
y = Φ [f(x)], f(x) = β0 + β1x1 + ... βkx
The cumulative normal probably function is represented by the Φ sign. Like the Logit model,
irrespective of the coefficients, f(x) ∈ (−∞, + ∞), but y ∈ (0, 1). When f(x) → −∞, y → 0,
and when f(x) → +∞, y → 1.
Analysis
The two models Prohit and Logit explain how we would set in our binary model, i.e.:
Pr[y = 1] = F(β¹x)
Where x stands for the vector of explanatory variables, β is a vector of coefficient, and F is the
Prohit/Logit function.
y^ = F(β¹x)
y^ stands for the fitted value of a given x and y. In any of these cases, the function of the Logit or
Prohit remains, as we have stated above. Of course
Pr[y = 0] = 1−F(βx)
This model can be expressed in conditional expectation form as
E[y|x] = F(β¹x)(y = 1)+[1−F(β¹x)](y = 0) = F(β¹x)
Slopes
It is very easy to observe how dependent variables change whenever there is a change in the
right-hand side variables in linear regression. This is not so with nonlinear models. To get started
with linear regression, let's recall that the value for y lies within the range of (0, 1). Our concern
now is in how any change in the value of explanatory variables leads to a change in E(y|x). First,
we would take the derivative as:

Now we will compute this as the means of the regressor for each model. The following is the
result of the Logit model:

This can be written as:

This can be rewritten as:

Now using the Prohit model, our result is :

Here the normal density function is φ(.)

Maximum-Likelihood Estimation (MLE)
For the above estimation, using the g l m function is done by R using MLE. To write out this
formula, recall that we already agree that each LHS variable is y = {0,1}. The function is likely
to take this shape:

The log-likelihood would take this form:

While we take this derivative to maximize the log-likelihood

This gives us a system of the equation that can be used to solve β. Likelihood equation is a
collective name for the system of first-order conditions. To get the t-stat for a coefficient, we
simply divide its value by its standard deviation. The standard deviation is gotten from the
answer to the question, how does the coefficient set β change when the log-likelihood changes?
Our interest is in ∂β/∂lnL. The reciprocal of this has already been computed above. Next, we
define:
g = ∂lnL /∂β
After this, we define the second derivative. This is also known as the Hessian matrix.

Note that the following equation are valid:

E(g) = 0 (this is a vector)
Var(g) = E(gg)−E(g)2 = E(gg)
= −E(H) (this is a non-trivial proof)
Next, we call
I(β) = −E(H)
the information matrix. This is because heuristically, the variation in log-likelihood with changes
in beta is given by Var(g) = −E(H) = I(β). The inverse of this gives the variance of β. Hence, we
will have:
Var(β) → I(β)−1
To get the t-statistics, we divide the value of β by the square root of the diagonal of the matrix.
Multinomial Logit
To use this package, we would need the nnet package. This can take the following form:

Then we set
Limited Dependent Variables in VC Syndication
It is indisputable that not all venture-backed firms would end up making a successful exit either
through a buyout, an IPO, or through another exit route. Here we would be measuring the
probability of a firm making a successful exit by examining a large sample of firms. Hence, a
successful exit would be designated as S = 1 while an unsuccessful exit would be S = 0. We
would fit a Prohit model to the data by using matrix X of explanatory variables. Next, we define
S to be based on a latent threshold variable S ∗ such that:

And the latent variable is modeled as

The probability of exit, which entails the E(S) for all financing rounds, is provided by the fitted
model.

Using the standard likelihood method, the vector of coefficient fitted in the Prohit model is γ.
The cumulative normal distribution is represented by Φ(.).
Endogeneity
Assuming we want to look at the impact of syndication in a successful venture. Success in a
syndicated venture is a product of two broad aspects of VC expertise. To start with, syndicate are
very effective for picking good firm while VC is very effective for selecting a good project to
invest in. VC is a selection hypothesis discovered by Lerner (1994). Since the process of
syndicate entails the derivation of a second opinion by the VCs, this means that a syndicate
provides evidence that the project is a very good one. Aside from this, the syndicate can be used
to provide detailed monitoring as a result of its ability to bring a wide range of skills to the
venture.
A dummy variable for syndication permits a firsthand estimation of whether or not syndication
has any impact on performance, while a regression of variables allows for a return on different
characteristics of the firm, although it can be said or assumed that syndicated firm is large of
higher performance capacity, irrespective of whether they chose to syndicate or not. VC tends to
prefer better firms that might likely syndicate. Not only can this VC identify these firms. In this
kind of situation, the added value from syndicate is revealed by the coefficient of the dummy
variables, especially when there is no value. As a result, we first correct specifications for
endogeneity. Next, we check whether or not dummy variables are significant.
The correction specification that would be adopted for endogeneity is the one suggested by
Greene (2011). The required model would be briefly summarized as follows. However, before
then, here is the performance regression:

For the above equation, Y stands for the performance variable while S is the dummy variable
that takes the value of 1 if there is syndication, if there is no syndication, the value is zero. δ is
the coefficient that reveals the differences in performance often caused by syndication. If there is
no difference in performance, it means that there is no difference in performance across the two
firms or that the X variables are enough to explain the differences in performance across the
firms.
However, since this is the same variables that determine whether or not there is syndication, it
means that we have an endogeneity issue. This would be resolved by adding some corrections to
the above model. The sign € standing for the error term would be affected by our corrections.
When the firm is syndicated, and our value for S is 1, then the adjustment in the € sign would be:

where ρ = Corr(,u), and the standard deviation of € is σ€. It means that

For firms that are not syndicated, our result would be:
This can be estimated by linear cross-sectional regression as:

The estimation model would take the form of both the none syndicated equation and the cross-
regression model. If this is the case, β would be forced to remain constant in all the firms without
initiating any additional constraint. Hence, the specification would maintain the same OLS form.
However, if after the endogeneity correction, δ remains constant, this supports the hypothesis
that syndication is an initiator of differences in performance. If the coefficients {δ, βm} remain
significant, then for each syndicate performance round, the expected differences would be:

The method explained above is one of the best ways to address the treatment effect. Another
effective way to do this is first to use a Prohit model and then set m(γX) = Φ(γX). This approach
is what is referred to as an instrumental variables approach.
Endogeneity - Some Theories to Wrap Up
This is a situation that arises as a result of the correlation of error in terms of regression and
independent variables. Endogeneity can be highlighted as:

This can happen in the following listed ways:

Measurement error: this occurs when X is measured for error. In such a situation, we have X =
X+e. The regression formula is, therefore:

Hence we have

Omitted variable: assuming the equation for the true model is:

But the problem now is that we don't X2, which is a correlate of X1. This implies that in the error
term, we will no longer have: E(Xi · u) = 0, ∀ i.
Simultaneity: This is a situation where both Y and X are determined jointly. A good example is
the joint use of high way and high school because they go together. The structural setting of this
kind of situation can be highlighted as:

When we try to solve this equation, what we get is a reduced-form version of the model:

Next, we compute the above and get:

Conclusion
This chapter has extensively covered all there is to know about limited Dependent
variables. In the next chapter we’ll cover Fourier Analysis and Network Theory.
Chapter Twelve: Fourier Analysis And Network Theory
This chapter would cover the following outlines:
Fourier Analysis
Fourier series
Solving the coefficients
Complex Algebra
Courier Transform
Fourier Analysis
This analysis entails numerous different connections between infinite series, vector theory,
complex number, and geometry. Different applications such as fitting economic time series,
wavelets, pricing time series, and generating risk-neutral pricing applications can be carried out
using Fourier analysis.
Fourier Series
Introduction:
These are series used to determine periodical time series by combining sine and cosine waves.
The time it takes one cycle of the wave is called "period" T, while the number of cycles per
second is the "frequency of the wave" f. The formula for this is:
f = 1/T
Unit circle
We would be using some basic geometry to explain this

In our circle above, if a=1, the circle is the unit circle. There is a relationship or link between the
unit circle and the sine wave. To understand this, we would plot another diagram:

In the second circle, the height of the unit vector on the circle traces out the circle as we rotate
through the angles. For radius a, we would arrive at a sine wave with aptitude a. This can be
written as:
f (θ) = asin(θ)
Angular Velocity
Velocity is simply the distance per time in a given direction. In angular velocity, distance is
measured in degree, that is, the degree per unit of time. Usually, angular velocity is represented
by the symbol w. The formula can be written as :

The function in the first equation can be written in time as :

f (t) = asinωt
Fourier Series
As already explained in the short introduction, a Fourier series is simply the collection of both
sine and cosine waves. When these two are summed up, it would closely approximate any given
waveform. Fourier series can be expressed in terms of both sine and cosine wave as :

We would need the a0 because the waves may not be asymmetric around the x-axis.
Radian
The figure below defines the angle of a radian. Degrees are presented in units of radians.

The angle in the above diagram is a radian. This is approximately 57.2958 degrees. This is a bit
lower than the 60 degrees expected of an equilateral triangle. Note that since the circumference is
2πa, 57.2958π = 57.2958×3.142 = 180 degrees. Hence for the unit circle, we would have:

The Fourier series can be rewritten as:

Our next action is how to solve the coefficient.
How to solve the coefficient
The first thing to note is that sine and cosine are orthogonal. Hence, their inner product is zero.
Therefore we will have:

This implies that when we multiply two waves and then integrate the resultant wave from 0 to T
unless the two waves have the same frequency, our result would be zero (0). With this, the way
we solve the coefficient of the Fourier series is highlighted below. We integrate both side of the
equation above from 0 to T to arrive at this:

Except for the first term, all other terms are zero. This implies that for the terms we arrive at
zero, the sine and cosine are integrated above the cycle. Hence we arrive at :

Let's try another integral:

Except for the terms in a1 cos(ωt) cos(ωt), all other terms are zero. This is because we are
multiplying two waves with the same frequency. Therefore we would get:

How do we arrive at the above?

Note that integrating cos(ωt) over one circle would be equals to zero for unit amplitude. When
we multiply cos (ωt) by itself, we flip all the wave segments from below the zero lines to above
the zero lines. With this, half the area from 0 to T is filled by the product wave. Hence, we get
T/2. Therefore:

We can use this method to solve all an all we do is multiply by cos (nωt) and then integrate. We
can also use this to solve bn. We simply multiply by a sine (nωt) and then integrate. This forms
the basis of the result of Fourier series coefficients highlighted below:
Complex Algebra
Recall that :

This generated the popular Euler's formula:

Also recall that, cos(−θ) = cos(θ), and sin(−θ) = −sin(θ). Also note that if θ = π, then

This can be written as

And as an equation that entails five major mathematical constants and three operators. These are
i,π,e,0,1}, and {+,−,=}.
From Trigs to Complex
Using the last two equations above, we would arrive at this:

Going back to the Fourier series:

How then do we solve this? Start with:

Then

How you get rid of a0

The first thing we would do here is to expand the first summation n =0; we would arrive at :

The expression can be written as:

Collapsing and Simplifying
Our focus here is to collapse the two terms. First, let's take note of the following:

When we apply this idea, we will get:

All we have to do is rename An to Cn for clarity. The big turnout of this process is that we have
been able to contain {a0, a bn} into one coefficient set Cn. We will write the following for
completeness sake:
Fourier Transform
With this technique, we can go from the Fourier series, which uses a period T to aperiodic
waves. This is simply to let the period go to infinity. This implies that the frequency gets very
small. To do our analysis, we will substitute f(t) with g(t). This is because we now need to use f
or ∆f to denote frequency. Recall that:

This can be alternatively written in frequency terms as:

Next, we substitute this into the formula of g(t) and get:

Taking the formula of limits as:

This gives a double integral

df stands for frequency domain while dt stands for the time domain. Thus the Fourier transform
moves from the time domain to a frequency domain.

Inverse Fourier Transform moves from the frequency domain to the time domain.

Fourier coefficient is as follows:

We would note that there is an incredible similarity between transform and coefficient. Take note
of the following:
The coefficients give the amplitude of each component wave
The transform gives the area of a component wave of frequency f. This can be seen because the
transform does not have the divide by T in it.
For every frequency F, the transform gives the rate of occurrence with that frequency, relative to
other waves.
The Fourier breaks a complex aperiodic wave into simple periodic waves.
Application To Probability Functions
Characteristic functions
The expectation of the following function of F gives the characteristics function of x

where f(x) is the probability density of x. By Taylor series for eisx we have:

Where mj is the jth moment. It is therefore easy to see that:

Network Theory
Network science is gaining a group in the business world today. The term "network effect" is
widely used in conceptual terms to describe the gain from piggybacking on connections in the
business world. The steady rise in the effect of the network on business today is only the tip of
the iceberg. As the cost of the network and its analysis drops rapidly, business organizations
would make use of them more and more.
Network theories are also used by firms to find communities of consumers to partition and drive
traffic in their marketing efforts. Networks are also used to understand how information flows in
Network.
Graph Theory
This is the first stage in understanding how network theory works. This is why a business student
is usually taught a digression in graph theory in other to understand how a network works.
The graph is simply a picture of a network. The picture consists of the relationship between
entities. The entities are called nodes or vertices ( set V), while the relationship is called the
edges of a graph (set E). Therefore a graph G is described as :
G =(V,E)
There are two basic types of graphs. When the edges e ∈ E of a graph are not tipped with
arrows signifying some causality or direction, this type of graph is called an “undirected” graph.
However, if the graph is tipped with direction, it is called a “directed” graph. When the
connections (edges) between vertices v ∈ V have weights on them, the graph is called a
“weighted graph.” However, when the graph has no weight on them, it's “unweighted.” For any
pair of vertices (u,v) in an unweighted graph, we have:

However, the value of w(u,v) is unrestricted in a weighted graph. This can also be negative.
A directed graph can either be cyclic or acyclic. For cyclic graphs, there is a path from a source
node leading back to the node itself. This is not the case with acyclic graphs. Direct acyclic
graphs are represented with the term "dag."
Moreover, a graph can be represented by its adjacent matrix. This is simply the matrix A =
{w(u,v)}, ∀ u, v. We can also take the transpose of the matrix. In the case of a directed graph,
this would simply reverse the direction of all the edges.
Features of Graphs
There are various features of a graph. These include the number of graph nodes and the
distribution of links across nodes. The edges, which are the links and the structure of the nodes,
determine the extent the nodes are connected, and the importance of individual nodes. This also
determines the flow of networks.
A simple bifurcation of graphs suggest two types:
Random graph
Scale-free graph
These two graphs are portrayed in an article in the Scientific American written by Barabasi and
Bonabeau (2003).

The random graph can be plotted by putting in place some sets of n nodes and then connecting
pairs of nodes randomly with some probability p. The higher the connected probabilities, the
more the edges contained in the graph. The distribution of the number of edges in each graph
will either be more or less Gaussian because there is a mean number of edges (n.p) with some
range around the mean. The left graph in the above figure is an example of this. In the graph, the
distribution of links is shown in bell-shapes. When a number d gives the number of links of a
node, the distribution of nodes in a random graph would be f(d) ∼ N(µ,σ2), where µ is the mean
number of nodes with variance σ2.
The structure of a scale-free graph is a hub and spoke. Most nodes in the graph have very few
links. However, there are some nodes with a large number of links. In our graph, the distribution
of links is shown on the right side. This is not bell-shaped at all, rather it is more exponential.
Although there is a mean for this distribution, this is not representative of either the hub-nodes or
non-hub nodes. Since the mean is not representative of the population, the distribution of links in
this type of graph is scale-free. The network of this type of graph is also known as a scale-free
network.
The distribution of nodes in a scale-free graph is often approximated by a power-law
distribution, i.e., f(d) ∼ d−α, where usually, nature seems to have stipulated that 2 ≤ α ≤ 3, by
some curious twist of fate. The log-log plot of this distribution is linear.
The majority of networks in the world today aim to be scale-free. The reason for this is explained
in the article by Barabasi and Albert (1999). To explain this, they developed the Theory of
Preferential Attachment, which stated that as network progress and new nodes are added to it, the
new nodes usually attach itself to existing nodes that have most of the links. As a result,
influential nodes gain more connection and evolves into a hub and spoke structure.
The structure of the graph also determines some of the properties of the graph. For instance, a
scale-free graph performs excellently in the transmission of information, and in moving air
traffic passengers. This is why airports are arranged in this format. A scale-free network is also
very good for random attacks. If, for example, a terrorist attacks an airport, unless it hits a hub,
the damage is usually minimal.
In the rest of this chapter, we would examine financial network risk and many more interesting
graphs.
Chapter 13: Searching Graph
In the previous chapter, we provided some introduction to Graph Theory in data science; in this
chapter, we would be exploring the two broad types of searching graphs. These include depth-
first-search (DFS) and breadth-first search (BFS). The reason we are concerned with this is that
DFS, i.e., depth-first-theory is very good at searching communities in social networks while BFS
works well with finding the shortest connections in networks. This chapter would cover the
following outlines:
Depth-First-Search
Breadth-First-Search
Strongly Connected Components
Dijkstra's Shortest Path Algorithm
Degrees Distribution
Diameter
Fragility
Centrality
Communities
Modularity
Depth-First-Search
This begins by taking a vertex and use this to produce a tree of connected vertices recurring
downward until there is no way to do this again. Here are the algorithms for DFS:

From this algorithm, we can generate two subtrees as follows:

The numbers of nodes show the sequence in which the nodes are accessed by the program.
Usually, the output of a DFS is less detailed and presented in a very simple sequence in which
the nodes are first visited. A good example of a DFS is the graph below:
We would notice that the DFS graph is a set of trees. The tree itself is a special kind of graph. It
is inherently acyclic when the graph is acyclic. This implies that a cyclic graph would have the
DFS trees at the back edges. This process can be interpreted as the partitioning of vertices into a
subset of connected groups.
In applying this to business, it is necessary first to understand why they are different. Secondly,
the ability to target these separate groups to different business questions and responses. Firms
and business organizations that make use of these types of data use algorithms to find out
"communities." Within communities, BFS is then applied to determine the connection of
networks and the nearness of these connections.
Also, DFS can be used to find out the connectedness of the networks. With the use of DFS, we
can determine how close entities are to each other in a network. These analyses often suggest a
"small world's phenomenon or what is colloquially referred to as " six degrees of separation."
Our next focus is to examine how DFS is implemented in the igraph package. We would be
making use of this process all through this chapter. Our example below shows how a paired
vertex list is used to create a graph.

To verify our result, we would be plotting a graph

Breadth-First-Search
From a source vertex s, BFS explores the edges E to find out all the reachable vertices on the
graph. This is carried out in a manner that proceeds to find a frontier of vertices k distant from s.
The search moves on to locating vertices K + 1 away from the source after it has located all such
vertices. The basic difference between DFS and BFS is that while BFS covers all the vertices
while DFS goes all the way for without covering all the vertices at a single search.
To implement BFS, we first label each with its distance from the source vertex. An example of
BFS is the graph below.

.
In the above graph, it is easy to determine the nearest graph. Now, when a positive reaction is
gotten from someone in the population, this would help target the nearest neighbor cost-
effectively. This is simply done by defining the edges of connections. The algorithms for BFS is
as follows:

BFS can also produce a tree, the level of the tree is determined by how close or far it is from the
source vertex.
Strongly Connected Components
The best place to cluster the members of a network is in a directed graph. This is done by finding
the strongly connected components on the graph. Hence, an SCC is a subgroup of the vertices
U ⊂ V in a graph with the properties for all pairs of its vertices (u,v) ∈ U, both vertices are
reachable from each other. Below is an example of a graph broken down into its strongly
connected components:

SCC is very useful for partitioning a graph into tight units. Not only this, but it also generates
local feedbacks. This implies that when a member of SCC is targeted, all the members of the
SCC components are targeted, and the stimulus is moved across the SCC.
igraph has emerged as the most popular packages for analysis graphs. It has versions in Python,
C, and R. This package can also be used to plot and generate the random graph in R.
Dijkstra's Shortest Path Algorithm
This is one of the most widely used algorithms in theoretical computer science. When a source
vertex is given on a weighted, directed graph, the algorithms find the shortest path to all other
nodes from source s. w(u, v) denotes the weight between two vertices. Dijkstra's algorithm
works for graphs where w(u,v)≥0. For negative weights, the Bellman-Ford algorithm is used.
Below is the algorithms:

Below is an example of a graph that Dijkstra algorithms have been applied to.
Degree Distributions
The number of links a node has to other nodes in a network is the degree of a node. The degree
distribution is the probability of distribution of the nodes. In a directed network, there are two
types of degrees. One is for in-degree, and the other is for out-degree. However, in an undirected
network, the degree distribution is simply the number of edges contained in a node. It is
important to note that the weight of the edges does not play any role in computing the degree
distribution of the nodes. Although there are times when there would be a need to avail of this
information.
Diameter
This is the longest shortest distance between any two nodes across all the nodes. It can be
computed as follows:

We can cross-check this using the following command:

Note that the paths that are of length 7 are 18 in number. However, this is a duplicate. Thus we
run these paths in the two directions. This would give us 9 pairs of nodes that have the longest
shortest distance between them.
Fragility
This is simply a quality of a network based on its degree distribution. The question that arises
from this is, in comparing two networks of the same degree, how do we assess on which network
contagion is more likely? This can be the first finding if the network is a scale-free network. This
is because a scale-free network tends to spread the variable of interest, irrespective of whether it
is a flu, information, or financial malaise. Also, in a scale-free network, the greater the
preponderance of central hubs, the greater the probability of networks. This is because a few
nodes already have a concentration of the degree. Hence, the higher the concentration, the higher
the scale-free, and the higher the fragility.
To measure concentration, the economist has been using a unique package for a long time. This
package is the Herndahl-Hirschman index. The index is quite technical to compute because it is
the average degree squared for n nodes, i.e.,

The more degrees get concentrated on a few nodes, the more the metric H increases, keeping the
total degrees of the network constant. For instance, let's assume we have a graph of three nodes
each with a degree {1, 1, 4} versus another graph of three nodes {2,2,2}. The value for metric H
would increase in the former than the latter. The former would have H = 18, while the latter
would have H = 12. To calculate the fragility, we simply normalize H by the average degree.
Here is the formula for this:

In the three nodes example we are using, the fragility is simply 3 and 2 respectively. Other
normalization can be chosen too, for instance, the denominator E (d)². To compute this is not
simply, as it requires a single line of code.
Centrality
This is the property of vertices in a network. Taking the adjacent matrix A = {w(u, v)} we went
ahead and generated a measure of the "influence" of all the vertices in a network. Taking the
influence of the vertex i as xi. The influence of each vertex is contained in column x, what then
is the influence? To answer this question, let's take some moment to observe the web page. The
more the links on the web page, the more the influence from its main page to other pages. This
shows that influence is interdependent.
x=Ax
We can simply add a scalar to this to get:

When we add a scalar, we get an eigensystem. When we decompose this, we get the principal
eigenvector. The value of this gives us the influence of each member. With this method, we can
find the most influential network. There are numerous applications to this data to real data. This
eigenvector centrality is exactly what Google trademarked as PageRank, even though they did
not invent eigenvector centrality.
Other concepts of centrality are "betweenness." This is simply the proportion of the shortest path
that goes through a node in relation to the other paths that go through the node. The formula for
this is :

Here, na,b is the number of shortest paths from node a to node b, and na,b(v) are the number of
those paths that traverse through vertex v. Below is an example:
Communities
These are simply the spatial agglomeration of the vertex that tends to connect with each other
than with others. To identify these agglomerations is a cluster detection problem, a
computationally difficult (NP-hard) one. This is so because we allow each cluster to have
different sizes. This, in turn, permits porous boundaries such that members both within and
outside their preferred clusters. The solution to this is where communities come in.
This is simply constructed by optimizing modularity. Modularity is a metric of the differences
between the number of within-community construction and the expected number of
constructions. Because of the large computational complexity involved in sifting through all
possible partitions, identifying communities is not an easy feat.
The whole idea of community formation started with Simon (1962). In his view, he explained
that complex systems with several entities usually have coherent subsystems, or communities,
that serve specific functional purposes. To understand the functional forces underlying these
entities, it is important to identify communities embedded in larger entities. To understand this,
we would be examining more definitions of the community detection method.
The community detection method is the method in which nodes are partitioned into clusters with
the tendency to interact with each other. Hence, all nodes cannot belong to the same community;
neither can we fix the number of the community at a time. Also, we would allow each
community to have different sizes. Having beaten our partitioning into a more flexible task, our
challenge now is to find the best partition because the number of possible partitions is very large.
However, since the community detection method aims at identifying clusters that are internally
tightknit. This is the same as finding a partition of clusters to maximize the observed number of
connections between cluster members minus what is expected conditional on the connections
within the cluster, aggregated across all clusters. Therefore we will go for partitioning with high
modularity Q.

In the above equation, Aij is the (i, j)-th entry in the adjacency matrix. This implies that i is the
number of connections in which i and j are jointly connected. The total number of transactions or
the degree of i that node i participated in is di = ∑j Aij. While m = 1 2 ∑ij Aij is the sum of all
edge weights in matrix A. When nodes i and j are from the same community, the function δ(i, j)
is an indicator equal to 1.0. However, when they are not, the function is zero. Q is bounded in
[-1, +1]. When Q > 0, it implies that intra-community connections are more than the expected
number.
Modularity
To understand this, we would use a very simple example before exploring the possible different
interpretations of modularity. The calculation that would be adopted in our example is based on
the measure given by Newman (2006). Also, since we have been using the igraph package in R,
our codes to compute modularity would be presented with this package.
To start with, let's assume we have a network of five nodes {A,B,C,D,E}, and the weights of the
edges are: A : B = 6, A : C = 5, B : C = 2, C : D = 2, and D : E = 10. Let's assume that a
community detection algorithm assigned {A, B, C} to one community and {D, E} to another.
This implies that we have only two communities. The adjacent matrix graph for this would be:

Now, let's first detect the communities:

This can be carried out with another algorithm called the "fast-greedy" approach
The Kronecker delta matrix that examines this community detection would be:

The modularity score would be:

Here, the sum of the edge weight in the graph is m = 1/2 ∑ij Aij = 1 2m·δij (13.4) 2 ∑i. Aij is the
(i, j)-th entry in the adjacency matrix. This implies the weight of the edge between nodes i and j,
while the degree of node i is di = ∑j Aij. The Kronecker's delta is the function δij. When the
nodes i and j are from the same community, δij takes the value 1. However, when they are not
from the same community, δij takes value zero. Matrix Aij − di× dj 2m is the center of this
formula. It entails the modularity, which produces a score that increases when the number of
connections within a community is more than the expected proportion of connections if they are
assigned at random depending on the degree of each node. The score takes a value ranging from
−1 to +1 as it is normalized by dividing by 2m. When Q > 0, it simply means that the number of
connections within communities is more than that between communities. Below is the program
code that takes in the adjacency matrix and delta matrix:

Now, we would be computing modularity in the R package by using a canned function. To do

this, we would first enter the two matrices and then call the function. We would also ensure that
we get the same function as the formula above:
Next, we would be repeating the same analysis we did with the first program. The aim of this is
to show how we can detect communities using walk trap algorithms. Then we will use this
community to determine how modularity is computed. The first step to achieve this is to convert
the adjacent matrix into a graph so that the community detection algorithms can use it.

We would notice that in the above program, the algorithms have separated the first three nodes
into one community and the last two nodes into another community. The size variable above
shows the size of each community. The next thing to do now is computing the modularity.

This is a confirmation of the value we arrived at when we used the implementation of the
formula.
Conclusion
This chapter has extensively covered the rudimentary aspects of a graph in data science. The next
chapter examines the Neural Network.
Chapter 14: Neural Networks
In this chapter, we would be treating one of the commonest nonlinear regressions. So far, what
we have been concentrating on are linear regressions. This chapter provides a thorough analysis
of nonlinear regression through the exploration of Neural Networks. The outlines that would be
covered in this chapter include:
Overview of Neural Networks
Nonlinear Regression
Perceptions
Squashing functions
Research applications
Overview of Neural Networks
These are some of the forms of nonlinear regression. Recall that in linear regression, we have:
Y =X¹β+e
Here our value for X ∈ Rt×n, while the solution for regression is simply equal to β = (X¹X)
−¹(X¹Y).
To get this, we simply minimize the sum of squared error:

When we different w.r.t. β would give us the following f.o.c:

We would be examining this using the markowitzdata.txt data set.

We would observe that, at the end of the program listing, our formula for the coefficients of the
minimized least-squares problem β=(XX)−1(XY) matches what we got from the regression
command lm.
Nonlinear Regression
This type of regression takes this form:
Y = f(X;β) + e
In the above formula, f(.) is a nonlinear function. Computing the coefficient in the nonlinear
regression is exactly the same as computing in linear regression.

When we differentiate w.r.t. β would give us the following f.o.c:

Perceptrons
Neural networks are unique types of nonlinear regressions. This is because, in this type of
nonlinear regression, the decision system for which NN is built imitates the same way the human
brain works. This implies that the decision system of the Neural Networks works in a perceptive
manner. Hence, the basic building block of NN is perceptron.
In Neural Networks, a perceptron is similar to the neuron in the human brain. Like the human
brain, it first takes the input like the sensory in a real brain, and they produce an output signal.
The entire network of perceptron in Neural Networks is called a neural net.
For instance, let's assume we want to carry out an application for a credit card. This would
include that we provide some of our personal information like age, business, location, sex, and so
on. This information is passed to a series of perceptrons in a parallel form. The layer of the first
perceptron is the first "layer" of assessment. Each of the perceptrons then produces an output
signal that might be sent to another series of the perceptron to work on; these second series also
emit another output signal. The second layer of output is referred to as the "hidden" perceptron
layer. After running a series of hidden layers, all the signals are then sent to a single perceptron;
this last perceptron produces the signal for deciding whether or not you are qualified for a credit
card.
From our explanation, it can be inferred that perceptron may emit continuous signal or binary
(0,1) signals. In our credit card example, the perceptron of a signal is a binary one. A binary
perceptron is implemented by means of "squashing" functions. A very simple squashing function
is the one that issues a 1 if the function value is positive and a 0 if it is negative. The formula for
this is highlighted below:

In the above equation, g(x) represent any function taking the negative or positive value.
When a neural network contains many layers, it is known as a “multi-layered” perceptrons.
While all the perceptrons together are referred to as a big, single perceptrons. Below is an
example of this kind of neural network:
Neural net models are very similar to Deep Learning. In deep learning, the number of hidden
layers is significantly higher than what was usually arrived at in the past when computational
power is generally limited. Today, deep learning nets has risen to above 20-30 layers. This
resulted in the unique ability of neural nets imitating the same process the human brain works.
Most times, Binary NNs are seen as a category of classier systems. This is because, as a classifier
system, they are often used to divide members of a population into different classes. Aside from
Binary NNs, NNs with continuous output are fast becoming popular.
Squashing Functions
This is more general than the binary. In simple terms, squashing function is a process whereby
the output signal is squashed into a narrow range, usually (0,1). A very common choice of
squashing function is the sigmoid function, popularly known as a logistic function. The formula
for this is

Where w is the adjustable weight. Another very common choice is the Probit function:
f (x) = Φ(w x)
where Φ(·) is the cumulative normal distribution function.
How do NN works?
The simplest way to see how NN works are to observe the simplest NN. This is simply an NN
with a single perceptron producing a binary output. The perceptron has n inputs, with values xi,i
= 1...n and current weights wi, i = 1...n. It generates an output y. The “net input” is defined as:

The output signal is y = 0 when the net input is greater than a threshold T. However; this is not
the case is if it is less than T if it is less than T, the output is y = 0. The actual output is referred
to as the “desired” output and is represented by d = {0,1}. Hence, the “training” data provided to
the NN comprises both the inputs xi and the desired output d. The output of our single perceptron
model will be the sigmoid function of the net input, which is

The error in the NN for a given input set is

Here, the size of the training set is m in yj. To get the optimal NN, we simply find the optimal
NN; we find the weights wi that minimizes this error function E. Once the optimal weights are
gotten, we have a calibrated “feedforward” neural net.
The multilayered perceptron for a given squashing function f and input x = [x1,x2,...,xn], would
give us an output at the hidden layer of

The final output of NN is:

In the above equation, the nested structure of the neural net is obvious.
Logit/Probit Model
A good example of a Logit model is the special model we have above. However, the model
becomes a probit regression model once the squashing function is taken to the cumulative normal
distribution. However, whether Logit or Probit, the model is fitted by minimizing squared errors,
not by maximum likelihood, which is how standard logit/probit models are parameterized.
Connection To Hyperplanes
It is important to note that in binary squashing functions, we passed the net input through a
sigmoid function and then compared to the threshold level T. This sigmoid function is a
monotone one. Hence, this means that there must be a level T in which the net input ∑ni=1 wi xi
must be for the result to be on the cusp. The following is the equation for a hyperplane

This also means that observations in n-dimensional space of the inputs xi must lie on one side or
the other of this hyperplane. When it lies above the hyperplane, then y = 1, else y = 0. From our
explanation so far, single perceptrons in neural nets have a simple geometrical intuition.
Feedback/Backpropagation
The major difference between ordinary neural net and nonlinear regressions is feedback.
Feedback plays a vital role in the neural net performance. Neural nets learn from their feedbacks.
The techniques used in implementing feedback is what is called backpropagation.
Assuming what we have is a calibrated NN. We would have to obtain another observation of
data and then run it through the NN. To get the error for this observation, we would compare the
value of the output y and the desired observation d. If the error detected is a very large one, the
best way to correct this is to update the weight in the NN. This would enable it to self-correct.
The process of updating the weight in NN to self-correct is what is known as "backpropagation."
The benefit of backpropagation is to avoid the long process of doing a full-refitting task. With
just simple rules, correction can be made in a gradual process.
Let's look at backpropagation using a single perceptron on this very simple example. Considering
the jth perceptron, the sigmoid value would be:

where yj is the output of the jth perceptron, and xij is the ith input to the jth perceptron. The error
that would be gotten from this observation is (yj − dj). Recall that E = 1/2 ∑mj=1(yj − dj)2. The
change in error can be computed with respect to the jth output.

Next, we examine how the error changes with input values:

We can now define the value of interest. This is simply the change in error value with respect to
the weights

In this case, we have a single equation for each observation j and each weight wi. It is important
to note that wi apply to all perceptrons. A unique case is where every perceptron has its own
perceptron, that is wij. Here, instead of updating just a single observation, updating would be
done on many observations. If this is the case, the error derivative would be:
∂E/∂wi = ∑j(yj −dj)yj(1−yj)xij
Therefore, wi would be reduced to E if ∂E/∂wi > 0. How do we achieve this? It is in this aspect
that some art and judgment is implemented. To start with, when the weight wi needs to be
shrunk, there is a tuning parameter 0 < γ < 1 that works for this; similarly, when the derivative is
∂E/∂wi < 0, then wi would be increased by dividing it by γ.
Chapter 15: One Or Zero: Optimal Digital Portfolio
Digital assets are a binary investment. This means that their payoff is either small or large. In this
chapter, we would be exploring some key features of optimal digital portfolio assets; these
include assets such as credit assets, venture investments and lotteries. The outlines that would be
covered in this chapter include:
Optimal Digital Portfolio
Modeling Digital Portfolios
Optimal Digital Portfolio
These kinds of portfolio shares correlated assets with joint Bernoulli distributions. In our
explanation, we would be using an easy and fast recursive technique to obtain the return on the
distribution of the portfolio. We would also be using the example to generate guidelines on how
digital asset investors should construct their portfolio. Recently, it has been discovered that
portfolios are better when they are constructed homogeneous in the size of the assets.
It is important to note that the distribution of return on digital portfolios is usually fat-tailed and
extremely skewed. The venture fund is a very good example of this kind of portfolio. Bernoulli
distribution is a simple representation of digital portfolio payoff. Bernoulli distribution has very
little or no payoff for a failed asset; however, its payoff for a successful portfolio is large. The
probability of achieving success in digital investment is relatively small. Hence, optimizing the
portfolio in digital investment is not amenable to the standard technique used for mean-variance
optimization.
Therefore, in our explanation, we would be using a technique based on the standard recursive
used for modeling the return distribution in Bernoulli distribution.
Modeling Digital Portfolio
Let's take, for instance, that an investor has an option of n investments in digital assets. The
investment are indexed i = 1,2,..., n. For each investment, there is a probability of success qi that
would yield si dollar. Given the probability (1 − qi), the investment and the start-up would fail.
All the money used for the investment would become a total waste. The payout of cashflow for
such investment is:

Below is the function to determine whether an investment is successful or not :

Here the ρi ∈ [0,1] is a coefficient that correlates a normalized common factor X ~ N(0,1) with
a threshold yi. Correlation is driven among the digital assets in the portfolio by the common
factor. Hence, we assume that Zi ∼ N(0,1) and Corr(X,Zi) = 0, ∀ i. Therefore, ρi × ρj stands
for the correlation between assets i and j.
Note that the mean and variance of yi are: E(yi) = 0,Var(yi) = 1, ∀ i. Conditional on X, the
values of yi are all independent, as Corr(Zi,Zj) = 0. Next, we formalize the probability model for
the success or failure of the investment by first defining a variable xi with distribution function
F(·), such that F(xi) = qi, gives us the probability of success of the digital investment.
Conditional on a fixed value of X, the probability of success of the ith investment would be:

Taking the normal of value to be F, we would have :

From the above equation, the cumulative normal density function is Φ(.). Taking the value for
the level of the common factor as X, asset correlation ρ, and the unconditional success
probabilities qi, our conditional success probability for each asset would be pXi. The more X
varies, the more our pXi varies. We would be choosing the function F(xi) as the cumulative
normal probability function for the numerical examples we would be treating.
An investment is deemed successful I it has a high payoff Si. The flow of cash from the portfolio
is a random variable
The sum of all digital assets cashflow would give us the maximum cash flow that would be
generated by the portfolio. This is because every single outcome is a success.

To simplify the issue, we would simply assume Si is an integer, and we round off the amount
nearest to the significant digit. Hence if the amount nearest to the significant digit is a million,
each Si would be a million integer.
Let's recall that in our previous formula, conditional on a value of X, the probability of success
of digital asset i is given as pXi. This recursion technique would be made it easy for us to
generate the portfolio cashflow probability for each level of X. Based on this, we simply use the
marginal distribution for X represented by g(X) to compose these conditional (on X)
distributions into the unconditional distribution of all the portfolios. As a result, the total
probability of cash flow from the portfolio, conditional on X would be f(C|X).
Conclusion
Data science is wider than computer science or statistic; however, to excel in the field, the
knowledge of these two is very necessary. From our explanations so far, a lot has been explored
under these two fields. For instance, Fourier analysis, Data extraction, Limited dependent
variables are all under the field of statistics, while algorithms, linear and nonlinear regression,
Auctions, Network theories, neural networks and more are prevalent in the field of computer
science.
In explaining some of the theories in the book, we use a recursion technique borrowed from
many different portfolios and examples. We also explicate the major difference between
nonlinear and linear regression, using this as a background into exploring what optimal digital
portfolio entails. Some popularly used theories in data science, such as Bass, Bayes,
GARCH/ARCH, and many more, were explained in detail. Not only this, we explore different
models like the Markowitz model, Eigensystem, Factor analysis and many more.
Important areas such as web sourcing with the use of API, Text classifier, Word-count classifier,
and many more were observed in detail. The approaches employed in each chapter of the book
are very simple. Broad statistics on models and theories were beaten done into explanatory
details so that readers are able to grasp all these areas.
With consistent study and practice, data scientists are guaranteed to excel in their field.

R Data Analysis Projects
100% (1)
R Data Analysis Projects
361 pages
Machine Learning Methods in Environmental Sciences
100% (2)
Machine Learning Methods in Environmental Sciences
365 pages
The Power of Ritual in Prehistory
100% (7)
The Power of Ritual in Prehistory
412 pages
Mathematical Statistics
From Everand
Mathematical Statistics
S. Wilks
5/5 (2)
Data Analytics For Absolute Beginners A Deconstructed Guide To Data Literacy 1081762462 9781081762469
100% (1)
Data Analytics For Absolute Beginners A Deconstructed Guide To Data Literacy 1081762462 9781081762469
133 pages
Innovations in Big Data Mining
100% (2)
Innovations in Big Data Mining
286 pages
(Smtebooks - Com) Big Data Processing With Hadoop 1st Edition
100% (1)
(Smtebooks - Com) Big Data Processing With Hadoop 1st Edition
255 pages
(Studies in Big Data) Mamta Mittal - Valentina E. Balas - Lalit Mohan Goyal - Raghvendra Kumar - Big Data Processing Using Spark in Cloud (2019, Springer) PDF
No ratings yet
(Studies in Big Data) Mamta Mittal - Valentina E. Balas - Lalit Mohan Goyal - Raghvendra Kumar - Big Data Processing Using Spark in Cloud (2019, Springer) PDF
274 pages
Data Mining A Tutorial-Based Primer, Second Edition PDF
100% (1)
Data Mining A Tutorial-Based Primer, Second Edition PDF
530 pages
Data Science and Big Data An Environment of Computational Intelligence
100% (4)
Data Science and Big Data An Environment of Computational Intelligence
303 pages
TensorFlow Developer Certification Guide
From Everand
TensorFlow Developer Certification Guide
Patrick J
No ratings yet
Data Visualization with Python: Exploring Matplotlib, Seaborn, and Bokeh for Interactive Visualizations (English Edition)
From Everand
Data Visualization with Python: Exploring Matplotlib, Seaborn, and Bokeh for Interactive Visualizations (English Edition)
Dr. Pooja
No ratings yet
Data Scientist Pocket Guide: Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together
From Everand
Data Scientist Pocket Guide: Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together
Mohamed Sabri
No ratings yet
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
From Everand
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
Prem Timsina
No ratings yet
Data Structures and Algorithms With Python 100 Coding Q A Code of Code by Cakal Yasin 1
100% (2)
Data Structures and Algorithms With Python 100 Coding Q A Code of Code by Cakal Yasin 1
327 pages
The Architecture Portfolio Guidebook
100% (3)
The Architecture Portfolio Guidebook
279 pages
Applied Longitudinal Analysis Lecture Notes
No ratings yet
Applied Longitudinal Analysis Lecture Notes
475 pages
A Primer On Process Mining Practical Skills With Python and Graphviz
No ratings yet
A Primer On Process Mining Practical Skills With Python and Graphviz
102 pages
Data Science and Big Data Computing - Frameworks and Methodologies
90% (10)
Data Science and Big Data Computing - Frameworks and Methodologies
332 pages
Machine Learning Models and Algorithms For Big Data Classification
50% (2)
Machine Learning Models and Algorithms For Big Data Classification
364 pages
Innovations in Classification Data Science Daniel Baier, Klaus Dieter
No ratings yet
Innovations in Classification Data Science Daniel Baier, Klaus Dieter
620 pages
Advances in Big Data PDF
100% (2)
Advances in Big Data PDF
364 pages
Python For The Busy Java Developer: The Language, Syntax, and Ecosystem
100% (1)
Python For The Busy Java Developer: The Language, Syntax, and Ecosystem
79 pages
Data Science and Big Data Computing PDF
100% (2)
Data Science and Big Data Computing PDF
332 pages
Ai Analytics in Production PDF
No ratings yet
Ai Analytics in Production PDF
137 pages
Abhijit Ghatak (Auth.) - Machine Learning With R (2017, Springer Singapore) PDF
100% (1)
Abhijit Ghatak (Auth.) - Machine Learning With R (2017, Springer Singapore) PDF
224 pages
Data Science Solutions With Python Fast and Scalable Models Using
100% (1)
Data Science Solutions With Python Fast and Scalable Models Using
128 pages
Linear Algebra For Computational Sciences and Engineering - Ferrante Neri (Springer, 2016)
100% (1)
Linear Algebra For Computational Sciences and Engineering - Ferrante Neri (Springer, 2016)
472 pages
Supercharge Your Data Science Career
100% (1)
Supercharge Your Data Science Career
20 pages
Data Science Ebook
0% (1)
Data Science Ebook
18 pages
Guide To Intelligent Data Science: Michael R. Berthold Christian Borgelt Frank Höppner Frank Klawonn Rosaria Silipo
100% (1)
Guide To Intelligent Data Science: Michael R. Berthold Christian Borgelt Frank Höppner Frank Klawonn Rosaria Silipo
427 pages
MML Book 2 PDF
100% (2)
MML Book 2 PDF
421 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Machine Learning With R Cookbook - Sample Chapter
100% (1)
Machine Learning With R Cookbook - Sample Chapter
41 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Guide Python Data Science
100% (2)
Guide Python Data Science
13 pages
Dynamics of Data Science
No ratings yet
Dynamics of Data Science
104 pages
Economic Strain Seen in Forecast Through Autumn: China Targeted As U.S. Weighs Ban On Visitors
100% (1)
Economic Strain Seen in Forecast Through Autumn: China Targeted As U.S. Weighs Ban On Visitors
50 pages
Learn D3.Js Simple Way - Nuno Correia
No ratings yet
Learn D3.Js Simple Way - Nuno Correia
129 pages
3 Multiple Linear Regression: Estimation and Properties: Ezequiel Uriel Universidad de Valencia Version: 09-2013
100% (1)
3 Multiple Linear Regression: Estimation and Properties: Ezequiel Uriel Universidad de Valencia Version: 09-2013
37 pages
Python For Data Science Extended Ebook PDF
100% (5)
Python For Data Science Extended Ebook PDF
56 pages
2020-07-01consumer Reports - July 2020 (USA)
No ratings yet
2020-07-01consumer Reports - July 2020 (USA)
81 pages
Deep Learning For Cloud and Mobile
100% (2)
Deep Learning For Cloud and Mobile
42 pages
Big Data Analytics Methods and Applications Jovan Pehcevski
100% (5)
Big Data Analytics Methods and Applications Jovan Pehcevski
430 pages
The Essential R Reference. Preview Sample
No ratings yet
The Essential R Reference. Preview Sample
60 pages
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
Cruel Sea Claims A Kind Soul: THE Party'S Over
No ratings yet
Cruel Sea Claims A Kind Soul: THE Party'S Over
94 pages
SQL For Data Science
75% (4)
SQL For Data Science
350 pages
The Times (2020-07-10) 100720081649
No ratings yet
The Times (2020-07-10) 100720081649
68 pages
The Washington Post December 09, 2020 PDF
No ratings yet
The Washington Post December 09, 2020 PDF
87 pages
'Stay Away' Selfish Acts: Tassie Closes Its Borders To Victorians High-Range Drink Drivers Put Lives at Risk
No ratings yet
'Stay Away' Selfish Acts: Tassie Closes Its Borders To Victorians High-Range Drink Drivers Put Lives at Risk
66 pages
Data Preprocessing in Data Mining PDF
100% (3)
Data Preprocessing in Data Mining PDF
327 pages
The New York Post January 05, 2021
No ratings yet
The New York Post January 05, 2021
52 pages
Class Clowns: Parents Rip Blas-Carranza's Half-In, Half-Out, Half-Assed School Plan
No ratings yet
Class Clowns: Parents Rip Blas-Carranza's Half-In, Half-Out, Half-Assed School Plan
54 pages
BA in Data Science
No ratings yet
BA in Data Science
22 pages
As Warming Makes Parts of The Planet Less and Less Livable An Epic Climate Migration Has Begun
No ratings yet
As Warming Makes Parts of The Planet Less and Less Livable An Epic Climate Migration Has Begun
48 pages
Michael Cohen Back in Prison After This Pic: THE Last Supper
No ratings yet
Michael Cohen Back in Prison After This Pic: THE Last Supper
50 pages
Building An Effective Data Science Practice
100% (2)
Building An Effective Data Science Practice
376 pages
Boston Globe December 09, 2020 PDF
No ratings yet
Boston Globe December 09, 2020 PDF
40 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Grab A 10 Rishi Dishi: Depp Wanted Witch Amber Set On Fire'
No ratings yet
Grab A 10 Rishi Dishi: Depp Wanted Witch Amber Set On Fire'
34 pages
2020 07 21 Daily Star PDF
No ratings yet
2020 07 21 Daily Star PDF
50 pages
The Wall Street Journal - 08.07.2020
No ratings yet
The Wall Street Journal - 08.07.2020
32 pages
President Is Not Above The Law,' Justices Decide: Schools Facing Crush of Costs To Open Safely
No ratings yet
President Is Not Above The Law,' Justices Decide: Schools Facing Crush of Costs To Open Safely
52 pages
Google Tells Staffers To Work Remotely Until Next Summer: Any Data, Anywhere
No ratings yet
Google Tells Staffers To Work Remotely Until Next Summer: Any Data, Anywhere
32 pages
Usa Today: USA Reaches 3 Million Cases
No ratings yet
Usa Today: USA Reaches 3 Million Cases
28 pages
G.O.P. Relief Plan Slices Extra Pay For Unemployed: Behind 2 Covid Studies, Big Talk and Big Plans
No ratings yet
G.O.P. Relief Plan Slices Extra Pay For Unemployed: Behind 2 Covid Studies, Big Talk and Big Plans
48 pages
UltimateGuidetoDataScienceInterviews 2
100% (4)
UltimateGuidetoDataScienceInterviews 2
87 pages
The Wall Street Journal - 09.07.2020
No ratings yet
The Wall Street Journal - 09.07.2020
30 pages
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
100% (3)
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
413 pages
Learning Probabilistic Graphical Models in R - Sample Chapter
No ratings yet
Learning Probabilistic Graphical Models in R - Sample Chapter
37 pages
Data Science Crash Course SharpSight
100% (6)
Data Science Crash Course SharpSight
107 pages
Skymind The Math Behind Neural Networks
100% (1)
Skymind The Math Behind Neural Networks
17 pages
The Wall Street Journal - 16.07.2020
No ratings yet
The Wall Street Journal - 16.07.2020
34 pages
We're Warriors For Our Sons': Trump Loses Bid To Keep Taxes Secret
No ratings yet
We're Warriors For Our Sons': Trump Loses Bid To Keep Taxes Secret
31 pages
2018 Book DataScienceAndPredictiveAnalyt PDF
100% (3)
2018 Book DataScienceAndPredictiveAnalyt PDF
851 pages
Financial Times Europe - 08.07.2020
No ratings yet
Financial Times Europe - 08.07.2020
18 pages
Safety Net From Centuries Past: The Quest For Progress Drowns A Turkish Gem A New Love For Merkel in Germany
No ratings yet
Safety Net From Centuries Past: The Quest For Progress Drowns A Turkish Gem A New Love For Merkel in Germany
16 pages
Master Data Science Essentials 2015-11 SHARPSIGHTLABS
No ratings yet
Master Data Science Essentials 2015-11 SHARPSIGHTLABS
11 pages
Tactical Asset Allocation With Macro Views:: Quantitative Research
No ratings yet
Tactical Asset Allocation With Macro Views:: Quantitative Research
12 pages
User Guide of GARCH-MIDAS and DCC-MIDAS MATLAB Programs
No ratings yet
User Guide of GARCH-MIDAS and DCC-MIDAS MATLAB Programs
12 pages
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
No ratings yet
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
3 pages
Data Science Theory, Analysis and Applications - Memon - Ahmed
100% (12)
Data Science Theory, Analysis and Applications - Memon - Ahmed
345 pages
Advance Statistical Methods in Data Science Chen
100% (5)
Advance Statistical Methods in Data Science Chen
229 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
Intelligent Techniques For Data Science
100% (12)
Intelligent Techniques For Data Science
282 pages
2019 Book DataScienceAndBigDataAnalytics
100% (15)
2019 Book DataScienceAndBigDataAnalytics
418 pages
9.3 Chương 9
No ratings yet
9.3 Chương 9
66 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (10)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Data Engineering Cookbook
89% (9)
Data Engineering Cookbook
88 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
100% (15)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Multivariate DCC-GARCH Model: Elisabeth Orskaug
No ratings yet
Multivariate DCC-GARCH Model: Elisabeth Orskaug
88 pages
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
No ratings yet
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
8 pages
SSA Beginners Guide v9
No ratings yet
SSA Beginners Guide v9
22 pages
2022 SCHEME PG MST - Compressed - 0
No ratings yet
2022 SCHEME PG MST - Compressed - 0
84 pages
Q Function
No ratings yet
Q Function
9 pages
Information Geometry Univariate Time Series
No ratings yet
Information Geometry Univariate Time Series
12 pages
3 Sls
No ratings yet
3 Sls
31 pages
Dynamic programming The Ultimate Step-By-Step Guide
From Everand
Dynamic programming The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Case International Diversification
No ratings yet
Case International Diversification
5 pages
Narcoland The Mexican Drug Lords and Their Godfathers
No ratings yet
Narcoland The Mexican Drug Lords and Their Godfathers
1 page
Robust - Robust Variance Estimates
No ratings yet
Robust - Robust Variance Estimates
25 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
9 pages
Manly Caps1-3 PDF
No ratings yet
Manly Caps1-3 PDF
37 pages
PCA Analysis
No ratings yet
PCA Analysis
28 pages
C:/Users/User/Downloads/model - Amos - Ads - Final - Amw: Analysis Summary Date and Time
No ratings yet
C:/Users/User/Downloads/model - Amos - Ads - Final - Amw: Analysis Summary Date and Time
37 pages
Lavaan
No ratings yet
Lavaan
54 pages
PG 1
No ratings yet
PG 1
38 pages
Covariance Matrix
No ratings yet
Covariance Matrix
6 pages
The Journal of Finance - 2022 - CHERNOV - Pricing Currency Risks
No ratings yet
The Journal of Finance - 2022 - CHERNOV - Pricing Currency Risks
38 pages
Principal Components Analysis and Redundancy Analysis
No ratings yet
Principal Components Analysis and Redundancy Analysis
18 pages
Sequence Weights: Stephen F. Altschul
No ratings yet
Sequence Weights: Stephen F. Altschul
17 pages
Young 2001
No ratings yet
Young 2001
22 pages
Journal of Statistical Software
No ratings yet
Journal of Statistical Software
14 pages
Computers & Geosciences: Mohammad Ali Goudarzi, Marc Cocard, Rock Santerre
No ratings yet
Computers & Geosciences: Mohammad Ali Goudarzi, Marc Cocard, Rock Santerre
12 pages
Assignment 2 FIN435.5 Group 2
No ratings yet
Assignment 2 FIN435.5 Group 2
10 pages

Data Science Tips and Tricks To Learn Data Science Theories Effectively

Uploaded by

Data Science Tips and Tricks To Learn Data Science Theories Effectively

Uploaded by

Data Science

Tips and Tricks to learn Data Science

Supervised and Unsupervised Learning

The distribution function gives the cumulative probability

However, if the return is independent, the formula collapses to

Where α=(µ−1/2σ2)h. For periods t=1,2,... T the entire series likelihood is

It is very simple (computationally) now to maximize.

Under ARCH, the variance is always auto-correlated. So we will have:

The output is the vector of the optimal portfolio weight.

We can verify the sum of total risk using this procedure:

The example we would be using is the Aid Test

Frequency is calculated like this

The formula for inverse document frequency is:

SMV is very fast and can be used in news analytics.

The Basic Idea

This can be rewritten as

Therefore, for the values p = 0.01 and q = 0.2, we will have

The discriminant weight is the ak coefficients.

We can extract some useful result as follows:

An M×M matrix A has attendant M eigenvectors V and eigenvalue λ if we can write

Assuming F is a uniform, we would have:

Here Λ (lambda) stands for logit.

The probability that y=1 is the fitted value of y.

This can be written as:

This can be rewritten as:

Here the normal density function is φ(.)

The log-likelihood would take this form:

While we take this derivative to maximize the log-likelihood

Note that the following equation are valid:

And the latent variable is modeled as

where ρ = Corr(,u), and the standard deviation of € is σ€. It means that

This can happen in the following listed ways:

Next, we compute the above and get:

The function in the first equation can be written in time as :

The Fourier series can be rewritten as:

Let's try another integral:

How do we arrive at the above?

This generated the popular Euler's formula:

This can be written as

Going back to the Fourier series:

How then do we solve this? Start with:

How you get rid of a0

The expression can be written as:

When we apply this idea, we will get:

This can be alternatively written in frequency terms as:

Next, we substitute this into the formula of g(t) and get:

Taking the formula of limits as:

This gives a double integral

Fourier coefficient is as follows:

Where mj is the jth moment. It is therefore easy to see that:

From this algorithm, we can generate two subtrees as follows:

To verify our result, we would be plotting a graph

We can cross-check this using the following command:

Now, let's first detect the communities:

The modularity score would be:

Now, we would be computing modularity in the R package by using a canned function. To do

When we different w.r.t. β would give us the following f.o.c:

We would be examining this using the markowitzdata.txt data set.

When we differentiate w.r.t. β would give us the following f.o.c:

The error in the NN for a given input set is

The final output of NN is:

Next, we examine how the error changes with input values:

Below is the function to determine whether an investment is successful or not :

Taking the normal of value to be F, we would have :

You might also like