Data Science Tips and Tricks To Learn Data Science Theories Effectively
Data Science Tips and Tricks To Learn Data Science Theories Effectively
Privacy
As the volume, velocity, and variety of data increase, individual privacy is flooded. As humans,
we are often torn between engaging in social interaction with others and maintaining our privacy,
and technology has made it happen in such a way that the war against maintaining our privacy
grows every blessed day. Daily we are in a struggle to keep our private information from leaking
out to the public. The reason for this is simply as a result of the involvement of data science.
Stakeholders, governments, business owners, and more useful data service to gain access to
people's private information. We are very conversant with sites asking personal questions
ranging from our age, marital status, place of origin, occupation and even address to our resident
areas. These kinds of questions help leak out private information.
Additionally, the loss of privacy is also caused by what is known as "human profiling." This is a
situation whereby the more we move our daily activities to the web, the more companies and
organizations use data mining and analysis to construct profiles of who we are even more than
we often realize. For instance, when we tweet "taking my dog for a walk," this information is
incremented through data analysis as "owner of a pet" when we tweet " going home to cook for
my kids," this is incremented as "a mother." This shows that a little information that appears like
an ordinary tweet to us on Facebook or Twitter reveals more than we often know. In succinct, a
machine knows you better than you think.
Aside from tweets on social media, phone calls, GPS location tracker, and emails are also part of
what companies and organizations can use to create human profiling. Those who don't often
tweet information about themselves are being recorded as people of low digital profile. However,
to create a balance, it is advisable not to hide so as not to be known but to maintain an average
profile as possible.
Human profiling means the separation of a targeted space for a separate audience or group of
people. This allows attention to be paid to that group of persons through what is known as price
discrimination. For instance, if my profile shows that I am an influential person and very rich, I
am likely to start receiving internet sales pitches from popular companies who have gathered that
I often buy the product they are into. Profiling allows companies or business organization to
reach their audience faster and more accurately.
Profiling is also used to snare terrorists, however, care should be taken not to engage in excessive
profiling. We are in an age where there is an interesting battle between man and machine for
their privacy. Let's be careful what information we disclose on social media.
Theories, Models, Intuition, Causality, Prediction, Correlation
Data science entails the implementation of theories and models. Data science also makes use of
intuition, causality, prediction, and correlation. Theories are a statement about how the world
should be or should not be. These statements are often derived from axioms that are assumed on
the nature of the world or from existing theories. Models, however, are the implementation of
theories. This is often achieved through the use of algorithms and variables. Intuition is the result
of a running model. This means intuition is a profound understanding of the world with the aid of
data, theories, and models.
Once the intuition for the result of a model is established, what is left is to determine if the
relationship observed between model and intuition is that of prediction, causality, or correlation.
Causality is usually stated in a mathematical form or structure. Theories might be causal. To
arrive at a causal effect, it must be deeply entrenched in data. This is why causality is very
difficult to establish, even with the use of theoretical foundations.
At the end of the inference chain in data science, the movement between two variables is often
determined by correlation. Correlation is of utmost importance to firms hoping to tease out
information from big data. Although correlation deals with a linear relationship between
variables, it could also lay the background for finding out nonlinear relationship, an aspect which
is becoming more and more flexible with the use of data.
In data science, a relationship is a multifaceted correlation among people. Social media, like
Twitter, Facebook, Instagram, and so on, use graph theory to datafy human relationships. The
aim of this is to understand how people relate to each other to make some profit from it. Data
science encompasses the understanding of how humans relate with one another and to
understand the behavior of a human generally. This aspect is the focus of social science.
Conclusion
This chapter explained in detail who a data scientist is, what is data science and the features a
good data science should have. The chapter also looked at the characteristics of good data,
machine learning, and the two major types of machine learning. In the subsequent chapters of
this book, we will consider theories, models, data application and techniques. We will also
explore some of the recent technologies created for big data and data science.
Chapter Two: Getting Started with Data Science
This chapter explores some of the mathematical models, statistics, and algebra of data science.
We would be looking at some equations prevalent in data analysis and how business
organizations use this in carrying out their duties.
Data analysis calls for technical expertise and excellence. It calls for the ability to use various
quantitative tools. These tools range from statistics to calculus and algebra, and of course,
econometrics. There are various tools used in data analysis; in this chapter, some of the tools
would be explained in detail. The outline that would be covered in this chapter include:
Exponentials, Logarithms, and Compounding
Normal distribution
Poisson distribution
Vector Algebra
Matrix calculus
Diversification
Exponentials, Logarithms, and Compounding
In this section, we would start our explanation from the most basic mathematical constant we are
familiar with. This is “e = 2.718281828...”, which is also the function “exp(·).” This function is
usually written as "ex." Here, x can either be a real or complex variable. This type of
mathematical constant is very popular in finance and other related areas. In finance, we use the
constant in the continuous compounding and discounting of money at a stipulated interest rate
which is (r) and a time frame (t).
Let's assume y = ex, any change in the value of x would also result in a change in the percentage
of y. The reason for this is simple. In (y) =x, In (.) is the inverse function of the exponential
function and also a natural logistic function.
Remember that the first derivative of this function is the equation dy/dx =ex. e is a constant, and
it is defined as the limit of a particular function
The limit of a successively shorter interval over discrete compounding is what is known as
exponential compounding. Let's assume that time frame (t) is split into intervals per year. The
equation for the compounding of a dollar from zero time frame to a time (t) years at a given
interval (n) at per annum rate can be written as:
When the limit of n rises to infinity, it leads to continuous compounding. The equation is written
like this:
The above equation is just the forward value for one dollar. To calculate the present value, we do
a reverse equation. Hence the price today of a dollar collected t years from today is P=e-rt. What
we got now is a bond. The yield of this bond is:
Duration is the negative of the percentage price sensitivity of a bond to changes in interest rate.
The equation represents this:
The percentage price sensitivity of a bond in relation to its second derivative is its convexity. The
equation for this is:
Normal Distribution
This aspect is the benchmark of many models in social science. This is because it is widely
believed to produce virtually all the data needed in the big data. It is quite interesting that most
phenomenon in the real world is "power law" distributed. This implies a very few observation of
high value as against many observation of low value. In this type of distribution, the probability
distribution does not have the features of normal distribution. Rather what we have here is left to
right decline in a probability distribution.
A good example of data distributed in this format is income distribution ( very few observations
of high income and many observations of low income). Other examples include the population of
cities, frequencies in language, and so on.
The normal distribution is very important in statistics. A good example of this type of
distribution is human heights and stock returns. In a normal distribution, if x is normally
distributed with the mean and variance, the probability density for x equals to
The notation N(·) or Φ(·) instead of F(·) is often used because the normal distribution is
symmetric. The “standard normal” distribution is: x ∼ N(0,1).
Poisson Distribution
This is also known as a rare-event distribution. The density function for this type of distribution
is
In this type of distribution, there is only one parameter, the mean λ. The function of density is
above the discrete values of n. Both the variance and the mean in Poisson distribution is
represented by λ. The Poisson is a discrete-support distribution, its values range from n =
(0,1,2,3,4,5....)
Moment of Continuous Random Variables
The formulas that would be reviewed in this sector are very necessary because any analysis of
data entails the use of these formulas. In our review, we would use the random variable x and the
probability density function f(x) to arrive at the first four moments.
Mean (first moment or average) =
The power of the variable results in a higher nth order moment. These types of moments are
non-central. The formula for this is
The next central moment is the variance. Moments of the demeaned variable is also known as
central moments.
Variance = Var(x) = E[x−E(x)]2 = E(x2)−[E(x)]2
The square root of the variance is the standard deviation, i.e., σ = √Var(x). The next moment is
skewness
The value of skewness is in relation to the degree of asymmetric in probability density. If there is
more occurrence of values in the left-hand side than the right, the distribution is left-skewed.
When the values fall more on the right-hand side, what we have is right-skewed.
The last normalized central distribution is the kurtosis.
The standard distribution value for Kurtosis is 3. Excess kurtosis occurs when the standard
distribution value is minus 3. A distribution with excess kurtosis is called leptokurtic.
How to Combine Random Variables
Here are the simple formats to combine random variables
1. Means are scalable and addictive, E(ax +by) = aE(x)+bE(y)
2. When a, b are scaler values and x, y are addictive, the variance of random plus scaled
variables is
Var(ax +by) = a2Var(x)+b2Var(y)+2abCov(x,y)
3. The equation for the covariance and correlation between two random variables is
Vector Algebra
In most of the models we will explore in this book, what we will be using are linear algebra and
vector calculus. Linear algebra encompasses the use of both vector and matrices, while vector
algebra and calculus are very effective in handling issues, including solutions in spaces of
several variables.
A good example is a high dimension. In this book, the use of vector calculus would be examined
in the context of a stock portfolio. The return of each stock is defined as:
What we have in the above equation is a random vector. This is because each return comes from
its own distribution. Also, there is a correlation in the return of all these stocks.
We can also define a Unit vector as :
The unit vector would be used further in subsequent chapters, especially for analysis. A set of
portfolio weights represents a portfolio vector. This implies the fraction of the portfolio invested
in each stock.
The sum of all portfolios must be 1. The equation for this is:
A good observation of the line above would show that there are two ways to calculate the sum of
a portfolio. The first way is by summation notation, while the second used a simple verbal
algebraic statement. Vectors are represented by the two elements at the left-hand side of the
equation while the elements at the right-hand side is a scalar.
Vector notation can also be used to compute statistics and the quantities of portfolios. The
formula for the portfolio return is
In the above equation, the quantities at the left-hand side represent the scalar, while the right-
hand side is the vector.
Diversification
Here we would examine the power of using vector algebra with an application. To explore how
diversification works, we would be using vector and summation math. When the number of non-
perfected correlations in a stock portfolio increases, diversification happens. This creates a
reduction in portfolio variance. Now, to compute the variance, we would use the portfolio weight
w and the covariance vector of stock return R. Σ represents this. In our calculation, the formula
for a portfolio return variance would first be written as:
But if an equal amount is invested in each asset and return are independent, the formula we have
is
In the above equation, the first term is the average variance, while the second term is the average
covariance. What we have at the end of our equation is an outstanding result of a diversified
portfolio. In this type of portfolio, the variance of stock does not play any role in portfolio risk.
The variance of the stock is the average of off-diagonal terms in the covariance matrix or vector.
Matrix Calculus
Matrix calculation is merely the function of countless variables. Just as a function can be
amended in multivariable calculus, functions are also amendable in matrix calculus. However,
the simplest among this is using vector and matrix. Here we can take the derivative of in just a
single step. For instance, let's assume
and
The fraction for f(w) will be wB. What we have here is a function of two variables w1,w2. When
we write out f(w) in long-form, what we will arrive at is 3w1 +4w2. The derivative of f(w) in
relation to w1 is ∂f/∂w1 = 3, while the derivative of f(w) in relation to w2 is ∂f/∂w2 = 4. When
we compare this with vector B, the value for df/dw is B.
The insight in this form of calculation is that when vectors are treated as regular scalars and
calculus are conducted accordingly, the result we will arrive at is a vector derivative.
Conclusion
In this chapter, we have successfully covered some models in data calculation. We have also
explored some of the basic statistics in data science. We considered mathematical calculations
like vector, matrix, calculus, variables, and so on. The next chapter will examine more about
theories and models in data science.
Chapter Three: R - Statistic Packages
This chapter would examine some of the useful steps for using R - statistic packages. For a great
user interface that comes with using the R package, it is advisable to download and install
RStudio; this can be done by visiting www.rstudio.com. However, it is necessary first to install R
from the R project page, www.r-project.org. Now let's get started with some R statistics basic
programming skills. The outlines that would be covered in this chapter include:
System command
Matrix
Descriptive statistics
High-ordered moments
Brownian motion in R
GARCH/ARCH Model
Heteroskedasticity
Regression model
System Command
To access the system directly, you can issue system command using the following technique:
system( "<command>" )
For example
system( " ls_lt_ |_ grep_Das" )
will list out all the directory entries that have my name in chronological order. However, this
kind of command would not work on a Windows machine because I am using a UNIX
command. This can only work with a Linux box or Mac.
Loading data
To get started with this, there is a need to get some data. Here are the steps to do this:
1. Go to Yahoo
2. Download and save some historical data into Excel spreadsheet
3. Restructure the order of the data chronologically
4. Save the work as a CSV file
5. Use the following method to read the file into R
If required, the last command in the above would reverse the structure of the data sequence.
Stock can be download using the quantmod package.
Note: the drop-down menu on Window and Mac can be used to install a package. In Linux, use
the package installer. The following command can also be used to achieve this.
install. packages( "quantmod" )
Now we can start using the package
We can also create a direct column of stock using the following formula:
Next, concatenate the data columns into a single stock data set
Now we will log in return in continuous-time to compute daily returns. The mean returns
include:
We will also compute the correlation matrix and the covariance matrix.
In our program, we will notice that the print layout made it easy to select several significant
digits.
To make the data files easy to work within all formats, you can use the reader package. It has
many tangible functions.
Matrices
In this section, we would examine the basic command needed to manipulate and create a matrix
in the R project. We will be creating a 4x3 matrix with some random numbers shown below:
When transposing the matrix, we would notice the reversion in the dimensions of the number
For easy multiplication of a matrix, the matrix to be multiplied must conform with each other.
This implies that the number of rows of the matrix at the right must be equal to the number of the
columns of the matrix on the left. The resultant matrix that has the sum of the computational
would contain the number of columns of the matrix at the left and the rows of the matrix at the
right.
Descriptive Statistics
Here, we would be using the same data to compute different descriptive statistics. The first step
to do this is to
Read a CSV data file into R file
Now we have our stock data intact, we can compute daily returns and then convert the
returns into an annualized return. The result of this action is shown below:
When we compute the daily and annualized return, the result is as follows:
Higher-Ordered Moments
Two major moments arise in return distribution; they are skewness and kurtosis. To show how
this works, we would be using a moment library in R.
Skewness = E[(X−µ)3] ÷ σ3
The meaning of skewness is that one tail is fatter than the other. A fatter right(left) tail means
that the skewness is positive (negative).
Kurtosis = E[(X −µ)4] ÷ σ4
In Kurtosis, the two tails are fatter than the normal distribution. In a normal distribution,
skewness is zero, and kurtosis is 3. Excess kurtosis occurs when the value of kurtosis is minus
three.
Brownian Motion in R
Stock motion law often major in Brownian Motion, especially its geometry.
dS(t) = µdS(t) = µS(t) dt+σS(t) dB(t), S(0) = S0
This kind of equation is a stochastic differential equation (SDE). The equation is an SDE because
it explains the random movement of the stock (t), the coefficient of the stock (µ) and (σ). The
drift of the process of the stock is determined by µ while σ determines the volatility. Brownian
motion determines the randomness B(t). Unlike the deterministic differential equation, which is
only a function of time, this aspect is more general. In SDE, the solution is always a random
function and not a determinant function. The time interval (h) solution is as follows:
The presence of B(h) ∼ N(0,h) in the solution is what gives the function its random function
quality. B(h) can also be written as the random variable (h) ∼ N(0,h), where ∼ N(0,1). The
presence of the exponential return makes the price of stock lognormal.
Maximum Likelihood Estimation
In MLE Estimation, our concern is to find the parameters {µ,σ} that causes the probability of
seeing the empirical sequence of returns R(t). To carry out this estimation, we would be using a
probability function. Here are the steps to carry out this estimation:
A quick overview of the normal distribution x ∼ N(µ,σ2)
Next, we do a density function
The formula for a standard normal distribution is x ∼ N(0,1).For the accepted normal
distribution: F(0)=12.
In the following equation, the probability density is normal.
This kind of action is known as log-likelihood; it is very easy to use in the R project. The
first step to do this is to generate the log-likelihood function.
After this, we can now go ahead and do the MLE using the NLM (non-linear
minimalization) package in R. This uses a Newton-type algorithm.
GARCH/ARCH Models
GARCH represents “Generalized Auto-Regressive Conditional Heteroskedasticity." Rob Engle
was the first who invented ARCH, which later earned him a Nobel Prize. This was later extended
to GARCH by Tim Bollerslev. The emphasis of ARCH models is that volatility tends to cluster,
i.e., volatility for period t, is auto-correlated with volatility from the period (t −1), or other
preceding periods. When a time series follows a random walk, it can be a model like this:
In GARCH, the stock is conditionally normal and independent. However, because of the changes
in variance, it is not identically distributed.
How Bivariate Random Variables Work
Two independent random (e1,e2) ∼ N(0,1) can be converted into two correlated random variables
(x1, x2) with correlation ρ using the following transformation method:
This implies that we can generate 10,000 pairs of variables using the R code explained below
Multivariate Random Variables
This is generated by using the Cholesky decomposition. Cholesky stands for a covariance matrix,
which is a product of two matrices. Covariance can be written in decomposed form Σ = L L.
Here, L represents a lower triangular matrix. There can also be an alternative decomposition for
upper triangular, here U = L. Each component that makes up the decomposition becomes a
square root of the covariance matrix.
Cholesky decomposition is very good at generating a correlated random number from a
distribution with mean vector µ and covariance matrix Σ. Assuming we have a scalar random
variable e ∼ (0,1) we can use this to change the variable into x ∼ (µ,σ2), we generate e and then
set x = µ+σe. However, if instead of a scalar, we have a vector e = [e1,e2,...,en] T ∼ (0,I) . This
can be transformed into a vector of correlated random variables x = [x1,x2,...,xn] ∼ (µ,Σ), by
computing: x =µ+Le
Portfolio Computation in R
Its variance usually calculates a portfolio's risk. When there is an increase in n (the number of
securities in the portfolio), this initiates a reduction in the variance. This continues to the point
that it becomes the same as the average covariance of the total assets. The following result shows
what happens when the variance demo through the R function.
Regression
A multivariate linear regression has:
Y =X·β+e
This is the value for Y ∈ Rt×1, X ∈ Rt×n, and β ∈ Rn×1. The overall regression solution is
β = (XX)−1(XY) ∈ Rn×1.
To arrive at the above result, we minimize the sum of squared error.
It is noteworthy that this expression is a scalar.
Heteroskedasticity
In simple linear regression, it is assumed that the standard error of the residual is the same for all
observations. A lot of regression suffers from this type of situation. This type of error is what is
known as "heteroskedastic error." "Hetero" means "different" while "skedastic" means
"dependent on type."
A heteroskedastic error can be tested by using the standard Breusch-Pagan test available in R.
This is found in the lmtest package and should be loaded before running the test.
In the above test, there is a little heteroskedastic error in the standard. This is seen in the
appearance of the p-value. We would correct this using the hccm function. This stands for
heteroskedasticity corrected covariance matrix would be as follows:\
In the above program, we use the hccm to generate a new covariance matrix vb of the
coefficients. Next, we generated the standard error as the square root of the diagonal of the
covariance matrix. With the help of these revised standard errors, we divided the coefficients by
the new standard error. This helps us to recompute the t-statistics.
Auto-Regressive Model
Whenever a data is autocorrelated, that is, generates a dependence on time, not giving an account
of this is tantamount to unnecessary high statistical significance. This is because when an
observation is correlated with time, they are often seen as independent, thereby limiting the true
number of observations.
In an efficient market, the correlation of time from one period to the other should be close to
zero.
The program above is for immediate consecutive periods, referred to as the first-ordered
autocorrelation. This can be examined across many staggered periods by using the R functions in
the package car.
When the DW is close to 2, there is usually no traces of autocorrelation. When the DW statistic
is less than two, it is positive autocorrelation; when it is greater than 2, it is negative
autocorrelation.
Vector Auto-Regression (VAR)
This is very useful for estimating systems where the variables influence each other, and there are
simultaneous regression equations. Therefore in VAR, each variable in a system depends on the
lagged value of itself and on other variables. To choose the number of lag values, we use the
econometrician to choose the expected decay in the time-dependence of the variable in VAR.
Conclusion
In this chapter, we explored in detail the various models and features of the R package. We also
examined the types of regression models in R-statistic. In the next chapter, we will examine data
handling using the R package.
Chapter Four: Data Handling and Other Useful Things
This chapter would focus on some of the alternative programs in R different from what we have
examined in the previous chapter. We will also constantly draw reference from the topics under
the R package treated in the former chapter. Here, we would explore some of the very strong
packages of R. Especially those that use sql-like operations on both the handling of small data
and the handling of big data. The topics we would be considered in this chapter include:
Data extraction of stocks using quantmod.
How to use the merge function
How to apply a class of functions
Getting interest data rate from FRED
How to handle data using lubricate
Using the data.table package
Data Extraction Of Stocks Using Quantmod
Here we will be using the stock package treated in the previous chapter to get a few initial data.
When the length of each stock series is printed, you will find out that they are not the same. Our
next action is to covert closing adjusted prices of every stock into separate data.frames. Here are
the steps to do this:
Construct a list of data.frames. This is important because data.frames are stored in lists
Each data.frames should have a column. This would be used later to join the separate
stock data.frames that were previously created to a single composite data.frames.
Next, we will use a join to integrate all the stocks, adjusted closing prices into one
data.frame. The aim of this is to merge; this can be done through a union join or through
an intersect join. Intersect join is the default.
We will observe that the stock table contains the number of rolls in the stock index.
This has limited observations than individual stocks. Because what we are dealing
with is an intersect join, part of the rows will be dropped.
Plot all stocks into a single data frame with the use of ggplots 2. This is more
functional than the basic plot function. However, to use ggplots 2, we would first use
the basic plot function.
Next, the data would be converted into returns. These could be either log returns or
continuously compounded returns.
The data.frame returns can be used to present the descriptive statistics of returns
Next, the correlation matrix of returns should be computed.
After this, the correlogram for the six return series should be displayed. This would
help us see the relationship between all variables in the data set.
How to Use the Merge Function
Data frames are similar to tables or spreadsheets. However, they are very much like a database.
When we want to merge two data frames, it is the same as joining two databases. R program has
the merge function for this.
Now, let's assume we already have a list of ticker symbols that we want to produce a well-
detailed data frame from these tickers. The first thing we would do is to go through the input
name of the tickers. Let's assume the tickers are in a file named tickers.csv; the diameter of the
file is the sign of a colon. This would be read like this:
tickers = read. table(" tickers . csv" ,header=FALSE, sep=" : " )
We arrive at two columns of data from the line of code read in the file. The upper part of the file
contains the six rows listed below:
> head( tickers )
V1 V2
1. NasdaqGS ACOR
2. NasdaqGS AKAM
3. NYSE ARE
4. NasdaqGS AMZN
5. NasdaqGS AAPL
6. NasdaqGS AREX
The next line identified below list out the numbers of input tickers while the third line renames
the data frames columns. The tickers' column is renamed "symbols" because the data frame that
would be merged with it shares a similar name. This column is the index that the two data frames
are joined.
The next action is to read in the list of every stock on NYSE, Nasdaq, and AMEX. This is shown
as follows:
The upper part of the Nasdaq would contain the following:
Our next action is to join all three data frames into a single data frame. Then we check the
number of rows in the merged file by checking the dimensions. These two actions are shown as
follows:
co _ names = rbind (nyse _ names, nasdaq _ names, amex _ names)
>dim (co _ names)
[1] 6801 8
Lastly, we would join the ticker symbol file and the exchange data into one using the merge
function. This would extend the ticker file to contain information in the exchange file.
Now, let's assume we wish to search for the names of the CEO of all the 98 companies listed in
our program. Since we don't have any available document containing the information we seek,
we can easily download. However, a site such as Google Finance Page has the information we
seek. Our next auction is to write R code and use this to scrape the data on the Google Finance
page one after the other. Once we extract the CEO's name, we augment the tickers' data frame
using the R code.
We would notice that the R code that augments the tickers' data frame did this with the stringr
package. This helps simplify the string. When we are through with the extraction of the names,
we then search the line that has the name "Chief Executive." Here is the final data frame with the
name of the CEOs
How To Use The Apply Class Of Functions
Most times, functions are expected to be applied to many cases. The parameter for the cases may
be provided in a matrix, vector, or lists. This is similar to using different sets of a parameter to
repeat evaluations of a function by running a loop through a set of values. In the illustration
below, we use the apply function to compute the means return of all stock. The function of the
data it is merged with is the first argument, the second is either 1 (by rows) or 2 (by columns)
while the function being evaluated is the third.
We will notice that the function returns the means column of the data. Not only this, the function
that applies to a list is the lappy, while sappy works with vectors and matrices. The Mapply uses
multiple arguments. To verify our work, we can easily use the colMeans function.
How To Get Interest Rate Data From FRED
FRED stands for Federal Reserve Economic Data. This is an authenticated data interest rate
source. It is managed by the St. Louis Reserve Bank and warehoused at this warehouse
https://research.stlouisfed.org/fred2/. Now, let's assume we want to download the data directly
using the R in FRED. For us to be able to achieve this, we would write out some codes. Although
before the website was changed, there was a site for this since it is so, we would easily roll in our
own code in R.
We will use the function above to download our data and to produce a list of economic time
series. The data would be used as an index to join the individual series as a single series. Also,
we download maturity interest rates (yields) beginning from the maturity of one month
(DGS1MO) to thirty years (DGS30).
Now we have a data frame that contains all the series we are interested in. Next, we sort the
data.frame by date, but before this we first convert the date into number strings as shown below
NA represents missing values. Note that there are values represented by "-99." Although both
NA and -99 can be wiped out, we leave them because they represent times when there was no
yield for that maturity.
How To Handle Dates Using Lubricate
Assuming we want to sort out the data.frames of failed banks. We would need to do this month
by month, day by day, and week by week. This definitely requires the use of dates package. A
very unique and useful tool developed by Hadley Wickham is the lubricate package.
We would do the same sorting we did here with a month to see if we will record any form of
seasonality.
There is no seasonality with the monthly sorting, let's try with daily sorting
From our counts, we would observe that counts are indeed lower at the start and end of each
month.
Using The Data.Table Package
This is a very brilliant package written by Matt Dowle. The function of the package is to allow
data.frame works as a database. Not only this, but it also allows the proper and effective handling
of massive quantities of data. The effectiveness IP address of a company known as
h2o:http://h2o.ai/ has now embedded this technology. To see how this works, we will be using
some downloadable crime data statistics for California. Next, we will create a csv file and place
our data inside so that it can be easily read into R.
data = read . csv ("CA_Crimes_Data_2004−2013.csv", header=TRUE)
Now it is easy to convert the data into a database
library ( data . table )
D_T = as . data. table( data )
Now let's see how this works, we will notice that the syntax of this looks very much like the
syntax of data.frame. As a result, we would only print a section of the name and not all.
print ( dim(D_T)
One of the unique characteristics of the database is that it can be index by making any column in
the index key. Once this is done, it is easier to compute subtotals and even generate plots from
them.
We would notice that the type of output generated looks like that of the data.table. It also
includes classes from DataFrames too. Our next action is to plot the result of the data.table the
same way we plot that of the data.frame.
Using the p l y r Table
Hadley Wickham writes this package. It is very useful to apply functions to tables of data
(data.frames). In our program, we would want also to write a custom function, it is in writing this
function that this package comes in. In R function, we can either use the p l y r class of package
or the data.table to handle data.frame as database.
Next, we would use the filter function to subset the rows of the dataset we want to select for
further analysis.
Also, Data.table provides a unique way to carry out statistics. Below are the steps to do this:
1. Group data by standpoint.
2. Use the groups to produce statistics
3. Choose the option that allows you to count the number of trips beginning from the first
station and also allows you to calculate the average time of each trips.
Conclusion
This chapter explains in detail how data is handled in R packages. Explanations on how to merge
data to functions, apply functions to data and use the various options available for handling big
and small data were provided. The next chapter examines Data statistics.
Chapter Five: Markowitz Mean-Variance Problem
This chapter examines the Markowitz mean-variance problem. This type of problem is not only
popular in data science, but its solution is also widely used. In this chapter, we will cover the
following outlines:
Markowitz mean-variance problem
How to solve the problem using the quadprog package
Risk Budgeting
Markowitz Mean-Variance Problem
This is a very popular portfolio optimization. The solution to this type of problem is still widely
used today. However, our major aim in this chapter is on the portfolio of n asset. This implies
that the return of E (rp), and a variance denoted as Var(rp). Our portfolio weight is represented
by w ∈ Rn. The meaning of this is that in allocating values to the assets, take, for instance, we
want to allocate $1 to the asset. It means that each $1 is allocated into various assets. The total
value of the sum of our weight is 1.
Quadratic (Markowitz) Problem
This optimization problem can be defined as this. We want our result to achieve the pre-specific
level of expected mean return, and its variance(risk) avoided as much as we can.
The ½ we have in front of the above variance is for mathematical neatness. The function of this
would be explained as we progress in this chapter. The scaling of the objective function by a
constant does not affect the minimized solution. There are two types of constraints working with
our variance above. The expected mean return is forced into a specific mean return E(rp) by the
first constraint. The second constraint, also known as the fully invested constraint ensures that
the weight of the portfolio is up to 1. These two constraints are equality constraints.
The type of problem explained above is a Lagrangian problem; it requires that we use the
Lagrangian multipliers {λ1,λ2} to embed the constraints into the objective function. What we
will have after this action is a civilization problem.
We will take derivative with respect to w, λ1, and λ2, to minimize this function and then arrive at
the first-order conditions started as follows:
The first equation represented by (*) is an n equation system. This is because the derivative is
taken with respect to all the elements of the vector w. This is why we arrive at a total of (n+2) as
our first-order condition. From(*)
Let's take note of these observations:
Since Σ is positive, this means that Σ-¹ would also be positive and: B>0, C>0.
Taking the solutions for λ1,λ2, we would find out the solution for w using this formula
The above equation is the expression for the optimization equation weight when the variance is
minimized for a given amount of expected return E(rp). Once the inputs to the problems µ and Σ
is given, the vectors g and h become fixed.
E(rp) can be varied to get a set of frontier (optimal or efficient) portfolios w
Therefore, these two portfolios g, g, and h produce the entire frontier.
Solution in R
We can use R to create a function to return the optimal portfolio weight. To do this, we will use
the following formula
We can call the function of an expected return and then enter the example of a mean return
vector and the covariance matrix of returns.
To get the expected return of 0.18 in the first example, we would notice that we shorten some
low-risk assets and lengthen some medium and high-risk assets. However, when the expected
return was reduced to 0.10, all the weight are positive.
How To Solve The Problem Using The Quadprog Package
This is an optimizer that uses linear constraint to take a quadratic objective function. As a result,
this is exactly what we need to solve the mean-variance portfolio problem we just treated.
Another significant use of this package is that we can use additional inequality constraints. For
instance, whenever we don't feel like granting short sales of any asset, we can easily bound the
weight to lie between zero and one. The manual below shows the specification of the quadprog
package.
In the setup of the problem, we are dealing with, with no short cuts and three securities. We will
have the following bvec and Amat\
The constraints will be modulated by meq = 2. This states that the first two constraints will be
equality constraints, while the last two will be greater than equal to a constraint.
The package code would be run in this format:
After running the code, our expected result would be 0.18, with a short-selling that allows:
[ 1] −0.3575931 0.8436676 0.5139255
This is exactly the same result we got in Markowitz's solution. When we restrict short-selling, we
will arrive at the same 0.10 we got in Markowitz.
Risk Budgeting
A single problem can have a different view of the Markowitz optimization problem. To control
this, we use one of the recent approaches to risk portfolio construction. We construct a portfolio
where the risk construction of all assets is equal. This approach is known as "Risk Parity."
Another portfolio where all the risk contributes the same quota of the total return of the portfolio
would also be created. This type of approach is known as the "Performance Parity Approach."
Assuming its weights portray the portfolio, the risk becomes the function of its weight and is
denoted by R(w). The standard deviation of the portfolio is as follows:
The risk function of this kind of risk is homogenous. This implies that if the size of the portfolio
is doubled, then the risk measures also doubles. This is also known as the homogeneity property
of risk measurement. Homogeneity is one of the coherence in risk measurement explained by
Eber, Artzner, Health, and Delbaen (1999). Once a risk measurement meets the requirements of
homogeneity, the next step is to apply Euler's theory to decompose the risk into the amount given
by each asset.
Let's assume that we define the risk measurement to be the standard deviation of portfolio return;
the risk decomposition would require the measurement of the risk along with all its weight. This
is shown as follows:
Conclusion
In this chapter, we observe the Markowitz problem in Data science and the various packages that
can solve this problem. The next chapter examines Bayes’ theorems and the types of models
under it.
Chapter Six: Bayes Theorem
This theorem deals with coincidence and reality. A very good explanation of the theory is
explicated on Wikipedia http://en.wikipedia.org/wiki/Bayes theorem and a video by Professor
Persi Diaconis's talk on Bayes on Yahoo video. In business, we often encounter questions
bothering on reality and coincidence. A good example of the question is, is Warren Buffet's
investment success a coincidence? How do we answer the question? Do we use our prior
knowledge of the probability of Buffet being able to beat the market, or do we check the
performance of the business over time? It is in answering this question that Buffet rule comes in.
The rule follows from the decomposition of joint probability. Here is the formula
Pr[A ∩ B] = Pr(A|B) Pr(B) = Pr(B|A) Pr(A)
The last two terms in the equation can be restated like this:
From our test, when the test is negative, there is a very slim chance that you might have it, so
there is nothing to worry about.
Correlated Default (Conditional Default)
Bayes’ theorem is very effective for verifying conditional default information. Bond fault
managers are not as concerned with the correlation of defaults in the bond of their portfolio as
much as they are concerned with the conditional default of bond. This means that they are
concerned with the conditional probability of bond. To calculate this, some of the modern
financial institutions already develop tools to obtain the conditional default of firms.
Let's assume that we already know that firm 1 has a default probability P1 = 1%, and firm 2 has a
default probability P2=3%. Assuming the default of both firms is 40% in a year, however, if
either bond default, what is the probability of default of the other conditional on the first default?
Despite the limited information on the firm's probability of default, we can still use Bayes
theorem to define the conditional probability of interest. Here are the steps to calculate this:
define di, i = 1,2. This is the default indicator for the two firms
define di = 1 if the firms default.
define di =0 if the firms did not.
We would note the following in our Bayes application
E(d1) = 1.p1 +0.(1− p1) = p1 = 0.01.
Likewise
E(d2) = 1.p2 +0.(1− p2) = p2 = 0.03.
With Bernoulli distribution, we would be able to determine the standard deviation of d1 and d2.
In the above calculation, p12 is the default probability for the two firms. Our conditional
probabilities would be:
p(d1|d2) = p12/p2 = 0.0070894/0.03 = 0.23631
p(d2|d1) = p12/p1 = 0.0070894/0.01 = 0.70894
From the result of this conditional probability, it can be summarized that once the firm begins to
defect, the default contagion would start getting severe.
Continuous and More Formal Exposition
There is some very significant expression in Bayesian approaches. These expressions are
posterior, prior, and likelihood. These expressions would be explained in detail in this section.
Usually, in standard notation, we are concerned with the parameter of a θ, the mean of a
distribution of some data x. However, in Bayesian theory, we won't only be concentrating on the
value of θ, we would also be exploring the distribution value of θ beginning with some prior
assumption about this distribution. Therefore, we would begin with p(θ); this is referred to as
prior distribution. We then move to the data x and combine our prior distribution value to it to
get the posterior distribution p(θ|x). However, to do this, we are required to compute the
probability of seeing the data x given our prior p(θ). This probability is given by the likelihood
function L(x|θ). Assuming that we already know the variance of our data x as o². When we apply
our Bayesian theory, we would have:
If we assume that both the prior distribution for the mean and the likelihood are normal, then we
would have:
If this be the case, our posterior value would be
When the prior distribution and posterior distribution are of the same form, they are a
"conjugate" with respect to the specific likelihood function. However, if we observe n new value
of x, the new posterior would be:
Bayes Net
Bayes Net is a network diagram that can be used to visualize joint distributions over several
outcomes/events and higher-dimension Bayes problem. The net is a directed acyclic graph (
referred to as DAG). This means that circles are not permitted in the graph.
To understand how Bayes Networks, we would be using an example of economic distress.
Distress can be noticed at these three levels: economy level (E = 1), industry level (I = 1), and
the firm level (F = 1). Economy distress can result in industry distress, but this may or may not
lead to firm distress. The diagram below shows the flow of causality. It is noteworthy that the
probability in our first table is unconditional, but all others are conditional.
In our conditional probabilities, each pair adds up to 1. The channels in the table the arrows in
the Bayes net diagram.
In the first diagram, we would notice that there are three channels in the Bayes net. Channel a
stands for the inducement of industry distress from economic distress; channel b stands for the
inducement of firm distress directly from industry distress. The last channel c stands for the
inducement of firm distress directly from industry distress.
The question that arises from this net is, what is the probability that the industry is distressed if
the firm is in distressed? The calculation for this problem is stipulated below:
Bayes Rule in Marketing
In one of the widest market research campaign: pilot marketing, Bayes showed up in a very easy
manner. Let's assume we have a project with a value x. Now, if the product fails (F), the payoff
is -70; however, if it is successful (S), the payoff is +100. The probability of these two
happenings is
Pr(S) = 0.7, Pr(F) = 0.3
We can easily check that our expected end is E(x) = 49. Assuming we were able to get protection
for a failed product, the protection would be a put option of the real option; it’s worth rate would
be 0.3 ×70 = 21. Since the put option is what saves all the loss recorded by the failed product,
value is the expected loss, condition on loss. This is usually seen as the value of "perfect
information" by market researchers.
However, suppose there is an intermediate choice rather than proceeding with the product launch
after the odds, we would have done a pilot test. Although this is not always accurate, it is
reasonably sophisticated.
The test signal of the pilot test is (T+) or failure (T-). Our probabilities in pilot test would be as
follows:
The pilot test above gives only a valid reading of success 80% of the time. The probability that
the pilot signal gives a positive result can be computed as follows:
Pr(T+) = Pr(T+|S)Pr(S)+Pr(T+|F)Pr(F)
= (0.8)(0.7) +(0.3)(0.3) = 0.65
Negative result can be computed as follows:
Pr(T−) = Pr(T−|S)Pr(S)+Pr(T−|F)Pr(F)
= (0.2)(0.7) +(0.7)(0.3) = 0.35
This would allow us to compute the following:
Now that we have these conditional probabilities, let us re-evaluate our product launch. If the
result of the pilot test is positive, what do we expect of the value of our product launch. This
would be as follows:
E(x|T+) = 100Pr(S|T+)+(−70)Pr(F|T+)
= 100(0.86154)−70(0.13846)
= 76.462
But if the test is negative, the value of our launch is
E(x|T−) = 100Pr(S|T−)+(−70)Pr(F|T−)
= 100(0.4)−70(0.6)
= −2
Now that we know the value of both the negative pilot test and positive pilot test, our overall
value of pilot test would be:
E(x) = E(x|T+)Pr(T+)+E(x|T−)Pr(T−)
= 76.462(0.65) +(0)(0.35)
= 49.70
Without the pilot test, the incremental value over the case is 0.70.
Bayes Models in Credit Rating Transitions
Most times, companies or business organizations are allocated to credit rating classes. Unlike
default probability, credit rating is a more coarse bucket of credit. Also, updating the credit rating
class in the section tends to be very slow. As a result, the DFG models use a Bayesian approach
to develop a model of rating changes that uses contemporaneous data on default probabilities.
Accounting Fraud
Bayesian inference can also be used to detect accounting fraud and audits. When fraudulence is
suspected, an auditor can use a Bayesian hypothesis of fraud to verify past data and assess the
chance that the current fraud situation has been ongoing for a while.
Conclusion
In this chapter, we have examined the main focus and use of the Bayes Model. We examined
Bayesian Net and how we use Bayesian to explain conditional default information. In the next
chapter we examine News Analysis in Data science, algorithms, word count, and more.
Chapter Seven: More Than Words - Extracting Information From
News
This chapter explains in detail the concept of news extracting. Wikipedia defines news analysis
as the measurement of the various qualitative and quantitative attributes of textual news stories.
Some of these attributes are sentiment, relevance, and novelty. Expressing news stories as
numbers the manipulation of everyday information mathematically and statistically.” The chapter
examines the various analytical techniques in news extraction, the various news analytic
software, method, and the sets of metrics that can be used for the assessments of analytic
performance. The outlines that would be covered in this chapter include:
What is News Analysis?
Algorithms
Scrapers and Crawlers
Pre-possessing Test
Term Frequency - Inverse Document Frequency (TF - IDF)
Text Classification
Word Count Multiplier
Metrics
Text Summarization
What is News Analysis
This is an umbrella term that covers a set of formulas, techniques, and statistics used to classify
and summarized public sources of information. It also includes metrics that are used to assess
analytics. The field of News analysis is very broad; it covers aspects such as machine learning,
information retrieval, network theory, statistic learning theory, and collaborative filtering.
However, all these can be broken into three broad categories of news analysis: text, content, and
context.
Text in news analytics entails the visceral aspect of news, i.e., words, phrases, sentences,
document headings and so on. The main purpose of analytics here is to convert text into
information. This action is carried out by these three means:
Signing the text
Classifying the text
Summarizing it into its main component.
During the summarization process, analytic discard text that is not relevant while separating
information that is of higher signal content.
The next layer of news analytic is content. Content works on the domain of text by expanding its
images, text forms ( blogs, emails, pages, etc.), time, formats (XML, HTML), etc. Content
enriches text such that it asserts quality and veracity that can be explored in analytics. For
instance, a blog can be streamed to have a higher-quality than a stock message-board post;
however, when financial information is streamed with Dow Jones, it can have more value than a
blog.
The last layer of news analytic is the context. This is simply the relationship between information
items. This can also refer to the network relationship of news. In exploring the relationship
between context and news analytic, Das, Martinez-Jerez, and Tufano (2005), a clinical study of
four companies examines the relationship of news analytic to message-board postings. Similarly,
Das and Sisk (2005) explore the social networks of message-board postings to find out if the
rules of a portfolio can be created with the network connections between stocks. A good example
of an analytic that functions at all these three levels are Google's PageRank algorithms.
Algorithms have a lot of features; the kernel of these features are context while others are text
and content. Context is the kernel of algorithms because search is the most popularly used news
analytics. However, this depends on the number of highly ranked pages pointing to it.
From our explanation so far, it can be deduced that news analytics is where algorithms and data
meet. This is where tension is generated between the two. This is why there has been a heated
debate on which of the two should be more than the other. This debate was brought up in a talk
at the 17th ACM Conference on Information Knowledge and Management (CIKM '08), Peter
Norvig, Google's director of research, made his preference by stating that it is better to have
more data than algorithms. According to him, “data is more agile than code.” On the one hand,
this might sound reasonable okay, on the other, too much data can make algorithms become
useless thereby leading to overfitting.
When we talk about algorithms and data and which among the two should be more than the
other, this debate made it seems as if there is no correlation or relationship between the two.
However, this is not the case. To start with, news data shares the same three broad classifications
that news analytic has, i.e., text, content, and context. The level of complexity of any of these
three depends on which one is dominant. Generally, in news data, the simplest among the three is
text analysis. The context that applies to network relationships can be quite difficult. For
example, a community-detection algorithm can be very difficult to compare to word-count
algorithms which are very simple, almost naive. The community-detection algorithm has more
complicated memory requirements and logic.
The tension between the two aspects, News data, and news algorithms, is managed and
controlled by domain specificity. This implies the quantity of customization needed to
implement news analytic. It is quite interesting that low-complexity algorithms more domain
specificity than high-complexity ones. For example, the previous illustration we use, community-
detection would need little domain knowledge because it is applicable to a wide range of the
graph. However, this is not the case with word-count algorithms. A word-count algorithm
requires domain knowledge of grammar, lexicon and even syntax. Not only this, political
messages would be read differently and separated from medical messages.
Algorithms
Crawlers and Scrapers
Crawlers are set.of algorithms that are used to generate a series of web pages that may be used to
search for news content. The software derives it's name "crawler" from the way it works. It starts
from some web pages and crawls to others. By this, the algorithms make it choose from the
series of web pages it gathered. The commonest approach to choosing a page out of the
numerous ones gathered is to move from the current page to a page that is linked to hyper-
referenced. Significantly, a crawler uses heuristics to explore the tree from any given node and
used this to determine useful paths among the numerous ones before choosing which ones to
focus on.
Web scrapers download the details of any web page chosen; it may or may not format the web
page for analysis. Virtually all programming language has its own modules used for web
scrapping. The modules contain some inbuilt function that is directed connected to the web.
Once the functions are opened, it makes it easy to download user-specific or crawler-specific
URLs. The popularity of web analysis has made most statistic packages come with its own
inbuilt web scraping functions. For instance, R functions come with its own web scraping
function in its base distribution. Whenever we want to read a page into a vector line, we can
easily download and use a single-line command.
Excel, which is the most widely used spreadsheet, has its own inbuilt web scraping function.
This can be downloaded from the Data ----- GetExternal command tree. Once we download the
web scraping function, it can be transferred into a worksheet and then operated as desired. We
can also set up excel such that it refreshes the content constantly.
Gone are the days when to use web-scraping code; we will need to write it in Java, C, Python or
Perl. Today, we can use tools like R to handle statistical analysis, algorithms, and data. With R,
these three can be written within the same software. This is to say, data science progress daily.
Pre-Processing Text
Often times we think that no text can be dirtier than text from external sources; this is not the
case. Text from web pages is dirtier than text from external sources. Before applying news
analysis on algorithms, they must be first cleaned. The process of cleaning up algorithms before
applying news analysis on them is what is known as pre-processing. The first process to cleaning
algorithms is by using HTML cleanup; this process removes all HTML tags from the body of
messages. Example of these tags include <p>, <BR>",etc. The next cleanup is with
abbreviations. Here we expand abbreviations to their full forms. All abbreviated phrases and
contractions are written out in full. For instance, it's is written out as it is, ain't is written out as
are not, etc. The third cleanup is a negative expression. An expression containing negative words
would mean the opposite of the negative expression. To handle this, we first detect the negative
words such as not, no and never. Then we tag the remaining words in the sentence where the
negative words are used. This would help reverse the meaning of the sentence.
Another significant aspect of pre-processing is the stem. This aspect deals with the root words. In
stem, words are replaced and represented by their roots words. This would make it possible for
the tenses of the words not to be treated differently. There are various types of stemming
algorithms available in a programming language. Popular among these stemming is Porter
stemmer discovered in 1980. Stemming varies from language to language. Hence, it is language-
dependent.
Term Frequency - Inverse Document Frequency (TF - IDF)
This is a scheme used to weigh the usefulness of rare words in a document. The TF-IDF uses a
very easy calculation and does not have any strong theoretical basis. It is simply the importance
of a word (w), in a document (d) in a corpus (C). Since this is a function of the three aspects, we
will write it as TF-IDF(w, d, C), It is a product of term frequency (TF) and inverse document
frequency (IDF).
Where d is the number of words in a document, the frequency equation would be rewritten as:
TF(w,d) = ln[f(w,d)]
The above equation is known as Log Normalization. There is another form of normalization
known as Double Normalization. The formula for this is
The formula for the score of weight for a given word w in document d and corpus c is
TF-IDF(w,d,C) = TF(w,d)× IDF(w,C)
We will illustrate this using the application below:
When we run this code, here is the result we will arrive at:
The code can be written into a function, after which we then examine the TF-IDF for all words.
These can be used to weigh other words in further analysis.
Word Clouds
You can make a word cloud from this document. It would come out like this:
Text Classification
Bayes Classifier
This is the most widely used classifier today. Bayes Classifier simply takes some part of the text
and then assign it to one of the pre-determined set of category. The classifier is first trained on a
pre-classified initial corpus before it is applied to the text. It is this trained data that produces the
prior probabilities needed for the Bayesian analysis of the text. Next, we applied the classifier to
an out of sample text to obtain the posterior probability of textual categories. The text is then
applied to the category that has the highest posterior probability.
To see how this works, we would use an e1071 R package that contains the function of naive
Bayes. Next, we would use iris data that contain detail of the flower. Then we will take a
classifier to go through the flower data and identify which one among the numerous flower is it.
To list out the set of data loaded on our R package, we would use the following
Next, we call a prediction test to predict a single data or to generate a confusion matrix in this
format:
In the above table, the mean and standard deviation of the table is given. Basic Bayes calculation
would take the following pattern:
F stands for the type of flower, while a, b, c, and d stand for the four attributes of the flower.
Note that we didn't compute the denominator because it is still the same for the calculation of
Pr[F=1|a,b,c,d],Pr[F=2|a,b,c,d],or Pr[F=3|a, b, c, d]
Support Vector Machines (SVM)
This is a kind of classifier technique. It is very similar to cluster analysis but also applicable to
very high+ dimensional spaces. SVM can be best described by taking every text message as a
vector in high-dimension space. The number of data can be taken as similar to the number of
words in a dictionary. As a very simple example, we would use the same flower data set we use
in the naive Bayes.
A search engine specifically index page as a word vector. When a search query is presented A
search engine essentially indexes pages by representing the text as a word vector. When a search
query is presented, the vector distance cos(θ) ∈ (0,1) is computed for the search query with all
indexed pages to find the pages with which the angle is the least, i.e., where cos(θ) is the
greatest. Presenting the best-match ordered list is made by sorting all indexed pages by their
angle with the search query.
In news analytics, when using the vector distance classier for news analytics, the classification
algorithm takes the new text sample and then finds the best match by computing the angle of the
message with all the text pages in the indexes training corpus. After this, pages with the same
tags are classified as the best matches. To implement the classifier, all that is required is only
linear algebra functions and sorting out routines readily available in virtually all the
programming environments.
Discriminant-Based Classier
All the classifiers we have examined so far do not weigh words differently. It is either they do
not weigh the words at all, as evidenced with SVM or Bayes classifier, or they weigh some parts
of the words while ignoring the other, as is the case with word count classifiers. Discriminant-
Based Classifier weighs words base on their discriminant value. Among the popularly used tool
for this purpose is Fisher's Discriminant.
In our example, we will take the mean value of each term for each category as = µi. The mean
stands for an average number of times word w appears in a text message of category i. The text
message itself would be index ad j. To evaluate the number of times word w occur in a text
message j of category i, our for this would be mij. The discriminant function can be written as:
We would consider the case we observe previously in this study, the economic evaluation we
grouped into an optimistic and pessimistic group. Let assume the word "dismal" appears once, in
the entire text, the word would be grouped as pessimistic and would not appear in the optimistic
class. The across-class variation of the word is positive, while the within-class variation is zero.
In this kind of situation, the denominator of the equation would be zero. We would conclude by
saying that the word "dismal" is an infinitely-powerful discriminant and should be evaluated with
a large weight in any word-count algorithms.
Metrics
Analytics developed without metrics is incomplete. In every developing analytics, it is important
to create measures that would examine whether or not the analytics are generating classifications
that are economically useful, statistically useful, and stable. However, there are some criteria
every analytic must meet for it to be statistically useful. These criteria would ensure the
classification power and accuracy. When an analytic is both economically useful and statistically
valid, it increases the quality of the analytics. Stability helps an analytic to perform effectively
in-sample and out-of-sample.
Confusion Matrix
This is a classic tool used in assessing classification accuracy. For n categories, the confusion
matrix would be of dimension n × n. The column stands for the correct category of the text,
while the rows represent the category given by the analytic algorithm. For each cell (i, j), the
number of text messages in type j and classified as type i are contained in the cell matrix. The
number of times the algorithm got the correct classification is stated in the cells on the diagonal
of the confusion. When this is sorted out, every other cell is a classification error. The rows and
columns of the classification can only be dependent on each other if an algorithm has no
classification ability. The statistics examined for rejection under the null statistics are as follows:
A(i,j) represents the numbers observed in the confusion matrix, while E(i, j) stands for the
expected numbers when there is no classification under the null. If T(j) stands for the total
column and (Ti) stands for the total across row i of the confusion matrix, then
(n −1)2 can be used to calculate the degree of freedom of the x2 statistics. This statistic is very
easy to calculate and can be used for any n model.
Precision and Recall
Two results can emerge from the creation of the confusion matrix. They are Precision or Recall.
Precision is also known as positive predictive value. This is simply the fraction of identified
positives that are really positives. It is the measurement of the validity of precision. Take, for
instance, we want to find out the number of people on LinkedIn who are looking for a job if our
algorithms find n of these kinds of people while only m are looking for jobs. Our precision value
would be m/n.
Recall, on the other hand, is also known as sensitivity. This is the number of positives that are
truly identified. A recall is the measure of the completeness of prediction. Using our LinkedIn
example, since the value of the actual people looking for a job is m, our recall formula would be
m/n. For instance, let's assume our recall confusion matrix is
For the above confusion matrix, our value for precision is 10/12, while recall is 10/11. This
implies that precision is related to the probability of false positives (Type 1 error). This is one
minus precision. However, recall is related to the probability of false-positive (Type 2 error).
This is simply one minus recall.
Accuracy
The measure of algorithm accuracy over a classification scheme is simply the percentage of text
that is accurately classified. This measurement can be done both out-of-sample and in-sample.
Here is the formula to compute this off our confusion matrix
False Positives
It is better to have a failure to classify than to have an improper classification. For instance, in a 2
×2 scheme, i.e., a two-category n=2, every off-dimension matrix in the confusion matrix is a
false positive. This implies that, when n >2, it means some classification errors are worse than
the other.
Calculating the percentage of false positives is a very important metric to work with. This can be
calculated by dividing the total classification undertaken by the weighted count or simple count
of classification.
Sentiment Error
An aggregate measure of sentiment may be computed once many texts or articles are computed.
This means that aggregation is very useful to cancel classification error.
Sentiment error is simply the percentage of the value we would get when there is no
classification error and the percentage difference between the computed aggregate sentiments.
Correlation
Having examined some of the vital aspects of news analysis, the question that comes to mind is,
how would the sentiment from news correlate with financial time series? Leinweber and Sisk
provide the explanation to this question in their paper published in 2010.
In the paper, they explained crucial differences in cumulative excess returns between strong
positive sentiment and strong negative sentiment days over prediction horizons of a week or a
quarter. Therefore, it can be inferred that the event studied are focused on point-in-time
correlation triggers. The visual correlation metric is the simplest correlation. Here we can see
how the sentiments and the returns track each other.
Phase-Lag Metrics
A unique case of lead-lag analysis is the correlation across sentiments and return time series.
This may be summed up as looking for correlations in the matrix. In simple term, a graphical
lead-lag analysis finds graph pattern across two series and examine if there are any ways pattern
in one time series can be predicted with the other. In other words, is there a way we can use the
sentiment data generated in algorithms to in-stock series. This type of graphical examination is
called the phase-lag analysis.
Economic Significant
We can evaluate news analysis using economic significance as a yardstick. In using economic
significance as a yardstick, we would be asking the following question, do the algorithms help
reduce risk my delivering profitable opportunities? Or does it not? This kind of evaluation would
help us identify a set of stocks that would perform significantly better than the other.
Economic metrics contain a lot of research and performances for news analysis. In fact,
Leinweber and Sisk, in the paper published in 2010, explained that there is exploitable alpha in
news streams. Economic analysis can make use of risk management and credit analysis areas to
validate news analysis.
Text Summarization
Text can be easily summarized using statistics. The simplest form of text summarizer works
more on the sentence-based model used in sorting the sentences in a document in descending
order. When this occurs, the most overlap words are arranged first, then others followed it. For
instance, let's assume an article D has a sentence si,i = 1,2,...,m. In this m sentence, each si
represents a set of words. To summarize the text, we would use the 3 similarity index to compute
each pairwise overlap between sentences.
To get the sentence overlap, we would find the ratio of the size of the intersection of the two
sentences, so and sj, divided by the size of the union of the two sets. Next, the similarity score of
every sentence is computed as the row sum of the Jaccard similarity matrix.
After obtaining the row sum, we sort them out; the summary is the first n sentence based on the
value.
Conclusion
We have explained in detail what news analysis is and how it is carried out. We examined the
vital features of news analysis and the different models that can be used to carry out this analysis.
Also, we examined how errors can be avoided or contained to the barest minimum when carrying
out news analysis. The vital aspect of word-count was explained in detail. In the next chapter, we
look at one of the important models in data science.
Chapter Eight: Bass Model
This chapter explains in detail all there is to know about Base Model. The chapter would cover
the following outlines:
The Bass Model
Calibration
Sales Peak
The Bass Model
The Bass Model is one of the classic models in Marketing literature. This was discovered in
1969 and has become one of the best models for predicting the market share of products that are
newly introduced and even matured products. The main focus of the model is the adoption rate
of a product must follow these two basic conditions:
the propensity of customers to adopt the product without the influence of social
influences
the additional propensity that the product would be adopted because other customers
have.
This is why, at some point in a very good product, the influence of the early adopters get so
strong that it affects or stir others to adopt the product. Usually, this is seen as a product of the
network. However, Frank Bass had already completed all there is to know about the influence of
early adopters on a very good product before the advent of the network effect. That is to say,
product adoption resulting from the influence of early adopters is not necessarily a product of the
network.
The bass model explains in detail how the information of the first few sales of a product can be
used to predict or forecast the product's future sale. Although this model seems to be more of a
marketing model, it can be used to determine the value of a start-up business by analyzing the
cash flow of the business.
Modeling this is very similar to how we model the adoption rate of a product for a given time t.
Using the Bass model, this adoption rate can be defined as :
P in the equation can be assumed to be the independent rate of a consumer adopting the product,
while q is the rate of imitation. This is because it modulates the impact of the consumer adopting
the product from the cumulative intensity of adoption F(t).
With our analysis of the p and q of the product, we can use our findings to forecast the adoption
of the product.
Software
Free software can be used to solve an ordinary differential equation. Among the most popularly
used open-source package is the Maxima. This is available for download in many places. Here is
what the basic solution for differential equation in Maxima looks like:
Maxima 5.9.0 http://maxima.sourceforge.net
This was distributed under the GNU Public License. Bug reporting information is provided by
the function bug_report()
Note that the function ¹/ 1−F was processed from the left and not from the right as the software
seems to be working. This is why Maxima would be used to solve the partial fraction results in
simple integral. The result of this would be
The above result is the correct one. Another very simple tool that is effective in calculating
small-scale symbolic calculation is WolframAlpha. This can be downloaded at
www.wolframalpha.com.
Calibration
How do we find out the coefficient of p and q in our previous Bass model? Since we already
have the current sales history of the product, this can be easily fit into the adoption curve. Below
is the formula.to calculate this:
Sales in any period are: s(t) = m f(t).
Cumulative sales up to a given time t are: S(t) = m F(t)
Since we already have the formula, we will go ahead and substitute f(t) and F(t) in the Bass
equation. This would give us:
We will be
using this equation in another example to understand it perfectly. Now let us examine the
ongoing sales for iPhone product as an example. First, we would read our quarterly sales already
stored in a file; after this, we will carry out a Bass model analysis. Next, we will R code to
compute it:
Now we will fit in the model and then plot our actual sales overlaid on the forecast.
Sales Peak
From our calculation so far, calculating the sales peak is very easy. All we need to differentiate
f(t) with respect to t, and then set the result equal to zero. This is shown as follows:
t ∗ = argmaxt f(t)
This is the same as the solution to f¹ (t)=0.
The calculation is very simple, the formula is
Now for our iPhone sales, the computation of the sales peak would give us:
In our calculation, we would observe that the peak happens in half a year. The number of quarter
that passed before the sales peak is 31.
Conclusion
In this chapter, we carried out an extensive explanation of the Bass Model. Also, an explanation
of how to use the Bass Model to determine the future of sales and calculate sales peak in
business. In the next chapter, we examine how dimensions are extracted in Data science.
Chapter Nine: Extracting Dimensions: Discriminant and Factor
Analysis
This chapter covers the analysis of large data sets. It explains in detail all there is to know about
the analysis of large data. We would be using the two common approaches of large data analysis:
Discriminant analysis and Factor analysis. The two data would help us understand the most
important structural components of any big data. In discriminant analysis, for example, we would
be developing models that would help us group population size into two broad components:
males vs. female, immigrants versus indigene and so on. With factor analysis, we would be able
to beat down large data on population into explanatory variables. Here are the outlines that
would be covered in this chapter:
Discriminant Analysis
Notation and Assumption
Discriminant Function
Eigensystem
Factor Analysis
Difference between discriminant analysis and factor analysis
Factor Rotation
Discriminant Analysis
Discriminant analysis is an attempt to explain categorical data by creating a dichotomous split of
observations. For instance, let's assume that we want to split our large business data into two
categories. One category is for the bad creditors, and the other is for the good creditors. In DA,
the bad and good creditors are referred to as dependent variables or criterion variables. The
variable we use in explaining the split in the criterion variable is referred to as explanatory or
predictor variable. We can assume the criterion variable to be the left-hand side variables while
the explanatory variable is the right-hand variables.
The significant property of DA is that left-hand variables are qualitative. This implies that aside
from their numerical value, they are naturally of good qualities. A good example of how DA
works is the admission process of universities and other tertiary institutions. Every university has
a specific cut-off for each department a student might want to apply for. The cut-off mark is what
separates the student that would be admitted from those that won't be admitted. Now, this cut-off
mark is determined with the aid of DA.
In a very simple term, DA is the tool that quantitative explanatory variables are used to explain
qualitative criterion variables. This does not mean that DA only works with two categories. The
number of categories that we use DA is not restricted to just two; it starts from two or more.
Notation and Assumption
Let's assume that there are N groups or categories indexed by i = 2...N.
In each of the N groups, there are observations yj, indexed by j = 1...Mi. Note that the group does
not necessarily need to have the same size.
We have a set of predictor or explanatory x = [x1,x2,...,xK]. There must be a valid reason for
choosing this so that y can have a group where it belongs. Therefore, the value of the kth variable
for group i, observation j, is denoted as xijk.
Observations must be mutually exclusive. This implies that each member of a group cannot
belong to the other group.
Cov(xi) = Cov(xj). that is, the explanatory variable of all group have the same K×K covariance
matrix.
Discriminant Function
The main focus of DA is to find a discriminant function that best defines and separates one group
from the other. The most common approach is to use a linear DA. However, the function might
be nonlinear. The function of the DA takes this formula:
Therefore the first 32 members of the team would form our category 1 (y=1), the last 32 would
form the category 2 (y=0). The result of our discriminant analysis is:
In the above command, we would observe that both 5 and 64 have been wrongly classified. To
assess this, we would be computing the x² statistics for the confusion matrix. To do this, we
would first define the confusion matrix as
The above matrix shows some classification ability. However, what happens when our model
does not have any confusion ability? It means that our matrix would have no relationship
between the rows and columns; hence the average number would be drawn based on the total of
rows and columns. Since the total of rows and columns for our program is 32, our matrix with no
confusion ability would look like this:
The total number of squared normalized differences in the cell of an individual matrix is Text
Statistics. The formula for this is:
Splitting into Multiple Groups
If we want to split our NCAA team into groups, for instance, we want to split the group into
four, we simply used the following commands:
Eigen Systems
Here, we will be exploring some components of matrices that would help us in data
classification. To get started, we would first download the Treasury Interest rate date from the
FRED website: http://research.stlouisfed.org/fred2/. This can be assessed in a file named
tryrates.txt. After this, we simply read the file
Types of Auctions
The main types of the auction include:
English (E), the highest bidder wins. This is an open kind of auction. It is called an open auction
because the progression of bids is revealed to the participants. Generally, the price of products is
in ascending order.
Dutch (D). This is also an open kind of auction. However, product prices in this type of auction
are in descending order. The auctioneer starts from the highest prices to the lowest. The winner
of the bid is usually the first bidder.
Ist price sealed auction (iP). Here, the bid is sealed and not revealed. The winner of the bidder is
the highest bidder.
2nd price sealed bid (2P): this is very similar to the (iP). However, the only difference between
the two is that unlike (iP) where the first highest bidder wins, here the second-highest bidder
wins.
Anglo-Dutch (AD): this type of auction starts off as an open auction but gets sealed when it is
left with only two bidders.
How To Determine The Value Of An Auction
The two most important aspects of an auction are the value and the price. However, the value of
a product to be auctioned can only be determined by the nature of the product. Here are two of
the model to determine the value of a product being auctioned:
Independent private values model: this model states that the individual bidder determines the
valuation of the product. This is very common with an art auction
Common-values model: Here, the bidders aim to discover the common price of the product
being auction. This is because there is usually an after a market where common value is traded.
A good example of this auction model is Treasury Securities.
Bidder Types
The types of the bidder and the assumption made by the bidders about the product determines the
revenue that would be generated from the auction. There are two major types of bidder:
Symmetric: In this type of bidder, the bidders share the same probability distribution of products
and stop-out (SP) prices. Stop-out price means the price of the lowest winning bid for the last
unit sold. This assumption is very good when the competition is high.
Non-symmetric or Asymmetric. This bidder type has different values distribution. This usually
occurs where the market is segmented. A good example is the bidding of firms in the M&A deal.
Benchmark Model (BM)
Benchmark Model is the simplest model that can be used to analyze auction. This model is based
on four major assumptions. The assumptions are explained below:
Risk-neutrality of bidders: this implies that utility function is not needed in the analysis of
auction
Private-values model: here, all bidders are welcome to their own reserved value for the products.
This implies that there is a distribution of bidder's private value.
Symmetric bidders: all bidders the same distribution of product value. This was already
explained in the types of bidders.
Winners' payment is based on bids alone.
Properties and Results of Benchmark Model
D = iP, that is, the Ist price and Dutch auction type are equivalent to bidders. This is because, in
each auction type, the bidder has to choose how high or low he or she would bid without the
knowledge of other bidders
In Benchmark Model, the most important thing is to bid according to how valuable the product is
to you. This is obvious in D and iP because both mechanisms do not entail bidders seeing any
other lower bids. Hence the bidder bids according to how valuable the product is to him or her
and watch if the bid wins. In other mechanisms like the 2P, when you bid too high, you overplay,
and when you bid too low, you lose. The best way to bid is according to how valuable the
product is to you. For the E-auction mechanism, it is advisable to keep bidding until the price
cross your level of valuation.
Equilibrium types:
Dominant: This is a situation whereby bidders bid with respect to their true valuation of the
product, not minding what other bidders are bidding. Satisfied by E and 2P.
Nash: here, bids are chosen according to the best guess of other bidders' bid and hence satisfied
by D and iP.
Auction Math
Now we will move away from the theoretical explanation of auction and apply some auction
equilibrium formula. To start with, F would be the probability distribution of the bids while vi is
the true value of the ith bidder on a 0 and 1 continuum. Let's say that we rank bidders in order of
their true valuation vi. How then do we define F(vi)? Let's take for instance, that the bid is drawn
from a beta distribution F on v ∈ (0,1) so that the probability of a very low bid and a very high
bid is lower than a bid around the mean of the distribution. Our for the expected difference
between the first and second highest bidder v1 and v2 is:
D =[1−F(v2)](v1 −v2)
This implies that the difference between the first and second bids would be multiplied by the
probability that v2 is the second-highest bidder. Or we assume the probability to be that there is a
higher bidder than v2. Now, from the first-order condition, i.e., the sellers' point of view, our
formula is:
Given that bidders are symmetric in BM, v1 ≡ d v2. ≡ d means equivalent in distribution. This
means that:
Since the expected revenue is equivalent to the expected second price, we would rearrange the
equation to get our equation for the second price:
Optimization By Bidders
The main aim of any bidder i is to find out the function/bidding rule B that is a function of the
private value vi such that
bi = B(vi)
In the above equation, bi stands for the actual bidder; when there is any n bidder, we will have
The goal of each bidder is to maximize his or her expected profit in relation to the true valuation
of the product. This is:
Now we are going to invoke the notion of bidder symmetry. The first step to this is to optimize
by taking ∂πi/∂bi = 0. We can only arrive at this optimization formula by first taking the sum of
all derivative of profit relative to the bidder's value like this:
Since ∂πi/∂bi = 0, the partial derivative of profit with respect to personal valuation is reduced.
The partial derivation is taken from this equation:
Next, we will take vi as the lowest bid, then integrate the two former equations to get:
When we equate the formula for the value of expected profit, we would have:
We will observe that our bid is shaded slightly from our personal valuation. This implies that we
bid less than the true value of the product; this would give room for profit. Bids are increased as
the level. to personal value increase and bidders increase, that is :
Treasury Auctions
Our previous explanation is based on a single unit auction. In this section, we would be moving
from a single unit to one of the most popular multiple units, Treasury auctions. Treasury auctions
are the mechanisms employed by the government and other similar bodies to issue its bills,
bonds, and notes. Usually, an auction is performed on Wednesday. This implies that bids are
received up until the afternoon of the day it is to be auctioned. After which the quantities
requested are supplied to the top bidders until there is no remaining supply of securities. Before
the auction or trade, Treasury auction is referred to as pre-market or when-issued. It is in this
market that bidders get an indication of prices that might result in a tighter clustering of the bid
in the auction.
Treasury auction is made up of two broad types of dealers: the small independent dealers and the
primary dealers. The primary dealers entail investment houses, big banks, and so on. Most times,
the auction is played among the primary dealers. The primary dealers place competitive prices on
the item to be auction while the small independent dealers play with non-competitive prices.
Usually, the value placed on the item being auctioned is based on information about the
secondary market of the item. The secondary market occurs immediately after the primary
market. This implies that the assumed profit the item is likely to attract at the secondary market
is what influenced the bidder's price at the auction. The likely price of the item at the secondary
market is usually gotten from the when-issued market.
The winner at the Treasury Securities often leaves with more regret than pleasure because he or
she is aware that he has bid with more money that is overplayed. In Treasury securities, this
phenomenon is referred to as the "winners curse." Before the auction takes place, the fed
government and other participants in the Treasury Securities try to mitigate the winner's curse.
This is because someone with less propensity of regret would bid at a higher price.
UPA or DPA
UPA stands for "uniform price auction," while DPA stands for "discriminating price auction."
DPA is mostly used and more preferred in Treasury Securities while UPA is only introduced
recently. For DPA, the highest bidder gets his bid quantity at the price he or she bids for. The
next highest bidder gets the same and this continues until the last bidder and the last item. This
implies that in Treasury Securities, each winning is filled at a price, hence bidding price varies.
This is what is known and refer to as discriminating in price.
However, for UPA, the highest bidder gets his or her bid quantity at the stop-out price, i.e., the
price of the last winning. The next highest bidder also gets the same until the Treasury Securities
supply is exhausted. This means that UPA uses single-price auction.
Although DPA tends to yield more revenue, UPA has shown to be more successful. This is
because the winners' curse is mitigated in UPA. All bidders bid the same, unlike DPA were to
win, you would have to pay more than other bidders.
Mechanism Design
To achieve a good auction mechanism, consider the following:
The starting price of the item to be auctioned off.
Is collusion contained to the barest minimum?
Is there a truthful value revelation? This is also referred to as "truthful bidding."
Is the product efficient? that is, the maximization of utility of auctioneer and bidders
Is it too expensive to play?
Fairness to both sides, whether big or small, high or low.
Clicks (Advertising Auctions)
One of the popular program that allows easy creation of advertisements that would appear on
important sites like the Google search result page and other related sites is the Google AdWords
program. Google AdWords is different from the Google AdSense program. Google AdSense is
the one that delivers Google AdWords to other sites. Depending on the type of ad displayed on
the site, Google pays web publishers based on the number of clicks and the number of
impressions gathered by the ad.
In this section, we would be explaining some of the basic features of a search engine
advertisement model using the research paper written by Aggarwal, Goel, and Motwani (2006).
There are three stages of search page experience used by the search engine advertisement
program. These stages include cost per click (CPC), cost by thousand views (CPM) and cost by
acquisition(CPA). Among these three, CPC is the most widely used. Under CPC, we have two
models:
a). Revenue ranking (the Google Model)
b). Direct ranking (the overture model).
The merchant pays the next click price. This price is different from that of the second auction.
However, this statement is not so in revenue ranking, as would be seen in our example.
Asymmetric: there is no incentive to overbid, only to underbid
Iterative: this is a situation whereby a bidder places many bids and watch for the response of
these bids. The reason for this is to uncover the ordering of bids by other bidders. But this is not
as simple as it sounds. In fact, Google often provides the Google Bid Stimulator or GBS so that
sellers can easily use AdWords to figure out optimal bids.
The utility of auctioneers and merchant will be maximized if revenue ranking is true. This is
known as auction efficiency
Innovation: the laddered auction. Randomized weight is attached to bids. This implies that if the
sum of weight is 1, then the type of ranking used is direct ranking. However, if it is CTR, i.e.,
Click Through Rate, then the type of ranking used is revenue-based, revenue ranking.
The following steps highlighted below can be used by merchants to figure out the maximum cost
per click(CPA) of each.
Maximum Profitable CPA. This is simply the margin of profit on the product. For instance, if the
cost price of the product is #200 and the selling price is #300, the profit margin is simply #100.
This is also the maximum amount a seller would pay for the CPA.
Conversion Rate (CR). CR is simply the rate gathered based on the number of times a click
results in a sale. To calculate this, we simply divide the number of sales by the number of clicks.
For instance, if for every 100 clicks, 5 sales are recorded. The CR is 5%.
Value Per Click (VPR). This is simply the CR multiplied by CPA. Using our example, our VPR
is simply 005 × 100 = #5
Determine the profit-maximizing CPC bid. The more the bid reduces, the more the number of
clicks reduce and the more the CPC and revenue reduce. However, this might not affect the
profit because it is possible for the profit after acquisition to rise. We can easily use the Google
Bid Simulator to find the number of clicks expected at every turn. Also, it is important to note
that the price you bid is not the same as the price for a click. This is because it is based on
revenue ranking, i.e., a next-price auction. Hence, the Google model determines the actual price
that would be paid. The equation for profits is :
(VPC − CPC) × #Clicks = ( CPA× CR − CPC) × #Clicks
Therefore, for a #4 bid, the profit would be:
(5 −407.02/154) ×154 = $362.98
Next-Price Auction
The CPC of the next-price auction is based on the price of the click right after a bid is made. This
implies that, if, for instance, the winning bid is for position j on the search screen, the price paid
is that of the winning bid at position j +1.
Laddered Auction
The main idea of a laddered auction is to set the position of CPC as:
The expected revenue to Google by ad impression is the lhs. This model aims to maximize
revenue for Google and at the same time, make the auction system very easy and effective for
merchants. If the result of the laddered auction is a truthful equilibrium, this is a good one for
Google. Note that the weights wi are arbitrary. Hence it is not disclosed to the merchants.
Conclusion
From our explanation so far, it is obvious that auction is still in vogue and not yet sidelined.
Additionally, it takes some mastery and skills to be able to perform effectively at any auction.
Data science is indeed an all-encompassing domain. The next chapter will examine limited
dependent variables.
Chapter 11: Limited Dependent Variables
This chapter examines the different approaches to creating and working with dependent
variables. The chapter covers the following outlines:
Limited Dependent Variables
Logit
Prohit
Slopes
Limited Dependent Variables
Dependent variables are limited when the variables are discrete, binomial, or multinomial.
However, most times, we use the continuous variables for the dependent (y) variable to run a
regression. A good example is when we run regressions on an income of education. Hence we
will need a different approach to run regression on these types of variables.
A unique type of limited dependent variables is discrete dependent variables. Some of the
examples of models that use this dependent variable are Logit and Prohit model. They are often
referred to as qualitative response (QR) models.
A discrete dependent variable often occurs as binary by taking values of {0,1}. When we regress
this, we get a probability model. However, if we just regress from the left-hand side of one and
zero on a suite of right-hand side variables, this could be fit as linear regression. If we should
take another observation with the right-hand side value, for instance, x = {x1,x2,..., xk}we could
use the fitted coefficient to compute the value of the y variables. Except by unusual coincidence,
the value would not be 0 or 1.
In a limited dependent variable, we would also explain the reason for the results in the allocation
of categories. There is also a relationship between limited dependent variables and classifier
models. This is because classifier models focus on allocating observation to categories, in the
same vein some examples of limited dependent variables focus on explaining whether a firm is
syndicated or not, whether a person is employed or not, and whether or not a firm is solvent and
so on.
It is important to note that most times, these fitted values might not even be between 0 and 1 in
linear regression. It means that we could choose a nonlinear regression to ensure that the fitted
value of y is restricted to 0 and 1. After this, we could get a model and fit in a probability. To
achieve this, we use any of the two models mentioned in our explanation, i.e., Logit or Prohit.
Logit
This is also known as logistic regression. This type of model takes the form highlighted below:
Our focus here is to fit in the coefficient {β0, β1,..., βk}. Note that this would be done
irrespective of coefficients (x) ∈ (−∞,+∞), but y ∈ (0,1). When f (x) → −∞, y → 0,and
when f(x) → +∞, y → 1. This model can be rewritten as
Now we will compute this as the means of the regressor for each model. The following is the
result of the Logit model:
This gives us a system of the equation that can be used to solve β. Likelihood equation is a
collective name for the system of first-order conditions. To get the t-stat for a coefficient, we
simply divide its value by its standard deviation. The standard deviation is gotten from the
answer to the question, how does the coefficient set β change when the log-likelihood changes?
Our interest is in ∂β/∂lnL. The reciprocal of this has already been computed above. Next, we
define:
g = ∂lnL /∂β
After this, we define the second derivative. This is also known as the Hessian matrix.
Then we set
Limited Dependent Variables in VC Syndication
It is indisputable that not all venture-backed firms would end up making a successful exit either
through a buyout, an IPO, or through another exit route. Here we would be measuring the
probability of a firm making a successful exit by examining a large sample of firms. Hence, a
successful exit would be designated as S = 1 while an unsuccessful exit would be S = 0. We
would fit a Prohit model to the data by using matrix X of explanatory variables. Next, we define
S to be based on a latent threshold variable S ∗ such that:
The probability of exit, which entails the E(S) for all financing rounds, is provided by the fitted
model.
Using the standard likelihood method, the vector of coefficient fitted in the Prohit model is γ.
The cumulative normal distribution is represented by Φ(.).
Endogeneity
Assuming we want to look at the impact of syndication in a successful venture. Success in a
syndicated venture is a product of two broad aspects of VC expertise. To start with, syndicate are
very effective for picking good firm while VC is very effective for selecting a good project to
invest in. VC is a selection hypothesis discovered by Lerner (1994). Since the process of
syndicate entails the derivation of a second opinion by the VCs, this means that a syndicate
provides evidence that the project is a very good one. Aside from this, the syndicate can be used
to provide detailed monitoring as a result of its ability to bring a wide range of skills to the
venture.
A dummy variable for syndication permits a firsthand estimation of whether or not syndication
has any impact on performance, while a regression of variables allows for a return on different
characteristics of the firm, although it can be said or assumed that syndicated firm is large of
higher performance capacity, irrespective of whether they chose to syndicate or not. VC tends to
prefer better firms that might likely syndicate. Not only can this VC identify these firms. In this
kind of situation, the added value from syndicate is revealed by the coefficient of the dummy
variables, especially when there is no value. As a result, we first correct specifications for
endogeneity. Next, we check whether or not dummy variables are significant.
The correction specification that would be adopted for endogeneity is the one suggested by
Greene (2011). The required model would be briefly summarized as follows. However, before
then, here is the performance regression:
For the above equation, Y stands for the performance variable while S is the dummy variable
that takes the value of 1 if there is syndication, if there is no syndication, the value is zero. δ is
the coefficient that reveals the differences in performance often caused by syndication. If there is
no difference in performance, it means that there is no difference in performance across the two
firms or that the X variables are enough to explain the differences in performance across the
firms.
However, since this is the same variables that determine whether or not there is syndication, it
means that we have an endogeneity issue. This would be resolved by adding some corrections to
the above model. The sign € standing for the error term would be affected by our corrections.
When the firm is syndicated, and our value for S is 1, then the adjustment in the € sign would be:
For firms that are not syndicated, our result would be:
This can be estimated by linear cross-sectional regression as:
The estimation model would take the form of both the none syndicated equation and the cross-
regression model. If this is the case, β would be forced to remain constant in all the firms without
initiating any additional constraint. Hence, the specification would maintain the same OLS form.
However, if after the endogeneity correction, δ remains constant, this supports the hypothesis
that syndication is an initiator of differences in performance. If the coefficients {δ, βm} remain
significant, then for each syndicate performance round, the expected differences would be:
The method explained above is one of the best ways to address the treatment effect. Another
effective way to do this is first to use a Prohit model and then set m(γX) = Φ(γX). This approach
is what is referred to as an instrumental variables approach.
Endogeneity - Some Theories to Wrap Up
This is a situation that arises as a result of the correlation of error in terms of regression and
independent variables. Endogeneity can be highlighted as:
Hence we have
Omitted variable: assuming the equation for the true model is:
But the problem now is that we don't X2, which is a correlate of X1. This implies that in the error
term, we will no longer have: E(Xi · u) = 0, ∀ i.
Simultaneity: This is a situation where both Y and X are determined jointly. A good example is
the joint use of high way and high school because they go together. The structural setting of this
kind of situation can be highlighted as:
When we try to solve this equation, what we get is a reduced-form version of the model:
Conclusion
This chapter has extensively covered all there is to know about limited Dependent
variables. In the next chapter we’ll cover Fourier Analysis and Network Theory.
Chapter Twelve: Fourier Analysis And Network Theory
This chapter would cover the following outlines:
Fourier Analysis
Fourier series
Solving the coefficients
Complex Algebra
Courier Transform
Fourier Analysis
This analysis entails numerous different connections between infinite series, vector theory,
complex number, and geometry. Different applications such as fitting economic time series,
wavelets, pricing time series, and generating risk-neutral pricing applications can be carried out
using Fourier analysis.
Fourier Series
Introduction:
These are series used to determine periodical time series by combining sine and cosine waves.
The time it takes one cycle of the wave is called "period" T, while the number of cycles per
second is the "frequency of the wave" f. The formula for this is:
f = 1/T
Unit circle
We would be using some basic geometry to explain this
In our circle above, if a=1, the circle is the unit circle. There is a relationship or link between the
unit circle and the sine wave. To understand this, we would plot another diagram:
In the second circle, the height of the unit vector on the circle traces out the circle as we rotate
through the angles. For radius a, we would arrive at a sine wave with aptitude a. This can be
written as:
f (θ) = asin(θ)
Angular Velocity
Velocity is simply the distance per time in a given direction. In angular velocity, distance is
measured in degree, that is, the degree per unit of time. Usually, angular velocity is represented
by the symbol w. The formula can be written as :
We would need the a0 because the waves may not be asymmetric around the x-axis.
Radian
The figure below defines the angle of a radian. Degrees are presented in units of radians.
The angle in the above diagram is a radian. This is approximately 57.2958 degrees. This is a bit
lower than the 60 degrees expected of an equilateral triangle. Note that since the circumference is
2πa, 57.2958π = 57.2958×3.142 = 180 degrees. Hence for the unit circle, we would have:
This implies that when we multiply two waves and then integrate the resultant wave from 0 to T
unless the two waves have the same frequency, our result would be zero (0). With this, the way
we solve the coefficient of the Fourier series is highlighted below. We integrate both side of the
equation above from 0 to T to arrive at this:
Except for the first term, all other terms are zero. This implies that for the terms we arrive at
zero, the sine and cosine are integrated above the cycle. Hence we arrive at :
Except for the terms in a1 cos(ωt) cos(ωt), all other terms are zero. This is because we are
multiplying two waves with the same frequency. Therefore we would get:
We can use this method to solve all an all we do is multiply by cos (nωt) and then integrate. We
can also use this to solve bn. We simply multiply by a sine (nωt) and then integrate. This forms
the basis of the result of Fourier series coefficients highlighted below:
Complex Algebra
Recall that :
Also recall that, cos(−θ) = cos(θ), and sin(−θ) = −sin(θ). Also note that if θ = π, then
And as an equation that entails five major mathematical constants and three operators. These are
i,π,e,0,1}, and {+,−,=}.
From Trigs to Complex
Using the last two equations above, we would arrive at this:
All we have to do is rename An to Cn for clarity. The big turnout of this process is that we have
been able to contain {a0, a bn} into one coefficient set Cn. We will write the following for
completeness sake:
Fourier Transform
With this technique, we can go from the Fourier series, which uses a period T to aperiodic
waves. This is simply to let the period go to infinity. This implies that the frequency gets very
small. To do our analysis, we will substitute f(t) with g(t). This is because we now need to use f
or ∆f to denote frequency. Recall that:
df stands for frequency domain while dt stands for the time domain. Thus the Fourier transform
moves from the time domain to a frequency domain.
Inverse Fourier Transform moves from the frequency domain to the time domain.
where f(x) is the probability density of x. By Taylor series for eisx we have:
However, the value of w(u,v) is unrestricted in a weighted graph. This can also be negative.
A directed graph can either be cyclic or acyclic. For cyclic graphs, there is a path from a source
node leading back to the node itself. This is not the case with acyclic graphs. Direct acyclic
graphs are represented with the term "dag."
Moreover, a graph can be represented by its adjacent matrix. This is simply the matrix A =
{w(u,v)}, ∀ u, v. We can also take the transpose of the matrix. In the case of a directed graph,
this would simply reverse the direction of all the edges.
Features of Graphs
There are various features of a graph. These include the number of graph nodes and the
distribution of links across nodes. The edges, which are the links and the structure of the nodes,
determine the extent the nodes are connected, and the importance of individual nodes. This also
determines the flow of networks.
A simple bifurcation of graphs suggest two types:
Random graph
Scale-free graph
These two graphs are portrayed in an article in the Scientific American written by Barabasi and
Bonabeau (2003).
The random graph can be plotted by putting in place some sets of n nodes and then connecting
pairs of nodes randomly with some probability p. The higher the connected probabilities, the
more the edges contained in the graph. The distribution of the number of edges in each graph
will either be more or less Gaussian because there is a mean number of edges (n.p) with some
range around the mean. The left graph in the above figure is an example of this. In the graph, the
distribution of links is shown in bell-shapes. When a number d gives the number of links of a
node, the distribution of nodes in a random graph would be f(d) ∼ N(µ,σ2), where µ is the mean
number of nodes with variance σ2.
The structure of a scale-free graph is a hub and spoke. Most nodes in the graph have very few
links. However, there are some nodes with a large number of links. In our graph, the distribution
of links is shown on the right side. This is not bell-shaped at all, rather it is more exponential.
Although there is a mean for this distribution, this is not representative of either the hub-nodes or
non-hub nodes. Since the mean is not representative of the population, the distribution of links in
this type of graph is scale-free. The network of this type of graph is also known as a scale-free
network.
The distribution of nodes in a scale-free graph is often approximated by a power-law
distribution, i.e., f(d) ∼ d−α, where usually, nature seems to have stipulated that 2 ≤ α ≤ 3, by
some curious twist of fate. The log-log plot of this distribution is linear.
The majority of networks in the world today aim to be scale-free. The reason for this is explained
in the article by Barabasi and Albert (1999). To explain this, they developed the Theory of
Preferential Attachment, which stated that as network progress and new nodes are added to it, the
new nodes usually attach itself to existing nodes that have most of the links. As a result,
influential nodes gain more connection and evolves into a hub and spoke structure.
The structure of the graph also determines some of the properties of the graph. For instance, a
scale-free graph performs excellently in the transmission of information, and in moving air
traffic passengers. This is why airports are arranged in this format. A scale-free network is also
very good for random attacks. If, for example, a terrorist attacks an airport, unless it hits a hub,
the damage is usually minimal.
In the rest of this chapter, we would examine financial network risk and many more interesting
graphs.
Chapter 13: Searching Graph
In the previous chapter, we provided some introduction to Graph Theory in data science; in this
chapter, we would be exploring the two broad types of searching graphs. These include depth-
first-search (DFS) and breadth-first search (BFS). The reason we are concerned with this is that
DFS, i.e., depth-first-theory is very good at searching communities in social networks while BFS
works well with finding the shortest connections in networks. This chapter would cover the
following outlines:
Depth-First-Search
Breadth-First-Search
Strongly Connected Components
Dijkstra's Shortest Path Algorithm
Degrees Distribution
Diameter
Fragility
Centrality
Communities
Modularity
Depth-First-Search
This begins by taking a vertex and use this to produce a tree of connected vertices recurring
downward until there is no way to do this again. Here are the algorithms for DFS:
The numbers of nodes show the sequence in which the nodes are accessed by the program.
Usually, the output of a DFS is less detailed and presented in a very simple sequence in which
the nodes are first visited. A good example of a DFS is the graph below:
We would notice that the DFS graph is a set of trees. The tree itself is a special kind of graph. It
is inherently acyclic when the graph is acyclic. This implies that a cyclic graph would have the
DFS trees at the back edges. This process can be interpreted as the partitioning of vertices into a
subset of connected groups.
In applying this to business, it is necessary first to understand why they are different. Secondly,
the ability to target these separate groups to different business questions and responses. Firms
and business organizations that make use of these types of data use algorithms to find out
"communities." Within communities, BFS is then applied to determine the connection of
networks and the nearness of these connections.
Also, DFS can be used to find out the connectedness of the networks. With the use of DFS, we
can determine how close entities are to each other in a network. These analyses often suggest a
"small world's phenomenon or what is colloquially referred to as " six degrees of separation."
Our next focus is to examine how DFS is implemented in the igraph package. We would be
making use of this process all through this chapter. Our example below shows how a paired
vertex list is used to create a graph.
.
In the above graph, it is easy to determine the nearest graph. Now, when a positive reaction is
gotten from someone in the population, this would help target the nearest neighbor cost-
effectively. This is simply done by defining the edges of connections. The algorithms for BFS is
as follows:
BFS can also produce a tree, the level of the tree is determined by how close or far it is from the
source vertex.
Strongly Connected Components
The best place to cluster the members of a network is in a directed graph. This is done by finding
the strongly connected components on the graph. Hence, an SCC is a subgroup of the vertices
U ⊂ V in a graph with the properties for all pairs of its vertices (u,v) ∈ U, both vertices are
reachable from each other. Below is an example of a graph broken down into its strongly
connected components:
SCC is very useful for partitioning a graph into tight units. Not only this, but it also generates
local feedbacks. This implies that when a member of SCC is targeted, all the members of the
SCC components are targeted, and the stimulus is moved across the SCC.
igraph has emerged as the most popular packages for analysis graphs. It has versions in Python,
C, and R. This package can also be used to plot and generate the random graph in R.
Dijkstra's Shortest Path Algorithm
This is one of the most widely used algorithms in theoretical computer science. When a source
vertex is given on a weighted, directed graph, the algorithms find the shortest path to all other
nodes from source s. w(u, v) denotes the weight between two vertices. Dijkstra's algorithm
works for graphs where w(u,v)≥0. For negative weights, the Bellman-Ford algorithm is used.
Below is the algorithms:
Below is an example of a graph that Dijkstra algorithms have been applied to.
Degree Distributions
The number of links a node has to other nodes in a network is the degree of a node. The degree
distribution is the probability of distribution of the nodes. In a directed network, there are two
types of degrees. One is for in-degree, and the other is for out-degree. However, in an undirected
network, the degree distribution is simply the number of edges contained in a node. It is
important to note that the weight of the edges does not play any role in computing the degree
distribution of the nodes. Although there are times when there would be a need to avail of this
information.
Diameter
This is the longest shortest distance between any two nodes across all the nodes. It can be
computed as follows:
Note that the paths that are of length 7 are 18 in number. However, this is a duplicate. Thus we
run these paths in the two directions. This would give us 9 pairs of nodes that have the longest
shortest distance between them.
Fragility
This is simply a quality of a network based on its degree distribution. The question that arises
from this is, in comparing two networks of the same degree, how do we assess on which network
contagion is more likely? This can be the first finding if the network is a scale-free network. This
is because a scale-free network tends to spread the variable of interest, irrespective of whether it
is a flu, information, or financial malaise. Also, in a scale-free network, the greater the
preponderance of central hubs, the greater the probability of networks. This is because a few
nodes already have a concentration of the degree. Hence, the higher the concentration, the higher
the scale-free, and the higher the fragility.
To measure concentration, the economist has been using a unique package for a long time. This
package is the Herndahl-Hirschman index. The index is quite technical to compute because it is
the average degree squared for n nodes, i.e.,
The more degrees get concentrated on a few nodes, the more the metric H increases, keeping the
total degrees of the network constant. For instance, let's assume we have a graph of three nodes
each with a degree {1, 1, 4} versus another graph of three nodes {2,2,2}. The value for metric H
would increase in the former than the latter. The former would have H = 18, while the latter
would have H = 12. To calculate the fragility, we simply normalize H by the average degree.
Here is the formula for this:
In the three nodes example we are using, the fragility is simply 3 and 2 respectively. Other
normalization can be chosen too, for instance, the denominator E (d)². To compute this is not
simply, as it requires a single line of code.
Centrality
This is the property of vertices in a network. Taking the adjacent matrix A = {w(u, v)} we went
ahead and generated a measure of the "influence" of all the vertices in a network. Taking the
influence of the vertex i as xi. The influence of each vertex is contained in column x, what then
is the influence? To answer this question, let's take some moment to observe the web page. The
more the links on the web page, the more the influence from its main page to other pages. This
shows that influence is interdependent.
x=Ax
We can simply add a scalar to this to get:
When we add a scalar, we get an eigensystem. When we decompose this, we get the principal
eigenvector. The value of this gives us the influence of each member. With this method, we can
find the most influential network. There are numerous applications to this data to real data. This
eigenvector centrality is exactly what Google trademarked as PageRank, even though they did
not invent eigenvector centrality.
Other concepts of centrality are "betweenness." This is simply the proportion of the shortest path
that goes through a node in relation to the other paths that go through the node. The formula for
this is :
Here, na,b is the number of shortest paths from node a to node b, and na,b(v) are the number of
those paths that traverse through vertex v. Below is an example:
Communities
These are simply the spatial agglomeration of the vertex that tends to connect with each other
than with others. To identify these agglomerations is a cluster detection problem, a
computationally difficult (NP-hard) one. This is so because we allow each cluster to have
different sizes. This, in turn, permits porous boundaries such that members both within and
outside their preferred clusters. The solution to this is where communities come in.
This is simply constructed by optimizing modularity. Modularity is a metric of the differences
between the number of within-community construction and the expected number of
constructions. Because of the large computational complexity involved in sifting through all
possible partitions, identifying communities is not an easy feat.
The whole idea of community formation started with Simon (1962). In his view, he explained
that complex systems with several entities usually have coherent subsystems, or communities,
that serve specific functional purposes. To understand the functional forces underlying these
entities, it is important to identify communities embedded in larger entities. To understand this,
we would be examining more definitions of the community detection method.
The community detection method is the method in which nodes are partitioned into clusters with
the tendency to interact with each other. Hence, all nodes cannot belong to the same community;
neither can we fix the number of the community at a time. Also, we would allow each
community to have different sizes. Having beaten our partitioning into a more flexible task, our
challenge now is to find the best partition because the number of possible partitions is very large.
However, since the community detection method aims at identifying clusters that are internally
tightknit. This is the same as finding a partition of clusters to maximize the observed number of
connections between cluster members minus what is expected conditional on the connections
within the cluster, aggregated across all clusters. Therefore we will go for partitioning with high
modularity Q.
In the above equation, Aij is the (i, j)-th entry in the adjacency matrix. This implies that i is the
number of connections in which i and j are jointly connected. The total number of transactions or
the degree of i that node i participated in is di = ∑j Aij. While m = 1 2 ∑ij Aij is the sum of all
edge weights in matrix A. When nodes i and j are from the same community, the function δ(i, j)
is an indicator equal to 1.0. However, when they are not, the function is zero. Q is bounded in
[-1, +1]. When Q > 0, it implies that intra-community connections are more than the expected
number.
Modularity
To understand this, we would use a very simple example before exploring the possible different
interpretations of modularity. The calculation that would be adopted in our example is based on
the measure given by Newman (2006). Also, since we have been using the igraph package in R,
our codes to compute modularity would be presented with this package.
To start with, let's assume we have a network of five nodes {A,B,C,D,E}, and the weights of the
edges are: A : B = 6, A : C = 5, B : C = 2, C : D = 2, and D : E = 10. Let's assume that a
community detection algorithm assigned {A, B, C} to one community and {D, E} to another.
This implies that we have only two communities. The adjacent matrix graph for this would be:
This can be carried out with another algorithm called the "fast-greedy" approach
The Kronecker delta matrix that examines this community detection would be:
Here, the sum of the edge weight in the graph is m = 1/2 ∑ij Aij = 1 2m·δij (13.4) 2 ∑i. Aij is the
(i, j)-th entry in the adjacency matrix. This implies the weight of the edge between nodes i and j,
while the degree of node i is di = ∑j Aij. The Kronecker's delta is the function δij. When the
nodes i and j are from the same community, δij takes the value 1. However, when they are not
from the same community, δij takes value zero. Matrix Aij − di× dj 2m is the center of this
formula. It entails the modularity, which produces a score that increases when the number of
connections within a community is more than the expected proportion of connections if they are
assigned at random depending on the degree of each node. The score takes a value ranging from
−1 to +1 as it is normalized by dividing by 2m. When Q > 0, it simply means that the number of
connections within communities is more than that between communities. Below is the program
code that takes in the adjacency matrix and delta matrix:
We would notice that in the above program, the algorithms have separated the first three nodes
into one community and the last two nodes into another community. The size variable above
shows the size of each community. The next thing to do now is computing the modularity.
This is a confirmation of the value we arrived at when we used the implementation of the
formula.
Conclusion
This chapter has extensively covered the rudimentary aspects of a graph in data science. The next
chapter examines the Neural Network.
Chapter 14: Neural Networks
In this chapter, we would be treating one of the commonest nonlinear regressions. So far, what
we have been concentrating on are linear regressions. This chapter provides a thorough analysis
of nonlinear regression through the exploration of Neural Networks. The outlines that would be
covered in this chapter include:
Overview of Neural Networks
Nonlinear Regression
Perceptions
Squashing functions
Research applications
Overview of Neural Networks
These are some of the forms of nonlinear regression. Recall that in linear regression, we have:
Y =X¹β+e
Here our value for X ∈ Rt×n, while the solution for regression is simply equal to β = (X¹X)
−¹(X¹Y).
To get this, we simply minimize the sum of squared error:
In the above equation, g(x) represent any function taking the negative or positive value.
When a neural network contains many layers, it is known as a “multi-layered” perceptrons.
While all the perceptrons together are referred to as a big, single perceptrons. Below is an
example of this kind of neural network:
Neural net models are very similar to Deep Learning. In deep learning, the number of hidden
layers is significantly higher than what was usually arrived at in the past when computational
power is generally limited. Today, deep learning nets has risen to above 20-30 layers. This
resulted in the unique ability of neural nets imitating the same process the human brain works.
Most times, Binary NNs are seen as a category of classier systems. This is because, as a classifier
system, they are often used to divide members of a population into different classes. Aside from
Binary NNs, NNs with continuous output are fast becoming popular.
Squashing Functions
This is more general than the binary. In simple terms, squashing function is a process whereby
the output signal is squashed into a narrow range, usually (0,1). A very common choice of
squashing function is the sigmoid function, popularly known as a logistic function. The formula
for this is
Where w is the adjustable weight. Another very common choice is the Probit function:
f (x) = Φ(w x)
where Φ(·) is the cumulative normal distribution function.
How do NN works?
The simplest way to see how NN works are to observe the simplest NN. This is simply an NN
with a single perceptron producing a binary output. The perceptron has n inputs, with values xi,i
= 1...n and current weights wi, i = 1...n. It generates an output y. The “net input” is defined as:
The output signal is y = 0 when the net input is greater than a threshold T. However; this is not
the case is if it is less than T if it is less than T, the output is y = 0. The actual output is referred
to as the “desired” output and is represented by d = {0,1}. Hence, the “training” data provided to
the NN comprises both the inputs xi and the desired output d. The output of our single perceptron
model will be the sigmoid function of the net input, which is
Here, the size of the training set is m in yj. To get the optimal NN, we simply find the optimal
NN; we find the weights wi that minimizes this error function E. Once the optimal weights are
gotten, we have a calibrated “feedforward” neural net.
The multilayered perceptron for a given squashing function f and input x = [x1,x2,...,xn], would
give us an output at the hidden layer of
In the above equation, the nested structure of the neural net is obvious.
Logit/Probit Model
A good example of a Logit model is the special model we have above. However, the model
becomes a probit regression model once the squashing function is taken to the cumulative normal
distribution. However, whether Logit or Probit, the model is fitted by minimizing squared errors,
not by maximum likelihood, which is how standard logit/probit models are parameterized.
Connection To Hyperplanes
It is important to note that in binary squashing functions, we passed the net input through a
sigmoid function and then compared to the threshold level T. This sigmoid function is a
monotone one. Hence, this means that there must be a level T in which the net input ∑ni=1 wi xi
must be for the result to be on the cusp. The following is the equation for a hyperplane
This also means that observations in n-dimensional space of the inputs xi must lie on one side or
the other of this hyperplane. When it lies above the hyperplane, then y = 1, else y = 0. From our
explanation so far, single perceptrons in neural nets have a simple geometrical intuition.
Feedback/Backpropagation
The major difference between ordinary neural net and nonlinear regressions is feedback.
Feedback plays a vital role in the neural net performance. Neural nets learn from their feedbacks.
The techniques used in implementing feedback is what is called backpropagation.
Assuming what we have is a calibrated NN. We would have to obtain another observation of
data and then run it through the NN. To get the error for this observation, we would compare the
value of the output y and the desired observation d. If the error detected is a very large one, the
best way to correct this is to update the weight in the NN. This would enable it to self-correct.
The process of updating the weight in NN to self-correct is what is known as "backpropagation."
The benefit of backpropagation is to avoid the long process of doing a full-refitting task. With
just simple rules, correction can be made in a gradual process.
Let's look at backpropagation using a single perceptron on this very simple example. Considering
the jth perceptron, the sigmoid value would be:
where yj is the output of the jth perceptron, and xij is the ith input to the jth perceptron. The error
that would be gotten from this observation is (yj − dj). Recall that E = 1/2 ∑mj=1(yj − dj)2. The
change in error can be computed with respect to the jth output.
We can now define the value of interest. This is simply the change in error value with respect to
the weights
In this case, we have a single equation for each observation j and each weight wi. It is important
to note that wi apply to all perceptrons. A unique case is where every perceptron has its own
perceptron, that is wij. Here, instead of updating just a single observation, updating would be
done on many observations. If this is the case, the error derivative would be:
∂E/∂wi = ∑j(yj −dj)yj(1−yj)xij
Therefore, wi would be reduced to E if ∂E/∂wi > 0. How do we achieve this? It is in this aspect
that some art and judgment is implemented. To start with, when the weight wi needs to be
shrunk, there is a tuning parameter 0 < γ < 1 that works for this; similarly, when the derivative is
∂E/∂wi < 0, then wi would be increased by dividing it by γ.
Chapter 15: One Or Zero: Optimal Digital Portfolio
Digital assets are a binary investment. This means that their payoff is either small or large. In this
chapter, we would be exploring some key features of optimal digital portfolio assets; these
include assets such as credit assets, venture investments and lotteries. The outlines that would be
covered in this chapter include:
Optimal Digital Portfolio
Modeling Digital Portfolios
Optimal Digital Portfolio
These kinds of portfolio shares correlated assets with joint Bernoulli distributions. In our
explanation, we would be using an easy and fast recursive technique to obtain the return on the
distribution of the portfolio. We would also be using the example to generate guidelines on how
digital asset investors should construct their portfolio. Recently, it has been discovered that
portfolios are better when they are constructed homogeneous in the size of the assets.
It is important to note that the distribution of return on digital portfolios is usually fat-tailed and
extremely skewed. The venture fund is a very good example of this kind of portfolio. Bernoulli
distribution is a simple representation of digital portfolio payoff. Bernoulli distribution has very
little or no payoff for a failed asset; however, its payoff for a successful portfolio is large. The
probability of achieving success in digital investment is relatively small. Hence, optimizing the
portfolio in digital investment is not amenable to the standard technique used for mean-variance
optimization.
Therefore, in our explanation, we would be using a technique based on the standard recursive
used for modeling the return distribution in Bernoulli distribution.
Modeling Digital Portfolio
Let's take, for instance, that an investor has an option of n investments in digital assets. The
investment are indexed i = 1,2,..., n. For each investment, there is a probability of success qi that
would yield si dollar. Given the probability (1 − qi), the investment and the start-up would fail.
All the money used for the investment would become a total waste. The payout of cashflow for
such investment is:
Here the ρi ∈ [0,1] is a coefficient that correlates a normalized common factor X ~ N(0,1) with
a threshold yi. Correlation is driven among the digital assets in the portfolio by the common
factor. Hence, we assume that Zi ∼ N(0,1) and Corr(X,Zi) = 0, ∀ i. Therefore, ρi × ρj stands
for the correlation between assets i and j.
Note that the mean and variance of yi are: E(yi) = 0,Var(yi) = 1, ∀ i. Conditional on X, the
values of yi are all independent, as Corr(Zi,Zj) = 0. Next, we formalize the probability model for
the success or failure of the investment by first defining a variable xi with distribution function
F(·), such that F(xi) = qi, gives us the probability of success of the digital investment.
Conditional on a fixed value of X, the probability of success of the ith investment would be:
From the above equation, the cumulative normal density function is Φ(.). Taking the value for
the level of the common factor as X, asset correlation ρ, and the unconditional success
probabilities qi, our conditional success probability for each asset would be pXi. The more X
varies, the more our pXi varies. We would be choosing the function F(xi) as the cumulative
normal probability function for the numerical examples we would be treating.
An investment is deemed successful I it has a high payoff Si. The flow of cash from the portfolio
is a random variable
The sum of all digital assets cashflow would give us the maximum cash flow that would be
generated by the portfolio. This is because every single outcome is a success.
To simplify the issue, we would simply assume Si is an integer, and we round off the amount
nearest to the significant digit. Hence if the amount nearest to the significant digit is a million,
each Si would be a million integer.
Let's recall that in our previous formula, conditional on a value of X, the probability of success
of digital asset i is given as pXi. This recursion technique would be made it easy for us to
generate the portfolio cashflow probability for each level of X. Based on this, we simply use the
marginal distribution for X represented by g(X) to compose these conditional (on X)
distributions into the unconditional distribution of all the portfolios. As a result, the total
probability of cash flow from the portfolio, conditional on X would be f(C|X).
Conclusion
Data science is wider than computer science or statistic; however, to excel in the field, the
knowledge of these two is very necessary. From our explanations so far, a lot has been explored
under these two fields. For instance, Fourier analysis, Data extraction, Limited dependent
variables are all under the field of statistics, while algorithms, linear and nonlinear regression,
Auctions, Network theories, neural networks and more are prevalent in the field of computer
science.
In explaining some of the theories in the book, we use a recursion technique borrowed from
many different portfolios and examples. We also explicate the major difference between
nonlinear and linear regression, using this as a background into exploring what optimal digital
portfolio entails. Some popularly used theories in data science, such as Bass, Bayes,
GARCH/ARCH, and many more, were explained in detail. Not only this, we explore different
models like the Markowitz model, Eigensystem, Factor analysis and many more.
Important areas such as web sourcing with the use of API, Text classifier, Word-count classifier,
and many more were observed in detail. The approaches employed in each chapter of the book
are very simple. Broad statistics on models and theories were beaten done into explanatory
details so that readers are able to grasp all these areas.
With consistent study and practice, data scientists are guaranteed to excel in their field.