Combinepdf
Combinepdf
PRELIM QUIZ 1
Question 1
Complete
Flag question
Question text
It transforms data into actionable intelligence for business purposes.
Select one:
a.
Text Analytics
b.
Business Intelligence
c.
Data Mining
d.
Statistics Analytics
Question 2
Complete
Flag question
Question text
It is used in organization’s strategic and tactical business decision making.
Select one:
a.
data visualization
b.
text analytics
c.
business intelligence
d.
data mining
Question 3
Complete
Flag question
Question text
The following are artifacts used in data analysis EXCEPT:
Select one:
a.
ANOVA
b.
pivot tables
c.
graphs
d.
statistical tools
Question 4
Complete
Flag question
Question text
The following processes are used in data analysis EXCEPT:
Select one:
a.
transforming
b.
collecting
c.
inspecting
d.
cleansing
Question 5
Complete
Flag question
Question text
_____________ includes identifying groups of data record.
Select one:
a.
Text Analytics
b.
Statistics Analytics
c.
Cluster analysis
d.
Business Intelligence
Question 6
Complete
Question text
Which of the following type of text is processed in text analytics?
Select one:
a.
structured
b.
unstructured
c.
unorganized
d.
raw
Question 7
Complete
Flag question
Question text
Which of the following is NOT a method used in data analysis?
Select one:
a.
Business Intelligence
b.
Text Analytics
c.
Statistics Analytics
d.
Data Mining
Question 8
Complete
Flag question
Question text
It is a free software programming language.
Select one:
a.
Orange
b.
WEKA
c.
Knime
d.
R-programming
Question 9
Complete
Flag question
Question text
It has the goal of discovering useful information to support decision making.
Select one:
a.
data mining
b.
data visualization
c.
data analysis
d.
database
Question 10
Complete
Flag question
Question text
It makes complex data more understandable and usable.
Select one:
a.
data mining
b.
text analytics
c.
data visualization
d.
business intelligence
Question 11
Complete
Flag question
Question text
The goal is to transform raw data into understandable business information.
Select one:
a.
Data mining
b.
text analytics
c.
data visualization
d.
business intelligence
Question 12
Complete
Flag question
Question text
What is the process of deriving useful information from text?
Select one:
a.
Statistics Analytics
b.
Text Analytics
c.
Data Mining
d.
Business Intelligence
Question 13
Complete
Question text
It extracts meaningful numerical indices from information and make it available to
statistical and machine learning.
Select one:
a.
Text analytics
b.
data mining
c.
business intelligence
d.
data visualization
Question 14
Complete
Flag question
Question text
___________ uses artifacts to present data visually.
Select one:
a.
Text Analytics
b.
data visualization
c.
Statistics Analytics
d.
Data Mining
Question 15
Complete
Flag question
Question text
What programming language doe Orange use?
Select one:
a.
python
b.
Fortran
c.
Cobol
d.
JAVA
Question 16
Complete
Flag question
Question text
It is a powerful tool that shows the network of data.
Select one:
a.
WEKA
b.
Orange
c.
Knime
d.
Rapid Miner
Question 17
Complete
Flag question
Question text
Which of the following data mining techniques is predictive?
Select one:
a.
tracking pattern
b.
clustering
c.
classification
d.
outlier detection
Question 18
Complete
Flag question
Question text
It includes identifying groups of data records.
Select one:
a.
cluster analysis
b.
data mining
c.
data analysis
d.
database
Question 19
Complete
Flag question
Question text
Which of the following is NOT a goal in data mining?
Select one:
a.
evaluating data
b.
collecting data
Question 20
Complete
Question text
Select one:
a.
Text Analytics
b.
Data Mining
c.
Statistics Analytics
d.
Business Intelligence
PRELIM QUIZ 2
Question 1
Complete
Flag question
Question text
a.
b.
c.
d.
Question 2
Complete
Flag question
Question text
What is an organized collection of information and set of information used to manage
that operation?
Select one:
a.
ADT
b.
ML
c.
data structure
d.
data science
Question 3
Complete
Flag question
Question text
3A + B =
Select one:
a.
b.
c.
d.
Question 4
Complete
Flag question
Question text
The intersection of the two sets A={ 2,3} B={4,5} is a
Select one:
a.
singleton
b.
singular
c.
null set
d.
nonsingular
Question 5
Complete
Flag question
Question text
What is the earlier name for data science?
Select one:
a.
datology
b.
dataology
c.
datatology
d.
datalogy
Question 6
Complete
Question text
Which is NOT a characteristic feature of data structure?
Select one:
a.
Question 7
Complete
Flag question
Question text
An array is a good example of _________data structure.
Select one:
a.
static
b.
dynamic
c.
linear
d.
nonlinear
Question 8
Complete
Flag question
Question text
What is the size of the product of a 5x 6 and a 6x 8 matrices?
Select one:
a.
5x5
b.
8x8
c.
8x5
d.
5x 8
Question 9
Complete
Flag question
Question text
If A={ 2,3} B={4,5},which of the following is a Cartesian product of the two sets?
Select one:
a.
Question 10
Complete
Flag question
Question text
a.
AB is not possible
b.
AB=BA
c.
A + B = B+ A
d.
BC=CB
Question 11
Complete
Flag question
Question text
What is a data structure that has a fixed size?
Select one:
a.
dynamic
b.
linear
c.
nonlinear
d.
static
Question 12
Complete
Flag question
Question text
ML means:
Select one:
a.
Machine Learning
b.
Mobile Learning
c.
Math Learning
d.
Machine Landscaping
Question 13
Complete
Flag question
Question text
Addition and subtraction of matrices only is possible if two are more matrices.
Select one:
a.
b.
Question 14
Complete
Flag question
Question text
The two sets If A={ 2,3} B={4,5} are said to be
Select one:
a.
adjoint
b.
disjoint
c.
joint
d.
equal
Question 15
Complete
Flag question
Question text
What is the correct meaning of ADT?
Select one:
a.
Question 16
Complete
Question text
It refers to a data structure that grows and shrinks at execution time.
Select one:
a.
dynamic
b.
nonlinear
c.
linear
d.
static
Question 17
Complete
Flag question
Question text
Matrix B is
Select one:
a.
transpose
b.
inverse
c.
singular
d.
invertible
Question 18
Complete
Flag question
Question text
What is the focus of data science?
Select one:
a.
statistical computation
b.
organization of data
d.
collection of data
Question 19
Complete
Flag question
Question text
Which of the matrices is singular?
Select one:
a.
B
b.
none
c.
C
d.
Question 20
Complete
Flag question
Question text
_______________ is a data structure that every component has a unique processor and
succesor.
Select one:
a.
static
b.
dynamic
c.
linear
d.
nonlinear
MIDTERM QUIZ 1
Question 1
Complete
Flag question
Question text
It shows a high correlation between the incidence of flu and searches about flu on
google.
Select one:
a.
Question 2
Complete
Flag question
Question text
It refers to well based theories and sound business judgement.
Select one:
a.
Data Mining
b.
Data Science
c.
Data Analytics
d.
Data visualization
Question 3
Complete
Flag question
Question text
PAW means____________.
Select one:
a.
Question 4
Complete
Question text
He said that “ In mathematics the art of proposing a question must be held of higher
value than solving it”.
Select one:
a.
Eric Schmidt
b.
Francis Galton
c.
William Gibson
d.
Georg Cantor
Question 5
Complete
Flag question
Question text
These are the data skills that a good data scientist need to cultivate EXCEPT
Select one:
a.
Communication
b.
speaking
c.
coding
d.
Question 6
Complete
Flag question
Question text
What is a great example of data product?
Select one:
a.
google drive
b.
google navigation
c.
google navigation
d.
google maps
Question 7
Complete
Flag question
Question text
It expands available data enormously since there is so much more text being generated
than numbers.
Select one:
a.
text analysis
b.
data mining
c.
data ranking
d.
Text mining
Question 8
Complete
Flag question
Question text
The following are the 3V's of big data EXCEPT
Select one:
a.
velocity
b.
veracity
c.
variety
d.
volume
Question 9
Complete
Question text
The developer of farmville, a famous game in the internet.
Select one:
a.
Zynga Incorporated
b.
Moontoon
c.
Supercell
d.
Electronic Arts
Question 10
Complete
Flag question
Question text
The explosion of _______data is the main reason why every 2 days 5 exabytes of data are
generated.
Select one:
a.
gargantuan
b.
reaction
c.
transaction
d.
interaction
Question 11
Complete
Flag question
Question text
He pointed out that until 2003 ,all of mankind had generated just 5 exabytes of data
Select one:
a.
Eric Schmidt
b.
Eric Smidth
c.
Eric Smith
d.
Eric Smicht
Question 12
Complete
Flag question
Question text
Which is Not an interaction data?
Select one:
a.
data base
b.
RFID data
c.
geo-location
d.
browser action
Question 13
Complete
Flag question
Question text
A new phenomenon for the explosion of _________data
Select one:
a.
communication
b.
transient
c.
interaction
d.
transaction
Question 14
Complete
Flag question
Question text
The creation of data from varied sources and its qualification into information.
Select one:
a.
datafition
b.
datafitration
c.
datafication
d.
datacation
Question 15
Complete
Flag question
Question text
“ All models are wrong but some are useful “
Select one:
a.
DJ Patil
b.
William Gibson
c.
George E. P. Box
d.
Georg cantor
Question 16
Complete
Question text
The person who said that “ The future is not google-able”.
Select one:
a.
William Gibson
b.
Georg cantor
c.
D J Patil
d.
Eric Schmidth
Question 17
Complete
Flag question
Question text
Exabyte means ________bytes
Select one:
a.
trillion trillion
b.
thousand thousand
c.
million million
d.
billion billion
Question 18
Complete
Flag question
Question text
IOT means
Select one:
a.
Interconnction of things
b.
Internet of time
c.
Interaction of time
d.
Internet of things
Question 19
Complete
Flag question
Question text
How many bytes of data are generated every two days in today's world?
Select one:
a.
5 terabytes
b.
5 exabytes
c.
5 gigabytes
5 gigabytes
d.
5 megabytes
Question 20
Complete
Flag question
Question text
The creation of data from varied sources and its quantification into information.
Select one:
a.
datology
b.
datalization
c.
Datafication
d.
dataology
PRE-TEST
Question 1
Correct
Question text
The proportion of a well defined positive event is called _________________.
Select one:
a.
probability
b.
sensitivity
c.
anonimity
d.
specificity
Feedback
Your answer is correct.
Question 2
Correct
Flag question
Question text
AUC means___________.
Select one:
a.
Question 3
Correct
Flag question
Question text
It allows you to see which value of the explanatory variable corresponds a given
probability success.
Select one:
a.
ogive
b.
probability table
c.
histogram
Feedback
Your answer is correct.
Question 4
Correct
Question text
LR means ________________________.
Select one:
a.
Logistic Regression
b.
Logistic Reinforcement
c.
Linear Regression
d.
Linear Relativity
Feedback
Your answer is correct.
Question 5
Correct
Flag question
Question text
Positive correlation means that_______________.
Select one:
a.
as x decreases y increases
b.
as x increases y decreases
c.
The correct answer is: as x increases y also increases and vice versa
Question 6
Correct
Flag question
Question text
Which of the following belong to the GLM?
Select one:
a.
exponential
b.
quadratic
c.
logistic
d.
multivariate
Feedback
Your answer is correct.
Question 7
Correct
Question text
GLM means_____________.
Select one:
a.
Question 8
Correct
Flag question
Question text
The proportion of well defined negative events is called ________________.
Select one:
a.
regression
b.
probability
c.
specificity
d.
sensitivity
Feedback
Your answer is correct.
Question 9
Correct
Flag question
Question text
The method that does not require the assumption that parameters are normally
distributed.
Select one:
a.
profile likeness
b.
feedback
c.
profile likehood
d.
parameter range
Feedback
Your answer is correct.
Question 10
Correct
Question text
Data involving two variables are called _________data.
Select one:
a.
dichotomal
b.
multivariate
c.
dichotomy
d.
bivariate
Feedback
Your answer is correct.
Select one:
a.
semantic
b.
surrogate
c.
reasoning
d.
inference
Feedback
The correct answer is: inference
Question 2
Complete
Flag question
Question text
The following are distinct roles that KR plays EXCEPT
Select one:
a.
Surrogate
b.
c.
d.
Feedback
The correct answer is: Medium for pragmatically diligent interpretation
Question 3
Complete
Flag question
Question text
Which is NOT a basic representation technology?
Select one:
a.
frame
b.
graph
c.
logic
d.
semantic net
Feedback
The correct answer is: graph
Question 4
Complete
Flag question
Question text
All representations are ________.
Select one:
a.
unstable
b.
perfect
c.
stable
d.
imperfect
Feedback
The correct answer is: imperfect
Question 5
Complete
Question text
KR means __________________________.
Select one:
a.
Knowledge Request
b.
Knowledge Requisition
c.
Knowledge Representation
d.
Knowledge Replenished
Feedback
The correct answer is: Knowledge Representation
Question 6
Complete
Flag question
Question text
KR is a set of __________commitments.
Select one:
a.
social
b.
anthropological
c.
ontological
d.
psychological
Feedback
The correct answer is: ontological
Question 7
Complete
Remove flag
Question text
A network purpoting to describe family memberships.
Select one:
a.
network topology
b.
network adherence
c.
networking
d.
network tautology
Feedback
The correct answer is: network topology
Question 8
Complete
Question text
It sees a set of prototypes in particular prototypical diseases to be matched against the
case at hand.
Select one:
a.
MYCIN
b.
SEMANTIC NETS
c.
INTERNIST
d.
LOGIC
Feedback
The correct answer is: INTERNIST
Question 9
Complete
Flag question
Question text
The following provided inspirations of what constitutes intelligent reasoning EXCEPT
Select one:
a.
Statistics
b.
Psychology
c.
Sociology
d.
Biology
Feedback
The correct answer is: Sociology
Question 10
Complete
Flag question
Question text
It is a variety of formal calculation typically deduction.
Select one:
a.
Intelligent Reasoning
b.
GLM
c.
Artificial Intelligence
d.
KR
Feedback
The correct answer is: Intelligent Reasoning
Question 11
Complete
Question text
Which is NOT a component of KR?
Select one:
a.
b.
fundamental conception
c.
d.
Feedback
The correct answer is: it adheres to the function
Question 12
Complete
Flag question
Question text
The following are abstract notions EXCEPT
Select one:
a.
processees
b.
actions
c.
beliefs
d.
casualty
Feedback
The correct answer is: casualty
Question 13
Complete
Flag question
Question text
It is a process that goes on internally while most things it wishes about exists only
externally.
Select one:
a.
inference
b.
logic
c.
actions
d.
reasoning
Feedback
The correct answer is: reasoning
Question 14
Complete
Question text
Which is NOT a KR technology?
Select one:
a.
frames
b.
logic
c.
semantic nets
d.
roles
Feedback
The correct answer is: roles
Question 15
Complete
Flag question
Question text
It is used to enable an entity to determine consequences by thinking rather than acting.
Select one:
a.
Knowledge Representation
b.
Artificial Intelligence
c.
Intelligent reasoning
d.
Knowledge Channel
Feedback
The correct answer is: Knowledge Representation
Question 16
Complete
Flag question
Question text
It views the world in terms of attributes object value triples.
Select one:
a.
frame
b.
semantic net
c.
rule based
d.
logic
Feedback
The correct answer is: rule based
Question 17
Complete
Question text
It views the world in thinking of prototypical objects.
Select one:
a.
logic
b.
rule
c.
semantic net
d.
frame
Feedback
The correct answer is: frame
Question 18
Complete
Flag question
Question text
It involves a commitment in viewing the world in terms of individual entities and
relations.
Select one:
a.
logic
b.
semantic nets
c.
frame
d.
rules
Feedback
The correct answer is: logic
Question 19
Complete
Flag question
Question text
KR as a _________is a substitute for the thing itself.
Select one:
a.
surrogate
b.
semantic
c.
ontological
d.
pragmatic
Feedback
The correct answer is: surrogate
Question 20
Complete
Question text
It is a language that we say things about the world.
Select one:
a.
b.
c.
d.
Feedback
The correct answer is: Medium of human expression
lOMoARcPSD|16010511
data visualization
Data mining
collecting data
data analysis
Python
Collecting
Classification
Data Mining
Business Intelligence
It extracts meaningful numerical indices from information and make it available to statistical
and machine learning.
Text analytics
Knime
Unstructured
Statistics Analytics
Cluster analysis
Text Analytics
data visualization
R-programming
ANOVA
business intelligence
PRELIM Q2 20/20
What is an organized collection of information and set of information used to manage that
operation?
ADT
Addition and subtraction of matrices only is possible if two are more matrices.
If A={ 2,3} B={4,5},which of the following is a Cartesian product of the two sets?
Static
Disjoint
static
Dynamic
Matrix B is
Invertible
null set
A + B = B+ A
ML means:
Machine Learning
_______________ is a data structure that every component has a unique processor and
succesor.
linear
3A + B =
Datalogy
5x 8
Data mining
3A + B
cluster analysis
In α =babaa β =a^6b^5bb, what is the length of the concatenation of the two strings?
18
Java
data visualization
2x3
A + B = B+ A
Regression
λ
Another term for text analytics.
text mining
The following are softwares used in data mining EXCEPT
SPSS
Knime
It is used in organization’s strategic and tactical business decision making.
business intelligence
It is a process of finding the computational complexity of algorithms.
analysis of algorithms
It is a powerful tool that shows the network of data.
Knime
Matrix B is
Invertible
The following are large inputs EXCEPT
Big beta notation
classification
It is used for prototyping in Rapid miner.
studio
The process of inspecting,cleansing,transforming and modelling data with the goal of
discovering useful information.
data analysis
Another term for an empty set.
Null
What type of text are processed in Text analytics?
Unstructured
A special type of function where the domain is a set of consecutive integers.
Sequence
The sets A= { x/x is a distinct letter in the word "MATHEMATICS"} and B={x/x is a distinct
letter in the word "STATISTICS"} , the two sets are
Joint
The goal is to transform raw data into understandable business information.
Data mining
Addition and subtraction of matrices only is possible if two are more matrices.
Have same sizes.
The function describing the performance of an algorithm is usually an upper bound
determined from ______inputs.
worst case
An example of an abstract computer.
Turing machine
{3,5,6,10,12}
Null strings are indicated by
λ
It relates the length of an algorithm’s input to the number of steps it takes.
time complexity
What is the size of the product of a 5x 6 and a 6x 8 matrices?
5x 8
It offers a way to examine trends from collected data and derive insights from it.
Business Intelligence
He coined the term <analysis of algorithms=.
Donald Knuth
The constant multiplicative factor in which algorithms are related are_______ constants.
Hidden
it is a perfect software for machine learning.
orange
MIDTERM Q1 20/20
It expands available data enormously since there is so much more text being generated than
numbers.
Text mining
A new phenomenon for the explosion of _________data
Interaction
It shows a high correlation between the incidence of flu and searches about flu on google.
Google Flu trends
What is a great example of data product?
google maps
These are the data skills that a good data scientist need to cultivate EXCEPT
Speaking
It refers to well based theories and sound business judgement.
Data Science
The developer of farmville, a famous game in the internet.
Zynga Incorporated
The following are the 3V's of big data EXCEPT
Veracity
He pointed out that until 2003 ,all of mankind had generated just 5 exabytes of data
Eric Schmidt
The creation of data from varied sources and its quantification into information.
Datafication
The explosion of _______data is the main reason why every 2 days 5 exabytes of data are
generated.
Interaction
Exabyte means ________bytes
billion billion
IOT means
Internet of things
PAW means____________.
Predictive Analytics World
He said that < In mathematics the art of proposing a question must be held of higher value
than solving it=.
Georg Cantor
How many bytes of data are generated every two days in today's world?
5 exabytes
< All models are wrong but some are useful <
George E. P. Box
The person who said that < The future is not google-able=.
William Gibson
Which is Not an interaction data?
data base
The creation of data from varied sources and its qualification into information.
Datafication
MIDTERM Q2 20/20
What range of values 3 SD below and above the mean in a normal distribution if the mean
is 10 and standard deviation is 2?
4-16
What is the value of the mean if a score of 110 is 3 standard deviation above the mean?
95
What is the value of the standard deviation in a standard normal distribution?
1
What percent of data will lie within 2 standard deviation of the mean?
95
Empirical rule for a normal distribution that is 3 standard deviations above and below the
mean covers ______% of the data.
99.7
What range of values lie between 3 standard deviations above and below the mean if the
mean is 80 and the standard deviation is 3?
71-89
A bell shaped curve that is symmetric about a vertical line.
normal distribution
A distribution where large distribution are displayed.
Grouped frequency distribution
The area of the standard normal curve to the right of z=0.82 is _______.
0.206
The normal distribution with a mean of 0 and standard deviation of 1.
Standard
A score of 50 lies 2 standard deviations above a mean of 30.What is the value of the
standard deviation?
10
A bell-shaped distribution that is symmetric about a vertical line?
Normal
What is the mean for a standard normal distribution?
0
A survey of 100 consumers said that the price charged for a kilo of rice could be
approximated by a normal distribution with a mean of 35 and a standard deviation of 4.How
many of them lie between 27 and 43?
95
Empirical rule for a normal distribution that is 2 standard deviations above and below the
mean is ________% of data.
95
Empirical rule for a normal distribution lie ______% of data with 1 standard deviation below
and above the mean.
68
A graph used to indicate intervals in a frequency distribution is refereed to as
a______________.
Histogram
0.206
As of 2014,there are _______million of tweets a day.
500
What range of values lie between 3 standard deviations above and below the mean if the
mean is 80 and the standard deviation is 3?
71-89
Data is NOT information unless we add_________.
Analytics
A vegetable distributor knows that during the month of August ,the weights of tomatoes
are normally distributed with a mean of 0.61 lb and a standard deviation of 0.15 lb. How
many can be expected to weigh more than 0.31 lb in a shipment of 6000 tomatoes.
200
The score NOT easily affected by extreme values.
Median
A negative correlation exists when___________.
x increases y decreases
According to Hilary Mason which is NOT a skill that a good data scientist must cultivate.
critical thinking
He is someone who asks interesting questions on formal and informal theory.
data scientist
Data involving two variables.
Bivariate
It partitions a ranked data into four equal groups.
Quartile
A graph that is used to indicate frequency distribution.
Histogram
The normal distribution with a mean of 0 and standard deviation of 1.
Standard
Question 1
Correct
What range of values 3 SD below and above the mean in a normal distribution if the mean is 10 and standard deviation is 2?
Select one:
a. 10-14
b. 5-15
c. 4-16
d. 8-14
Question 2
Correct
Select one:
a. relative frequency distribution
b. grouped frequency distribution
c. ogive
d. histogram
Question 3
Correct
Select one:
Select one:
a. Mean > Median >Mode
b. Mean=Median=Mode
Question 4
Correct
Select one:
a. normal distribution
b. kurtic
c. standard distribution
d. skewed
Question 5
Correct
Select one:
a. symmetric
b. skewed
c. standard
d. normal
Question 6
Correct
Select one:
b. histogram
d. ogive
Question 7
Correct
Select one:
a. bar graph
b. histogram
c. pie graph
d. ogive
Question 8
Correct
A survey of 100 consumers said that the price charged for a kilo of rice could be approximated by a normal distribution with a mean of 35 and a
standard deviation of 4.How many are less than 39?
Select one:
a. 80
b. 84
c. 82
d. 78
Question 9
Correct
Select one:
a. 5
b. 0
c. 1
d. 2
Question 10
Correct
A survey of 100 consumers said that the price charged for a kilo of rice could be approximated by a normal distribution with a mean of 35 and a
standard deviation of 4.How many of them lie between 27 and 43?
Select one:
a. 92
b. 95
c. 90
d. 88
Question 11
Correct
Select one:
a. 5
b. 0
c. 2
d. 1
Question 12
Correct
Select one:
a. Skewed
b. kurtic
c. Standard
Question 13
Correct
A score of 50 lies 2 standard deviations above a mean of 30.What is the value of the standard deviation?
Select one:
a. 10
b. 25
c. 20
d. 15
Question 14
Correct
Empirical rule for a normal distribution lie ______% of data with 1 standard deviation below and above the mean.
Select one:
a. 68
b. 64
c. 75
d. 79
Question 15
Correct
The area of the standard normal curve to the right of z=0.82 is _______.
Select one:
a. 0.295
b. 209
c. 0.294
d. 0.206
Question 16
Correct
Empirical rule for a normal distribution that is 2 standard deviations above and below the mean is ________% of data.
Select one:
a. 80
b. 90
c. 95
d. 85
Question 17
Correct
What is the value of the mean if a score of 110 is 3 standard deviation above the mean?
Select one:
a. 90
b. 91
c. 95
d. 85
Question 18
Correct
Empirical rule for a normal distribution that is 3 standard deviations above and below the mean covers ______% of the data.
Select one:
a. 95
b. 98
c. 99.7
d. 92
Question 19
Correct
What range of values lie between 3 standard deviations above and below the mean if the mean is 80 and the standard deviation is 3?
Select one:
a. 72-89
b. 71-88
c. 71-89
d. 70-89
Question 20
Correct
What percent of data will lie within 2 standard deviation of the mean?
Select one:
a. 95
b. 99
c. 68
d. 90
Question 2
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Question 3
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. ML
Feedback
Question 4
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. linear
Feedback
Question 5
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. static
Feedback
Question 6
Correct
Mark 1.00 out of 1.00
Flag question
Question text
ML means:
Select one:
a. Machine Landscaping
b. Math Learning
c. Mobile Learning
d. Machine Learning
Feedback
Question 7
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. statistical computation
Feedback
Question 8
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. dynamic
Feedback
Question 9
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. datatology
Feedback
Question 10
Correct
Mark 1.00 out of 1.00
Flag question
Question text
d. linear
Feedback
Methods used:
1. Data Mining - is a method of data analysis for discovering patterns in large data sets using methods of
statistics, artificial intelligence, machine learning and data bases. The goal is to transform raw data into
understandable business information. These might include identifying groups of data records (known as
cluster analysis) or identifying anomalies and dependencies between data groups.
2. Text analytics - is the process of deriving useful information from text It is accomplished by processing
unstructured textual information, extract meaningful numerical indices from the information and make
the information available to statistical and machine learning algorithms for further processing.
3. Business Intelligence - transforms data into actionable intelligence for business purposes and maybe
used in an organization's strategic and tactical business decision making. It offers a way for people to
examine trends from collected data and derive insights from it.
4. Data Visualization - refers very simply to the visual representation of data. In the context of data analysis,
it means using the tools of statistics, probability, pivot tables and other artifacts to present data visually.
It makes complex data more understandable and usable.
Data Mining
7 most important data mining techniques:
1. Tracking pattern
2. Classification (predictive)
3. Association (descriptive)
4. Outlier detection
5. Clustering Desciptive0
6. Regression (predictive)
7. Prediction
Question 1
Correct
Mark 1.00 out of 1.00
Flag question
Question text
The person who said that “ The future is not google-able”.
Question 1Select one:
a.
Eric Schmidth
b.
D J Patil
c.
William Gibson
d.
Georg cantor
Feedback
Your answer is correct.
Question 2
Correct
Mark 1.00 out of 1.00
Flag question
Question text
The following are the 3V's of big data EXCEPT
Question 2Select one:
a.
volume
b.
variety
c.
veracity
d.
velocity
Feedback
Your answer is correct.
Question 3
Correct
Mark 1.00 out of 1.00
Flag question
Question text
These are the data skills that a good data scientist need to cultivate EXCEPT
Question 3Select one:
a.
speaking
b.
coding
c.
Communication
d.
Math and Stats
Feedback
Your answer is correct.
Question 4
Correct
Mark 1.00 out of 1.00
Flag question
Question text
He pointed out that until 2003 ,all of mankind had generated just 5 exabytes of
data
Question 4Select one:
a.
Eric Smidth
b.
Eric Schmidt
c.
Eric Smith
d.
Eric Smicht
Feedback
Your answer is correct.
Question 5
Correct
Mark 1.00 out of 1.00
Flag question
Question text
It refers to well based theories and sound business judgement.
Question 5Select one:
a.
Data visualization
b.
Data Mining
c.
Data Analytics
d.
Data Science
Feedback
Your answer is correct.
Question 6
Correct
Mark 1.00 out of 1.00
Flag question
Question text
How many bytes of data are generated every two days in today's world?
Question 6Select one:
a.
5 exabytes
b.
5 terabytes
c.
5 gigabytes
5 gigabytes
d.
5 megabytes
Feedback
Your answer is correct.
Question 7
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Which is Not an interaction data?
Question 7Select one:
a.
browser action
b.
RFID data
c.
data base
d.
geo-location
Feedback
Your answer is correct.
Question 8
Correct
Mark 1.00 out of 1.00
Flag question
Question text
What is a great example of data product?
a.
google maps
b.
google drive
c.
google navigation
d.
google navigation
Feedback
Your answer is correct.
Question 9
Correct
Mark 1.00 out of 1.00
Flag question
Question text
The creation of data from varied sources and its quantification into information.
Question 9Select one:
a.
datalization
b.
dataology
c.
Datafication
d.
datology
Feedback
Your answer is correct.
Question 10
Correct
Mark 1.00 out of 1.00
Flag question
Question text
He said that “ In mathematics the art of proposing a question must be held of
higher value than solving it”.
Question 10Select one:
a.
Eric Schmidt
b.
William Gibson
c.
Francis Galton
d.
Georg Cantor
Feedback
Your answer is correct.
Question 11
Correct
Mark 1.00 out of 1.00
Flag question
Question text
It shows a high correlation between the incidence of flu and searches about flu
on google.
Question 11Select one:
a.
Google Flu trends
b.
c.
Google Flu Searches
d.
Google Flu Viral
Feedback
Your answer is correct.
Question 12
Incorrect
Mark 0.00 out of 1.00
Flag question
Question text
The explosion of _______data is the main reason why every 2 days 5 exabytes of
data are generated.
Question 12Select one:
a.
transaction
b.
reaction
c.
gargantuan
d.
interaction
Feedback
Your answer is incorrect.
Question 13
Correct
Mark 1.00 out of 1.00
Flag question
Question text
PAW means____________.
Question 13Select one:
a.
Predictive Analytics web
b.
Predictive Analytics World
c.
Preliminary Assumption Web
d.
Predicting Analytics Web
Feedback
Your answer is correct.
Question 14
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Exabyte means ________bytes
Question 14Select one:
a.
thousand thousand
b.
billion billion
c.
trillion trillion
d.
million million
Feedback
Your answer is correct.
Question 15
Correct
Mark 1.00 out of 1.00
Flag question
Question text
It expands available data enormously since there is so much more text being
generated than numbers.
Question 15Select one:
a.
data ranking
b.
text analysis
c.
data mining
d.
Text mining
Feedback
Your answer is correct.
Question 16
Correct
Mark 1.00 out of 1.00
Flag question
Question text
The creation of data from varied sources and its qualification into information.
Question 16Select one:
a.
datafitration
b.
datafication
c.
datacation
d.
datafition
Feedback
Your answer is correct.
Question 17
Correct
Mark 1.00 out of 1.00
Flag question
Question text
IOT means
Question 17Select one:
a.
Interconnction of things
b.
Internet of time
c.
Interaction of time
d.
Internet of things
Feedback
Your answer is correct.
Question 18
Correct
Mark 1.00 out of 1.00
Flag question
Question text
A new phenomenon for the explosion of _________data
Question 18Select one:
a.
interaction
b.
transaction
c.
communication
d.
transient
Feedback
Your answer is correct.
Question 19
Correct
Mark 1.00 out of 1.00
Flag question
Question text
“ All models are wrong but some are useful “
Question 19Select one:
a.
William Gibson
b.
DJ Patil
c.
George E. P. Box
d.
Georg cantor
Feedback
Your answer is correct.
Question 20
Correct
Mark 1.00 out of 1.00
Flag question
Question text
The developer of farmville, a famous game in the internet.
Question 20Select one:
a.
Moontoon
b.
Electronic Arts
c.
Supercell
d.
Zynga Incorporated
Feedback
Your answer is correct.
Question 1
Correct
Select one:
a. 7✓
b. 3
c. 5
d. 6
Question 2
Correct
Select one:
a. range
b. quartile ✓
c. variance
d. standard deviation
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 1/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 3
Correct
Select one:
a. Q2=Mean
b. Q2=Mode
c. Q2=median ✓
d. Q2=Range
Question 4
Correct
Select one:
a. Mode ✓
b. median
c. mean
d. range
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 2/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 5
Correct
Select one:
Question 6
Correct
Select one:
a. 5
b. 6
c. 7✓
d. 8
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 3/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 7
Correct
A score of 3 in 2,4,4,4,5,5,6,8,9 is
Select one:
Question 8
Correct
Select one:
a. median
b. standard deviation
c. mode ✓
d. mean
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 4/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 9
Correct
Select one:
a. mean
b. mode
c. median
d. standard deviation ✓
Question 10
Correct
Select one:
a. it is normal
c. it is skewed.
d. there is no mode. ✓
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 5/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 11
Correct
Select one:
a. bimodal
b. unimodal
c. trimodal
d. multimodal ✓
Question 12
Correct
Select one:
a. mode
b. Median ✓
c. mean
d. range
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 6/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 13
Correct
On an examination given to 1000 students, Jef’s score of 80 was higher than the score of 480 students who took the exam. What is
the percentile for Jef’s score?
Select one:
a. 48th ✓
b. 65th
c. 50th
d. 60th
Question 14
Correct
Select one:
a. median
b. Mean ✓
c. mode
d. range
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 7/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 15
Correct
If there are 101 scores the median is equal to the _____ranked score.
Select one:
a. 54th
b. 55th
c. 52nd
d. 51st ✓
Question 16
Correct
Select one:
a. 2.16
b. 2.17 ✓
c. 2.15
d. 2.18
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 8/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 17
Correct
Select one:
a. 1.41
b. 9 ✓
c. 6
d. 1.5
Question 18
Correct
Select one:
a. center
b. mean
c. frequent
d. dispersion ✓
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 9/10
Downloaded by Rythm Quira ([email protected])
lOMoARcPSD|16010511
Question 19
Correct
Select one:
a. unimodal ✓
b. multimodal
c. bimodal
d. trimodal
Question 20
Correct
Select one:
a. mean
b. median
c. mode
d. quartile ✓
Jump to...
Statistical Computations ►
https://trimestral.amaesonline.com/2313A/mod/quiz/review.php?attempt=213520&cmid=21506 10/10
Downloaded by Rythm Quira ([email protected])
[Data Analysis]
1 [Introduction]
Module 1 Introduction
Data analysis
Data Analysis- is the process of inspecting,cleansing,transforming and modelling data
with the goal of discovering useful information ,informing conclusions and supporting
decision-making. It is the process of evaluating data using analytical and statistical tools
to discover useful information and aid in business decision making.
Methods used:
1. Data Mining
2.Text analytics
3.Business Intelligence
4. Data Visualization
Data mining-is a method of data analysis for discovering patterns in large data sets using methods of statistics,
artificial intelligence,machine learning and data bases. The goal is to transform raw data into understandable
business information.These might include identifying groups of data records(known as cluster analysis) or
identifying anomalies and dependencies between data groups.
Text Analytics-is the process of deriving useful information from text It is accomplished by processing
unstructured textual information,extract meaningful numerical indices from the information and make the
information available to statistical and machine learning algorithms for further processing.
Business Intelligence-transforms data into actionable intelligence for business purposes and maybe used in an
organization's strategic and tactical business decision making. It offers a way for people to examine trends from
collected data and derive insights from it.
Data Visualization- refers very simply to the visual representation of data. In the context of data, analysis,it means
using the tools of statistics, probability,pivot tables and other artifacts to present data visually. It makes complex
data more understandable and usable.
Data Science
Data science is the science of learning from data. The sciences are focusing on answering specific
questions about the world while data science is focusing on how to manipulate data efficiently and
effectively.
The primary focus is not which questions to ask of the data but how we can answer them, whatever they
may be. It is more like computer science and mathematics than it is like natural sciences, in this way. It isn’t
so much about studying the natural world as it is about how to compute data efficiently. Included in data
science is the design of experiments. With the right data, we can address the questions we are interested in.
With a poor design of experiments or a poor choice of which data we gather, this can be difficult. Study
design might be the most important aspect of data science. In this module the focus on the analysis of
data, once gathered.
Computer science is also mainly the study of computations—as is hinted at in the name—but is a bit
broader in this focus. Although datalogy, an earlier name for data science, was also suggested for
computer science, and for example in Denmark it is the name for computer science, using the name
“computer science” puts the focus on computation while using the name “data science” puts the focus on
data. But of course, the fields overlap.
If you are writing a sorting algorithm, are you then focusing on the computation or the data? Is that even
a meaningful question to ask? There is a huge overlap between computer science and data science and
naturally the skill sets you need overlap as well. To efficiently manipulate data, you need the tools for
doing that, so computer programming skills are a must and some knowledge about algorithms and data
structures usually is as well.
For data science, though, the focus is always on the data. In a data analysis project, the focus is on how
the data flows from its raw form through various manipulations until it is summarized in some useful form.
Although the difference can be subtle, the focus is not about what operations a program does during the
analysis, but about how the data flows and is transformed data, what purpose those changes serve, and
how they help us gain knowledge about the data. It is as much about deciding what to do with the data as
it is about how to do it efficiently.
Statistics is of course also closely related to data science. So closely linked, in fact, that many consider data
science just a fancy word for statistics that looks slightly more modern and sexier. I can’t say that I strongly
disagree with this—data science does sound sexier than statistics—but just as data science is slightly
different from computer science, data science is also slightly different from statistics. Just, perhaps,
somewhat less different than computer science is. A large part of doing statistics is building mathematical
models for your data and fitting the models to the data to learn about the data in this way. That is also
what we do in data science. As long as the focus is on the data, I am happy to call statistics data science.
If the focus changes to the models and the mathematics, then we are drifting away from data science into
something else—just as if the focus changes from the data to computations we are drifting from data
science to computer science.
Data science is also related to machine learning and artificial intelligence, and again there are huge
overlaps. Perhaps not surprising since something like machine learning has its home both in computer
science and in statistics; if it is focusing on data analysis, it is also at home in data science. To be honest, it
has never been clear to me when a mathematical model changes from being a plain old statistical model
to becoming machine learning anyway.
- set of data values and associated operations that are precisely specified independent of any
implementation.
- organized collection of information and a set of operations used to manage that information
1. The representation or definition of the type and the operations are contained in a single syntactic unit.
2. The representation of objects of the type is hidden from the program units that use the type, so only
direct operations possible on those objects are those provided in the type's definition.
Data Structures
1. It contains component data items, which may be atomic or another data structure (still a domain)
3. Defines rules as to how components relate to each other and to the structure as a whole (assertations)
Types:
Ex. arrays
2. Dynamic data structure-grows and shrinks at execution time as required by its contents. It is
implemented using links.
3. Linear data structure -every component has a unique predecessor and successor except first and last
elements.
4. Non-linear data structure- no such restriction is there as elements may be arranged in any desired
fashion restricted by the way we use to represent such types.
Module 3-Mathematical Preliminaries
empty set or null set or void set -set with no elements denoted by { }
Cartesian product of two sets-set of all the ordered pair of sets X and Y
R= { (2,4),(2,6),(3,3), (3,6),(4,4)}
Domain is {2,3,4}
Range is {3,4,6}
Sequence-is a special type of a function in which the domain is a set of consecutive integers
Course Module
Ex. X={ a,b,c} then a string may be baac or acab, Order is taken into account
Repetitions in a string can be specified by superscriipts for example the string bbaaac may be written
b^2a^3c
The length of a string α is denoted by /α/ which refers to the number of elements
Vector Algebra
matrix- rectangular array of data represented by capital letters. If A is a matrix the number of m rows
and n columns determines the size written as m x n. It is either enclosed by parenthesis or bracket.
Operations:
Addition and Subtraction of matrices: It can only be made possible if the matrices are of the same
size. Addition and subtraction is done by adding and subtracting corresponding entries.
Multiplication of matrices: To multiply any two matrices ,the number of columns of the first must be
equal to the number of rows of the second. A matrix with a size of 3x2 and a 2x3 yields a 3x3 matrix
Transpose of a matrix: If the entries in the rows and columns are interchanged.Uses the symbol A^T.
Matrix raised to an exponent p: M^p is equal to the matrix product taken p times
Inverse of a matrix: It exist if and only if the the matrix is invertible such that ad-bc is not equal to 0.
However the inverse does not exist if the matrix is NOT invertible.
Course Module
Math 6200 / Data Analyis
1
]
Algorithm Analysis- refers to the process of deriving estimates for the time and space needed to
execute the algorithm
Analysis of algorithms
For looking up a given entry in a given ordered list, both the binary and the linear search algorithm (which
ignores ordering) can be used. The analysis of the former and the latter algorithm shows that it takes at
most log2(n) and n check steps, respectively, for a list of length n. In the depicted example list of length 33,
searching for "Morin, Arthur" takes 5 and 28 steps with binary (shown in cyan) and linear (magenta) search,
respectively.
Course Module
Graphs of functions commonly used in the analysis of algorithms, showing the number of
operations N versus input size n for each function
In computer science, the analysis of algorithms is the process of finding the computational complexity of
algorithms – the amount of time, storage, or other resources needed to execute them. Usually, this
involves determining a function that relates the length of an algorithm's input to the number of steps it
takes (its time complexity) or the number of storage locations it uses (its space complexity). An algorithm
is said to be efficient when this function's values are small, or grow slowly compa red to a growth in the
size of the input. Different inputs of the same length may cause the algorithm to have different behavior,
so best, worst and average case descriptions might all be of practical interest. When not otherwise
specified, the function describing the performance of an algorithm is usually an upper bound, determined
from the worst case inputs to the algorithm.
The term "analysis of algorithms" was coined by Donald Knuth.[1] Algorithm analysis is an important part of
a broader computational complexity theory, which provides theoretical estimates for the resources needed
by any algorithm which solves a given computational problem. These estimates provide an insight into
reasonable directions of search for efficient algorithms.
In theoretical analysis of algorithms it is common to estimate their complexity in the asymptotic sense, i.e.,
to estimate the complexity function for arbitrarily large input. Big O notation, Big-omega
notation and Big-theta notation are used to this end. For instance, binary search is said to run in a number
of steps proportional to the logarithm of the length of the sorted list being searched, or in O(log(n)),
colloquially "in logarithmic time". Usually asymptotic estimates are used because
different implementations of the same algorithm may differ in efficiency. However the efficiencies of any
two "reasonable" implementations of a given algorithm are related by a constant multiplicative factor
called a hidden constant.
Exact (not asymptotic) measures of efficiency can sometimes be computed but they usually require certain
assumptions concerning the particular implementation of the algorithm, called model of computation. A
model of computation may be defined in terms of an abstract computer, e.g., Turing machine, and/or by
postulating that certain operations are executed in unit time. For example, if the sorted list to which we
apply binary search has n elements, and we can guarantee that each lookup of an element in the list can
be done in unit time, then at most log 2 n + 1 time units are needed to return an answer.
Cost models
Time efficiency estimates depend on what we define to be a step. For the analysis to correspond usefully
to the actual execution time, the time required to perform a step must be guaranteed to be bounded
above by a constant. One must be careful here; for instance, some analyses count an addition of two
numbers as one step. This assumption may not be warranted in certain contexts. For example, if the
numbers involved in a computation may be arbitrarily large, the time required by a single addition can no
longer be assumed to be constant.
the uniform cost model, also called uniform-cost measurement (and similar variations), assigns a
constant cost to every machine operation, regardless of the size of the numbers involved
the logarithmic cost model, also called logarithmic-cost measurement (and similar variations),
assigns a cost to every machine operation proportional to the number of bits involved
The latter is more cumbersome to use, so it's only employed when necessary, for example in the analysis
of arbitrary-precision arithmetic algorithms, like those used in cryptography.
A key point which is often overlooked is that published lower bounds for problems are often given for a
model of computation that is more restricted than the set of operations that you could use in practice and
therefore there are algorithms that are faster than what would naively b e thought possible.[7]
Run-time analysis
Run-time analysis is a theoretical classification that estimates and anticipates the increase in running
time (or run-time) of an algorithm as its input size (usually denoted as n) increases. Run-time efficiency is a
topic of great interest in computer science: A program can take seconds, hours, or even years to finish
executing, depending on which algorithm it implements. While software profiling techniques can be used
to measure an algorithm's run-time in practice, they cannot provide timing data for all infinitely many
possible inputs; the latter can only be achieved by the theoretical methods of run-time analysis.
Take as an example a program that looks up a specific entry in a sorted list of size n. Suppose this
program were implemented on Computer A, a state-of-the-art machine, using a linear search algorithm,
and on Computer B, a much slower machine, using a binary search algorithm. Benchmark testing on the
two computers running their respective programs might look something like the following:
Course Module
References and Supplementary Materials
Books and Journals
1. Sanjiv Ranjan Das; 2016; Data Science :Theories ,Models ,Algorithms and Analytics ; S.
R. Das
2. Richard Johnsonbough; 2005; Introduction to Discrete Mathematics; Pearson
Education South Asia Pacific
Module 7-Statistical Computations
Ex. 92,84.65.76.88.90
B.Measures of Dispersion/Spread/Variation
Course Module
For above data variance=29,5
percentile=576/900 x100=64
Elaine's score places her at the 64th percentile
Select one:
a. 4100
b. 4000
c. 4275
d. 4215
Feedback
Question 2
Incorrect
Mark 0.00 out of 1.00
Flag question
Question text
Select one:
a. 0.9
b. 0.56
c. -0,43
d. 1.2
Question 3
Correct
Mark 1.00 out of 1.00
Flag question
Question text
A vegetable distributor knows that during the month of August ,the weights of tomatoes
are normally distributed with a mean of 0.61 lb and a standard deviation of 0.15 lb. What
percent of the tomatoes weigh less than 0.71 lb?
Select one:
a. 95
b. 97
c. 84
d. 85
Feedback
Question 4
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. google search
b. google games
c. google drive
d. google map
Feedback
Question 5
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. normal
b. kurtic
c. skewed
d. standard
Feedback
Question 6
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. 0
b. -1
c. 1
d. 2
Feedback
Question 7
Correct
Mark 1.00 out of 1.00
Flag question
Question text
According to Hilary Mason which is NOT a skill that a good data scientist must cultivate.
Select one:
a. critical thinking
b. communication
c. coding
Question 8
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. ogive
b. bar graph
c. pie graph
d. histogram
Feedback
Flag question
Question text
Select one:
a. velocity
b. variety
c. vastness
d. viscosity
Feedback
Question 10
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. J Pastor
b. G.Cantor
c. N.R. Drops
d. DJ Patil
Feedback
Your answer is correct.
Question 11
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. 500
b. 200
c. 300
d. 400
Feedback
Question 12
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. analytic models
b. decision support tools
c. interlinked data output
d. graphs
Feedback
Question 13
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. percentile
b. mode
c. median
d. mean
Feedback
Question 14
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. data analyst
b. data expert
c. data scientist
d. data drive
Feedback
Question 15
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. datafication
b. analytics
c. dataology
d. mining
Feedback
Question 16
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. percent distribution
b. relative distribution
c. frequency distribution
Question 17
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. depth
b. velocity
c. volume
d. analytics
Feedback
Question 18
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. mean
b. median
c. percentile
d. quartile
Feedback
Question 19
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. time
b. process
c. technical expertise
d. data
Feedback
Question 20
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. Dennis Grant
b. Roland Patil
c. William Gillason
d. Wiliam Harvey
Feedback
Question 21
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. prediction
b. interpretation
c. analysis
d. critical thinking
Feedback
Question 23
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. variance
b. deviation
c. range
d. mean
Feedback
Question 24
Correct
Mark 1.00 out of 1.00
Flag question
Question text
Select one:
a. text
b. text mining
c. volume
d. sorting
Feedback
The Art of Data Science — “All models are wrong, but some are useful.”
George E. P. Box and N.R. Draper in “Empirical Model Building and Response
Surfaces,” John Wiley & Sons, New York, 1987. So you want to be a “data
scientist”? There is no widely accepted definition of who a data scientist is.1
Several books now attempt to 1 The term “data scientist” was coined by D.J.
Patil. He was the Chief Scientist for LinkedIn.
In 2011 Forbes placed him second in their Data Scientist List, just behind
Larry Page of Google. define what data science is and who a data scientist
may be, see Patil (2011), Patil (2012), and Loukides (2012). This book’s
viewpoint is that a data scientist is someone who asks unique, interesting
questions of data based on formal or informal theory, to generate rigorous
and useful insights.2 It is likely to be an individual with multi-disciplinary
train- 2 To quote Georg Cantor - “In mathematics the art of proposing a
question must be held of higher value than solving it.” ing in computer
science, business, economics, statistics, and armed with the necessary
quantity of domain knowledge relevant to the question at hand. The potential
of the field is enormous for just a few well-trained data scientists armed with
big data have the potential to transform organizations and societies. In the
narrower domain of business life, the role of the data scientist is to generate
applicable business intelligence. Among all the new buzzwords in business –
and there are many – “Big Data” is one of the most often heard. The
burgeoning social web, and the growing role of the internet as the primary
information channel of business, has generated more data than we might
imagine. Users upload an hour of video data to YouTube every second.3 87%
of the U.S. 3 Mayer-Schönberger and Cukier (2013), p8. They report that
USC’s Martin Hilbert calculated that more than 300 exabytes of data storage
was being used in 2007, an exabyte being one billion gigabytes, i.e., 1018
bytes, and 260 of binary usage. population has heard of Twitter, and 7% use
it.4 Forty-nine percent of 4 In contrast, 88% of the population has heard of
Facebook, and 41% use it. See www.convinceandconvert.com/ 7-surprising-
statistics-about -twitter-in-america/. Half of Twitter users are white, and of the
remaining half, half are black. Twitter users follow some brand or the other,
Course Module
hence the reach is enormous, and, as of 2014, there are more then 500 million
tweets a day. But data is not information, and until we add analytics, it is just
noise. And more, bigger, data may mean more noise and does not mean
better data. In many cases, less is more, and we need models as well. That is
what this book is about, it’s about theories and models, with or without data,
26 data science: theories, models, algorithms, and analytics big or small. It’s
about analytics and applications, and a scientific approach to using data
based on well-founded theory and sound business judgment. This book is
about the science and art of data analytics. Data science is transforming
business. Companies are using medical data and claims data to offer
incentivized health programs to employees. Caesar’s Entertainment Corp.
analyzed data for 65,000 employees and found substantial cost savings.
Zynga Inc, famous for its game Farmville, accumulates 25 terabytes of data
every day and analyzes it to make choices about new game features. UPS
installed sensors to collect data on speed and location of its vans, which
combined with GPS information, reduced fuel usage in 2011 by 8.4 million
gallons, and shaved 85 million miles off its routes.5 McKinsey argues that a
successful data 5 “How Big Data is Changing the Whole Equation for
Business,” Wall Street Journal March 8, 2013. analytics plan contains three
elements: interlinked data inputs, analytics models, and decision-support
tools.6 In a seminal paper, Halevy, Norvig 6 “Big Data: What’s Your Plan?”
McKinsey Quarterly, March 2013. and Pereira (2009), argue that even simple
theories and models, with big data, have the potential to do better than
complex models with less data. In a recent talk7 well-regarded data scientist
Hilary Mason empha- 7 At the h2o world conference in the Bay Area, on 11th
November 2015. sized that the creation of “data products” requires three
components: data (of course) plus technical expertise (machine-learning) plus
people and process (talent). Google Maps is a great example of a data
product that epitomizes all these three qualities. She mentioned three skills
that good data scientists need to cultivate: (a) in math and stats, (b) coding,
(c) communication. I would add that preceding all these is the ability to ask
relevant questions, the answers to which unlock value for companies,
consumers, and society. Everything in data analytics begins with a clear
problem statement, and needs to be judged with clear metrics. Being a data
scientist is inherently interdisciplinary. Good questions come from many
disciplines, and the best answers are likely to come from people who are
interested in multiple fields, or at least from teams that co-mingle varied skill
sets. Josh Wills of Cloudera stated it well - “A data scientist is a person who is
better at statistics than any software engineer and better at software
engineering than any statistician.” In contrast, complementing data scientists
are business analytics people, who are more familiar with business models
and paradigms and can ask
4. 1.1 Volume, Velocity, Variety There are several "V"s of big data: three of
these are volume, velocity, variety.8 Big data exceeds the storage capacity
of conventional databases. 8 This nomenclature was originated by the
Gartner group in 2001, and has been in place more than a decade. This is
it’s volume aspect. The scale of data generation is mind-boggling.
Google’s Eric Schmidt pointed out that until 2003, all of human kind had
generated just 5 exabytes of data (an exabyte is 10006 bytes or a
billionbillion bytes). Today we generate 5 exabytes of data every two days.
The main reason for this is the explosion of “interaction” data, a new
phenomenon in contrast to mere “transaction” data. Interaction data
comes from recording activities in our day-to-day ever more digital lives,
such as browser activity, geo-location data, RFID data, sensors, personal
digital recorders such as the fitbit and phones, satellites, etc. We now live
in the “internet of things” (or iOT), and it’s producing a wild quantity of
data, all of which we seem to have an endless need to analyze. In some
quarters it is better to speak of 4 Vs of big data, as shown in Figure 1.1.
Figure 1.1: The Four Vs of Big Data. A good data scientist will be adept at
managing volume not just technically in a database sense, but by building
algorithms to make intelli- 28 data science: theories, models, algorithms,
and analytics gent use of the size of the data as efficiently as possible.
Things change when you have gargantuan data because almost all
correlations become significant, and one might be tempted to draw
spurious conclusions about causality. For many modern business
applications today extraction of correlation is sufficient, but good data
science involves techniques that extract causality from these correlations
as well. In many cases, detecting correlations is useful as is. For example,
consider the classic case of Google Flu Trends, see Figure 1.2. The figure
shows the high correlation between flu incidence and searches about “flu”
on Google, see Ginsberg et. al. (2009). Obviously searches on the key word
“flu” do not result in the flu itself! Of course, the incidence of searches on
this key word is influenced by flu outbreaks. The interesting point here is
that even though searches about flu do not cause flu, they correlate with
it, and may at times even be predictive of it, simply because searches lead
the actual reported levels of flu, as those may occur concurrently but take
time to be reported. And whereas searches may be predictive, the cause of
searches is the flu itself, one variable feeding on the other, in a repeat
cycle.9 Hence, prediction is a major outcome of corre- 9 Interwoven time
series such as these may be modeled using Vector AutoRegressions, a
technique we will encounter later in this book. lation, and has led to the
recent buzz around the subfield of “predictive analytics.” There are entire
conventions devoted to this facet of correlation, such as the wildly popular
PAW (Predictive Analytics World).10 10 May be a futile collection of
Course Module
people, with non-working crystal balls, as William Gibson said - “The
future is not google-able.” Pattern recognition is in, passe causality is out.
Figure 1.2: Google Flu Trends. The figure shows the high correlation
between flu incidence and searches about “flu” on Google. The orange
line is actual US flu activity, and the blue line is the Google Flu Trends
estimate. Data velocity is accelerating. Streams of tweets, Facebook
entries, financial information, etc., are being generated by more users at
an ever increasing pace. Whereas velocity increases data volume, often
exponentially, it might shorten the window of data retention or
application. For example, high-frequency trading relies on micro-second
information and streams of data, but the relevance of the data rapidly
decays. the art of data science 29 Finally, data variety is much greater than
ever before. Models that relied on just a handful of variables can now avail
of hundreds of variables, as computing power has increased. The scale of
change in volume, velocity, and variety of the data that is now available
calls for new econometrics, and a range of tools for even single questions.
This book aims to introduce the reader to a variety of modeling concepts
and econometric techniques that are essential for a well-rounded data
scientist. Data science is more than the mere analysis of large data sets. It
is also about the creation of data. The field of “text-mining” expands
available data enormously, since there is so much more text being
generated than numbers. The creation of data from varied sources, and its
quantification into information is known as “datafication.” 1.2 Machine
Learning Data science is also more than “machine learning
Course Module