How Big Is The World Wide Web

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Problems with the question about the size

of the Web
Which two reasons make it difficult to answer the question: "What is the size of the world
wide web?"

It is not clear what size means.


It is underspecified what we understand as the world wide web
the dark web makes it hard to index all web pages
Once the size of something exceeds a peta byte it cannot be measured precisely
2When Looking at this picture of the web. What modelling choices could have been made?
web pages are nodes of a graph and edges are links
domains are nodes of a graph and edges exists for every link between domains
all of the above
none of the above
3Which of the following viewpoints are commonly used when modelling the world wide web
software system
a diagram
graph of connected web pages
a mathematical function
a spider web
collection of text documents
4how could you measure the size of a distributed software system
count how many computers have the software installed
gather information about the computing time being used
you cannot do so
the linces of code of each installed software component is a good indicator

3 ways to study the Web


1Which are the fundamental types of Models the web can be seen as?
Descriptive Model
Predictive Model
Software system
Graph Model
Generative Model
Collection of Text documents
2Which of the following are Pros of looking at the Web as a collection of Text documents
Methods from the field of Software engineering can be applied
Amount of information on the web can be quantified via entropy
Methods from Natural language processing and information retrieval can
be applied
Structural information can easily be analysed
3Which of the following are Cons of looking at the Web as a graph of connected Web Pages
Large amount of data to Model
Ignores crucial information
Trust of web pages cannot be determined
Hard to have a good measure for the size of the Web
4Which of the following statements about modelling the Web are true?
Generative Models are created to understand why certain properties arise
on the Descriptive Models of the Web.
Descriptive Models are created to understand why certain properties arise on
the Generative Models of the Web.
Descriptive Models help understanding what properties the World Wide
Web has.
When studying the Web as a graph one must use Generative Models.
When studying the Web as a collection of Text documents one must use
Descriptive Models.
Descriptive and Generative Models can be used to model collections of text
documents as well as graphs of web pages.

A simplistic descriptive model


1Of How many words consists the following sentence: "John F. Kennedy visited New York."
3
4
5
6
The correct answer depends on the modelling choice that are not further
specified here
all of the above are possible
2Assuming sentences end with punctuation signs and everything between two successive
whitespaces is considered a word. How many sentences and words can be counted in the
following sequence: "John F. Kennedy visited New York"

1 Sentence with 6 words.


1 Sentence with 5 words.
2 Sentences with two words in the first sentence and 4 words in the second one
2 Sentences with two words in the first sentence and 3 words in the second one
3You want to measure the size of the Simple English Wikipedia by counting words. Which of
the following are strong assumptions creating an impact on the result?

Wikipedia is identical to the crawling of it


Words are separated by White space
all pages are reached by the crawler
The size should be measured in Byte

An unrealistic, simplistic generative


model
1Why is formulating a hypothesis so crucial in the process of
scientific modelling?
Formulating a hypothesis clears the path towards a clear defined model
Often simplifying assumptions are knit into the hypothesis and the afterwards
built model
2Every Minute 0.19305 words are generated on the simple english
wikipedia
true
False
3Why should one have several runs of a generative probabilistic
model?
in the first run the caches need to warm up
to get statistic stability
because random experiments can produce strong outliers in just one run
there is no cost of making sure the computer did correct calculations by running
the experiment twice or more
because every scientific experiment should be repeated more than once to avoid
mistakes

Counting Words And Documents


1How many words are in the following sentence? It is really really difficult
to count words.
6
7
8
9
2How many unique word tokens are in the following sentence? It is really
really difficult to count words.
6
7
8
9
3which of the following command line tools can be used to count words?
cat
ls
wc
tr

Typical length of a document


given the following List [5,3,8,2,7]} which value is the median?
2
3
5
7
8
2given the following List [2,3,5,7,8]} which value is the median?
2
3
5
7
8

3given the following List [2,3,5,7,8]} which value is the average?


2
3
5
7
8
4given the following list [1,2,3,4,5,6,7] which statements are true?]
The median is exactly the average
The median is smaller than the average
the average is smaller than the median
4 is the average
4 is the median
7 is the median
for any list of elements the median is smaller than the number of elements.

How to formulate a research hypothesis


1Which of the following hypothesis are falsifiable?
No word is longer than 100 characters on the World Wide Web
At some time the web will reach is maximum size
All web pages in the google index have well formated HTML syntax
Tim Berners Lee stole HTML from the swiss people who have already created
web pages in the 80s which are now hidden on the dark web.
2What should you do after you came up with a research hypothesis
Convince everyone that this hypothesis will be true
find testable predictions
complete the first two steps of the circle of research to have a good reason why
you came up with that research question
find datasets which can be used to find evidence for the hypothesis
make sure that the hypothesis is falsifiable.

Number of words needed to understand most of


Wikipedia

We saw that more than half of the unique word tokens on Simple English Wikipedia occured only
once. Which of the following statements are true?
picking a random word from the simple english wikipedia the chance is higher than 50% that it
occured only once
picking 100 random words from wikipedia we expect more than 50 of them to occure only once
picking a random word the chance for getting a word that occurs only once is less than 10%

Linguists way of checking simplicity of text


Why do we prefer the Automated Readability Index over the Flesh Kincaid Readability test
it is easier to count
it is faster to compute
it has a deeper linguistic insight
it is more accurate

The Zipf law for text


What do you know about Zipf law?
Plotting the rank of words against the frequency appear as a straight line
the word rank multiplied by its frequency is supposed to be roughly constant
on the simple english wikipedia dataset the law only seams to hold for the top ranked words
Zipf's law has been falsified for many years and is only taught for historical reasons

Visually straight lines on log log plots


1Which functions appear as straight lines on log log plots?
linear functions
sine, cosine, tan
power functions
exponential functions
logarithms
square root functions

2Which functions appear as straight lines on linear plots plots?


linear functions
sine, cosine, tan
power functions
exponential functions
logarithms
square root functions

3Which functions appear as straight lines on plots that are only logarithmic on the y scale?
linear functions
sine, cosine, tan
power functions
exponential functions
logarithms
square root functions

Fitting a curve on a log log plot

Zipf law powerlaw or pareto law.webm


1What is true about the power law plot for words
The x axis depicts the frequncy
The x axis depicts how many words occur exactly y times
The x axis depicts how many words occur exactly x times
The y-axis depicts the frequency
The y-axis depicts how many words occur exactly x times
The y-axis depicts how many words occur exactly y times
Similarity Measures and their Applications
1Is there a connection between similarities and distance functions?
No, not at all
Yes by taking the inverse
Yes by taking the negative value
Yes via log and exponential function

2Which of the following are applications of similarity measures?


Vectorspaces
Recommender systems
Machine Learning
Information Retrieval
Jaccard Coefficient

3Which of the following statements is true?


Similarity measures can always be normalized
Similarity measures need to be transitive
Similarity measures have by symmetric
Similarity measures have to have equal self similarity for all Elements
They can have negative values

Jaccard Similarity for Sets


1given D1 = a a a b and D2 = b b b a what is the jaccard coefficient of the corresponding word sets?
1/1
2/4
2/8
2/6

2given D1 = a b c d e and D2 = e f g h what is the jaccard coefficient of the corresponding word
sets?
0
1/7
1/8
1/9
2/8

Cosine Similarity For Vectorspaces

Probabilistic Similarity Measures Kullback Leibler Divergence


Smoothing is needed
to make sure the probability function will not take the value 0
because this will always yield more accurate results
because otherwise the query likelihood model would have sparse results for many queries
all of the above

Comparing Results of Similarity Merasures


1which method can be used best to find characteristic words of a text?
jaccard
TF-IDF
TF
Language Model
Smoothed Language Model

2Which method works well in an information retrieval setting


jaccard
TF-IDF
Language Model
Smoothed Language Model

3Which method should be used when you don't have several occurences of the same elements?
jaccard
TF-IDF
Language Model
Smoothed Language Model

Introduction to generative modelling.webm


How to create a probability distribution from frequencies
by going to relative frequencies
by deviding each frequency with thte sum of all frequencies
by aplying the gauss algorithm
this is impossible

Sampling from a probability distribution


what does our sampling process make sure?
the sampled values will most likely follow the given probability distribution
the sampled values will certainly follow the given probability distribution
the sampled values are not biased but are uniformly distributed
the value of the Kolmogorov Smirnov test for the sampled values and the original distribution
should be smaller the more values are sampled
the value of the Kolmogorov Smirnov test for the sampled values and the original distribution
should be bigger the more values are sampled

Evaluating a generative model


What is true if the statistics of a generative Model match the statistics of the descriptive model?
The generated data can still be very different from the observed data.
The generated model is a perfect match of the observed data
There could be other statistics which we haven't looked at that might not match.
The model parameters of the generative model could give some reason why we can observe
something
One should try to decrease the number of model parameters

Pittfalls when increasing the number of model


parameters
1Increasing the number of Model parameters often yields more accurate generative Models. Why
should one be careful to do so?
more model parameter always lead to a worse complexity class of the Algorithm
when a certain amount of parameters is reached one might not get an interesting insight from the
parameter set
we are aiming for simplicity of our models.

2Which of the following rule of thumbs is most likely true?


increasing the number of model parameters has a good chance to create a generative model whose
statistics better match the observed data
decreasing the number of model parameters leads to simpler models
doubling the amount of model parameters will decrease the error of the model by 50%
doubling the amount of model parameters will increase the error of the model by 50%

Reviewing terms from graph theory

2What kind of mathematical object is used to describe a graph labeling?


set
element
function
matrix
vector
String

3which of the following are types of graphs that you know?


heavy graphs
complex graphs
directed graphs
difficult graphs
bipartite graphs
robust graphs
web graphs
weighted graphs

The standard web graph model


1In the standard Web Graph Model vertices correspond to...
web sites
web pages
urls
anchor texts
authors

2In the standard Web Graph Model edges correspond to...


web sites
web pages
urls
anchor texts
authors

3Which of the following properties are used in the standard web graph model?
bipartite
edge labeled
vertex labled
directed
undirected
weighted

Descriptive statistics of the web graph


1having a random web crawl which of the following statements would you expect to be true?
the highest indegree would be smaller than the highest outdegree
counting the anchor-tags on one html document gives the indegree of the node representing this document
in degrees can be exactly counted
the indegree and outdegree distribution will take the same values.

2Wich statements with regard to the gini coefficient are true?


high values mean that the measured distribution is not very equal
low values mean perfect equality
the gini coefficient can take values between 0 and infinity
the gini coefficient can take values between -1 and 1
the gini coefficient can take values between 0 and 1

Topology of the web graph


1What is true about the largest strongly connected component of the World Wide Web
It consists only of the Web pages of the Wide Web Consortium and CERN.
the diameter will be surprisingly small (presumably less than 100)
there is at least one path of links from one url to any other url in the strongly connected component
every node inside the strongly connected component has at least 3 incoming edges
every node inside the strongly connected component has at least 3 common nodes with any other node

2Which of the following statements about the bow tie structure of the web are true?
the incomponent of the bow tie model can easily be crawled.
the out component cannot be crawled since search engines cannot find it.
two random nodes from the in component can have a path between them
two random nodes of the out component can have a path between them
if a new link from a node of the out component is created to a node of the in component both nodes will
then be part of the strongly connected component
the strongly connected component is the intersection of the in component with the out component
the strongly connected component is the union of the in component with the out component

Modelling-graphs-with-linear-algebra

You might also like