The following problem appeared as a project in the coursera course Algorithmic Thinking (by RICE university), a part of Fundamentals of Computing specialization. The following description of the problem is taken directly from the project description.
In this problem, the mathematical analysis to analyze a real-world problem: “How do scientific papers get cited” will be combined with simulation. The goal is to provide a more realistic simulation of how the concepts are actually used in practice.
Citation graphs
Our task for this application is to analyze the structure of graphs generated by citation patterns from scientific papers. Each scientific paper cites many other papers, say 20-40, and sometimes (e.g., review papers) hundreds of other papers. But, let’s face it: It is often the case that the authors of a paper are superficially familiar with some (many?) of the papers they cite. So, the question is: Are the cited papers chosen randomly (from within the domain of the paper) or is there some “hidden pattern”?
Given that “paper i cites paper j” relationships are going to be examined, it makes sense to represent the citation data as a directed graph (a citation graph) in which the nodes correspond to papers, and there is an edge from node i to node j if the paper corresponding to node i cites the paper corresponding to node j. Since we’re interested in understanding how papers get cited, we shall analyze the in-degree distribution of a specific graph, and contrast it to those of graphs generated by two different random processes.
For this question, the task is to load a provided citation graph for 27,770 high energy physics theory papers. This graph has 352,807 edges. Then we need to compute the in-degree distribution for this citation graph. Once this distribution have been computed, we need to normalize the distribution and then compute a log/log plot of the points in this normalized distribution.
A very small part of the citation graph is shown below:


The following Algorithm ER shows an algorithm for generating random graphs:

Consider the simple modification of the algorithm to generate random directed graphs: For every pair of nodes i and j, the modified algorithm considers the pair twice, once to add an edge (i,j) with probability p, and another to add an edge (j,i) with probability p.
Now, let’s consider the shape of the in-degree distribution for an ER graph and compare its shape to that of the physics citation graph. To determine the shape of the in-degree distribution for an ER graph, we can compute several examples of in-degree distributions (with simulation) or determine the shape mathematically.
The expected in-degree is the same for every node in an ER graph. Since for any given node, incoming edges to it, from each of the n-1 remaining nodes, are chosen by a tossing a coin with the probability of success p, the random variable in-degree follows a Binomial distribution with parameters (n-1,p) and the expected in-degree = (n-1)p. Each node is equivalent in this sense and will have the same expected in-degree.
As can be seen from the next figure (a log-log plot for the in-degree distribution of the ER graph for n = 1000, p = 0.1). It’s centred approximately around in-degree 100. All the nodes have in-degree approximately in between 66 and 132, with steady and sharp increase in probability from in-degree 66 to 100, having the peak probability at around in degree 100, then the probability falling steadily and sharply in between in-degrees 100 and 132.
Also, it can be seen that the shape of the in-degree distribution of the citation graph is different from that of the ER graph. We observe a steady decrease in probability when moving from lower to higher in-degrees in case of citation graph, with pick probabilities (high frequency of nodes) at small values of in degrees.But in case of ER graph there is first a steady increase and then a steady decrease happens, pick probability is attained (high frequency of nodes) at higher in-degree values.






































































































































