0% found this document useful (0 votes)

31 views21 pages

HWK 5

Uploaded by

sitanzhang1996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views21 pages

HWK 5

Uploaded by

sitanzhang1996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

HWK 5 – Detecting fraudulent data, from the back and the front

This assignment will give the students practice with the basics of statistical testing and is inspired
by homework from Prof. Emilia Gan (https://courses.cs.washington.edu/courses/cse160/19wi/).

In this assignment, you will try to detect fraud in a dataset, in two different ways:

• Part 1: By examining the last digits of the numbers in the dataset. For each
datum such as 21063, you would examine the 3 and the 6.
• Part 2: By examining the first digit of the numbers in the dataset. For each
datum such as 21063, you would examine the 2.

To get started, download the data files.

You will write program in a new file, fraud_detection.py, that you will create. In
some cases, you will write answers to answers.txt.

Coding style
A portion of your grade depends on the use of a good coding style. Code written in
good style is easier to read and to modify and is less likely to contain errors.

The documentation should allow us to understand it. You can get some ideas on how
to document your code from the starter code of previous assignments. Follow
good docstring conventions. You do not have to follow these instructions to the letter.

Different parts of this assignment require similar, but not exactly identical, work.
When you notice such repetition, you should refactor and generalize your code. That
is, rather than writing two similar routines, you should write one routine that takes the
place of most of both.

You should decompose functions into smaller helper functions. One usual good rule
of thumb is that if you cannot come up with a descriptive name for a function, then it
may be too small or too large. If there is a good name for a portion of a function, then
you might consider abstracting it out into a helper function.

We have not provided tests or exact results to check your program against. We
encourage you to write your own tests and to use assert statements.

Important: Leave yourself some time to go back and refactor your code before you
turn it in. Whenever you make a change to your program, ensure that it produces the
same results as before. It is very possible to get a low grade on this assignment even if
your program correctly executes of the requested calculations.

Part 1: Detecting fraudulent data from the

back
In this part of the assignment, you will look for fraud in election returns from the
disputed 2009 Iranian presidential election. You will examine the least significant
digits of the vote totals — the ones place and the tens place.

The ones place and the tens place don't affect who wins. They are essentially random
noise, in the sense that in any real election, each value is equally likely. Another way
to say this is that we expect the the ones and tens digits to be uniformly distributed
— that is, 10% of the digits should be “0”, 10% should be “1”, and so forth. If
these digits are not uniformly distributed, then it is likely that the numbers were made
up by a person rather than collected from ballot boxes. (People tend to be poor at
making up truly random numbers.)

It is important to note that a non-uniform distribution does not necessarily mean that
the data is fraudulent data. A non-uniform distribution is a great signal for fraudulent
data, but it is possible for a non-uniform distribution to appear naturally.

Getting Started
Create a file called fraud_detection.py for your python program. As we have given
you no starter code, it is up to you to create this program from scratch.

There are a few specific details that you must adhere to. The first is that your
program's output should exactly match the following formatting, including
capitalization and spacing (except where ___ is replaced by your answers).
2009 Iranian election MSE: ___
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___
2008 United States election MSE: ___
Quantity of MSEs larger than or equal to the 2008 United States election MSE:
___
Quantity of MSEs smaller than the 2008 United States election MSE: ___
2008 United States election null hypothesis rejection level p: ___
Some of the problems request that you write functions that create plots and save them
to files. Upon execution, your program should generate and write these files (even
though there are no traces of these files in the printed output).

We do ask for specific functions that take exact parameter formats and return exact
output formats. You must preserve the names, parameters, and output of these
functions. The functions that we ask for are as follows:
• extract_election_vote_counts(filename, column_names)
• ones_and_tens_digit_histogram(numbers)
• plot_iranian_least_digits_histogram(histogram)
• plot_distribution_by_sample_size()
• mean_squared_error(numbers1, numbers2)
• calculate_mse_with_uniform(histogram)
• compare_iranian_mse_to_samples(mse)

Lastly, you should use a main function to organize the execution of code in your
program. You may begin with the following template code, which goes at
the bottom of your program, after the definitions of all of your functions.
# The code in this function is executed when this file is run as a Python
program
def main():
...

if __name__ == "__main__":
main()

Your program should not execute any code, other than the main function, when it is
loaded; that is, all statements should be inside a function, never at the top level.

Make sure to add import statements to gain access to tools and functions that are not
included by default, such as matplotlib.pyplot or math. All import statements should
be at the top of the file.

Problem 1: Read and clean Iranian election data

There were four candidates in the 2009 Iranian election: Ahmadinejad, Rezai,
Karrubi, and Mousavi. The file election-iran-2009.csv contains data, reported by the
Iranian government, for each of the 30 provinces. We are interested in the vote counts
for each of these candidates. Thus, there are 120 numbers we care about in the file.

Write a function called extract_election_vote_counts that takes a filename and a list

of names of columns to extract vote counts from. It should return a list of all of the
vote counts from the respective rows. The order of the integers in the returned list
does not matter. You may assume that the names that are passed in the list do exist as
column names in the data file.

You may want to refer to String Methods and int() from the python language
documentation. Instead of using split, you should make use of csv.DictReader,
which will make it easier to produce a clean solution.
>>> extract_election_vote_counts("election-iran-2009.csv", ["Ahmadinejad",
"Rezai", "Karrubi", "Mousavi"])
[1131111, 16920, 7246, 837858, 623946, 12199, 21609, 656508, ...

You will notice that the data contains double-quotes and commas. It is common to
receive data that is not formatted quite how you would like for it to be. It is important
to be able to clean data before analysis. It is up to you to handle the input by removing
these symbols from the data before converting them into numbers.

Problem 2: Make a histogram

Write a function ones_and_tens_digit_histogram that takes as input a list of numbers
and produces as output a list of 10 numbers. Each element of the result indicates the
frequency with which that digit appeared in the ones place or the tens place in the
input. Here is an example call and result:
>>> ones_and_tens_digit_histogram([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
144, 233, 377, 610, 987, 1597, 2584, 4181, 6765])
[0.21428571428571427, 0.14285714285714285, 0.047619047619047616,
0.11904761904761904, 0.09523809523809523, 0.09523809523809523,
0.023809523809523808, 0.09523809523809523, 0.11904761904761904,
0.047619047619047616]

In this example call, the index 1 of the list contains the

element 0.14285714285714285 because the value 1 appears in 14.285714285714285% of
the ones or tens digits of the given numbers.

In a number that is less than 10, such as 3, the tens place is implicitly zero. That is, 3
must be treated as 03. Your code should treat the tens digits of these values as zero.

Problem 3: Plot election data

Write a function called plot_iranian_least_digits_histogram that takes a histogram
(as created by ones_and_tens_digit_histogram) and graphs the frequencies of the ones
and tens digits for the Iranian election data. Save your plot to a file named iran-
digits.png using pyplot.savefig. The function should return None. It is alright to have
the name of the Iranian election file and the names of the Iranian candidates hard-
coded as strings inside of this function.
>>> plot_iranian_least_digits_histogram(histogram)

The resultant plot should be identical to the following plot. Don't forget the x- and y-
axis labels and the legend. Use pyplot.plot for the line itself. To create the legend at
the top right corner, use plt.legend, and don't forget to use the label= optional
argument to plot.

You may wish to reference the pyplot tutorial. As a hint (that is also discussed in the
tutorial) be sure that the call to pyplot.savefig comes before any call to pyplot.show;
if savefig comes after, the graph will be empty.

The Iran election data are rather different from the expected flat line at y=.1. Are these
data different enough that we can conclude that the numbers are probably fake? You
can't tell just by looking at the graphs we have created so far. We will show more
principled, statistical ways of making this determination.
Problem 4: Smaller samples have more variation
With a small sample, the vagaries of random choice might lead to results that seem
different than expected. As an example, suppose that you plotted a histogram of 20
randomly-chosen digits (10 random numbers, 2 digits per number):

That looks much worse than the Iran data, even though it is genuinely random! Of
course, it would be incorrect to conclude from this experiment that the data for this
plot is fraudulent and that the data for the Iranian election is genuine. Just because
your observations do not seem to fit a hypothesis does not mean the hypothesis is
false — it is very possible that you didn't examined enough data to see the trend.

Write a function called plot_distribution_by_sample_size. This function creates 5

different-sized collections of random numbers. Then, it plots the digit histograms for
each of those collections. Your function should save your plot as random-digits.png.
The function should return None.
The graph should look like the figure above (note the title, legend, and x- and y-axis
labels), but it should have 5 plots, for 10, 50, 100, 1000, and 10000 random numbers.
Those plots should be in different colors, so that you can distinguish them, and should
all be mentioned in the legend. (The legend will be so large that it may cover up some
of the lines; that is OK.) Naturally, random variation will make your graph look
different than this one (and it will differ from run to run of your program).

You will want to use random.randint to generate numbers in the range [0, 99], inclusive.
>>> plot_distribution_by_sample_size()

Your plot demonstrates that the more datapoints there are, the closer the result is to
the ideal histogram. We must take the sample size into account when deciding
whether a given sample is suspicious.

Problem 5: Comparing variation of samples

You can visually see that some graphs are closer to the ideal than others. But, we
would like a way to determine that computationally.

We would like a way to determine how similar two graphs are — and more
specifically, we would like to determine whether the difference between graphs A and
B is larger or smaller than the difference between graphs C and D. For this, we will
define a distance metric. Given two graphs, it returns a number — a distance — that is
0 if the two graphs are identical, and is larger the more different two graphs are.

One common measure for the difference/distance between two datasets is the mean
squared error. For each corresponding datapoint, compute the difference between
the two points, then square it. The overall distance measure is the sum of the
squares.

The use of squares means that one really big difference among corresponding
datapoints yields greater weight than several small differences. It also means that the
distance between A and B is the same as the distance between B and A.That is, (9 -
4)2 is the same as (4 - 9)2.For example, suppose that you had the data that appears in
the following table and plot:

x f(x) g(x) h(x)

11 2 6
24 3 5
39 4 4
The MSE difference between f and g is (1-2)2 + (4-3)2 + (9-4)2 = 27.
The MSE difference between f and h is (1-6)2 + (4-5)2 + (9-4)2 = 51.
The MSE difference between g and h is (2-6)2 + (3-5)2 + (4-4)2 = 20.

The absolute values of the MSE are not interesting; it's only comparisons between
them that are. These numbers show that g and h are the most similar, and f and h are
the most different.

Write a function mean_squared_error that, given two lists of numbers, computes the
mean squared error between the lists.
>>> mean_squared_error([1, 4, 9], [6, 5, 4])
51

Statistics background
The 120 datapoints from the 2009 Iranian election is a small sample of some
hypothetical very large dataset. We don't know what that large dataset is, but we want
to answer a question about it: does that dataset have uniformly-distributed ones and
tens digits? If, just from looking at our small sample, we can determine that the large
unknown dataset does not have uniformly-distributed ones and tens digits, then we
can conclude that the observed sample is fraudulent (it came from some other source,
such as some bureaucrat's imagination).

One sample can't conclusively prove anything about the underlying distribution. For
example, there is a very small possibility that, by pure coincidence, a fair election
might produce 120 numbers that all end with “11”. If we saw a sample whose ones-
and-tens-digit histogram is all 1, we would be quite sure, but not 100% sure, that the
data is fraudulent.

Our methodology is as follows: We take as an assumption that the observed sample

(the Iranian election data) is not fraudulent — we call this the “null hypothesis”. Our
question is whether we can, with high likelihood, reject that assumption. Our specific
question is, “What is the likelihood that the Iranian election data is a sample of a
large unknown dataset whose least significant digits are uniformly distributed?”

Problem 6: Comparing variation of samples

Augment your program with a function called calculate_mse_with_uniform that takes
a histogram (as created by ones_and_tens_digit_histogram) and returns the mean
squared error of the given histogram with the uniform distribution.
Invoking calculate_mse_with_uniform with the the Iranian election results histogram
(for the ones and tens digits) should return the result 0.00739583333333, or
approximately 0.007.
>>> calculate_mse_with_uniform(histogram)
0.00739583333333

This number on its own does not mean anything — we don't know whether it is
unusually low, or unusually high, or about average. To find out, we need to compare it
to similarly-sized sets.

In a function called, compare_iranian_mse_to_samples that takes the Iranian MSE (as

computed by calculate_mse_with_uniform) and compares it to the MSE to the
uniform distribution for 10000 groups of random numbers, where each group is the
same size as the Iranian election data (120 numbers). You will only use the last two
digits of the random numbers.

Your function should determine where the passed in MSE (For our sample of the 2009
Iranian election data, this is ~0.007) appears relative to the computed MSE samples.
In other words, determine how many of the random MSEs are larger than or
equal to the Iran MSE, and how many of the random MSEs are smaller than the
Iran MSE. Print these values. This function should return None. With each run of your
program, you should expect a slightly different outcome from this function call.
>>> compare_iranian_mse_to_samples(0.00739583333333)
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___

Put the output from one run of this function in answers.txt.

Interpreting statistical results

Below are some possibilities for the outcome of the previous question. They each
include an explanation of how to interpret such a result. You should pay attention to
the % of MSEs that the Iranian election result is greater than.

• Suppose that the value 0.007 is larger than 9992 of the random MSEs, and is
smaller than 8 of the random MSEs. If the election results were genuine, then
there would only be a 0.08% chance of this result. This is highly unlikely, and
we say that we are 99.92% confident that the data are fraudulent. More
precisely, we say that “we reject the null hypothesis at the p=.0008 level”.
• Suppose that the value 0.007 is larger than 9607 of the random MSEs, and is
smaller than 393 of the random MSEs. If the election results were genuine, then
there would only be a 3.93% chance of such a lopsided choice. This
is unlikely, and we say that we are 96.07% confident that the data are
fraudulent (we reject the null hypothesis at the p=.0393 level).
• Suppose that the value 0.007 is larger than 8871 of the random MSEs, and is
smaller than 1129 of the random MSEs. If the election results were genuine,
then there would only be a 11.29% chance of such a lopsided choice. This
is somewhat unlikely, but not so very surprising.

By convention, when an event is more than 5% likely, we state that it

does not provide any statistically significant evidence against the null
hypothesis. This means that, by convention, when an event is reported as
"statistically significant", there is a 1/20 or less chance that it is just a fluke
caused by unlucky (or lucky) random choices.

• Suppose that the value 0.007 is larger than 4833 of the random MSEs, and is
smaller than 5167 of the random MSEs. If the election results were genuine,
then there would be a 51.67% chance of such a lopsided choice. This is not
surprising at all; it provides no evidence regarding the null hypothesis.
• Suppose that the value 0.007 is larger than 29 of the random MSEs, and is
smaller than 9971 of the random MSEs. If the election results were genuine,
then there would be a 99.71% chance of a result this lopsided or more lopsided.
This is not surprising at all; it provides no evidence regarding the null
hypothesis.

(Actually, the fit from this example is remarkably close to the theoretical ideal
— much closer than one would typically expect a randomly-chosen sample of
that size to be. But, it would come up once in a while, and maybe this is one of
those times. Or maybe the data were fudged to look really, really natural — too
natural, suspiciously natural. In any event, this result does not give grounds for
accusing the election authorities of fraud, given what we were measuring.)

Suppose that you are only testing genuine, non-fraudulent elections. Then, 1 time in
20, the above procedure will cause you to cry foul and make an incorrect accusation
of fraud. This is called a false positive, false alarm, or Type I error. False positives
are an inevitable risk of statistics. If you run enough different statistical tests, then by
pure chance, some of them will (seem to) yield an interesting result. You can reduce
the chance of such an error by reducing the 5% threshold discussed above. Doing so
makes your test less sensitive: your test will more often suffer a false negative, failed
alarm, or Type II error — a situation where there really was an effect but you missed
it. That is, there really was fraud but it was not detected.

In this assignment, you have computed approximated statistical confidence via

simulation: generation of random data, then comparison. There are better, closed-form
formulas for computing exact statistical confidence. We do not want to burden you
with sophisticated math that you may not understand. More importantly, this idea of
performing many trials, and seeing how likely or unlikely the real data are, is at the
heart of all statistics, and it is more important than understanding a set of formulas.

If you are interested, Wikipedia has more on hypothesis testing. Reading this
is optional, however.

Problem 7: Interpret your results

Interpret your results in answers.txt. State whether the data suggest that the Iran
election results were tampered with before being reported to the press. Briefly justify
your answer.

Problem 8: Other datasets

We have provided you with another dataset, from the 2008 US presidential election. It
appears in file election-us-2008.csv, and is taken from Wikipedia. Consider the null
hypothesis that it was indeed generated according to a uniform distribution of ones
and tens digits. That is, your null hypothesis is that this data follows the patterns of a
genuine data set.

Update your program to include calculations for the United States 2008 presidential
election in addition to the 2009 Iranian election. Use the following list of candidates:
us_2008_candidates = ["Obama", "McCain", "Nader", "Barr", "Baldwin",
"McKinney"]

Additionally, update your program to include all of the requested output. Make sure to
refactor your code from previous solutions to be general enough to handle the 2008
United States Election as opposed to duplicating your code.

When a datum is missing (that is, an empty space in the .csv file), your calculation
should ignore that datum. Do not transform it into a zero.

Do not include data for "other voters" in your calculations.

You do not need to produce graphs or plots for the US election — just the textual
output.

In answers.txt, state whether you can reject that hypothesis, and with what
confidence. Briefly justify your answer.

Submit part 1
Submit the following files:

• fraud_detection.py
• answers.txt

Furthermore, be sure that fraud_detection.py generates the following files upon

execution:

• iran-digits.png
• random-digits.png
Part 2: Detecting fraudulent data from the
front
In this part of the assignment, you will look for fraud in geographical data (place
populations) and in financial data. You will examine the most significant digit of the
data — that is, the leftmost digit.

For Part 2, please use the same fraud_detection.py file that you used in part 1. Add
new code where necessary, and submit that same file again at the end.

You are allowed to change your code from Part 1. However, your program must still
satisfy all the requirements of Part 1. When you run your program, it must produce all
the output required by Part 1, then all the output required by Part 2. You must abide
by all the requirements of Part 1 regarding the number of parameters and
specification/behavior of each function. One way to generalize is to create a helper
function, copy the body of an existing function to the helper function, and make the
original function's body be little more than a call to the helper function. Since you
defined the helper function, you are allowed to give it any name, any number of
parameters, and any specification that you like.

In this part of the assignment, the structure of your program is entirely up to you.
Your program's external correctness will be graded based on the .png images your
code generates, so you are free to define any functions with any names and parameters
you would like. You will find, however, that a good function decomposition will
make this assignment much easier.

We will still grade your code by hand, so make sure to practice good coding and
commenting style.

The Benford's Law distribution

Suppose that you measure some naturally-occurring phenomenon, such as the land
area of cities or lakes, or the price of stocks on the stock market. You can plot a
histogram of how frequently each digit (1-9) appears in the most significant place.

Benford's Law states that for natural processes such as these, the probabilities of
seeing each digit in the most significant place are shown in the table below
(from Wikipedia). Let P(d) be the probability of the given digit being the first one in a
measurement. Benford's Law also states the remarkable fact that all of these
processes produce the same histogram!
d P(d) Relative size of P(d)
1 30.1%
2 17.6%
3 12.5%
4 9.7%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%

Think about this: your measurements were made in some arbitrary units, such as
square miles or dollars, but what if you made the measurements in some other units,
such as acres or euros? Would you expect the histogram to change?

In fact, you would not expect the shape of the histogram to differ just because you
changed the units of measurement. This is because the units are arbitrary and
unrelated to the underlying phenomenon. If the shape of the histogram did change,
that could indicate that the data were fraudulent. This approach has been used to
detect fraud, particularly in accounting and economics, but also in science.

Data are called scale invariant if measuring them in any units yields similar results.
Many natural processes produce scale-invariant data. This means that for all natural
processes, regardless of the units used, the histogram will be the same.

Benford's law only holds when the data has no natural limit nor cutoff. For example, it
would not hold for grades in a class (which are in the range 0% to 100% or 0.0 to 4.0)
nor for people's height in centimeters (where almost every value would start with 1 or
2). If you are interested, Wikipedia has more information about Benford's
law. Reading the Wikipedia article is optional, however.

Problem 9: Plotting Benford's distribution

The histogram of first digit values for a distribution obeying Benford's Law can be
computed as P(d) = log10(1 + 1/d).

Plot the values produced by evaluating the Benford's Law formula for d on the
interval [1, 10). Your plot should look like this, including the same x- and y-axis
labels and the same legend:
Use pyplot.plot for the line itself. (You may also find the pyplot tutorial useful.) You
will also need to use Python's math.log10 function.

Save your plot as scale-invariance.png. You will turn it in later.

Problem 10: Sampling datapoints to fit Benford's law

In this part of the problem, you will create artificial data that obeys Benford's Law.

Here is one way to generate datapoints that obey Benford's Law:

1. Pick a random number r uniformly in the range [0.0, 30.0). That is, the value is
greater than or equal to 0, it is less than 30, and every value in that range is
equally likely. Hint: use random.random or random.uniform.
2. Compute er, where e is the base of the natural logarithms, or approximately
2.71828. Hint: use math.e.

Generate 1000 datapoints using the above technique.

On your graph from Problem 9, draw another line, labeled “1000 samples”, that
plots the frequency of the most significant digits of your 1000 samples (where each
sample is the result of calculating er). Don't use the pyplot.hist routine — just
use pyplot.plot, as you did above. Don't create a new graph — modify the one you
made in Problem 9.

Hint: You may find it helpful to write helper functions, much as you did in Part 1.

Problem 11: Scale invariance

In this problem, you will see that the scale-invariance property holds.

On your graph from Problem 9 and 10, draw another line, where each datapoint is
computed as π × er. For the label in the legend, use the string “1000 samples, scaled
by $\pi$”. (The funny “$\pi$” will show up as a nicely-formatted π in the graph.)

Compare this line with the one you drew in Problem 10. There are some differences
due to random choices, but overall it demonstrates the scale-independence of the
distribution you just created. It also demonstrates the scale-independence of the
Benford distribution, since it is so similar to the one you just created. (It is possible to
demonstrate the scale-independence of Benford's Law mathematically as well. You
are welcome to try doing this, but it is not required.)

You now have a single plot with three functions graphed on it (from problems 9-
11). Turn in this file as scale-invariance.png.

Problem 12: Population of U.S. cities

We wish to know whether the population of cities in the United States follows
Benford's Law.

Your directory contains a file SUB-EST2009_ALL.csv with United States census data.
The file is in “comma-separated values” format. You can parse this file the same was
as you did in Problem 1.

Create new plot like the one from Problem 9. It should have only the theoretical
Benford's distribution, calculated as log10(1+1/d) for each digit d. Then plot on it a
histogram of the frequency of the each first digit in the data from the
"POPCENSUS_2000" column of the file. Label it "US (all)".

Just like in Problem 1, you might run into unclean data. You should handle this data in
the same way you did then.
If any city has population 0 in the 2000 census, you may ignore this city.

Save this plot as population-data.png. You will turn it in later.

Your graph now has two curves plotted on it. From the similarity of the two curves,
you can see that the population of U.S. cities obeys Benford's Law.

Problem 13: Population of places from literature and pop

culture
The file literature-population.txt contains the populations of various places that
appear in literature and pop culture. We would like to know whether these populations
are characteristic of the real population counts, in the sense of obeying Benford's Law
equally well.

On your graph from Problem 12, plot the data from the file literature-
population.txt. Label this plot "Literature Places." Notice that the data are similar to,
but not the same as, the real dataset and the perfect Benford's-Law histogram.

Once again, note that this data may not be clean.

Are these data far enough from obeying Benford's Law that we can conclude that the
numbers are fake? Or are they close enough to obeying Benford's Law that they are
statistically indistinguishable from real data? You can't tell just by looking at the
graphs we have created so far. We will show more principled, statistical ways of
making this determination.

Your plot now has three lines plotted on it. Turn this plot in as population-data.png.

Problem 14: Smaller samples have more variation

The larger a sample is, the closer it is to the ideal distribution. With smaller samples,
the vagaries of random choice might lead to results that seem different than random.
As an extreme example, suppose that you plotted a histogram of just the first 10 cities
in the SUB-EST2009_ALL.csv file. (You don't have to do this for the assignment.) It
would look like this:
This graph is rather different from the Benford's-Law histogram, but that does not
necessarily mean that city populations do not obey Benford's Law — maybe you have
not yet examined enough data to see the trend.

Create a new graph like the one from Problem 10. Add to it plots for 10, 50, 100,
and 10000 randomly-selected values or r. In other words, where in Problem 10 you
used 1000 samples, here you should additionally use 10, 50, 100, and 10000. Your
final graph will plot six functions. You should label these functions "10 samples", "50
samples", and so on, just as in Problem 10.

Save your plot as benford-samples.png. Turn in this plot.

Notice that the larger the sample size, the closer the distribution of first digits
approaches the theoretically perfect distribution. This demonstrates that the more
datapoints there are, the closer it is to the true distribution.

Statistics background
A distribution is a process that can generate arbitrarily many datapoints. A
distribution obeys Benford's Law if, after choosing infinitely many points, the
frequency of first digits is exactly P(d) = log10(1 + 1/d).

A sample is a finite set of measurements. We wish to know whether it is possible that

these measurements were chosen from some distribution that obeys Benford's Law.

The populations of places from literature is a small sample — just a few dozen
datapoints. If, just from looking at our small sample, we can determine that the
unknown distribution they came from does not obey Benford's Law, then we can
conclude that the observed sample is not a result of a natural process (in this case, it is
a result of the authors' choices, not a natural process). We can conclude that because
place populations from the United States and elsewhere in the real world do obey
Benford's Law.

One sample can't conclusively prove anything about the underlying distribution. For
example, there is a very small possibility that, by pure coincidence, we might
randomly choose 100 numbers that all start with the digit 1. If we saw a sample of 100
datapoints whose first-digit histogram is all 1, we would be quite sure, but not 100%
sure, that the underlying distribution does not obey Benford's Law. We might say that
we are more than 99.9% sure that the data is fraudulent.

So, our question is, “What is the probability that the populations from literature are a
sample of an unknown distribution that does not obey Benford's Law?” In other
words, if we had to bet on whether the literature place populations are fraudulent,
what odds would we give? We will determine a quantitative answer to this question.

We take as an assumption that the observed sample (the populations from literature)
are not fraudulent — we call this the “null hypothesis”. Our question is whether we
can reject that assumption. Rejecting the assumption means determining that the
sample is fraudulent. By “fraudulent”, we mean that it was generated by some other
process — such as human imagination — that is different than the natural process that
generates real population data.

Problem 15: Comparing variation of samples

Compute the mean squared error (the MSE distance) between Benford's distribution
and the histogram of first digits of populations from literature. You should obtain the
result 0.00608941252081, or approximately 0.006.
This number on its own does not mean anything — we don't know whether it is
unusually low, or unusually high, or about average. To find out, we need to compare it
to similarly-sized sets.

Generate 10,000 sets, each of which contains population data from n US towns (n is
the size of the literature dataset). Each datapoint in each set should be chosen at
random from the POPCENSUS_2000 data. For each of these sets, compute its
MSE distance from Benford's distribution.

Now determine how many of the US MSEs are larger than or equal to the literature
MSE, how many of the US MSEs are smaller, and how many are the same as the
literature MSE. Your program should print out these quantities, in the following
format:
Comparison of US MSEs to literature MSE:
larger/equal: ___
smaller: ___

Also, paste that output into your answers.txt file.

Hint: Your program should not open and parse the census file 10,000 or 560,000
times. Your program should only read the file once.

Problem 16: Interpret your results

Interpret your results. Is there evidence that the literature place populations were
artificially generated? Answer this in your answers.txt file. Hint: you may wish to
use a similar approach to that you used in Problem 9.

Submit part 2
You are almost done!

Look over your work to find ways to eliminate repetition in it. Then, refactor your
code to eliminate that repetition. This is important when you complete each part, but
especially important when you complete part 2. When turning in part 2, you should
refactor throughout the code, which will probably include more refactoring in part 1.
You will find that there is some similar code within each part that does not need to be
duplicated, and you will find that there are also similarities across the two parts. You
may want to restructure some of your part 1 code to make it easier for you to reuse in
part 2.
Now look over your work and make sure you practiced good coding style.

At the bottom of your answers.txt file, in the “Collaboration” part, state which
students or other people (besides the course staff) helped you with the assignment, or
that no one did.

At the bottom of your answers.txt file, in the “Reflection” part, reflect on this
assignment. What did you learn from this assignment? What do you wish you had
known before you started? What would you do differently? What advice would you
offer to future students?

Submit the following files:

• fraud_detection.py
• answers.txt

Furthermore, be sure that fraud_detection.py generates the following files upon

execution:

• scale-invariance.png
• population-data.png
• Benford-samples.png

AMSLI Questions Part 3
100% (4)
AMSLI Questions Part 3
5 pages
Vtu Python Lab Manual For 1st and 2nd Se
No ratings yet
Vtu Python Lab Manual For 1st and 2nd Se
18 pages
X-AI Practical File-2 (2024)
No ratings yet
X-AI Practical File-2 (2024)
17 pages
Ip Practical
No ratings yet
Ip Practical
31 pages
Practice Assignment - 1-Class XI AI
No ratings yet
Practice Assignment - 1-Class XI AI
2 pages
Day7 AlgorithmicsinR Lecture
No ratings yet
Day7 AlgorithmicsinR Lecture
73 pages
Verb Tense
No ratings yet
Verb Tense
19 pages
Ip Project File: Class-Xii ' Roll No.
No ratings yet
Ip Project File: Class-Xii ' Roll No.
23 pages
Record Ip Mithun
No ratings yet
Record Ip Mithun
25 pages
Practical File (Xii - Ip)
No ratings yet
Practical File (Xii - Ip)
32 pages
Practical File (Xii - Ip Final)
No ratings yet
Practical File (Xii - Ip Final)
35 pages
Python Lab Manual Updated
No ratings yet
Python Lab Manual Updated
24 pages
ML 2
No ratings yet
ML 2
25 pages
XII-IP Practical File 1-16 2022-23
No ratings yet
XII-IP Practical File 1-16 2022-23
23 pages
Essential Grammar in Use-Verb To Be
100% (1)
Essential Grammar in Use-Verb To Be
2 pages
Info Fair Record
No ratings yet
Info Fair Record
28 pages
Ek125 Final Project
No ratings yet
Ek125 Final Project
16 pages
Practical For Class XII
No ratings yet
Practical For Class XII
19 pages
AI Practical Project
No ratings yet
AI Practical Project
15 pages
2023 12 Exam English
No ratings yet
2023 12 Exam English
13 pages
Lab Manual - Python
No ratings yet
Lab Manual - Python
28 pages
CSE160 Midterm 15sp Key
No ratings yet
CSE160 Midterm 15sp Key
11 pages
Singh Project1 Report
No ratings yet
Singh Project1 Report
12 pages
Visualization Worksheet
No ratings yet
Visualization Worksheet
8 pages
DADS301 MBA Sem 3programming in DS
No ratings yet
DADS301 MBA Sem 3programming in DS
10 pages
CSE160 Midterm 15sp
No ratings yet
CSE160 Midterm 15sp
11 pages
2.0. Mathematical Language and Symbols Including Sets and Functions
No ratings yet
2.0. Mathematical Language and Symbols Including Sets and Functions
69 pages
Week2 Lab
No ratings yet
Week2 Lab
8 pages
Pa2 (Ab)
No ratings yet
Pa2 (Ab)
6 pages
Project On Banking System in Mis PDF
No ratings yet
Project On Banking System in Mis PDF
43 pages
Practice Exercise - Session 1
No ratings yet
Practice Exercise - Session 1
11 pages
Assignment 3 - Open Data (Searching and Sorting) - CISC 121 - Introduction To Computing Science (ASO) S24
No ratings yet
Assignment 3 - Open Data (Searching and Sorting) - CISC 121 - Introduction To Computing Science (ASO) S24
5 pages
Python
No ratings yet
Python
4 pages
Divp Pyq 2023
No ratings yet
Divp Pyq 2023
7 pages
319 Scriptural Rosary
No ratings yet
319 Scriptural Rosary
40 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Module 2 Data Types, Operators, Variables Assignment
No ratings yet
Module 2 Data Types, Operators, Variables Assignment
6 pages
English - Literature 3 6 2017
No ratings yet
English - Literature 3 6 2017
12 pages
Family Folk Song 2019 Copies Translations
No ratings yet
Family Folk Song 2019 Copies Translations
16 pages
Google - Wikipedia
No ratings yet
Google - Wikipedia
93 pages
Lg6 Lesson 4 Noah and The Ark
No ratings yet
Lg6 Lesson 4 Noah and The Ark
6 pages
Theocritus' Idyll 13 Love and The Hero
No ratings yet
Theocritus' Idyll 13 Love and The Hero
19 pages
Alice Ogden Bellis, "Jeremiah 31:22b: An Intentionally Ambiguous, Multivalent Riddle-Text"
No ratings yet
Alice Ogden Bellis, "Jeremiah 31:22b: An Intentionally Ambiguous, Multivalent Riddle-Text"
9 pages
Grammar in Use 2: - Noun + Preposition - Adj + Preposition
No ratings yet
Grammar in Use 2: - Noun + Preposition - Adj + Preposition
16 pages
Science 6 - Week 7 Dll-Bow
No ratings yet
Science 6 - Week 7 Dll-Bow
2 pages
Core Network in GSM
No ratings yet
Core Network in GSM
81 pages
History Y5 2019 3rd Term
No ratings yet
History Y5 2019 3rd Term
11 pages
Django Ninja
No ratings yet
Django Ninja
10 pages
Brkent-2183 (1
No ratings yet
Brkent-2183 (1
67 pages
Defining - Non-Defining New Version
No ratings yet
Defining - Non-Defining New Version
6 pages
Q 4
No ratings yet
Q 4
27 pages
Mans Best Friend British English Teacher
No ratings yet
Mans Best Friend British English Teacher
11 pages
Example Worksheets Format
No ratings yet
Example Worksheets Format
5 pages
PR ELO NelsonMandela Worksheet2
No ratings yet
PR ELO NelsonMandela Worksheet2
7 pages
Philippine Literature and Its Historical Backround
No ratings yet
Philippine Literature and Its Historical Backround
8 pages
Organic Chemistry Assignment-1: Complex Question SET
No ratings yet
Organic Chemistry Assignment-1: Complex Question SET
6 pages
C Programming Language - Repetition
No ratings yet
C Programming Language - Repetition
65 pages
Coding Form Dokter
No ratings yet
Coding Form Dokter
5 pages
The Handwriting Difficulty Checklist
No ratings yet
The Handwriting Difficulty Checklist
2 pages
Kennedy 1945 Bibliography of Indonesian Peoples and Cultures
No ratings yet
Kennedy 1945 Bibliography of Indonesian Peoples and Cultures
12 pages
Unit 1 What Is Industrial Engineering? I. Read The Text and Answer The Questions Below!
No ratings yet
Unit 1 What Is Industrial Engineering? I. Read The Text and Answer The Questions Below!
3 pages
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
Integrated Algebra on the Ti-73
From Everand
Integrated Algebra on the Ti-73
Kathleen Noftsier
No ratings yet
C++ Programming Language
From Everand
C++ Programming Language
Younish Pathan
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
TouchCode Class 8: Coding Book
From Everand
TouchCode Class 8: Coding Book
Team Orange
No ratings yet
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
From Everand
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
MARTY TWITTY
No ratings yet
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Python from the Very Beginning
From Everand
Python from the Very Beginning
John Whitington
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Machine Learning: Hands-On for Developers and Technical Professionals
From Everand
Machine Learning: Hands-On for Developers and Technical Professionals
Jason Bell
No ratings yet
AutoIT Scripting For Beginners
From Everand
AutoIT Scripting For Beginners
Rajan
5/5 (2)
Ace the Trading Systems Developer Interview (C++ Edition) : Insider's Guide to Top Tech Jobs in Finance
From Everand
Ace the Trading Systems Developer Interview (C++ Edition) : Insider's Guide to Top Tech Jobs in Finance
Dennis Thompson Sr
5/5 (1)
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Next Level Deep Machine Learning: Complete Tips and Tricks to Deep Machine Learning
From Everand
Next Level Deep Machine Learning: Complete Tips and Tricks to Deep Machine Learning
Joe Grant
No ratings yet
Python: Tips and Tricks to Programming Code with Python: Python Computer Programming, #3
From Everand
Python: Tips and Tricks to Programming Code with Python: Python Computer Programming, #3
Charlie Masterson
5/5 (1)
Programming with Python
From Everand
Programming with Python
Enrique Vicente
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Top Jobs: Computer and Information Technology
From Everand
Top Jobs: Computer and Information Technology
William Perry
No ratings yet
TypeScript Interview Playbook
From Everand
TypeScript Interview Playbook
Tech Interviews
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
C Programming Concepts
From Everand
C Programming Concepts
Jitendra Patel
No ratings yet
Python: Tips and Tricks to Programming Code with Python
From Everand
Python: Tips and Tricks to Programming Code with Python
Charlie Masterson
No ratings yet

HWK 5

Uploaded by

HWK 5

Uploaded by

HWK 5 – Detecting fraudulent data, from the back and the front

To get started, download the data files.

Part 1: Detecting fraudulent data from the

Problem 1: Read and clean Iranian election data

Write a function called extract_election_vote_counts that takes a filename and a list

Problem 2: Make a histogram

In this example call, the index 1 of the list contains the

Problem 3: Plot election data

Write a function called plot_distribution_by_sample_size. This function creates 5

Problem 5: Comparing variation of samples

x f(x) g(x) h(x)

Our methodology is as follows: We take as an assumption that the observed sample

Problem 6: Comparing variation of samples

In a function called, compare_iranian_mse_to_samples that takes the Iranian MSE (as

Put the output from one run of this function in answers.txt.

Interpreting statistical results

By convention, when an event is more than 5% likely, we state that it

In this assignment, you have computed approximated statistical confidence via

Problem 7: Interpret your results

Problem 8: Other datasets

Do not include data for "other voters" in your calculations.

Furthermore, be sure that fraud_detection.py generates the following files upon

The Benford's Law distribution

Problem 9: Plotting Benford's distribution

Save your plot as scale-invariance.png. You will turn it in later.

Problem 10: Sampling datapoints to fit Benford's law

Here is one way to generate datapoints that obey Benford's Law:

Generate 1000 datapoints using the above technique.

Problem 11: Scale invariance

Problem 12: Population of U.S. cities

Save this plot as population-data.png. You will turn it in later.

Problem 13: Population of places from literature and pop

Once again, note that this data may not be clean.

Problem 14: Smaller samples have more variation

Save your plot as benford-samples.png. Turn in this plot.

A sample is a finite set of measurements. We wish to know whether it is possible that

Problem 15: Comparing variation of samples

Also, paste that output into your answers.txt file.

Problem 16: Interpret your results

Submit the following files:

Furthermore, be sure that fraud_detection.py generates the following files upon

You might also like