HWK 5
HWK 5
This assignment will give the students practice with the basics of statistical testing and is inspired
by homework from Prof. Emilia Gan (https://courses.cs.washington.edu/courses/cse160/19wi/).
In this assignment, you will try to detect fraud in a dataset, in two different ways:
• Part 1: By examining the last digits of the numbers in the dataset. For each
datum such as 21063, you would examine the 3 and the 6.
• Part 2: By examining the first digit of the numbers in the dataset. For each
datum such as 21063, you would examine the 2.
You will write program in a new file, fraud_detection.py, that you will create. In
some cases, you will write answers to answers.txt.
Coding style
A portion of your grade depends on the use of a good coding style. Code written in
good style is easier to read and to modify and is less likely to contain errors.
The documentation should allow us to understand it. You can get some ideas on how
to document your code from the starter code of previous assignments. Follow
good docstring conventions. You do not have to follow these instructions to the letter.
Different parts of this assignment require similar, but not exactly identical, work.
When you notice such repetition, you should refactor and generalize your code. That
is, rather than writing two similar routines, you should write one routine that takes the
place of most of both.
You should decompose functions into smaller helper functions. One usual good rule
of thumb is that if you cannot come up with a descriptive name for a function, then it
may be too small or too large. If there is a good name for a portion of a function, then
you might consider abstracting it out into a helper function.
We have not provided tests or exact results to check your program against. We
encourage you to write your own tests and to use assert statements.
Important: Leave yourself some time to go back and refactor your code before you
turn it in. Whenever you make a change to your program, ensure that it produces the
same results as before. It is very possible to get a low grade on this assignment even if
your program correctly executes of the requested calculations.
The ones place and the tens place don't affect who wins. They are essentially random
noise, in the sense that in any real election, each value is equally likely. Another way
to say this is that we expect the the ones and tens digits to be uniformly distributed
— that is, 10% of the digits should be “0”, 10% should be “1”, and so forth. If
these digits are not uniformly distributed, then it is likely that the numbers were made
up by a person rather than collected from ballot boxes. (People tend to be poor at
making up truly random numbers.)
It is important to note that a non-uniform distribution does not necessarily mean that
the data is fraudulent data. A non-uniform distribution is a great signal for fraudulent
data, but it is possible for a non-uniform distribution to appear naturally.
Getting Started
Create a file called fraud_detection.py for your python program. As we have given
you no starter code, it is up to you to create this program from scratch.
There are a few specific details that you must adhere to. The first is that your
program's output should exactly match the following formatting, including
capitalization and spacing (except where ___ is replaced by your answers).
2009 Iranian election MSE: ___
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___
2008 United States election MSE: ___
Quantity of MSEs larger than or equal to the 2008 United States election MSE:
___
Quantity of MSEs smaller than the 2008 United States election MSE: ___
2008 United States election null hypothesis rejection level p: ___
Some of the problems request that you write functions that create plots and save them
to files. Upon execution, your program should generate and write these files (even
though there are no traces of these files in the printed output).
We do ask for specific functions that take exact parameter formats and return exact
output formats. You must preserve the names, parameters, and output of these
functions. The functions that we ask for are as follows:
• extract_election_vote_counts(filename, column_names)
• ones_and_tens_digit_histogram(numbers)
• plot_iranian_least_digits_histogram(histogram)
• plot_distribution_by_sample_size()
• mean_squared_error(numbers1, numbers2)
• calculate_mse_with_uniform(histogram)
• compare_iranian_mse_to_samples(mse)
Lastly, you should use a main function to organize the execution of code in your
program. You may begin with the following template code, which goes at
the bottom of your program, after the definitions of all of your functions.
# The code in this function is executed when this file is run as a Python
program
def main():
...
if __name__ == "__main__":
main()
Your program should not execute any code, other than the main function, when it is
loaded; that is, all statements should be inside a function, never at the top level.
Make sure to add import statements to gain access to tools and functions that are not
included by default, such as matplotlib.pyplot or math. All import statements should
be at the top of the file.
You may want to refer to String Methods and int() from the python language
documentation. Instead of using split, you should make use of csv.DictReader,
which will make it easier to produce a clean solution.
>>> extract_election_vote_counts("election-iran-2009.csv", ["Ahmadinejad",
"Rezai", "Karrubi", "Mousavi"])
[1131111, 16920, 7246, 837858, 623946, 12199, 21609, 656508, ...
You will notice that the data contains double-quotes and commas. It is common to
receive data that is not formatted quite how you would like for it to be. It is important
to be able to clean data before analysis. It is up to you to handle the input by removing
these symbols from the data before converting them into numbers.
In a number that is less than 10, such as 3, the tens place is implicitly zero. That is, 3
must be treated as 03. Your code should treat the tens digits of these values as zero.
The resultant plot should be identical to the following plot. Don't forget the x- and y-
axis labels and the legend. Use pyplot.plot for the line itself. To create the legend at
the top right corner, use plt.legend, and don't forget to use the label= optional
argument to plot.
You may wish to reference the pyplot tutorial. As a hint (that is also discussed in the
tutorial) be sure that the call to pyplot.savefig comes before any call to pyplot.show;
if savefig comes after, the graph will be empty.
The Iran election data are rather different from the expected flat line at y=.1. Are these
data different enough that we can conclude that the numbers are probably fake? You
can't tell just by looking at the graphs we have created so far. We will show more
principled, statistical ways of making this determination.
Problem 4: Smaller samples have more variation
With a small sample, the vagaries of random choice might lead to results that seem
different than expected. As an example, suppose that you plotted a histogram of 20
randomly-chosen digits (10 random numbers, 2 digits per number):
That looks much worse than the Iran data, even though it is genuinely random! Of
course, it would be incorrect to conclude from this experiment that the data for this
plot is fraudulent and that the data for the Iranian election is genuine. Just because
your observations do not seem to fit a hypothesis does not mean the hypothesis is
false — it is very possible that you didn't examined enough data to see the trend.
You will want to use random.randint to generate numbers in the range [0, 99], inclusive.
>>> plot_distribution_by_sample_size()
Your plot demonstrates that the more datapoints there are, the closer the result is to
the ideal histogram. We must take the sample size into account when deciding
whether a given sample is suspicious.
We would like a way to determine how similar two graphs are — and more
specifically, we would like to determine whether the difference between graphs A and
B is larger or smaller than the difference between graphs C and D. For this, we will
define a distance metric. Given two graphs, it returns a number — a distance — that is
0 if the two graphs are identical, and is larger the more different two graphs are.
One common measure for the difference/distance between two datasets is the mean
squared error. For each corresponding datapoint, compute the difference between
the two points, then square it. The overall distance measure is the sum of the
squares.
The use of squares means that one really big difference among corresponding
datapoints yields greater weight than several small differences. It also means that the
distance between A and B is the same as the distance between B and A.That is, (9 -
4)2 is the same as (4 - 9)2.For example, suppose that you had the data that appears in
the following table and plot:
The absolute values of the MSE are not interesting; it's only comparisons between
them that are. These numbers show that g and h are the most similar, and f and h are
the most different.
Write a function mean_squared_error that, given two lists of numbers, computes the
mean squared error between the lists.
>>> mean_squared_error([1, 4, 9], [6, 5, 4])
51
Statistics background
The 120 datapoints from the 2009 Iranian election is a small sample of some
hypothetical very large dataset. We don't know what that large dataset is, but we want
to answer a question about it: does that dataset have uniformly-distributed ones and
tens digits? If, just from looking at our small sample, we can determine that the large
unknown dataset does not have uniformly-distributed ones and tens digits, then we
can conclude that the observed sample is fraudulent (it came from some other source,
such as some bureaucrat's imagination).
One sample can't conclusively prove anything about the underlying distribution. For
example, there is a very small possibility that, by pure coincidence, a fair election
might produce 120 numbers that all end with “11”. If we saw a sample whose ones-
and-tens-digit histogram is all 1, we would be quite sure, but not 100% sure, that the
data is fraudulent.
This number on its own does not mean anything — we don't know whether it is
unusually low, or unusually high, or about average. To find out, we need to compare it
to similarly-sized sets.
Your function should determine where the passed in MSE (For our sample of the 2009
Iranian election data, this is ~0.007) appears relative to the computed MSE samples.
In other words, determine how many of the random MSEs are larger than or
equal to the Iran MSE, and how many of the random MSEs are smaller than the
Iran MSE. Print these values. This function should return None. With each run of your
program, you should expect a slightly different outcome from this function call.
>>> compare_iranian_mse_to_samples(0.00739583333333)
Quantity of MSEs larger than or equal to the 2009 Iranian election MSE: ___
Quantity of MSEs smaller than the 2009 Iranian election MSE: ___
2009 Iranian election null hypothesis rejection level p: ___
• Suppose that the value 0.007 is larger than 9992 of the random MSEs, and is
smaller than 8 of the random MSEs. If the election results were genuine, then
there would only be a 0.08% chance of this result. This is highly unlikely, and
we say that we are 99.92% confident that the data are fraudulent. More
precisely, we say that “we reject the null hypothesis at the p=.0008 level”.
• Suppose that the value 0.007 is larger than 9607 of the random MSEs, and is
smaller than 393 of the random MSEs. If the election results were genuine, then
there would only be a 3.93% chance of such a lopsided choice. This
is unlikely, and we say that we are 96.07% confident that the data are
fraudulent (we reject the null hypothesis at the p=.0393 level).
• Suppose that the value 0.007 is larger than 8871 of the random MSEs, and is
smaller than 1129 of the random MSEs. If the election results were genuine,
then there would only be a 11.29% chance of such a lopsided choice. This
is somewhat unlikely, but not so very surprising.
• Suppose that the value 0.007 is larger than 4833 of the random MSEs, and is
smaller than 5167 of the random MSEs. If the election results were genuine,
then there would be a 51.67% chance of such a lopsided choice. This is not
surprising at all; it provides no evidence regarding the null hypothesis.
• Suppose that the value 0.007 is larger than 29 of the random MSEs, and is
smaller than 9971 of the random MSEs. If the election results were genuine,
then there would be a 99.71% chance of a result this lopsided or more lopsided.
This is not surprising at all; it provides no evidence regarding the null
hypothesis.
(Actually, the fit from this example is remarkably close to the theoretical ideal
— much closer than one would typically expect a randomly-chosen sample of
that size to be. But, it would come up once in a while, and maybe this is one of
those times. Or maybe the data were fudged to look really, really natural — too
natural, suspiciously natural. In any event, this result does not give grounds for
accusing the election authorities of fraud, given what we were measuring.)
Suppose that you are only testing genuine, non-fraudulent elections. Then, 1 time in
20, the above procedure will cause you to cry foul and make an incorrect accusation
of fraud. This is called a false positive, false alarm, or Type I error. False positives
are an inevitable risk of statistics. If you run enough different statistical tests, then by
pure chance, some of them will (seem to) yield an interesting result. You can reduce
the chance of such an error by reducing the 5% threshold discussed above. Doing so
makes your test less sensitive: your test will more often suffer a false negative, failed
alarm, or Type II error — a situation where there really was an effect but you missed
it. That is, there really was fraud but it was not detected.
If you are interested, Wikipedia has more on hypothesis testing. Reading this
is optional, however.
Update your program to include calculations for the United States 2008 presidential
election in addition to the 2009 Iranian election. Use the following list of candidates:
us_2008_candidates = ["Obama", "McCain", "Nader", "Barr", "Baldwin",
"McKinney"]
Additionally, update your program to include all of the requested output. Make sure to
refactor your code from previous solutions to be general enough to handle the 2008
United States Election as opposed to duplicating your code.
When a datum is missing (that is, an empty space in the .csv file), your calculation
should ignore that datum. Do not transform it into a zero.
You do not need to produce graphs or plots for the US election — just the textual
output.
In answers.txt, state whether you can reject that hypothesis, and with what
confidence. Briefly justify your answer.
Submit part 1
Submit the following files:
• fraud_detection.py
• answers.txt
• iran-digits.png
• random-digits.png
Part 2: Detecting fraudulent data from the
front
In this part of the assignment, you will look for fraud in geographical data (place
populations) and in financial data. You will examine the most significant digit of the
data — that is, the leftmost digit.
For Part 2, please use the same fraud_detection.py file that you used in part 1. Add
new code where necessary, and submit that same file again at the end.
You are allowed to change your code from Part 1. However, your program must still
satisfy all the requirements of Part 1. When you run your program, it must produce all
the output required by Part 1, then all the output required by Part 2. You must abide
by all the requirements of Part 1 regarding the number of parameters and
specification/behavior of each function. One way to generalize is to create a helper
function, copy the body of an existing function to the helper function, and make the
original function's body be little more than a call to the helper function. Since you
defined the helper function, you are allowed to give it any name, any number of
parameters, and any specification that you like.
In this part of the assignment, the structure of your program is entirely up to you.
Your program's external correctness will be graded based on the .png images your
code generates, so you are free to define any functions with any names and parameters
you would like. You will find, however, that a good function decomposition will
make this assignment much easier.
We will still grade your code by hand, so make sure to practice good coding and
commenting style.
Benford's Law states that for natural processes such as these, the probabilities of
seeing each digit in the most significant place are shown in the table below
(from Wikipedia). Let P(d) be the probability of the given digit being the first one in a
measurement. Benford's Law also states the remarkable fact that all of these
processes produce the same histogram!
d P(d) Relative size of P(d)
1 30.1%
2 17.6%
3 12.5%
4 9.7%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%
Think about this: your measurements were made in some arbitrary units, such as
square miles or dollars, but what if you made the measurements in some other units,
such as acres or euros? Would you expect the histogram to change?
In fact, you would not expect the shape of the histogram to differ just because you
changed the units of measurement. This is because the units are arbitrary and
unrelated to the underlying phenomenon. If the shape of the histogram did change,
that could indicate that the data were fraudulent. This approach has been used to
detect fraud, particularly in accounting and economics, but also in science.
Data are called scale invariant if measuring them in any units yields similar results.
Many natural processes produce scale-invariant data. This means that for all natural
processes, regardless of the units used, the histogram will be the same.
Benford's law only holds when the data has no natural limit nor cutoff. For example, it
would not hold for grades in a class (which are in the range 0% to 100% or 0.0 to 4.0)
nor for people's height in centimeters (where almost every value would start with 1 or
2). If you are interested, Wikipedia has more information about Benford's
law. Reading the Wikipedia article is optional, however.
Plot the values produced by evaluating the Benford's Law formula for d on the
interval [1, 10). Your plot should look like this, including the same x- and y-axis
labels and the same legend:
Use pyplot.plot for the line itself. (You may also find the pyplot tutorial useful.) You
will also need to use Python's math.log10 function.
1. Pick a random number r uniformly in the range [0.0, 30.0). That is, the value is
greater than or equal to 0, it is less than 30, and every value in that range is
equally likely. Hint: use random.random or random.uniform.
2. Compute er, where e is the base of the natural logarithms, or approximately
2.71828. Hint: use math.e.
Hint: You may find it helpful to write helper functions, much as you did in Part 1.
On your graph from Problem 9 and 10, draw another line, where each datapoint is
computed as π × er. For the label in the legend, use the string “1000 samples, scaled
by $\pi$”. (The funny “$\pi$” will show up as a nicely-formatted π in the graph.)
Compare this line with the one you drew in Problem 10. There are some differences
due to random choices, but overall it demonstrates the scale-independence of the
distribution you just created. It also demonstrates the scale-independence of the
Benford distribution, since it is so similar to the one you just created. (It is possible to
demonstrate the scale-independence of Benford's Law mathematically as well. You
are welcome to try doing this, but it is not required.)
You now have a single plot with three functions graphed on it (from problems 9-
11). Turn in this file as scale-invariance.png.
Your directory contains a file SUB-EST2009_ALL.csv with United States census data.
The file is in “comma-separated values” format. You can parse this file the same was
as you did in Problem 1.
Create new plot like the one from Problem 9. It should have only the theoretical
Benford's distribution, calculated as log10(1+1/d) for each digit d. Then plot on it a
histogram of the frequency of the each first digit in the data from the
"POPCENSUS_2000" column of the file. Label it "US (all)".
Just like in Problem 1, you might run into unclean data. You should handle this data in
the same way you did then.
If any city has population 0 in the 2000 census, you may ignore this city.
Your graph now has two curves plotted on it. From the similarity of the two curves,
you can see that the population of U.S. cities obeys Benford's Law.
On your graph from Problem 12, plot the data from the file literature-
population.txt. Label this plot "Literature Places." Notice that the data are similar to,
but not the same as, the real dataset and the perfect Benford's-Law histogram.
Are these data far enough from obeying Benford's Law that we can conclude that the
numbers are fake? Or are they close enough to obeying Benford's Law that they are
statistically indistinguishable from real data? You can't tell just by looking at the
graphs we have created so far. We will show more principled, statistical ways of
making this determination.
Your plot now has three lines plotted on it. Turn this plot in as population-data.png.
Create a new graph like the one from Problem 10. Add to it plots for 10, 50, 100,
and 10000 randomly-selected values or r. In other words, where in Problem 10 you
used 1000 samples, here you should additionally use 10, 50, 100, and 10000. Your
final graph will plot six functions. You should label these functions "10 samples", "50
samples", and so on, just as in Problem 10.
Notice that the larger the sample size, the closer the distribution of first digits
approaches the theoretically perfect distribution. This demonstrates that the more
datapoints there are, the closer it is to the true distribution.
Statistics background
A distribution is a process that can generate arbitrarily many datapoints. A
distribution obeys Benford's Law if, after choosing infinitely many points, the
frequency of first digits is exactly P(d) = log10(1 + 1/d).
The populations of places from literature is a small sample — just a few dozen
datapoints. If, just from looking at our small sample, we can determine that the
unknown distribution they came from does not obey Benford's Law, then we can
conclude that the observed sample is not a result of a natural process (in this case, it is
a result of the authors' choices, not a natural process). We can conclude that because
place populations from the United States and elsewhere in the real world do obey
Benford's Law.
One sample can't conclusively prove anything about the underlying distribution. For
example, there is a very small possibility that, by pure coincidence, we might
randomly choose 100 numbers that all start with the digit 1. If we saw a sample of 100
datapoints whose first-digit histogram is all 1, we would be quite sure, but not 100%
sure, that the underlying distribution does not obey Benford's Law. We might say that
we are more than 99.9% sure that the data is fraudulent.
So, our question is, “What is the probability that the populations from literature are a
sample of an unknown distribution that does not obey Benford's Law?” In other
words, if we had to bet on whether the literature place populations are fraudulent,
what odds would we give? We will determine a quantitative answer to this question.
We take as an assumption that the observed sample (the populations from literature)
are not fraudulent — we call this the “null hypothesis”. Our question is whether we
can reject that assumption. Rejecting the assumption means determining that the
sample is fraudulent. By “fraudulent”, we mean that it was generated by some other
process — such as human imagination — that is different than the natural process that
generates real population data.
Generate 10,000 sets, each of which contains population data from n US towns (n is
the size of the literature dataset). Each datapoint in each set should be chosen at
random from the POPCENSUS_2000 data. For each of these sets, compute its
MSE distance from Benford's distribution.
Now determine how many of the US MSEs are larger than or equal to the literature
MSE, how many of the US MSEs are smaller, and how many are the same as the
literature MSE. Your program should print out these quantities, in the following
format:
Comparison of US MSEs to literature MSE:
larger/equal: ___
smaller: ___
Hint: Your program should not open and parse the census file 10,000 or 560,000
times. Your program should only read the file once.
Submit part 2
You are almost done!
Look over your work to find ways to eliminate repetition in it. Then, refactor your
code to eliminate that repetition. This is important when you complete each part, but
especially important when you complete part 2. When turning in part 2, you should
refactor throughout the code, which will probably include more refactoring in part 1.
You will find that there is some similar code within each part that does not need to be
duplicated, and you will find that there are also similarities across the two parts. You
may want to restructure some of your part 1 code to make it easier for you to reuse in
part 2.
Now look over your work and make sure you practiced good coding style.
At the bottom of your answers.txt file, in the “Collaboration” part, state which
students or other people (besides the course staff) helped you with the assignment, or
that no one did.
At the bottom of your answers.txt file, in the “Reflection” part, reflect on this
assignment. What did you learn from this assignment? What do you wish you had
known before you started? What would you do differently? What advice would you
offer to future students?
• fraud_detection.py
• answers.txt
• scale-invariance.png
• population-data.png
• Benford-samples.png