Stats Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

MATH 1040 Skittles Data Project

In my statistic 1040 math course fall 2019 I was required to complete a gathering data

project. This project required one to work alone and also have group discussions with other

fellow students. For the first part of the assignment each student in the class was required to buy

a bag of skittles and count that number of each color of candy in the bag. The class data was

complied and used for a number of other different exercises involving methods a statistician

might use in their life. The second part of the project I determined the proportion of each color of

candy and created different charts for the total number of each color of candies in the entire

class. These charts compared the data to my own personal data and made it easier for myself and

others to notice any differences and similarities between my bag of candy and the other students

in the class. The third part of the project I used the skittles data from the class to create statistic

summaries of the mean, standard deviation and a five number summary. I was able to make a

frequency histogram of the total number of candies as well as a box plot. Under each chart one

was able to read a description at what they might be looking at. The last part of the project

involved confident interval. I found three different confidence intervals for the population

proportion, mean, and standard deviation. Each of the confidence intervals had an analysis

written describing what each confidence interval meant.


Part #1 For this part of the project I was able to determine the proportion of each color of candy

and created different charts for the total number of each color of candies in the entire class.

These charts compared the data to my own personal data and made it easier for myself and others

to notice any differences and similarities between my bag of candy and the other students in the

class.

Hypothesis: I believe that upon the data collection there will be significantly more green and

yellow skittles than red, orange, or purple.


This graph represents the color of skittles verse the number of each color in a bag of skittles. The

class complied their data together comparing their own bag to everyone else in the class. The x

axis represents the different colors of skittles that were found in each students bag of skittles. The

y axis represents the relative amount of skittles of each color every student in the class had in

their own bag.


Skittles Colors Paragraph:

For my prediction before I counted what was in my skittles bag I thought that green and yellow

would have the majority of the count in the bag, but I was wrong. It turns out that for my

particular bag green and red were the smallest counts that I had. As fore the rest of the class they

seemed to stay true to my prediction of green and yellow having the majority of the count in their

bags. I was surprised that red had an 18% proportion and being the smallest count next to purple.
It seems like for the overall data collection that the data seems very uniform all the proportions

are around 20%.

Group Discussion:

Does the Class data represent a random sample?

Yes, the class data does represent a random sample. Although each student was asked to buy their

own bag of skittles and not every bag of skittles in the region had an equal chance of being

selected, the distribution of skittles from the central plant/warehouse was most likely random.

The skittles company most likely does not count colors as they load the bags and simply loads by

weight, and assuming students did not make any biased decisions about which bag to grab off the

shelf every bag produced had an equal chance of being shipped to any location in the country

and being selected at random by a student in the class.

What would the population be?

In this study, the sample is the class data. Since not everyone in the class is currently living in the

same state, the population would be all 2.17 ounce skittles bags in the United States. There are

currently different manufacturing plants operating overseas, therefore the population can only

reasonably be expanded to include the United States distribution circuit.


Part #2: I used the skittles data from the class to create statistic summaries of the mean, standard

deviation and a five number summary. I was able to make a frequency histogram of the total

number of candies as well as a box plot. Under each chart one was able to read a description at

what they might be looking at.

This table was made from the values of the total skittles in the class using a program called “Stat

Crunch.” This table shows the mean, standard deviation, and 5 number summary of each color or

skittles the class had found in their bags all compiled together with each student in the class.

Var2 shows all the data for the Red skittles. Var3 shows all the data for the Orange skittles. Var4

shows all the data for the Yellow skittles. Var5 shows all the data for the green skittles. Var 5

shows all the data for the Purple skittles. Comparing all the colors of the skittles together one can

see that Yellow and Green have the higher average per each students bag of skittles.
This image represents a Histogram of the Skittles Colors in each students bag of skittles. This

histogram shows the frequency of disruption. One can see that the histogram is slightly bell

shape but does appear to be more left skewed.


This box plot represents the frequency of the color of skittles in each students bag collected. By

analyzing this box plot one is able to distinguish that the distribution is skewed to the left.

4. Number of Candies:
By using the program, “Stat Crunch” one was able to analyze the average, standard deviation, 5

number summary, frequency of the colors of skittle in each students bag and also analyze the

follow box plot. Upon analyzing the individual images one is able to tell that the relative

distribution is lefty skewed. One was also able to determine that yellow and green are the most

common colors in each individual students bag by seeing the average is the highest in those two

colors. I was surprised by yellow and green being the most common colors because I thought

that red would be the most common. I can also see that the numbers in colors don’t differ by a lot

per bag. I would assume that the factory just randomly puts skittles in a bag by machine not

making sure the distribution was evenly through out, but upon analysis it shows that the

distribution is fairly even through out each students bag.

Group Discussion:

Categorical variables are also known as qualitative variables. These variables can be put into

different categories, such as a model of car, color, gender, etc. Quantitative data is data that can

be ordered and measured. The number of candies in a bag of skittles is quantitative, whereas the

color of the candy is categorical.

Graphing quantitative data is best done with histograms, stem leaf plots, dot plots, bar graphs,

and box plots. All of these types of graphs can be used to measure the quantity of a certain

variable. Categorical data is best graphed using a method that lets you compare the groups to one

another. A bar graph can work for both quantitative and categorical data, but a pie chart doesn’t

make sense for quantitative data because it is comparing categories to the whole. A pie chart
would effectively show the percentage of each color of skittles in a bag (categorical data), but

cannot effectively be used to show the number of skittles in a bag (quantitative data).

When it comes to calculations, mean and median only make sense for quantitative data. The

mean is the average quantity of something in an entire sample, therefore it is a more meaningful

calculation when applied to quantitative data. The median represents the middle value of the data

and once again makes the most sense only when applied to quantitative data. The best central

tendency to apply to categorical data is the mode. When looking at the colors of candy in a

skittles bag, you may not able to find the average color or the median color, but you can establish

which color occurs the most often. Likewise, when looking at the number of candies in a skittles

bag, the best values for probability distributions are going to be the average and median number

of skittles.

Part #4: The last part of the project involved confident interval. I found three different

confidence intervals for the population proportion, mean, and standard deviation. Each of the

confidence intervals had an analysis written describing what each confidence interval meant.

99% Confidence Interval estimate for the population proportion of yellow candies

X= 410

n= 1874

Z-value for 99%


p= 410/1874 = 0.2188

99% Confidence Interval Estimate: (0.194, 0.24228)

Confidence Intervals estimated from a population proportion are used to determine, with the

specified degree of confidence, the proportion of a characteristic found within a population. In

relation to the skittles, we are 99% confident that the proportion of yellow skittles in any bag of

skittles falls between 0.194 and 0.24228.

95% Confidence Interval estimate for the population mean number of skittles per bag

Sample mean= 58.56 (32/1874)

Sx= 2.422

Standard deviation= 2.384

n= 32

95% Confidence Interval Estimate: (57.876,59.258)

Confidence Interval estimates of the population mean use sample date to give an interval with

the specified degree of confidence that the mean characteristic of a population should fall within.

In this case, we are 95% confident that the mean number of skittles in any bag is between 57.876

and 59.258.

The purpose of taking sample data and calculating statistics from them is to apply those statistics

to a larger population. Since a population is larger than a sample, how well a sample statistic can

be used to estimate a population parameter is an issue. A confidence interval helps to solve that

issue by allowing us to provide a range of values that the population parameter is likely to fall

within. The intervals are constructed with a certain level of confidence, reflected as a percentage

such as 95%, or 99%. This means that if the same population were to be examined on multiple
occasions and a parameter interval calculated each time, the intervals would contain the true

parameter in X% of cases.

Conclusion:

From taking this course, I was able to not only become a lot more familiar with my calculator but

also with how one is able to collect data and show that data properly. I was able to better

understand promotions variables, the value of knowing how to calculate means and standard

deviations, reading graphs, bot plot and histograms, and also how to better communicate the data

that I was able to find. This project at times was challenging and as was the class. I am grateful

for everything that I have learned and I know it will better help reenforce my learning within the

rest of my college and life learning.

You might also like