0% found this document useful (0 votes)

144 views5 pages

CS373 Homework 1: 1 Part I: Basic Probability and Statistics

This document provides instructions for homework 1 for the CS373 course. It is due on February 11th by 11:59pm and should be submitted as a PDF file on Gradescope. The homework consists of two parts - part 1 involves basic probability and statistics questions, and part 2 involves analyzing a dataset using R and answering related questions. Instructions are provided for submitting the homework and specific questions are outlined for parts 1 and 2.

Uploaded by

sanjay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views5 pages

CS373 Homework 1: 1 Part I: Basic Probability and Statistics

Uploaded by

sanjay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS373 Homework 1

Due date: Monday, February 11, 11:59pm (submit pdf on Gradescope)

Any use of late days must be explicitly mentioned at the top of your submission.

Homework must be submitted as a PDF; answers should be typed.

Instructions for submission
Submit a single PDF on Gradescope with all your answers. Make sure you
select the page corresponding to the beginning of each answer, else points
might be deducted. For part I, show the steps you took. For part II, include the R code
you used for analysis, along with its output and any plots required by the question. Please
label all plots with the question number. Your homework must be typed and must contain
your name and Purdue ID.

1 Part I: Basic Probability and Statistics

A. (6 pts) Consider an experiment where a fair die is rolled repeatedly until the first time
a 3 is observed.

(a) What is the sample space for this experiment? What is the probability that the
die turns up a 3 after i rolls?
(b) What is the expected number of times we roll the die?
(c) Let E be the event that the first time a 3 turns up is after an even number of
rolls. What set of outcomes belong to this event? What is the probability that E
occurs?

B. (5 pts) Two standard dice are rolled. Let E be the event that at least one of the dice
lands on 5; let F be the event that the sum of the dice is even; and let G be the event
that the sum is 7. Compute the following:

(a) P(E ∩ F )
(b) P(E ∪ F )
(c) P(E ∪ G)
(d) P(E ∩ ¬G)
(e) P(E ∪ F ∪ G)

C. (4 pts) We are given four coins C1 , C2 , C3 and C4 . Coin Ci is chosen randomly with
probability proportional to i for i = 1, 2, 3, 4. Let Hi represent the event that heads is
observed when Ci is tossed; P (H1 ) = 1/4; P (H2 ) = 1/3; P (H3 ) = 1/2; P (H4 ) = 2/3.

(a) What is the probability of selecting coin C3 ?

(b) If a coin is selected at random and tossed, find the conditional probability that
the coin is C3 given that a tail is observed. (State this as a conditional probability
and show the calculation.)

1
D. (4 pts) 52% of the students at a particular college are female. 5% of the students in the
college are majoring in computer science. 0.55% of the students are women majoring
in computer science.
(a) If a student is selected at random, find the conditional probability that the student
is female given that they are majoring in computer science. (State this as a
conditional probability and show the calculation.)
(b) If a student is selected at random, find the conditional probability that the student
is majoring in computer science given that they are female. (State this as a
conditional probability and show the calculation.)
(c) Now suppose that the overall proportion of female students increases to 57% and
that the conditional probability from D(a) changes (i.e., increases or decreases) to
15%. Compute the updated conditional probability that a student is majoring in
computer science given that they are female. (Assume that the overall proportion
of students majoring in CS stays the same.)
E. (6 pts) A system is built using 3 disks d1 , d2 , d3 having probabilities of failure 0.01,
0.03 and 0.05 respectively. Suppose the disks fail independently.
(a) Let W denote the event that the system will work, which happens if at least one
of the disks works. Compute P (W ), the probability that the system will work.
(b) Let A denote the event that at least one of the following happens: (i) d1 and d2
work; (ii) d3 works. If the system works when event A occurs, then compute the
probability that the system will work.
(c) Considering the setting of Eb, given that d1 works, what is the conditional prob-
ability that event A will occur and the system works?
F. (6 pts) Let an experiment consist of rolling three standard 6-sided dice.
(a) Compute the expected value of the sum of the rolls.
(b) Compute the variance of the sum of the rolls.
(c) If X represents the maximum value that appears in the three rolls, what is the
expected value of X?

2 Part II: R
In this assignment, you will use the R statistical package to explore, transform, and analyze
data. Based on your analysis you will formulate hypotheses about the data. To get started,
do the following:
• Download and install R from: http://cran.r-project.org/
• Download the Yelp dataset from Piazza.
This data set is part of the Yelp academic dataset and consists of data about 24,813
restaurants. The datafile yelp.csv contains 28 attributes: 6 numeric and 22 discrete.
The first row of the data file is a header row with the names of the attributes where
names are separated by a comma (,).

2
Use R to analyze the Yelp data and complete the questions below.

3 Data import and summarization

Read the data into R using read.table() function. Use the argument sep="," to specify
the column delimiter, the argument header=TRUE to read in the column names, the ar-
gument quote="\"" to read in the quoted fields, and the argument comment.char="" to
treat the # characters as text rather than comments.

(a) (2 pts) Print a summary of the data using the summary() function.

(b) (2 pts) Print the names of the columns in the table using the names() function.

4 1D plots
A. (a) (3 pts) Plot a histogram of the ‘reviewCount’ attribute. Use the hist() function
with its default values and make sure to title the plot with the name of the attribute
for clarity.
(b) (3 pts) Compute the logged values for ‘reviewCount’ (you can use log(d$column name)
to compute the log of all the values in a column). Plot a histogram of the logged
values.
(c) (3 pts) Plot a density plot of the logged values of the ‘reviewCount’ attribute using
the density() function.
(d) (3 pts) Discuss the similarities and differences between the three plots and the
information they convey about the distribution of ‘reviewCount’ values in the data.

B. (2 pts) Plot a barplot of the ‘state’ attribute to show the frequency of each value. Use
the table() function to get the counts for each value and the names() function to get
the names of the values in the table. Use the barplot() function with the names.arg
argument to label the bars with the appropriate value. Again, make sure to title the
plot with the name of the attribute for clarity.
(Note that this will look like a histogram but for nominal values. In small renderings
of this plot, you might not see all the state name labels, but if you stretch the window
you will be able to see all the labels.)

5 Sampling and transforming data

A. (4 pts) The attributes ‘categories’ and ‘recommendedFor’ each contain a comma sepa-
rated list of values associated with each restaurant. Compute two new boolean features:
‘servesPizza’ and ‘goodForBreakfast’ with a value of TRUE if the list contains Pizza
(in ‘categories’ ), breakfast (in ‘recommendedFor’ ) respectively and FALSE otherwise.
You can use the function grepl(str, f$column name) to check whether the values in
column name contain the string str.

3
Append the two new columns to the original data frame, using cbind() to increase the
number of features by 2.

B. (a) (3 pts) Compute the quantiles (using quantile()) for the ‘checkins’ attribute.
(b) (3 pts) Select a subset of the data with ‘checkins’ value ≤ the 1st quartile (25th
percentile). You can use subset() or select from the data frame with [ ] opera-
tions.
(c) (3 pts) Print a summary of the above subset for the following attributes only:
‘checkins’, ‘stars’, ‘noiseLevel, ‘priceRange’, ‘reviewCount”, ‘goodForGroups’, and
compare them to their summary for the full dataset.

Discuss any differences that you find in the distributions of these attributes.

6 2D plots and correlations

A. (6 pts) Plot a scatterplot matrix (using plot()) for the five attributes:
‘stars’, ‘reviewCount’, ‘checkins’, ‘longitude’, ‘latitude’.

Identify which pair of attributes exhibit the most association (as you can determine
visually) and discuss if this is interesting or expected, given your domain knowledge.

B. (6 pts) Calculate the pairwise correlation among the above five attributes using the
cor() function.

Identify the pair of attributes with largest positive correlation and the pair with largest
negative correlation. Report the correlations and discuss how it matches with your
visual assessment in part A.

C. (6 pts) Plot a boxplot (using boxplot()) for each of the following four attributes vs.
the ‘attire’ attribute: ‘checkins’, ‘reviewCount’, ‘stars’, ‘latitude’.
Make sure to label both axes of the plot with the appropriate attribute names.

(a) Identify the attribute that exhibits the most association with ‘attire’ (as you can
determine visually) and discuss whether this is interesting or expected, given your
domain knowledge.
(b) For the attribute identified above, calculate its interquartile range for each value
of ‘attire’ (i.e., a separate IQR for the “casual” instances, the “dressy” instances,
the “formal” instances and the instances with “ ” for ‘attire’ ). You can do this
with the subset() and quantile() functions. Calculate the overlap between the
four IQRs. Discuss whether these results support the conclusion you made based
on visual inspection.

4
7 Identifying potential hypotheses (20 pts)
During your exploration above, investigate other aspects of the data. Explore relationships
between variables by assessing plots, computing correlation, or other numerical analysis.

Identify TWO possible relationships in the data (other than the ones specified in earlier
questions) and formulate hypotheses based on the observed data. For each of the two
identified relationships:

(a) Include a plot illustrating the observed relationship (between at least two variables).

(b) State whether the variables are discrete or continuous and what type of plot is relevant
for comparing these two types of variables.

(c) Formulate a hypothesis about the observed relationship as a function of two random
variables (e.g., X is associated with Y ).

(d) Write the hypothesis as a claim in English, relating it to the attributes in the data.

(e) Identify the type of hypothesis.

Internship Report
100% (2)
Internship Report
59 pages
MCQ Hot Air Oven
No ratings yet
MCQ Hot Air Oven
15 pages
MPLS TP Overview
100% (1)
MPLS TP Overview
30 pages
Stastistics and Probability With R Programming Language: Lab Report
50% (2)
Stastistics and Probability With R Programming Language: Lab Report
44 pages
Dragonpay API
No ratings yet
Dragonpay API
31 pages
Wind Turbine Blade Design On SolidWorks
No ratings yet
Wind Turbine Blade Design On SolidWorks
6 pages
2 75
33% (3)
2 75
18 pages
Edexcel s1 Mixed Question
100% (1)
Edexcel s1 Mixed Question
78 pages
Esp32 Technical Reference Manual en
No ratings yet
Esp32 Technical Reference Manual en
660 pages
Programming With R Test 2
50% (2)
Programming With R Test 2
5 pages
QUIZ Complete Answers
0% (1)
QUIZ Complete Answers
21 pages
Data8 Fa23 Final
No ratings yet
Data8 Fa23 Final
22 pages
EEU33E03 - Sample Exam Paper
No ratings yet
EEU33E03 - Sample Exam Paper
18 pages
BSR Presentation For TOTAL
No ratings yet
BSR Presentation For TOTAL
16 pages
Arunav Da Prac
No ratings yet
Arunav Da Prac
55 pages
Borges - The Garden of Forking Paths PDF
No ratings yet
Borges - The Garden of Forking Paths PDF
7 pages
Edexcel S1 Mixed Question PDF
No ratings yet
Edexcel S1 Mixed Question PDF
78 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
Ad3411-Data Science and Analytics Laboratory
No ratings yet
Ad3411-Data Science and Analytics Laboratory
27 pages
Mock Exam-P1 Review 2025
No ratings yet
Mock Exam-P1 Review 2025
41 pages
Chapter8 Student
No ratings yet
Chapter8 Student
60 pages
IAL Statistics Revision Worksheet Month 6
100% (1)
IAL Statistics Revision Worksheet Month 6
5 pages
Presentation 3
No ratings yet
Presentation 3
29 pages
BE184
No ratings yet
BE184
47 pages
MI2026 Problems
No ratings yet
MI2026 Problems
44 pages
21hcs4108 Davpracticals
No ratings yet
21hcs4108 Davpracticals
29 pages
Data8 Su22 Final
No ratings yet
Data8 Su22 Final
17 pages
Canopy Merged PDF
No ratings yet
Canopy Merged PDF
32 pages
Collaborative Statistics Teacher's Guide
No ratings yet
Collaborative Statistics Teacher's Guide
59 pages
1 Plant Nutrition
No ratings yet
1 Plant Nutrition
35 pages
23HCS4142 PDF
No ratings yet
23HCS4142 PDF
24 pages
2 High-Performance Supercapacitor Electrode Based On Cobalt Oxide
No ratings yet
2 High-Performance Supercapacitor Electrode Based On Cobalt Oxide
21 pages
R Programming Practical File
No ratings yet
R Programming Practical File
38 pages
CAPE Applied Mathematics 2016 U1 P2
No ratings yet
CAPE Applied Mathematics 2016 U1 P2
28 pages
10-An - Swimming Pool Dehumidifier Sizing
No ratings yet
10-An - Swimming Pool Dehumidifier Sizing
4 pages
Prob Assignment
No ratings yet
Prob Assignment
8 pages
M.SC - Statistics - 2013
No ratings yet
M.SC - Statistics - 2013
12 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
Final Compre - Solutions - Updated FoDS
No ratings yet
Final Compre - Solutions - Updated FoDS
12 pages
Main
No ratings yet
Main
13 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
5 pages
Chapter 3: Numerical Summary Measures
No ratings yet
Chapter 3: Numerical Summary Measures
34 pages
Cost Practical
No ratings yet
Cost Practical
13 pages
Portfolio Spring 25
No ratings yet
Portfolio Spring 25
5 pages
Model Question Paper Ans
No ratings yet
Model Question Paper Ans
19 pages
Basics of Statistics and Probability - FP: Statistical Measures
No ratings yet
Basics of Statistics and Probability - FP: Statistical Measures
12 pages
Exposing The Deception Deepfake Detection
No ratings yet
Exposing The Deception Deepfake Detection
13 pages
Data8 Fa21 Midterm
No ratings yet
Data8 Fa21 Midterm
22 pages
CBSE Class 12 Chemistry Question Paper Solution 2019
No ratings yet
CBSE Class 12 Chemistry Question Paper Solution 2019
6 pages
Application of Matrix - Linear Mapping-5
No ratings yet
Application of Matrix - Linear Mapping-5
7 pages
Final Exam Fall 2019
No ratings yet
Final Exam Fall 2019
12 pages
Compre FoDS
No ratings yet
Compre FoDS
3 pages
Chapter 2: Tables and Graphs For Summarizing Data
No ratings yet
Chapter 2: Tables and Graphs For Summarizing Data
21 pages
FDS Important Q
No ratings yet
FDS Important Q
5 pages
Q4G8W2
No ratings yet
Q4G8W2
7 pages
Chapter 1 Exam Review - Graphical Displays of Data SOLUTIONS
No ratings yet
Chapter 1 Exam Review - Graphical Displays of Data SOLUTIONS
8 pages
Ds Imp Qs
No ratings yet
Ds Imp Qs
4 pages
DS1000 Assignment 1
No ratings yet
DS1000 Assignment 1
6 pages
Module 3 - Characterization
No ratings yet
Module 3 - Characterization
16 pages
FM Statistics, Fall 2022, Homework 02
No ratings yet
FM Statistics, Fall 2022, Homework 02
8 pages
ATATool
No ratings yet
ATATool
6 pages
Program // Mouseeventsview - CPP: Implementation of The Cmouseeventsview Class
No ratings yet
Program // Mouseeventsview - CPP: Implementation of The Cmouseeventsview Class
6 pages
Tut1 Students
No ratings yet
Tut1 Students
4 pages
Pre Workshet 1
No ratings yet
Pre Workshet 1
3 pages
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
No ratings yet
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
7 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
Fods Question Paper
No ratings yet
Fods Question Paper
4 pages
MCQ Statistics
No ratings yet
MCQ Statistics
8 pages
Stats Midterm
No ratings yet
Stats Midterm
3 pages
Module 2 - Setting
No ratings yet
Module 2 - Setting
11 pages
Exam 3 Review
No ratings yet
Exam 3 Review
16 pages
CS37300 Data Mining & Machine Learning: Anomaly Detection
No ratings yet
CS37300 Data Mining & Machine Learning: Anomaly Detection
10 pages
Summary Report: Threat Analysis
No ratings yet
Summary Report: Threat Analysis
9 pages
(Unit 4-5) R 2marks
No ratings yet
(Unit 4-5) R 2marks
6 pages
EWP Micro Project
No ratings yet
EWP Micro Project
5 pages
PS Midsem
No ratings yet
PS Midsem
4 pages
Compre FoDS
No ratings yet
Compre FoDS
2 pages
Improvements in The Mechanical Properties of The 18R-6R High-Hysteresis Martensitic Transformation by Nanoprecipitates in CuZnAl Alloys
No ratings yet
Improvements in The Mechanical Properties of The 18R-6R High-Hysteresis Martensitic Transformation by Nanoprecipitates in CuZnAl Alloys
8 pages
112-1 第一次期中考試題
No ratings yet
112-1 第一次期中考試題
4 pages
BES - R Lab 6
No ratings yet
BES - R Lab 6
7 pages
Quiz 13
No ratings yet
Quiz 13
6 pages
Cryptanalysis of A New Ultralightweight RFID Authentication ProtocolSASI
No ratings yet
Cryptanalysis of A New Ultralightweight RFID Authentication ProtocolSASI
5 pages
Maths - Quantitative Aptitude Sample Test: Direction For Questions 8 To 11
No ratings yet
Maths - Quantitative Aptitude Sample Test: Direction For Questions 8 To 11
6 pages
Paper R
No ratings yet
Paper R
4 pages
MC 3487
No ratings yet
MC 3487
6 pages
Summary of Lectures 02 Vector Spaces
No ratings yet
Summary of Lectures 02 Vector Spaces
3 pages
S11 Question Catalog en
No ratings yet
S11 Question Catalog en
2 pages
Review For Final Exam: New Material ONLY
No ratings yet
Review For Final Exam: New Material ONLY
4 pages
Statistics Exercises
No ratings yet
Statistics Exercises
4 pages
HW 1
No ratings yet
HW 1
3 pages
Nastran Shell Element Orientation Question
No ratings yet
Nastran Shell Element Orientation Question
3 pages
TD1360c Shell and Tube Datasheet
No ratings yet
TD1360c Shell and Tube Datasheet
2 pages
PC Control Using Android Over Internet
No ratings yet
PC Control Using Android Over Internet
3 pages
Stat 4510
No ratings yet
Stat 4510
4 pages
Assingnment
No ratings yet
Assingnment
2 pages
CS3352 Fds 2
No ratings yet
CS3352 Fds 2
1 page
3-Day Food-Activity Log Instructions
No ratings yet
3-Day Food-Activity Log Instructions
2 pages
Babu Krishna-Sanjay PDF
No ratings yet
Babu Krishna-Sanjay PDF
1 page
2020006101CustomLetter PDF
No ratings yet
2020006101CustomLetter PDF
1 page
1.review - Exercises - RFundamental and UNivariate Stats
No ratings yet
1.review - Exercises - RFundamental and UNivariate Stats
1 page
IGNOU BCA Computer Oriented Numerical Technique Previous Year Unsolved Papers BCS 054
From Everand
IGNOU BCA Computer Oriented Numerical Technique Previous Year Unsolved Papers BCS 054
Manish Soni
No ratings yet
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
From Everand
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
Manish Soni
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet