0% found this document useful (0 votes)
54 views66 pages

Data Science Lab

This document provides an overview of an introductory data science programming course. It discusses functions and homework assignments, including writing predictive models and working through examples from the textbook. It also references following up on additional resources and materials.

Uploaded by

018 Neelima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views66 pages

Data Science Lab

This document provides an overview of an introductory data science programming course. It discusses functions and homework assignments, including writing predictive models and working through examples from the textbook. It also references following up on additional resources and materials.

Uploaded by

018 Neelima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 66

Welcome (back) to IST 380 !

Today: putting the programming in


Data Science Programming…

Functions!

I have a sinking
feeling about all
of this fun?

Congratulations, Ravens!
Assignments…
Homework #1 is due tomorrow evening (2/5)
Getting started with R (tutorial + "quiz" + text)
There will be time this evening, if you'd like to use it…
started? finished? thoughts so far?

Homework #2 is due next Tuesday (2/12)

Pr #1: working through the text examples


Pr #2: writing some additional functions + a chance
to consider probability problems
Predictive?
Pr #3: writing a predictive model… It was 101
years ago!
Following up…
Following up…

http://www.propublica.org/article/everything-we-know-so-far-about-obamas-big-data-operation
Following up…

http://www.propublica.org/article/everything-we-know-so-far-about-obamas-big-data-operation
Palm trees!

R in the
NYT

http://www.nytimes.com/2009/01/07/
technology/business-computing/07program.html
Our path
g 2
… R's toolsets, in
m
m
now building gra i l l s
r o Sk
larger pieces… P

descriptive
statistics
Subject
Central Limit Expertise
Theorem

Functions

predictions I predict we'll get here…


R Reference Card

A security blanket for some of us…


RStudio An IDE that wraps R IDE like to know
what this means!

You might want to start with Chapter 9's Integrated Development Environment…
RStudio An IDE that wraps R IDE like to know
what this means!

Editable
files/scripts Live data

Plots and help

Console
interactions

You might want to start with Chapter 9's Integrated Development Environment…
summary
Descriptive statistics hist
quantile
sd, mean

state populations

Chapter 6 reviews statistical descriptions using these data


rgeom
Generative statistics runif
rnorm …
sample
replicate

distribution of samples of state populations

Chapter 7 reviews repeated sampling and the resulting distribution of means


Try it!
(1) Load in the state's populations

Is the mean or median greater? Why?


(2) Which state is closest to the median?
Extra: Which is closest to the 42% quantile?

Create a sample of 16 states from the list.


Create a distribution of 100 such samples – histogram it!
(3) Increase the number of samples until you obtain
sample-distribution mean within ~1% of the real mean.
Find 5% and 95% quantiles of that distribution.
(p<.05)
Big Ideas:

Central Limit Theorem

Law of Large Numbers

Monte Carlo Methods


Central Limit Theorem
the mean of a large number of independent random
variables, each with finite mean and variance, will be
approximately normally distributed.

state populations means of 4000 samples


(size 16) from the states
Central Limit Theorem
Take N samples from this population and find the
mean of each one. For large N, those sample means
will form a bell curve around the true mean

state populations means of 4000 samples


(size 16) from the states
Central Limit Theorem
Take N samples from this population and find the
mean of each one. For large N, those sample means
will form a bell curve around the true mean

o n e
s i s of …
h i
T l ca s e
c i a
spe

state populations means of 4000 samples


(size 16) from the states
Law of Large Numbers
in the limit, the average of the results obtained from a large
number of random trials of a process will converge to the
expected value for that process

chances of rolling doubles?


Law of Large Numbers
whose take-home message is: Try it lots of times and
just see what happens!

Monte Carlo Methods


The two Monte Carlos

Making random
numbers work for you!

Monte Carlo casino, Monaco


Monte Carlo
methods, Math/CS
The two Monte Carlos

Stanislaw Ulam
(Los Alamos badge)

Bond. (James Bond)

Making random
numbers work for you!

Monte Carlo casino, Monaco


Monte Carlo
methods, Math/CS
Monty Hall
Let's make a deal '63-'86
Sept. 1990

inspiring the “ Monty Hall paradox”


0 1 42 3 4 5 6 7 8 9
A

H
0 1 42 3 4 5 6 7 8 9
A

H
Hw #2: Monte Carlo Monty Hall

… and a second example:

Both envelopes hold some positive amount of money (in a check or IOU),
but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?


Functions in R (Chapter 9)
Functions in R

A function to add two inputs…

thoughts? oddities? niceties?


Functions in R

A "guessing-game" function… Let's fix it!


42 in 72-point font!
Slide credits: thanks to JHU's R. D. Peng
nested conditionals…
What's going
on here?
How are these two
conditionals different?

cat is better
than print
seq_along creates
a list of indices
Thoughts?
Could we write a Monty Hall function?

MH <- function()

… that runs one three-curtain trial?

MHall(chosen_curtain=1, sors="switch", verbose=TRUE)

… another to allow us to turn off printing?

Mhall_N(chosen_curtain=1, sors="switch", N=300)


… another to run it N times?

MystEnv_N(first_env=10, sors="switch", N=10)


… and another to try the envelope-switching?
So, what is Machine Learning?

The goal of machine learning … or


predictive statistics/analytics,
is to find a function
that yields an output from a previously-unseen input
- based on the data available about the process in question.

This week's final problem asks you to write such a


function – for the Titanic survivor dataset.
The Titanic

April 15, 1912

1502 out of the


2224 passengers
died in the sinking

What characteristics did


the survivors share?
The Data

here are the


11 columns

There are 742 rows and 11 columns in the training data.


Our goal

… is to write a function that takes in a row of new data and


outputs whether that passenger would survive (1) or not (0).
A first predictor
A second predictor

Does the data match the


famous emergency cry?
Testing our functions…
Try it!

Help is available either with hw#1


(getting started with R)

or hw#2 (writing functions/Titanic)


this evening during lab time…

Good luck with everything this week!


Lab !
CS vs. IS and IT ?
greater integration
system-wide issues

smaller details
machine specifics

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
CS vs. IS and IT ?

Where will IS go?


CS vs. IS and IT ?
IT ?

Where will IT go?


IT ?
The bigger picture

Weeks 10-12 Weeks 13-15


Objects Final Projects

Week 10 Week 13
classes vs. objects final projects

Week 11 Week 14
methods and data final projects

Week 12 Week 15
inheritance final exam
• Neighbor's name
Data?!
• A place they consider home

• Are they working at a company now? Where?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"


(statistics, machine learning, CS)
background?
state reminders…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA

• Are they working at a company now? Where? Harvey Mudd

• How many U.S. states have they visited? 44

• Their favorite unhealthy food… ? M&Ms

• Do they have any "Data Science"


(statistics, machine learning, CS)
background? mostly CS for me…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA

• Are they working at a company now? Where? Harvey Mudd

• How many U.S. states have they visited? 44

• Their favorite unhealthy food… ? M&Ms

• Do they have any "Data Science" This class is truly


seminar-style:
(statistics, machine learning, CS) we're devloping
expertise in this
background? mostly CS for me… field together.

be sure to set up your login + profile for the submission site…

You might also like