d 01 Introduction
d 01 Introduction
Waitlist
Course Web Site
Wiki
Course Recordings
Prerequisites
This room
Grading
Books
Spark & Related Tools
Data Science
2
Waitlist - How to get into a Class
3
Waitlist - How it works
When a seat in a class becomes available the top priority student is added
You can not be enrolled in two classes that meet at the same time
If wait list system adds you to a class, it will drop you from classes that meet at the same
time
Second week of classes students are only added if instructor releases the seats
4
Can you add me to the Course?
5
Waitlist FAQ
No Grader
6
Waitlist FAQ
No
Why not?
No grader
7
Waitlist FAQ
Feb 4
8
Waitlist FAQ
What are the odds of that many people dropping the class
9
Grading
1 exam
4-6 assignments
Project
10
Course Website Demo
11
What are the Tools & Methods?
Programming language - Python
Programming Notebook
Visualization
scatter, box, violin, qq, line, density plots
errorbar, histogram, beeswarms
Statistics
mean, variance, quantiles, distributions
confidence intervals, correlation, coveriance
regression, goodness-of-fit, chi-squared test
Bayes theorem
Machine Learning
k-means, DBSCAN, Decision & Regression trees
Streaming - Kafka
Database - Cassandra
Hadoop, Spark, Pig, Mahout, etc.
12
What will be be doing
Installing programs
Python, Jupyter, Spark, Kafka, Cassandra
Analyzing data
Distributing data
Visualizing Data
Using Spark
13
What will be be doing
~2 Weeks
Intro, Python
~5 weeks
Statistics, ML, NumPy, SciPy
Visualization
~3 weeks
Spark
~2 weeks
Kafka & Cassandra
14
Notebooks - Documentation, development
Python, Julia, R,
Other supported by community - Java, Fortran, Haskell, Ruby, Go, Scala, many more
Other notebook systems
Visualization
Python, Julia, R, Matlab
ML
Python (C), Julia, Matlab, R?
Spark - Large Data Sets Kafka - Streaming Data Cassandra - Data Storage
Scala Java Java
Java JVM languages Python
Python
Python Julia (Except for offsets) R - sort of
R Others - No R Client
Julia
15
Prerequisites
You will be installing software
Python
Jupyter
Some of these are more complex
Spark
on Windows than Unix/Mac OS
Kafka
Cassandra
Plotly
16
Tasks - Install the Following
Windows http://wiki.apache.org/hadoop/Hadoop2OnWindows
17
Books
Python Data Science Handbook: Essential Tools for Working with Data
Jake VanderPlas
O'Reilly Media
December 10, 2016
ISBN 9781491912058
18
Books
Course books are available for free on-line via SDSU library
19
Spark, Amazon
Sign up for Amazon Educate account - $100 compute time for free
20
Data Science & Big Data
Very trendy
21
Data Science
Wikipedia
22
Data Science
23
Data Engineer
A software engineer that deals with data plumbing
Traditional database setup, Hadoop, Spark, etc.
Data analyst
A person who digs into data to surface insights,
but lacks the skills to do so at scale
They know how to use
Excel, Tableau and SQL
but can’t build a web app from scratch
24
Data Science
25
Data Science & Big Data
Big Data
Data Science with large datasets
26
Inconvenient Truth About Data Science
Data is never clean.
You will spend most of your time cleaning and preparing data.
There is no fully automated Data Science. You need to get your hands dirty.
27
Share of Respondents
0%
10%
20%
30%
40%
50%
60%
70%
SQL
Excel
Python
R
MySQL
Python: numpy, scipy, scikit-learn
ggplot
TOOLS
28
Spark
Cloudera
Visual Basic/VBA
MongoDB
LANGUAGES, DATA PLATFORMS, ANALYTICS
Apache Hadoop
SAS
C++
PowerPivot
50K
100K
150K
200K
0K
SQL
Excel
Python
R
MySQL
Python: numpy, scipy, scikit-learn
ggplot
Microsoft SQL Server
Tableau
JavaScript
Matplotlib (Python)
SALARY MEDIAN AND IQR (US DOLLARS)
Java
PostgreSQL
Oracle
D3
Homegrown analysis tools
29
Hive
Spark
Cloudera
TOOLS: LANGUAGES, DATA PLATFORMS, ANALYTICS
Visual Basic/VBA
MongoDB
If you can not think of three things that might go wrong with your analysis
there is something wrong with your thinking
31
Data Science Verses Programming Jobs
Data - 23
32
Data Science Programming Languages
33
Features of Languages for Data Science
Interactive
Supports computation
Simple syntax
Fast
34
Python
35
Julia
36
Java, Scala, Hadoop, Spark
Scala
OO & Functional
Type inference
Far less verbose than Java
37