0% found this document useful (0 votes)
4 views

d 01 Introduction

The document outlines the course CS 696 Intro to Big Data, detailing prerequisites, grading structure, and tools used such as Python, Spark, and Kafka. It explains the waitlist process for enrollment and provides information on course materials, including recommended books and online resources. The course aims to teach data science concepts, programming, and data analysis techniques over the semester.

Uploaded by

girosi4121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

d 01 Introduction

The document outlines the course CS 696 Intro to Big Data, detailing prerequisites, grading structure, and tools used such as Python, Spark, and Kafka. It explains the waitlist process for enrollment and provides information on course materials, including recommended books and online resources. The course aims to teach data science concepts, programming, and data analysis techniques over the semester.

Uploaded by

girosi4121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

CS 696 Intro to Big Data: Tools and Methods

Spring Semester, 2020


Doc 1 Introduction
Jan 23, 2020
Copyright ©, All rights reserved. 2020 SDSU & Roger Whitney, 5500
Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://
www.opencontent.org/openpub/) license defines the copyright on this
document.
Course Issues
http://www.eli.sdsu.edu/courses/index.html

Waitlist
Course Web Site
Wiki
Course Recordings
Prerequisites
This room
Grading
Books
Spark & Related Tools
Data Science

2
Waitlist - How to get into a Class

Add yourself to the course waitlist

Instructors can not


Add individuals to the class
See who is on the waitlist
Change your priority on the waitlist

3
Waitlist - How it works

Waitlist is a priority queue

When a seat in a class becomes available the top priority student is added

You can not be enrolled in two classes that meet at the same time

If wait list system adds you to a class, it will drop you from classes that meet at the same
time

First week of classes as students drop others are added

Second week of classes students are only added if instructor releases the seats

4
Can you add me to the Course?

Instructors can't select individual students to add to the course

5
Waitlist FAQ

Why not get a bigger room and admit everyone?

No first hard assignment to scare people

No Grader

Do you really want a 600 level class of 100 people?

This is the largest room of its type on campus

6
Waitlist FAQ

Will you be increasing the size of the class?

No

Why not?

No grader

New courses are a lot of work

Technology courses are a lot of work

7
Waitlist FAQ

Feb 4

Last day for regular students to add/drop classes

Open University students have lower priority than SDSU students

8
Waitlist FAQ

So what are my chances of adding this class?

Look up your position on the waitlist

What are the odds of that many people dropping the class

I can not see the waitlist

I have no idea how many people will drop

9
Grading
1 exam
4-6 assignments
Project

10
Course Website Demo

11
What are the Tools & Methods?
Programming language - Python
Programming Notebook

Visualization
scatter, box, violin, qq, line, density plots
errorbar, histogram, beeswarms

Statistics
mean, variance, quantiles, distributions
confidence intervals, correlation, coveriance
regression, goodness-of-fit, chi-squared test
Bayes theorem

Machine Learning
k-means, DBSCAN, Decision & Regression trees

Streaming - Kafka
Database - Cassandra
Hadoop, Spark, Pig, Mahout, etc.

12
What will be be doing
Installing programs
Python, Jupyter, Spark, Kafka, Cassandra

Writing Python, Java, Scala-Spark programs

Reports using Jupyter Notebooks

Analyzing data

Distributing data

Visualizing Data

Using Spark

Using Amazon Cloud

13
What will be be doing
~2 Weeks
Intro, Python

~5 weeks
Statistics, ML, NumPy, SciPy
Visualization

~3 weeks
Spark

~2 weeks
Kafka & Cassandra

14
Notebooks - Documentation, development
Python, Julia, R,
Other supported by community - Java, Fortran, Haskell, Ruby, Go, Scala, many more
Other notebook systems

Visualization
Python, Julia, R, Matlab

ML
Python (C), Julia, Matlab, R?

Spark - Large Data Sets Kafka - Streaming Data Cassandra - Data Storage
Scala Java Java
Java JVM languages Python
Python
Python Julia (Except for offsets) R - sort of
R Others - No R Client

Julia

15
Prerequisites
You will be installing software
Python
Jupyter
Some of these are more complex
Spark
on Windows than Unix/Mac OS
Kafka
Cassandra
Plotly

We will be doing some


Statistics
Math
Machine learning

16
Tasks - Install the Following

Jupyter via Anaconda & Conda with Python 3


http://jupyter.readthedocs.io/en/latest/install.html

Spark 2.4.4, Prebuild for Apache Hadoop 2.7


Unix/Linux/Mac OS
http://spark.apache.org/docs/latest/

Windows http://wiki.apache.org/hadoop/Hadoop2OnWindows

17
Books

Python Data Science Handbook: Essential Tools for Working with Data
Jake VanderPlas
O'Reilly Media
December 10, 2016
ISBN 9781491912058

Spark: The Definitive Guide


Matei Zaharia, Bill Chambers
February 2018
ISBN 9781491912218

18
Books

Course books are available for free on-line via SDSU library

Need SDSU Library account to access books off campus

Some people do not like reading books on-line


But if you need to save money it is available

May add chapters of other books as semester progresses


But on-line from books available on-line

19
Spark, Amazon

You will run Spark on Amazon’s cloud

You need to create an Amazon AWS account

Sign up for Amazon Educate account - $100 compute time for free

But you may incur some cost on Amazon

20
Data Science & Big Data

Very trendy

When topics become trendy in CS the terms become very vague

Big Data Analytics with Excel

Is Data Scientist A Useless Job Title?

21
Data Science

Data science is an interdisciplinary field about processes and systems to extract


knowledge or insights from data in various forms, either structured or
unstructured,[1][2] which is a continuation of some of the data analysis fields
such as statistics, data mining, and predictive analytics,[3] similar to Knowledge
Discovery in Databases (KDD)

Wikipedia

22
Data Science

Data Scientist (n.):


Person who is better at statistics than any software engineer and
better at software engineering than any statistician.

— Josh Wills (@josh_wills) May 3, 2012

23
Data Engineer
A software engineer that deals with data plumbing
Traditional database setup, Hadoop, Spark, etc.

Data analyst
A person who digs into data to surface insights,
but lacks the skills to do so at scale
They know how to use
Excel, Tableau and SQL
but can’t build a web app from scratch

24
Data Science

Science of transforming data into useful information by means of


Statistical and
Machine learning techniques

25
Data Science & Big Data

Big Data
Data Science with large datasets

No hard boundary between Big Data and medium data

Requires more data plumbing

26
Inconvenient Truth About Data Science
Data is never clean.

You will spend most of your time cleaning and preparing data.

95% of tasks do not require deep learning.

In 90% of cases generalized linear regression will do the trick.

Big Data is just a tool.

You should embrace the Bayesian approach.

No one cares how you did it.

Academia and business are two different worlds.

Presentation is key - be a master of Power Point.

All models are false, but some are useful.

There is no fully automated Data Science. You need to get your hands dirty.

27
Share of Respondents

0%
10%
20%
30%
40%
50%
60%
70%

SQL
Excel
Python
R
MySQL
Python: numpy, scipy, scikit-learn
ggplot
TOOLS

Microsoft SQL Server


Tableau
JavaScript
Matplotlib (Python)
Java
PostgreSQL
Oracle
D3
Homegrown analysis tools
Hive

28
Spark
Cloudera
Visual Basic/VBA
MongoDB
LANGUAGES, DATA PLATFORMS, ANALYTICS

Apache Hadoop
SAS
C++
PowerPivot

Tool: language, data platform, analytics


Scala
SQLite
C
Pig
Amazon RedShift
Weka
Hbase
Amazon Elastic MapReduce (EMR)
Perl
SPSS
Teradata
Share of Respondents

50K
100K
150K
200K

0K
SQL
Excel
Python
R
MySQL
Python: numpy, scipy, scikit-learn
ggplot
Microsoft SQL Server
Tableau
JavaScript
Matplotlib (Python)
SALARY MEDIAN AND IQR (US DOLLARS)

Java
PostgreSQL
Oracle
D3
Homegrown analysis tools

29
Hive
Spark
Cloudera
TOOLS: LANGUAGES, DATA PLATFORMS, ANALYTICS

Visual Basic/VBA
MongoDB

Tool: language, data platform, analytics


Apache Hadoop
SAS
C++
PowerPivot
Scala
SQLite
C
Pig
Amazon RedShift
Weka
Hbase
Amazon Elastic MapReduce (EMR)
Perl
SPSS
Teradata
30
Rule of Three

If you can not think of three things that might go wrong with your analysis
there is something wrong with your thinking

31
Data Science Verses Programming Jobs

Intuit Job Listing Worldwide Aug 22 2016

Data - 23

Software Engineer - 168

32
Data Science Programming Languages

Python Scala Java


R Julia C++
Matlab C
Javascript C#
SAS
Perl
Ruby

33
Features of Languages for Data Science

Interactive

Statistical, Machine Learning, Math libraries

Plays well with others

Supports computation

Simple syntax

Fast

34
Python

Wildly used Slow

Interactive Python 2.x verses Python 3.x


3/2
Lots of libraries
Threads do not scale
Plays well with other Global Interpreter Lock (GIL)

35
Julia

New language from MIT LLVM

Interactive & Fast Lisp style macros

Untyped & Typed Multiple dispatch

Designed for computation Designed for parallelism &


Distributed computation
f(x) = 2x + 4
Int32, Int64, Int128, BigInt

Statistical and Math libraries

Plays well with others

36
Java, Scala, Hadoop, Spark

Hadoop written in Java


Spark written in Scala

JVM languages (Java, Scala, Clojure, Groovy, JRuby, Jython)


Much more efficient on Hadoop & Spark
First access to new features

Scala
OO & Functional
Type inference
Far less verbose than Java

37

You might also like