0% found this document useful (0 votes)

15 views

Lecture 1- Introduction to Big Data

Uploaded by

Werd We

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lecture 1- Introduction to Big Data

Uploaded by

Werd We

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

THE ART OF ANALYZING BIG DATA- THE DATA SCIENTIST’S TOOLBOX - LECTURE 1

DR. MICHAEL FIRE

The Big Data Revolution
Pillars of Science
Computational Data-Intensive
Theory Experimentation
Science Science
“There was 5 exabytes of information created
between the dawn of civilization through 2003, but
that much information is now created every 2 days,
and the pace is increasing”
Eric Schmidt, 2010
The Data Tsunami

A Day in Data Infographic

What is Big Data?
• “Big data is a term used to refer to data sets that are too large or complex for traditional data-
processing application software to adequately deal with. Data with many cases (rows) offer greater
statistical power, while data with higher complexity (more attributes or columns) may lead to a
higher false discovery rate. Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying, updating, information privacy and
data source. Big data was originally associated with three key concepts: volume, variety, and
velocity. Other concepts later attributed with big data are veracity (i.e., how much noise is in the
data) and value.‫( ״‬Wikipedia)
• “Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced insight, decision
making, and process automation.” (Gartner)
Data 3Vs (or 4vs)
Example: Big Data at Netflix
Big Data at Netflix:
• 167 million users
• 160+ millions hours of video watched each day
• 4000 different devices
• 700+ billion events a day
• 60 peta bytes of data

Some of Netflix data related challenges:

• Building Big Data Infrastructure
• Personal recommendation of movies
• Creating Data Visualization Tools
• Improving Marketing Effectiveness
• Creating video previews
• Minimize the playback startup time

More can be found on the Netflix Technology Blog

Exciting Times
We are living in exciting times with a lot of new things to
discover using new datasets, data analysis tools,
new data infrastructures

“The next Kinsey, I strongly suspect, will be a data

scientist. The next Foucault will be a data scientist.
The next Freud will be a data scientist. The next Marx
will be a data scientist. The next Salk might very well
be a data scientist”
Seth Stephens Davidowitz, 2017
OPEN
DATASETS
Diverse Datasets
Notable Open Datasets
• Kaggle - over 28,000+ datasets
• Microsoft Academic Graph - over 231 million papers
• data.gov - U.S. Government’s open data
• pushshift.io - full Reddit dataset
• Common Crawl - 8 years of web pages data
• YouTube-8M Dataset - a large-scale labeled video
dataset that consists of millions of YouTube video
• Data4Good.io - over 1TB of compressed networks data
:-)
"Hiding within those mounds of data is knowledge that could change
the life of a patient, or change the world”
Atul Butte, 2012
DATA SCIENCE
TOOLS
Wide Variety of Easy to
Use Tools
Using Data Science Tools
My Personal belief:
Using data science tools is similar to using electricity - we can start
using most of the tools without knowing the details behind the
underline algorithms
CLOUD
INFRASTRUCTURE
Cloud Computing
Increasing affordable
Computational Power
DEEP LEARNING
Deep Learning
- Deep learning is part of a broader family of machine learning methods
based on artificial neural networks

- Deep learning architectures have been applied to fields including

computer vision, speech recognition, natural language processing,
audio recognition, social network filtering, machine translation,
bioinformatics, drug design, medical image analysis, material
inspection and board game programs

- They have produced results comparable to and in some cases

surpassing human expert performance
OUR ACADEMIC
COURSE
Course Goals
During this course, you will learn to:
• How to collect data
• How to create data from various sources
• How to manipulate data
• How to handle with massive datasets
• How to identify patterns in the data
Course Goals
During this course, you will also learn to:
• Learn ow to work with various data analytics tools
• Learn how to work with graphs
• Learn some practical text analytics
• Learn to visualize data

We will learn how to transfer data to knowledge

Course Assignments
• Weekly relatively small code tasks to check you
understand the material of each lesson (you get one in the
end of today lesson)
• Course Project (in pairs only) - doing something cool with
a real dataset
• Test
WORKING WITH
DATA
“Data Scientist: The Sexiest Job of the 21st Century”
Thomas H. Davenport and D.J. Patil, 2012
Some Things to Remember

“If you torture the data long enough, it will confess"

Ronald H. Coase
The Bonferroni Principle
• In a completely random dataset still there are
interesting events that may occur
• If you look hard enough you will find them
• In big datasets there are many “interesting” patterns
that occur by chance.

For example, in a large geolocation dataset, if we want

to identify people that are friends according to repeating
joint locations over time. We will probably match pairs
of people that were in the same places by chance.
The Look-Elsewhere Effect

• An apparently statistically significant observation may have actually a

space to be searched
• “The Bible Code” - with enough options something significant will
be discovered
Underfitting & Overfitting
• We use data and machine learning algorithms to
create prediction models
• The goal of a good machine learning model is to
generalize well from the training data
• Underfitting is when the model is too simple
• Overfitting is when the model is too complex
• A rule of thumb - if at first your model’s
performances is too good to be true on the first
runs - you are probably overfitting
Overfitting according to XKCD

Underfitting according to XKCD

https://xkcd.com/605/
WORKING WITH
STRUCTURED
DATA
Working with DBMS
• DBMS are here with us for a long time (the first DBMS
was developed in 1960s )
• Using Structured Query Language (SQL) is a common
and useful way to analyze/manipulate data
• There are excellent open source DBMS that can
be easily installed and used
• Can also be useful to run queries on
Hadoop, Spark, and BigQuery
Data Science and Databases
From my personal experience:
When to use databases:
• Working with structured/tabular data
• Working with relatively small datasets (up to several million
rows)
• Doing relatively simple analytics
• Needing to work with many subsets of the datasets
When not to use databases:
• Working with unstructured data
• Working with data that contains dictionary/lists structures
• Working with relatively large datasets (several hundreds of
millions of rows)
• Doing complex analytics
SQL - A Very Quick Review
Select <Col_1>,<Col_2>,…,<Col_N>
From <Table1>, <Table2>, ….,<Table_N>
Where <RowCondtion>
Order by <Col_i>

SELECT FirstName, LastName

FROM Users
WHERE firstName=‘John’ and LastName like ‘Sm%’
ORDER BY Age
Data Definition Language (DDL)

Used to Create/Drop/Alter/Truncate tables

CREATE TABLE "flavors_of_cacao" ( UPDATE User ALTER Table User

"Company" TEXT, SET Country = ‘USA' ADD LastPost varchar(255);
“SpecificBeanOriginorBarName” TEXT, WHERE Country = ‘United States’;
"REF" INTEGER,
“Review Date” INTEGER, TRUNCATE Table Users;
ALTER Table User
“Cocoa_Percent" TEXT, Drop LastPost varchar(255);
“Company_Location" TEXT,
“Rating" REAL,
“Bean_Type" TEXT,
“BroadBean_Origin" TEXT
);
Data Manipulation Language (DML)
Used to manipulate data using Select/Insert/Update/Delete
Select u1.firstname, u2.firstname
INSERT INTO Links (User1, User2)
From Links l, Users u1, Users u2
VALUES (5,4);
Where l.user1 = u1.userid, l.user2 = u2.userid
Select GroupNumber, AVG(JoinYear), Max(JoinYear) UPDATE Users
From Users SET GroupNumber = 3
Group by GroupNumber WHERE GroupNumber = 1;
Order by AVG(JoinYear)
DELETE FROM Users WHERE UserId=4;

Links Users
User1 User2 Userid FirstName LastName JoinYear GroupNumber
1 2
1 Jhon Smith 2018 1
2 3
2 Marry Perry 2019 1
2 1
3 William Brown 2018 2
3 1
4 1 4 Daniel Miler 2017 2
SQL Joins
SQLite
In this course, we will be working with SQLite
Useful Links:

• SQLite.org
• DB Browser for SQLite
• sqlite3 module
WORKING WITH
REAL WORLD
DATA
Importing Dataset from CSV
Example 1: Baby Names
There is an open datasets containing the names

of babies that was born each years.

Let’s use this dataset to discover various trends

•How many movies in each rating category?

•What is the highest rated movie in each category?

Working with Jupyter Notebooks
In this course, we are going to work with Python
• I recommend to install Anaconda and PyCharm
• It recommended to work with virtual environment
$ conda create -n venv python=3.7 anaconda
$ source activate venv
• We will use Juypter Notebooks
$ jupyter notebook

• I also recommend to get familiar with ipython

Practice SQL Online

- SQLZOO
- SQL Murder Mystery
Let’s move to reviewing the
course first notebook

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
cst499 Final Capstone Proposal
No ratings yet
cst499 Final Capstone Proposal
25 pages
Intentional Language in The Tantras
No ratings yet
Intentional Language in The Tantras
11 pages
Learn Tamil
100% (1)
Learn Tamil
19 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Project Report
No ratings yet
Project Report
29 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Activ Steps
No ratings yet
Activ Steps
11 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
BDA2023Outline
No ratings yet
BDA2023Outline
7 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Unit 1
No ratings yet
Unit 1
137 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Lect 3 Big Data Lesson02
No ratings yet
Lect 3 Big Data Lesson02
51 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Research Paper On Hadoop
No ratings yet
Research Paper On Hadoop
47 pages
1) Data-sci Chapter-1
No ratings yet
1) Data-sci Chapter-1
17 pages
1.introduction To Data Science
No ratings yet
1.introduction To Data Science
23 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Data Science
No ratings yet
Data Science
40 pages
Unit 1
No ratings yet
Unit 1
30 pages
1
No ratings yet
1
32 pages
Chapter 1
No ratings yet
Chapter 1
49 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Unit 1
No ratings yet
Unit 1
28 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Big Data
No ratings yet
Big Data
35 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
1 Introduction
No ratings yet
1 Introduction
24 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
01_Introduction to Big Data Analytics.pdf
No ratings yet
01_Introduction to Big Data Analytics.pdf
37 pages
Unit 1
No ratings yet
Unit 1
61 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
UNIT- 1_DA_Notes
No ratings yet
UNIT- 1_DA_Notes
51 pages
data science
No ratings yet
data science
23 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Module 1
No ratings yet
Module 1
90 pages
01 Introduction
No ratings yet
01 Introduction
37 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Principles of Data Management and Mining: CS 504 Spring 2020
No ratings yet
Principles of Data Management and Mining: CS 504 Spring 2020
28 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Week-1 Introduction To BDDA-TWM PDF
No ratings yet
Week-1 Introduction To BDDA-TWM PDF
48 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
3250+module+1+ +Intro+to+Data+Science
No ratings yet
3250+module+1+ +Intro+to+Data+Science
71 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
CSCI946 w1-Introduction
No ratings yet
CSCI946 w1-Introduction
36 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
8 Parts of Speech in English
No ratings yet
8 Parts of Speech in English
51 pages
INFORMATION TECHNOLOGY AND COMPUTER STUDIES NOTES FOR JSS1 TO SS3_ TOPIC_ 041206
No ratings yet
INFORMATION TECHNOLOGY AND COMPUTER STUDIES NOTES FOR JSS1 TO SS3_ TOPIC_ 041206
6 pages
Prof Ed 3 - Learning Disabilities and Giftedness PDF
No ratings yet
Prof Ed 3 - Learning Disabilities and Giftedness PDF
18 pages
How To Solve The Problem: 1. Write Down The Problem. 2. Think Real Hard. 3. Write Down The Solution
No ratings yet
How To Solve The Problem: 1. Write Down The Problem. 2. Think Real Hard. 3. Write Down The Solution
34 pages
Sentences Structures
100% (1)
Sentences Structures
74 pages
03-Logic31
No ratings yet
03-Logic31
19 pages
Dissertation Introduction Example PDF
100% (2)
Dissertation Introduction Example PDF
5 pages
Elmo Motion Control SAX 14 230 Datasheet 202062515269
No ratings yet
Elmo Motion Control SAX 14 230 Datasheet 202062515269
6 pages
t3 Reading
No ratings yet
t3 Reading
7 pages
Communicative Approach An Innovative Tactic in English Language Teaching
No ratings yet
Communicative Approach An Innovative Tactic in English Language Teaching
8 pages
BORANG PBD 2021 New
No ratings yet
BORANG PBD 2021 New
71 pages
Text Message Analysis
No ratings yet
Text Message Analysis
3 pages
Low Code No Code Ai Presentation
No ratings yet
Low Code No Code Ai Presentation
11 pages
Microsoft Office Specialist-Excel Syllabus: 1. Manage Workbook Options and Settings
No ratings yet
Microsoft Office Specialist-Excel Syllabus: 1. Manage Workbook Options and Settings
5 pages
Class Notes - 3
No ratings yet
Class Notes - 3
5 pages
ECB1 - Tests - Grammar Check 1.4A
No ratings yet
ECB1 - Tests - Grammar Check 1.4A
1 page
ARTS7 Quarter2 MOD3 W7 W8 ObandoHoneyGrace
No ratings yet
ARTS7 Quarter2 MOD3 W7 W8 ObandoHoneyGrace
31 pages
Damped Pendulum Equation
No ratings yet
Damped Pendulum Equation
3 pages
Rhino Readers Product Guide
No ratings yet
Rhino Readers Product Guide
10 pages
Mad Unit-5
No ratings yet
Mad Unit-5
5 pages
Technical Drafting 10
No ratings yet
Technical Drafting 10
20 pages
Text Analysis Abstract
No ratings yet
Text Analysis Abstract
1 page
Worksheet #2 - More Integration - (1221 - Caleng2 - Eb2) - Integral Calculus
No ratings yet
Worksheet #2 - More Integration - (1221 - Caleng2 - Eb2) - Integral Calculus
14 pages
Qabool Hai by Shagufta Kanwal Novelatte
No ratings yet
Qabool Hai by Shagufta Kanwal Novelatte
68 pages
Interaction-Activity Upload Template SG0827
No ratings yet
Interaction-Activity Upload Template SG0827
71 pages
FilmQ - The Mongols
No ratings yet
FilmQ - The Mongols
2 pages
OOPS UNIT 1 important questions
No ratings yet
OOPS UNIT 1 important questions
5 pages
How To Speak Well Part 2
No ratings yet
How To Speak Well Part 2
151 pages

Lecture 1- Introduction to Big Data

Uploaded by

Lecture 1- Introduction to Big Data

Uploaded by

THE ART OF ANALYZING BIG DATA- THE DATA SCIENTIST’S TOOLBOX - LECTURE 1

DR. MICHAEL FIRE

A Day in Data Infographic

Some of Netflix data related challenges:

More can be found on the Netflix Technology Blog

“The next Kinsey, I strongly suspect, will be a data

- Deep learning architectures have been applied to fields including

- They have produced results comparable to and in some cases

We will learn how to transfer data to knowledge

“If you torture the data long enough, it will confess"

For example, in a large geolocation dataset, if we want

• An apparently statistically significant observation may have actually a

Underfitting according to XKCD

SELECT FirstName, LastName

Used to Create/Drop/Alter/Truncate tables

CREATE TABLE "flavors_of_cacao" ( UPDATE User ALTER Table User

of babies that was born each years.

Let’s use this dataset to discover various trends

See also Kaggle Dataset

•How many movies in each rating category?

•What is the highest rated movie in each category?

• I also recommend to get familiar with ipython

You might also like