0% found this document useful (0 votes)
8 views

Lecture 1- Introduction to Big Data

Uploaded by

Werd We
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 1- Introduction to Big Data

Uploaded by

Werd We
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

THE ART OF ANALYZING BIG DATA- THE DATA SCIENTIST’S TOOLBOX - LECTURE 1

DR. MICHAEL FIRE


The Big Data Revolution
Pillars of Science
Computational Data-Intensive
Theory Experimentation
Science Science
“There was 5 exabytes of information created
between the dawn of civilization through 2003, but
that much information is now created every 2 days,
and the pace is increasing”
Eric Schmidt, 2010
The Data Tsunami

A Day in Data Infographic


What is Big Data?
• “Big data is a term used to refer to data sets that are too large or complex for traditional data-
processing application software to adequately deal with. Data with many cases (rows) offer greater
statistical power, while data with higher complexity (more attributes or columns) may lead to a
higher false discovery rate. Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying, updating, information privacy and
data source. Big data was originally associated with three key concepts: volume, variety, and
velocity. Other concepts later attributed with big data are veracity (i.e., how much noise is in the
data) and value.‫( ״‬Wikipedia)
• “Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced insight, decision
making, and process automation.” (Gartner)
Data 3Vs (or 4vs)
Example: Big Data at Netflix
Big Data at Netflix:
• 167 million users
• 160+ millions hours of video watched each day
• 4000 different devices
• 700+ billion events a day
• 60 peta bytes of data

Some of Netflix data related challenges:


• Building Big Data Infrastructure
• Personal recommendation of movies
• Creating Data Visualization Tools
• Improving Marketing Effectiveness
• Creating video previews
• Minimize the playback startup time

More can be found on the Netflix Technology Blog


Exciting Times
We are living in exciting times with a lot of new things to
discover using new datasets, data analysis tools,
new data infrastructures

“The next Kinsey, I strongly suspect, will be a data


scientist. The next Foucault will be a data scientist.
The next Freud will be a data scientist. The next Marx
will be a data scientist. The next Salk might very well
be a data scientist”
Seth Stephens Davidowitz, 2017
OPEN
DATASETS
Diverse Datasets
Notable Open Datasets
• Kaggle - over 28,000+ datasets
• Microsoft Academic Graph - over 231 million papers
• data.gov - U.S. Government’s open data
• pushshift.io - full Reddit dataset
• Common Crawl - 8 years of web pages data
• YouTube-8M Dataset - a large-scale labeled video
dataset that consists of millions of YouTube video
• Data4Good.io - over 1TB of compressed networks data
:-)
"Hiding within those mounds of data is knowledge that could change
the life of a patient, or change the world”
Atul Butte, 2012
DATA SCIENCE
TOOLS
Wide Variety of Easy to
Use Tools
Using Data Science Tools
My Personal belief:
Using data science tools is similar to using electricity - we can start
using most of the tools without knowing the details behind the
underline algorithms
CLOUD
INFRASTRUCTURE
Cloud Computing
Increasing affordable
Computational Power
DEEP LEARNING
Deep Learning
- Deep learning is part of a broader family of machine learning methods
based on artificial neural networks

- Deep learning architectures have been applied to fields including


computer vision, speech recognition, natural language processing,
audio recognition, social network filtering, machine translation,
bioinformatics, drug design, medical image analysis, material
inspection and board game programs

- They have produced results comparable to and in some cases


surpassing human expert performance
OUR ACADEMIC
COURSE
Course Goals
During this course, you will learn to:
• How to collect data
• How to create data from various sources
• How to manipulate data
• How to handle with massive datasets
• How to identify patterns in the data
Course Goals
During this course, you will also learn to:
• Learn ow to work with various data analytics tools
• Learn how to work with graphs
• Learn some practical text analytics
• Learn to visualize data

We will learn how to transfer data to knowledge


Course Assignments
• Weekly relatively small code tasks to check you
understand the material of each lesson (you get one in the
end of today lesson)
• Course Project (in pairs only) - doing something cool with
a real dataset
• Test
WORKING WITH
DATA
“Data Scientist: The Sexiest Job of the 21st Century”
Thomas H. Davenport and D.J. Patil, 2012
Some Things to Remember

“If you torture the data long enough, it will confess"


Ronald H. Coase
The Bonferroni Principle
• In a completely random dataset still there are
interesting events that may occur
• If you look hard enough you will find them
• In big datasets there are many “interesting” patterns
that occur by chance.

For example, in a large geolocation dataset, if we want


to identify people that are friends according to repeating
joint locations over time. We will probably match pairs
of people that were in the same places by chance.
The Look-Elsewhere Effect

• An apparently statistically significant observation may have actually a


space to be searched
• “The Bible Code” - with enough options something significant will
be discovered
Underfitting & Overfitting
• We use data and machine learning algorithms to
create prediction models
• The goal of a good machine learning model is to
generalize well from the training data
• Underfitting is when the model is too simple
• Overfitting is when the model is too complex
• A rule of thumb - if at first your model’s
performances is too good to be true on the first
runs - you are probably overfitting
Overfitting according to XKCD

Underfitting according to XKCD

https://xkcd.com/605/
WORKING WITH
STRUCTURED
DATA
Working with DBMS
• DBMS are here with us for a long time (the first DBMS
was developed in 1960s )
• Using Structured Query Language (SQL) is a common
and useful way to analyze/manipulate data
• There are excellent open source DBMS that can
be easily installed and used
• Can also be useful to run queries on
Hadoop, Spark, and BigQuery
Data Science and Databases
From my personal experience:
When to use databases:
• Working with structured/tabular data
• Working with relatively small datasets (up to several million
rows)
• Doing relatively simple analytics
• Needing to work with many subsets of the datasets
When not to use databases:
• Working with unstructured data
• Working with data that contains dictionary/lists structures
• Working with relatively large datasets (several hundreds of
millions of rows)
• Doing complex analytics
SQL - A Very Quick Review
Select <Col_1>,<Col_2>,…,<Col_N>
From <Table1>, <Table2>, ….,<Table_N>
Where <RowCondtion>
Order by <Col_i>

SELECT FirstName, LastName


FROM Users
WHERE firstName=‘John’ and LastName like ‘Sm%’
ORDER BY Age
Data Definition Language (DDL)

Used to Create/Drop/Alter/Truncate tables

CREATE TABLE "flavors_of_cacao" ( UPDATE User ALTER Table User


"Company" TEXT, SET Country = ‘USA' ADD LastPost varchar(255);
“SpecificBeanOriginorBarName” TEXT, WHERE Country = ‘United States’;
"REF" INTEGER,
“Review Date” INTEGER, TRUNCATE Table Users;
ALTER Table User
“Cocoa_Percent" TEXT, Drop LastPost varchar(255);
“Company_Location" TEXT,
“Rating" REAL,
“Bean_Type" TEXT,
“BroadBean_Origin" TEXT
);
Data Manipulation Language (DML)
Used to manipulate data using Select/Insert/Update/Delete
Select u1.firstname, u2.firstname
INSERT INTO Links (User1, User2)
From Links l, Users u1, Users u2
VALUES (5,4);
Where l.user1 = u1.userid, l.user2 = u2.userid
Select GroupNumber, AVG(JoinYear), Max(JoinYear) UPDATE Users
From Users SET GroupNumber = 3
Group by GroupNumber WHERE GroupNumber = 1;
Order by AVG(JoinYear)
DELETE FROM Users WHERE UserId=4;

Links Users
User1 User2 Userid FirstName LastName JoinYear GroupNumber
1 2
1 Jhon Smith 2018 1
2 3
2 Marry Perry 2019 1
2 1
3 William Brown 2018 2
3 1
4 1 4 Daniel Miler 2017 2
SQL Joins
SQLite
In this course, we will be working with SQLite
Useful Links:

• SQLite.org
• DB Browser for SQLite
• sqlite3 module
WORKING WITH
REAL WORLD
DATA
Importing Dataset from CSV
Example 1: Baby Names
There is an open datasets containing the names

of babies that was born each years.

Let’s use this dataset to discover various trends

See also Kaggle Dataset


Example 1 - Questions
• How many rows in the dataset?
• How many distinct names in the dataset?
• What is the most common name? (males/female)
• How many names starts with ‘B’? How common is the name
Beyonce?
• What is the rarest name for female babies, and starts with z,
and were born at 2001?
Example 2: 1000 Netflix Shows
Example 2: Questions
•How many movies?

•How many movies in each rating category?

•What is the highest rated movie in each category?


Working with Jupyter Notebooks
In this course, we are going to work with Python
• I recommend to install Anaconda and PyCharm
• It recommended to work with virtual environment
$ conda create -n venv python=3.7 anaconda
$ source activate venv
• We will use Juypter Notebooks
$ jupyter notebook

• I also recommend to get familiar with ipython


Practice SQL Online

- SQLZOO
- SQL Murder Mystery
Let’s move to reviewing the
course first notebook

You might also like