0% found this document useful (0 votes)
21 views

Lecture 2

Uploaded by

Mrawan Taha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 2

Uploaded by

Mrawan Taha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Lec.2.

computational
Tools for

4170201
Data Science

What is data science?

© Dr. Magda Madbouly


 An area that manages, manipulates, extracts, and
interprets knowledge from tremendous amount
of data

 Data science (DS) is a multidisciplinary field of


study with goal to address the challenges in
big data

 Data science principles apply to all data – big


and small
What is

What makes data, “Big” Data?


Big data from its name is very big
Starting size of it at least 1 TB
Volume Velocity Variety
• Data • Data • Data
quantity Speed Types
1st Character of Big Data
Scale (Volume)
•A typical PC might have had 10 gigabytes of storage in 2000.

•Today, Facebook ingests 500 terabytes of new data every day.

•Boeing 737 will generate 240 terabytes of flight data during a


single flight across the US.

• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
 Data volume is increasing
exponentially

Exponential increase in
collected/generated data

9
 Clickstreams and ad impressions capture user behavior at
millions of events per second
 high-frequency stock trading algorithms reflect market
changes within microseconds
 machine to machine processes exchange data between
billions of devices
 infrastructure and sensors generate massive log data in
real-time
 on-line gaming systems support millions of concurrent
users, each producing multiple inputs per second.
 Data is begin generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store
next to you

◦ Healthcare monitoring: sensors monitoring your activities


and body  any abnormal measurements require immediate
reaction

11
 Big Data isn't just numbers, dates, and strings. Big Data
is also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.

 Traditional database systems were designed to address


smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.

 Big Data analysis includes different types of data


 Various formats, types, and
structures
 Text, numerical, images, audio,
video, sequences, time series, social
media data, multi-dim arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types of
data

To extract knowledge all these


types of data need to linked together

13
14
 "Big Data are high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization” .

 Complicated (intelligent) analysis of data may make a


small data “appear” to be “big”

 Bottom line: Any data that exceeds our current


capability of processing can be regarded as “big”
Sources of data
Data from internet
Data from military corporations
Hospitals data
NASA corporation data
And so on…
Where is all this data coming from ?
 Relational Data (Tables/Transaction/Legacy
Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
Big Data is any data that is expensive to manage and
hard to extract value from
◦ Volume
 The size of the data
◦ Velocity
 The latency of data processing relative to the
growing demand for interactivity
◦ Variety and Complexity
 the diversity of sources, formats, quality,
structures.
- Data Science is the science which uses computer
science, statistics and machine learning, visualization
and human-computer interactions to collect, clean,
integrate, analyze, visualize, interact with data to
create data products.
- - Simply, data science is an umbrella of several
techniques that are used for extracting the
information and the insights of data.
22
- Companies learn your secrets, shopping patterns, and
preferences
For example, can we know if a person is diabetic,
even if he/she doesn’t want us to know?

- Data Science and election (2008, 2012) 1 million


people installed the Obama Facebook app that gave
access to info on “friends”
 Data Scientist
◦ The most attractive Job of the 21st Century
 They find stories, extract knowledge. They are not
reporters
 A data scientist is the key person in acquiring,
clearing, representing and analyzing data for business
and research purposes
 Data scientists are the key to realizing the
opportunities presented by big data. They
bring structure to it, find compelling patterns
in it, and advise executives on the implications
for products, processes, and decisions
 The problem is that with this un-
sorted very large data size , we cant
analysis it, more over we cant
classify it , it become un-useful if we
stored data without any usage .
How to solve this ?!
The Data Analytics Lifecycle is designed
specifically for Big Data problems and
data science projects.

The lifecycle has six phases, and project


work can occur in several phases at once.

You might also like