0% found this document useful (0 votes)

8 views62 pages

Emerging Tech CH 2

Data science is a multi-disciplinary field that utilizes scientific methods and algorithms to extract insights from various types of data. It involves a data processing cycle consisting of input, processing, and output, and encompasses structured, semi-structured, and unstructured data. The document also discusses big data characteristics, including volume, variety, velocity, veracity, and value, as well as the Hadoop ecosystem for managing large datasets.

Uploaded by

birhanukassahunabuye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views62 pages

Emerging Tech CH 2

Uploaded by

birhanukassahunabuye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 62

CHAPTER TWO

DATA SCIENCE
An Overview of Data Science
 Data science is called data-driven science
 It is multi-disciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured, semi-structured
and unstructured data.
o Is a blend of various tools, algorithms, and machine learning principles
 It is much more than simply analyzing data.
 It offers a range of roles and requires a range of skills.
 It is primarily used to make decisions and predictions.
 It is a process of using raw data to explore insight and deliver a data product.

2
What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions
in a formalized manner
• Data is unprocessed facts and figures
• Data is a symbol or any row material (it can be text, number, image
and diagram.)
o Can be represented with:
 alphabets (A-Z, a-z)
 digits (0-9) or
 special characters (+, -, /, *, <,>, =, etc.).

• It can be usable or not

• It does not have meaning by it self . 3
Cont’d…
 Information: is the processed data on which decisions and actions are
based.

 It is a data that has been processed into a form that is meaningful.

 Information is interpreted data;

• Created from organized, structured, and processed data in a particular

context.

4
Data Processing Cycle
 Data processing is the re-structuring or re- ordering of data
by people or machines

o In order to increase their usefulness and add values for a particular

purpose.

 A raw data is fed to computer systems to generate the final output which is
information.

 Information can be presented into the forms of diagrams, chart, graph,

5
etc.
Cont.…
• Data processing consists of the following basic steps:
• Input,
• processing and
• output.
• These three steps constitute the data processing cycle.

•
Data processing cycle 6
Input:
• In this step, the input data is prepared in some convenient form for
processing.
• The form will depend on the processing machine.
• For example, when electronic computers are used, the input data can be recorded
on any one of the several types of storage medium, such as hard disk, CD, flash
disk and so on.
Processing:
• The input data is changed to produce data in a more useful form.
Output:
• The result of the proceeding processing step is collected.
7
DATA SCIENCE APPLICATIONS AND EXAMPLES

• Identifying and predicting disease

• Personalized healthcare recommendations

• Optimizing shipping routes in real-time

• Getting the most value out of soccer rosters

• Automating digital ad placement

• Predicting incarceration/prediction rates

Data types and their representation
 Data types can be described from different perspectives.

1. In computer science and computer programming, for instance,

 A data type is simply an attribute of data that tells the compiler or interpreter how
the programmer intends to use the data.

 A data type makes the values that expression, such as a variable or a function, might
take.

 This data type defines the operations that can be done on the data, the meaning of the
data, and the way values of that type can be stored. 9
Data types from Computer programming perspective
 Integers(int)- is used to store whole numbers, mathematically
known as integers
 Booleans(bool)- is used to represent restricted to one of two values: true
or false
 characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of
characters and numbers

10
Data types from Data Analytics perspective
 From a data analytics point of view,
 It is important to understand that there are three common types of data
types or structures:
1. Structured,
2. Semi-structured, and
3. Unstructured data types.

11
1. Structured Data

 Structured data are those that can be easily organized, stored and
transferred in a defined data model.
 Easily searchable by basic algorithm like spread sheets.
 Easily processed by computers.
 Structured data conforms to a tabular format with a relationship between
the different rows and columns.
 Example:
o Excel files or SQL databases

12
Example -------- Database
ID Name Age Department CGPA

1212/13 Yohannes 20 Accounting 2.8

1213/13 Aster 21 Economics 3.3

1414/13 Tolossa 22 Sociology 3.1

1415/13 Fatuma 20 Management 2.7

13
2. Semi-structured Data
 Their structures are irregular, implicit, flexible and often nested
hierarchically.
 Is a form of structured data that does not conform with the formal
structure of data models associated with relational databases
 It has some organizational properties like tags and other markers to
separate semantic elements that makes it easier to analyze.
 It is also known as a self-describing structure.
o Examples: include JSON and XML

13
.

15
3. Unstructured Data
 Is information that either does not have a predefined data model or is not
organized in a pre-defined manner.
 They are not easily combined or computationally analyzed
 Unstructured information is typically text-heavy but may contain data
such as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in
structured databases.
o Examples: include text documents, audio, video files , or PDFs
16
.

17
Metadata
Metadata – Data about Data
 From technical point of view, this is not a separate data structure, but it
is one of the most important elements for Big Data analysis and
big data solutions.
 Metadata is data about data
 it is meaning of data
 It provides additional information about a specific set of data.
 Metadata is considered as processed data, used by Big data solutions
for initial analysis.
o Example: In a set of photographs, metadata could describe
when and where the photos were taken. 18
Data value Chain

 Describe the process of data creation and use; from first identifying a need
for data to its final use and possible reuse.

 The Data Value Chain is introduced to describe the information flow within a
big data system as a series of steps needed to generate value and useful
insights from data.

 Data chain: is any combination of two or more data element/ data item.

 Data value: is average of set of data value

19
The Big Data Value Chain identifies the following key high-level activities:
for more, clickhere

20
1. Data Acquisition
 It is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage on which data analysis can be
carried out.
 Later used on data analysis
 Data acquisition is one of the major big data challenges in
terms of infrastructure requirements.
 Data acquisition answers the following:
• How do we get the data
• What kind of data do we need
• Who owns the data
21
2. Data Analysis
 Making the raw data acquired amenable to use in decision-making as well
as domain-specific usage.

 Data analysis involves exploring, transforming, and modeling data with the
goal of highlighting relevant data, synthesizing and extracting useful
hidden information with high potential from a business point of view.

 Related areas include data mining, business intelligence, and machine

learning.

22
3. Data Curation
 It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for
its effective usage.

 Is a process of extraction important information from

scientific task

o e.g. research
 Data curation processes can be categorized into different
activities such as content creation, selection, classification,
23
4. Data Storage
 It is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to
the data.
o E.g. Relational Database Management Systems (RDBMS)
 RDBMS: the main, and almost unique, a solution to the storage
paradigm for nearly 40 years.
 The RDBMS: Not used for Big data

2
3
5. Data Usage
 It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the
business activity.

 Data usage in business decision-making can enhance competitiveness through

the reduction of costs, increased added value, or any other parameter that
can be measured against existing performance criteria.

 Interpreting output data’s

25
What Is Big Data?
 Big data is the term for a collection of data sets so large and
complex.
 It can not handle with a single computer
 It becomes difficult to process using on-hand database
management tools or traditional data processing applications.
 “large dataset” means a dataset too large to reasonably process or
store with traditional tooling or on a single computer.
 Big data is characterized by 3V(5V) and more:

26
Big data is characterized b y 5V and more:
 Volume:
• Refer to the vast amount of data generated every second.
• Data generated from emails, social networking sites, photos,
videos, sensor data etc.

• Data generated is in zettabytes. Zettabyte = trillion Gigabytes

• This increasing data sets makes too large to store and analysis
using traditional database technology.

• Now with big data Technology we can store and use data with
help of distributed system.
27
Big data is characterized b y 5V and more:
Variety
 Refer the different type of data we can now use.

 In past we only focused to deal with only structured data which

neatly fitted into tables, relational databases and structure table
databases like MYSQL

 Now 80% of the data is unstructured and can’t be put in table

easily.

 With big data Technology we can now analyze and bring

together data of different types such as messages, social media
conversations, photos, sensor data, video or voice recordings.

• We can handle not only structure data but also accommodate

unstructured data and semi-structure data.

Big data is characterized b y 5V and more:
 Veracity: refers to the trustworthiness/ reliability of the data.
 With many forms of big data, quality and accuracy are less
controllable (just think of Twitter posts with hash tags,
abbreviations, typos and colloquial speech as well as the
reliability and accuracy of content) , but
 Big data and analytics technology now allows us to work with
these type of data.
 The volumes often make up for the lack of quality or accuracy.

29
Big data is characterized b y 5V and more:
Value: most important V.
 Having access to big data is no good unless we can turn
it into value.

 Companies are starting to generate amazing value from

their big data.

 It ensure that huge data is available in the business

context.

 Collects the big data analysis it and make it available in

the test cases required for business.
30
Big data is characterized b y 5V and more:
 Velocity: Refer to the speed at which data is generated
the speed at which data moves around.
o Just think of social media messages going viral in
seconds.
o Technology allows us now to analyze the data while it
is being generated (sometimes referred to as in-
memory analytics), without ever putting it into
databases.
31
Big data is characterized b y 5V (summary)
 Volume: large amounts of data Zeta bytes/Massive
datasets
 Velocity: Data is live streaming or in motion
 Variety: data comes in many different forms from
diverse sources
 Veracity: can we trust the data? How accurate is it?
 Value: a mechanism to bring the correct meaning out
of the data

32
Big data is characterized b y 5V and more:

Characteristics of Big Data

33
Variety
• Variety: data comes in many different forms from diverse
sources

34
Velocity: Data is live streaming or
in motion

35
Value: a mechanism to bring the correct meaning out of the
data

36
Veracity: can we trust the data? How accurate is it?

37
Clustered Computing and Hadoop Ecosystem
Clustered Computing
 Because of the qualities of big data, individual
computers are often inadequate for handling the
data at most stages.
 To better address the high storage and
computational needs of big data, computer clusters
are a better fit.
 Giving different task to different
computers
39
Cont’d…
 Big data clustering software combines the resources
of many smaller machines,
 Seeking to provide a number of benefits:
o Resource Pooling/sharing
o High Availability:
o Easy Scalability:

40
.

41
Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to make interaction
with big data easier.
 It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
 Hadoop is a software which manage different computers which are found
on different locations but they are connected each other using computer
network
 It is inspired by a technical document published by Google.

5
7
The four key characteristics of
Hadoop are:
 Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
 Reliable: It is reliable as it stores copiesof the
data on different machines and is resistant to hardware
failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

43
Cont’d…
 Hadoop has an ecosystem that has evolved
from its four core components:
o Data management,
o Access,
o Processing, and
o Storage.
 It is continuously growing to meet the needs
of Big Data.
44
It comprises the following components and
many others:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
45
Hadoop
Ecosystem

46
HDFS
 HDFS: Is specially designed for storing huge
dataset in commodity hardware
 Data is stored in a distributed manner
 Enables fast data transfer among the nodes
 It is all about storing and managing huge dataset
in a cluster
 It is highly fault tolerance and efficient enough
to process huge amount of data
47
• HDFS Has two core
components
2. Data master
1. Name node and
node
• Name node: also called master
• Is the brain of the system
slave slave
• There is only one name node slave
• Maintains and manage the data node and it also store the meta
data
• If this name node crashed the entire system will dead

• Data node : also called slave

• Store a block of data
• There can be multiple data node
• Store the actual data and does reading writing and
processing performs replication as well 4
6
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
2. Processing the data in storage
3. Computing and analyzing data
4. Visualizing the result

6
3
Big Data Life Cycle with Hadoop
1.Ingesting data into the system
 The first stage of Big Data processing is Ingest.
 The data is ingested or transferred to Hadoop from
various sources such as relational databases,
systems, or local files.
 Sqoop transfers data from RDBMS to HDFS,
whereas Flume transfers event data.

4
8
Big Data Life Cycle with Hadoop
2. Processing the data in storage
 The second stage is Processing.
 In this stage, the data is stored and processed.
 The data is stored in the distributed file system,
HDFS, and the NoSQL distributed data, HBase.
 Spark and MapReduce perform data processing

4
9
Big Data Life Cycle with Hadoop
3. Computing analyzing data
 The third stage is to Analyze. Here, the data is
analyzed by processing frameworks such as Pig,
Hive, and Impala.
 Pig converts the data using a map and reduce and
then analyzes it.
 Hive is also based on the map and reduce
programming and is most suitable for structured
data. 5
0
Big Data Life Cycle with Hadoop
4. Visualizing the results
 The fourth stage is Access, which is performed by
tools such as Hue and Cloudera Search.
 In this stage, the analyzed data can be accessed
by users.

5
1
Chapter Two Review Questions

1. Define data science; what are the roles of a data scientist?

2. Discuss data and its types from computer programming and data analytics
perspectives?
3. Discuss a series of steps needed to generate value and useful insights from data?
4. What is the principal goal of data science?
5. List out and discuss the characteristics of Big Data?
6. How we ingest streaming data into Hadoop Cluster?

Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Data Science
No ratings yet
Data Science
35 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Data Science
No ratings yet
Data Science
32 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
PowerQuery PowerPivot DAX PDF
100% (1)
PowerQuery PowerPivot DAX PDF
115 pages
BI MCQs
33% (3)
BI MCQs
20 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
The Ultimate Guide To Data Lineage
100% (2)
The Ultimate Guide To Data Lineage
19 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
BECE Computer Studies Objective Questions and Answer
0% (1)
BECE Computer Studies Objective Questions and Answer
5 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Introduction of DBMS
No ratings yet
Introduction of DBMS
83 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Introduction To Database Management Systems DBMS
100% (2)
Introduction To Database Management Systems DBMS
10 pages
Instruction and Parts List
No ratings yet
Instruction and Parts List
2,812 pages
Pentaho Big Data Analytics
No ratings yet
Pentaho Big Data Analytics
118 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Library Assistant Job Description
No ratings yet
Library Assistant Job Description
1 page
Disaster Recovery Using Alwayson Availability Group - Scenario 1
No ratings yet
Disaster Recovery Using Alwayson Availability Group - Scenario 1
34 pages
Aastu Arch. Club
No ratings yet
Aastu Arch. Club
16 pages
Big Data 2.0 Processing Systems 2ed
No ratings yet
Big Data 2.0 Processing Systems 2ed
155 pages
About The Arch. Entrance Exam
No ratings yet
About The Arch. Entrance Exam
10 pages
Pietro Metastasio - Opere VII
No ratings yet
Pietro Metastasio - Opere VII
743 pages
Emerging Tech CH 5
No ratings yet
Emerging Tech CH 5
22 pages
Emerging Tech CH 4
No ratings yet
Emerging Tech CH 4
36 pages
Dice Resume CV BHAGWATI PRAJAPATI
100% (1)
Dice Resume CV BHAGWATI PRAJAPATI
4 pages
Arch. Entrance Exam 3
No ratings yet
Arch. Entrance Exam 3
6 pages
Big Data 2019 IEEE PROJECTS IEEE PAPERS
No ratings yet
Big Data 2019 IEEE PROJECTS IEEE PAPERS
8 pages
Računarske Aplikacije: Univerzitet U Novom Pazaru Departman Za Računarske Nauke Studijski Program Informatika
No ratings yet
Računarske Aplikacije: Univerzitet U Novom Pazaru Departman Za Računarske Nauke Studijski Program Informatika
19 pages
Arch. Entrance Exam 2
No ratings yet
Arch. Entrance Exam 2
5 pages
Database Management System
No ratings yet
Database Management System
37 pages
MCS-221 Repeated Questions
No ratings yet
MCS-221 Repeated Questions
3 pages
ADBMS - Introduction Chapter
No ratings yet
ADBMS - Introduction Chapter
6 pages
Association Rules
No ratings yet
Association Rules
14 pages
Types of DBMS Architecture Lesson Two
No ratings yet
Types of DBMS Architecture Lesson Two
4 pages
2 - Database As A Service - Current Issues and Its Future - Zheng2018
No ratings yet
2 - Database As A Service - Current Issues and Its Future - Zheng2018
5 pages
Wepik Advancements in Diabetes Detection Leveraging Machine Learning Models Including SVM Random Forest 20231103202928mQLf
No ratings yet
Wepik Advancements in Diabetes Detection Leveraging Machine Learning Models Including SVM Random Forest 20231103202928mQLf
12 pages
Win10 Backup Checklist v3 PDF
No ratings yet
Win10 Backup Checklist v3 PDF
1 page
Assignment: Ce Marketing Research & Data Analytics
No ratings yet
Assignment: Ce Marketing Research & Data Analytics
7 pages
Library Staff Profile
No ratings yet
Library Staff Profile
2 pages
Semantic Web Questionniare
No ratings yet
Semantic Web Questionniare
7 pages
340-Article Text-644-1-10-20210531
No ratings yet
340-Article Text-644-1-10-20210531
14 pages
Patent Watch Search
No ratings yet
Patent Watch Search
2 pages
Paras New Resume
No ratings yet
Paras New Resume
1 page
Tutorial 1
No ratings yet
Tutorial 1
2 pages

Emerging Tech CH 2

Uploaded by

Emerging Tech CH 2

Uploaded by

CHAPTER TWO

• It can be usable or not

 It is a data that has been processed into a form that is meaningful.

 Information is interpreted data;

• Created from organized, structured, and processed data in a particular

o In order to increase their usefulness and add values for a particular

 Information can be presented into the forms of diagrams, chart, graph,

• Identifying and predicting disease

• Personalized healthcare recommendations

• Optimizing shipping routes in real-time

• Getting the most value out of soccer rosters

• Automating digital ad placement

• Predicting incarceration/prediction rates

1. In computer science and computer programming, for instance,

1212/13 Yohannes 20 Accounting 2.8

1213/13 Aster 21 Economics 3.3

1414/13 Tolossa 22 Sociology 3.1

1415/13 Fatuma 20 Management 2.7

 Data value: is average of set of data value

 Related areas include data mining, business intelligence, and machine

 Is a process of extraction important information from

 Data usage in business decision-making can enhance competitiveness through

 Interpreting output data’s

• Data generated is in zettabytes. Zettabyte = trillion Gigabytes

 In past we only focused to deal with only structured data which

 Now 80% of the data is unstructured and can’t be put in table

 With big data Technology we can now analyze and bring

• We can handle not only structure data but also accommodate

unstructured data and semi-structure data.

 Companies are starting to generate amazing value from

 It ensure that huge data is available in the business

 Collects the big data analysis it and make it available in

Characteristics of Big Data

• Data node : also called slave

1. Define data science; what are the roles of a data scientist?

You might also like