0% found this document useful (0 votes)

2 views57 pages

Chapter Two

Chapter Two provides an overview of data science, emphasizing the distinction between data and information, data types, and the data processing cycle. It discusses big data concepts, including the characteristics of big data (volume, velocity, variety, veracity, variability, and value) and the importance of clustered computing for managing large datasets. The chapter outlines the life cycle of big data, covering data ingestion, storage, analysis, and visualization.

Uploaded by

leul24138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views57 pages

Chapter Two

Uploaded by

leul24138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Chapter Two

Overview for Data Science

Learning outcomes
After completing this lesson you should be able to

Diﬀerentiate the diﬀerence between data and information

Explain data types and their representation

Explain the data value change

Explain basic concepts of big data

An Overview of Data Science
Data science is a multidisciplinary ﬁeld that uses

scientiﬁc methods, processes and algorithm systems to extract

knowledge, Insights from structured, semi-structured and
unstructured data

Data science continues to evolve as one of the most promising and in-demand
career paths for skilled professionals
An Overview of Data Science
It is multidisciplinary
What is data?
A representation of facts, concepts, or instructions in a formalized manner,
which should be suitable for communication, interpretation, or
processing by human or electronic machine

Data can be described as unprocessed facts and ﬁgures

It can also deﬁned as groups of non-random symbols in the form of text,

images, and voice representing quantities, action and objects
What is Information?
Organized or classiﬁed data, which has some meaningful values for the
receiver

Processed data on which decisions and actions are based. Plain collected
data as raw facts cannot help much in decision-making

Interpreted data created from organized, structured, and processed data in

a particular context
What is Information?
For the decision to be meaningful, the processed data must qualify for the
following characteristics

Timely − Information should be available when required

Accuracy − Information should be accurate

Completeness − Information should be complete

Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people or
machine to increase their usefulness and add values for a particular
purpose

Data processing consists the following steps

Input

Processing

Output
Data Processing Cycle
Input

The input data is prepared in some convenient form for processing

The form will depend on the processing machine

For example, when electronic computers are used, the input data can be
recorded on any one of the several types of input medium, such as ﬂash
disks, hard disk, and so on
Data Processing Cycle
Processing

In this step, the input data is changed to produce data in a more useful
form

For example, a summary of sales for a month can be calculated from the
sales orders data
Data Processing Cycle
Output

At this stage, the result of the proceeding processing step is collected

The particular form of the output data depends on the use of the data

For example, output data can be total sale in a month

Data types and its representation
In computer science and computer programming, a data type or simply type
is an attribute of data which tells the compiler or interpreter how the
programmer intends to use the data

Common data types include

Integers, Boolean, Characters, Floating-Point Numbers,

Alphanumeric Strings

This data type defines the operations that can be done on the data, the
meaning of the data, and the way values of that type can be stored
Classifications of Data
Data can be classified into the following categories

Structured

Unstructured

Semi-structured

Meta-data
Structured Data
Data that adheres to a predeﬁned data model and is therefore
straightforward to analyze

Conforms to a tabular format with relationship between diﬀerent rows and

columns

Common examples

Excel ﬁles or SQL databases

Structured Data
Structured data depends on the existence of a data model – a model of how
data can be stored, processed and accessed

Because of a data model, each ﬁeld is discrete and can be accessed

separately or jointly along with data from other ﬁelds

it is possible to quickly aggregate data from various locations in the database

Unstructured Data
Data that either does not have a predeﬁned data model or is not
organized in a predeﬁned manner

It is typically text-heavy, but may contain data such as dates, numbers, and
facts as well

Common examples

audio, video ﬁles, NoSQL, pictures, pdfs ...

Unstructured Data
The ability to store and process unstructured data has greatly grown in
recent years, with many new technologies and tools coming to the market that
are able to store specialized types of unstructured data
Unstructured Data
The ability to analyze unstructured data is especially relevant in the context
of Big Data, since a large part of data in organizations is unstructured

Think about pictures, videos or PDF documents

The ability to extract value from unstructured data is one of main drivers
behind the quick growth of Big Data
Semi-structured Data
A form of structured data that does not conform with the formal structure
of data models associated with relational databases or other forms of data
tables

But contain tags or other markers to separate semantic elements and

enforce hierarchies of records and ﬁelds within the data

Therefore, it is also known as self-describing structure

Semi-structured Data
Examples of semi-structured data

JSON and XML

Semi-structured Data
It is considerably easier to analyze than unstructured data

Many Big Data solutions and tools have the ability to ‘read’ and process either
JSON or XML
Metadata – Data about Data
It provides additional information about a speciﬁc set of data

For example

Metadata of a photo could describe when and where the photos were
taken

The metadata then provides ﬁelds for dates and locations which, by
themselves, can be considered structured data

Because of this reason, metadata is frequently used by Big Data solutions for
initial analysis.
What Is Big Data?
Large datasets together with the category of computing strategies and
technologies that are used to handle them
Data Value Chain
Describe the information ﬂow within a big data system as a series of steps
needed to generate value and useful insights from data

The Big Data Value Chain identiﬁes the following key high-level activities

Data Acquisition, Data Analysis, Data Curation, Data Storage, Data

Usage,
Data Acquisition
It is the process of gathering, ﬁltering, and cleaning data before it is put in a
data warehouse or any other storage solution on which data analysis can be
carried out

Data acquisition is one of the major big data challenges in terms of

infrastructure requirements
Data Acquisition
The infrastructure required for data acquisition must

deliver low, predictable latency in both capturing data and in executing

queries

be able to handle very high transaction volumes, often in a distributed

environment

support ﬂexible and dynamic data structures

Data Analysis
Involves exploring, transforming, and modelling data with the goal of
highlighting relevant data, synthesising and extracting useful hidden
information with high potential from a business point of view

Related areas include data mining, business intelligence, and machine

learning
Data Curation
Active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its eﬀective usage

Data curation processes can be categorized into diﬀerent activities

content creation, selection, classiﬁcation, transformation, validation,

and preservation
Data Curation
Data curators (also known as scientiﬁc curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable,
accessible, reusable, and ﬁt their purpose

A key trend for the curation of big data utilizes community and crowd
sourcing approaches
Data Storage
It is the persistence and management of data in a scalable way that satisﬁes
the needs of applications that require fast access to the data

Relational Database Management Systems (RDBMS) have been the main, and
almost unique, solution to the storage paradigm for nearly 40 years
Data Storage
The ACID (Atomicity, Consistency, Isolation, and Durability) properties of the
relational database that guarantee database transactions, lack ﬂexibility
with regard to schema changes and the performance and fault tolerance
when data volumes and complexity grow, making them unsuitable for big
data scenarios

NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models
Data Usage
Covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the
business activity

In business decision-making , it can enhance competitiveness through

reduction of costs, increased added value, or any other parameter that can
be measured against existing performance criteria
Why Are Big Data Systems Different?
The goal of most big data systems is to surface insights and connections
from large volumes of heterogeneous data that would not be possible using
conventional methods

The “three Vs of big data” characterizes big data

Volume, Velocity, Variety

Volume
Large datasets can be orders of magnitude larger than traditional
datasets, which demands more thought at each stage of the processing and
storage life cycle

Often, because the work requirements exceed the capabilities of a single

computer, this becomes a challenge of pooling, allocating, and coordinating
resources from groups of computers

Cluster management and algorithms capable of breaking tasks into smaller

pieces become increasingly important
Velocity
The speed that information moves through the system

Data is frequently ﬂowing into the system from multiple sources and is
often expected to be processed in real time to gain insights and update the
current understanding of the system

This focus on near instant feedback has driven many big data practitioners
away from a batch-oriented approach and closer to a real-time streaming
system
Velocity
Data is constantly being added, massaged, processed, and analyzed in
order to keep up with the inﬂux of new information and to surface valuable
information early when it is most relevant

Requires robust systems with highly available components to guard against

failures along the data pipeline
Variety
Wide range of data in terms of both the sources being processed and their
relative quality

Data can be ingested from internal systems like application and server
logs, from social media feeds and other external APIs, from physical
device sensors, and from other providers

Big data seeks to handle potentially useful data regardless of where it’s
coming from by consolidating all information into a single system
Variety
The formats and types of media can vary signiﬁcantly as well

Rich media like images, video ﬁles, and audio recordings are ingested
alongside text ﬁles, structured logs, etc
Veracity
"Veracity" in the context of data refers to the accuracy, reliability, and
trustworthiness of the data.

The variety of sources and the complexity of the processing can lead to
challenges in evaluating the quality of the data ( biases, noise and
abnormality in data)
Variability
Variability in data refers to the extent to which data points in a dataset deviate
or diﬀer from each other.

Variation in the data leads to wide variation in quality

Additional resources may be needed to identify, process, or ﬁlter low quality

data to make it more useful
Value
"Value in data" refers to the usefulness, relevance, and actionable insights that
data can provide when properly analyzed and interpreted.

The ultimate challenge of big data is delivering value

Sometimes, the systems and processes in place are complex enough that
using the data and extracting actual value can become diﬃcult
Clustered Computing
Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages

To better address the high storage and computational needs of big data,
computer clusters are a better ﬁt

Big data clustering software combines the resources of many smaller

machines, seeking to provide a number of beneﬁts

Resource Pooling, High Availability, Easy Scalability

Clustered Computing
Resource Pooling

Combining the available Storage Space, CPU and Memory

Processing large datasets requires large amounts of all three of these

resources
Clustered Computing
High Availability

Clusters can provide varying levels of fault tolerance and availability

guarantees to prevent hardware or software failures from aﬀecting
access to data and processing

This becomes increasingly important as we continue to emphasize the

importance of real-time analytics
Clustered Computing
Easy Scalability

Clusters make it easy to scale horizontally by adding additional

machines to the group

This means the system can react to changes in resource requirements

without expanding the physical resources on a machine
Clustered Computing
Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on individual
nodes

Cluster membership and resource allocation can be handled by software like

Hadoop’s YARN (which stands for Yet Another Resource Negotiator) or
Apache Mesos
Clustered Computing

https://www.geeksforgeeks.
org/hadoop-ecosystem/
Life Cycle Big Data
Ingesting data into the system

Persisting the data in storage

Computing and Analyzing data

Visualizing the results

Ingesting Data into the System
The process of taking raw data and adding it to the system

The complexity of this operation depends heavily on the format and quality of
the data sources and how far the data is from the desired state prior to
processing

During the ingestion process, some level of analysis, sorting, and labelling
usually takes place

This process is sometimes called ETL, which stands for extract, transform,
and load
Persisting the Data in Storage
The ingestion processes typically hand the data oﬀ to the components that
manage storage, so that it can be reliably persisted to disk

The volume of incoming data, the requirements for availability, and the
distributed computing layer make more complex storage systems necessary

Usually a distributed ﬁle system for raw data storage is used

Persisting the Data in Storage
Solutions such as Apache Hadoop’s HDFS ﬁlesystem, Ceph, and GlusterFS
allow large quantities of data to be written across multiple nodes in the cluster

Distributed databases, especially NoSQL databases, are well-suited for this

role because they are often designed with the same fault tolerant
considerations and can handle heterogeneous data
Computing and Analyzing Data
Once the data is available, the system can begin processing the data to
surface actual information

Data is often processed repeatedly, either iteratively by a single tool or by

using a number of tools to surface diﬀerent types of insights
Computing and Analyzing Data
There are two ways of processing big data

Batch Processing

Real-time (Stream) Processing

Batch processing
This process involves breaking work up into smaller pieces, scheduling
each piece on an individual machine, reshuﬄing the data based on the
intermediate results, and then calculating and assembling the ﬁnal result

These steps are often referred to individually as splitting, mapping,

shuﬄing, reducing, and assembling, or collectively as a distributed map
reduce algorithm

Batch processing is most useful when dealing with very large datasets that
require quite a bit of computation
Batch processing
This is the strategy used by Apache Hadoop’s MapReduce
Real-time processing
Demands that data be processed and made ready immediately and
requires the system to react as new data becomes available

One way of achieving this is stream processing, which operates on a

continuous stream of data composed of individual items

Another common characteristic of real-time processors is in-memory

computing, which works with representations of the data in the cluster’s
memory to avoid having to write back to disk

Apache Storm, Apache Flink, and Apache Spark provide diﬀerent ways of
achieving real-time or near real-time processing
Visualizing the Results
Visualizing data is one of the most useful ways to spot trends and make
sense of a large number of data points

Real-time processing is frequently used to visualize application and server

metrics

Technologies for visualization

Prometheus, Elastic Stack, Jupyter Notebook, and Apache Zeppelin

Prometheus Certified Associate-1
No ratings yet
Prometheus Certified Associate-1
513 pages
Throughput Issue in Lte-Debugging
No ratings yet
Throughput Issue in Lte-Debugging
6 pages
Fuck Girl
100% (1)
Fuck Girl
28 pages
Uneecops Technologis LTD., Noida
No ratings yet
Uneecops Technologis LTD., Noida
175 pages
Micom p346 Relay Manuals Areva Inc North America Areva Group CDG 31 Over
0% (1)
Micom p346 Relay Manuals Areva Inc North America Areva Group CDG 31 Over
2 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
PROG3112 Programming Java NCIII Part 2 WEEK 10 FIRST
100% (1)
PROG3112 Programming Java NCIII Part 2 WEEK 10 FIRST
5 pages
Fundamentals of Software Engineering
No ratings yet
Fundamentals of Software Engineering
45 pages
LN9 - Heap and Deap
No ratings yet
LN9 - Heap and Deap
6 pages
Pointers and Reference Parameters: CSE 251 Dr. Charles B. Owen Programming in C 1
No ratings yet
Pointers and Reference Parameters: CSE 251 Dr. Charles B. Owen Programming in C 1
44 pages
User Manual: DCSF-C Series Servo Motor
100% (1)
User Manual: DCSF-C Series Servo Motor
46 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
c09 Aws Blu Age Custom Architecture 2405
No ratings yet
c09 Aws Blu Age Custom Architecture 2405
45 pages
Chapter 12
No ratings yet
Chapter 12
48 pages
Yyyyyy
No ratings yet
Yyyyyy
55 pages
Mapinfo Pro v2019 Release Notes PDF
No ratings yet
Mapinfo Pro v2019 Release Notes PDF
22 pages
Proceso de Limpieza
No ratings yet
Proceso de Limpieza
17 pages
Project LiteratureSurvey Presentation
No ratings yet
Project LiteratureSurvey Presentation
31 pages
Data Science
No ratings yet
Data Science
35 pages
Dox Aku
No ratings yet
Dox Aku
6 pages
Information Theory & Coding Syllabus
No ratings yet
Information Theory & Coding Syllabus
3 pages
Chapter 2 - Overview For Data Science
No ratings yet
Chapter 2 - Overview For Data Science
31 pages
Photoshop Commands
No ratings yet
Photoshop Commands
2 pages
F SMART 51 - Rev
No ratings yet
F SMART 51 - Rev
2 pages
Xii Ip JPR QP Pb2 Set-C
No ratings yet
Xii Ip JPR QP Pb2 Set-C
7 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Tugas 9 Praktikum SD - 1197050084-Muhammad Fahmi Rizaldi Ilham - Kelas C
No ratings yet
Tugas 9 Praktikum SD - 1197050084-Muhammad Fahmi Rizaldi Ilham - Kelas C
10 pages
Manual Hummer H-240CL PDF
No ratings yet
Manual Hummer H-240CL PDF
16 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Admitcard (4thsem)
No ratings yet
Admitcard (4thsem)
2 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Installation Guide: 24-Port Gigabit Ethernet Poe+ Smart Managed Pro Switch With 2 SFP Ports
No ratings yet
Installation Guide: 24-Port Gigabit Ethernet Poe+ Smart Managed Pro Switch With 2 SFP Ports
2 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Matlab Exercise
No ratings yet
Matlab Exercise
4 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Java Exercise-1
No ratings yet
Java Exercise-1
3 pages
Object-Oriented Programming in Python
No ratings yet
Object-Oriented Programming in Python
2 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Novatel OEM7 AUTHCODES
No ratings yet
Novatel OEM7 AUTHCODES
2 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Designing A Home Alarm Using The Uml and Implementing It Using C++ and Vxworks
No ratings yet
Designing A Home Alarm Using The Uml and Implementing It Using C++ and Vxworks
0 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Module 1
No ratings yet
Module 1
35 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Dr. Ayaz - Data Science Presentation
No ratings yet
Dr. Ayaz - Data Science Presentation
164 pages
DS Notes
No ratings yet
DS Notes
49 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
No ratings yet
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
4 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Unit 1
No ratings yet
Unit 1
61 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet

Chapter Two

Uploaded by

Chapter Two

Uploaded by

Chapter Two

Overview for Data Science

Diﬀerentiate the diﬀerence between data and information

Explain data types and their representation

Explain the data value change

Explain basic concepts of big data

scientiﬁc methods, processes and algorithm systems to extract

Data can be described as unprocessed facts and ﬁgures

It can also deﬁned as groups of non-random symbols in the form of text,

Interpreted data created from organized, structured, and processed data in

Timely − Information should be available when required

Accuracy − Information should be accurate

Completeness − Information should be complete

Data processing consists the following steps

The input data is prepared in some convenient form for processing

The form will depend on the processing machine

At this stage, the result of the proceeding processing step is collected

For example, output data can be total sale in a month

Common data types include

Integers, Boolean, Characters, Floating-Point Numbers,

Conforms to a tabular format with relationship between diﬀerent rows and

Excel ﬁles or SQL databases

Because of a data model, each ﬁeld is discrete and can be accessed

it is possible to quickly aggregate data from various locations in the database

audio, video ﬁles, NoSQL, pictures, pdfs ...

Think about pictures, videos or PDF documents

But contain tags or other markers to separate semantic elements and

Therefore, it is also known as self-describing structure

JSON and XML

Data Acquisition, Data Analysis, Data Curation, Data Storage, Data

Data acquisition is one of the major big data challenges in terms of

deliver low, predictable latency in both capturing data and in executing

be able to handle very high transaction volumes, often in a distributed

support ﬂexible and dynamic data structures

Related areas include data mining, business intelligence, and machine

Data curation processes can be categorized into diﬀerent activities

content creation, selection, classiﬁcation, transformation, validation,

In business decision-making , it can enhance competitiveness through

The “three Vs of big data” characterizes big data

Volume, Velocity, Variety

Often, because the work requirements exceed the capabilities of a single

Cluster management and algorithms capable of breaking tasks into smaller

Requires robust systems with highly available components to guard against

Variation in the data leads to wide variation in quality

Additional resources may be needed to identify, process, or ﬁlter low quality

The ultimate challenge of big data is delivering value

Big data clustering software combines the resources of many smaller

Resource Pooling, High Availability, Easy Scalability

Combining the available Storage Space, CPU and Memory

Processing large datasets requires large amounts of all three of these

Clusters can provide varying levels of fault tolerance and availability

This becomes increasingly important as we continue to emphasize the

Clusters make it easy to scale horizontally by adding additional

This means the system can react to changes in resource requirements

Cluster membership and resource allocation can be handled by software like

Persisting the data in storage

Computing and Analyzing data

Visualizing the results

Usually a distributed ﬁle system for raw data storage is used

Distributed databases, especially NoSQL databases, are well-suited for this

Data is often processed repeatedly, either iteratively by a single tool or by

Real-time (Stream) Processing

These steps are often referred to individually as splitting, mapping,

One way of achieving this is stream processing, which operates on a

Another common characteristic of real-time processors is in-memory

Real-time processing is frequently used to visualize application and server

Technologies for visualization

Prometheus, Elastic Stack, Jupyter Notebook, and Apache Zeppelin

You might also like