0% found this document useful (0 votes)
6 views

Chapter 2 Data Science

Data science is a multi-disciplinary field focused on extracting knowledge and insights from various forms of data using scientific methods and algorithms. It offers significant advantages, such as fraud detection and improved decision-making, but also faces challenges like data variety and a lack of skilled professionals. The document also discusses data types, the data processing cycle, and the concept of big data, highlighting its characteristics and applications.

Uploaded by

daniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chapter 2 Data Science

Data science is a multi-disciplinary field focused on extracting knowledge and insights from various forms of data using scientific methods and algorithms. It offers significant advantages, such as fraud detection and improved decision-making, but also faces challenges like data variety and a lack of skilled professionals. The document also discusses data types, the data processing cycle, and the concept of big data, highlighting its characteristics and applications.

Uploaded by

daniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Chapter Two

Data Science
2.1. Overview of Data Science

 Data science is a multi-disciplinary field that uses scientific


methods, processes, algorithms, and systems to extract knowledge
and insights from structured, semi-structured and unstructured
data.
 Data Science is the area of study which involves extracting insights
from vast amounts of data by the use of various scientific methods,
algorithms, and processes. It helps you to discover hidden patterns
from the raw data.
Overview of Data Science
3
(I)
 Data Science is an interdisciplinary field that allows you to extract
knowledge from structured or unstructured data.
 Data science enables you to translate a business problem into a
research project and then translate it back into a practical solution.
Significant advantages of using Data Science

 Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive
business advantage.
 Data Science can help you to detect fraud using advanced machine
learning algorithms.
 It helps you to prevent any significant monetary losses.
Significant advantages of using Data
5
Science (II)

 Allows to build intelligence ability in machines


 You can perform sentiment analysis to gauge customer
brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right
customer to enhance your business.
Challenges of Data science
6

 High variety of information & data is required for accurate analysis


 Not adequate data science talent pool available
 Management does not provide financial support for a data science
team
 Unavailability of/difficult access to data
Challenges of Data science
7
(I)
 Data Science results not effectively used by business decision
makers
 Explaining data science to others is difficult
 Privacy issues
 Lack of significant domain expert
 If an organization is very small, they can't have a Data Science team
What are data and information?
8

 Data can be defined as a representation of facts, concepts, or


instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or
electronic machines.
 It can be described as unprocessed facts and figures.
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
What are data and information?
(I)
9

 Information is the processed data on which decisions and actions


are based.
 Information is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived value in the
current or the prospective action or decision of recipient.
 Furtherer more, information is interpreted data; created from
organized, structured, and processed data in a particular context.
Data Processing Cycle
10

 Data processing is the re-structuring or re-ordering of data by


people or machines to increase their usefulness and add values for
a particular purpose.
 Data processing consists of the following basic steps: Input,
Processing and Output. These three steps constitute the data
processing cycle.

Fig. 1.Data processing Cycle


Data Processing Cycle (I)
11

 Input :- in this step, the input data is prepared in some convenient form for
processing.
 The form will depend on the processing machine.
 For example, when electronic computers are used, the input data can be recorded on
any one of the several types of storage medium, such as hard disk, CD, flash disk and
so on.
 Processing:- in this step, the input data is changed to produce data in a more
useful form.
 For example, interest can be calculated on deposit to a bank, or a summary of
sales for the month can be calculated from the sales orders.
Data Processing Cycle
(II)
12

Output-at this stage, the result of the proceeding processing step is


collected.
 The particular form of the output data depends on the use of the

data.
 For example, output data may be payroll for employees.
Data types and their representation

13

 Data types can be described from diverse perspectives.


 In computer science and computer programming, for instance, a
data type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
14
 Almost all programming languages explicitly include the notion of
data type, though different languages may use different
terminology. Common data types include:
 Integers(int):- is used to represent whole numbers, mathematically
known as integers
 Booleans(bool):- is used to represent restricted to one of two
values: true or false
 Characters(char):- is used to represent a single character
 Floating-point numbers(float)- is used to represent real numbers
 Alphanumeric strings(string):- used to represent a combination of
characters and numbers
Data types from Data Analytics perspective
15

 From a data analytics point of view, it is important to


understand that there are three common types of
data types or structures:
 Structured

 Semi-structured and

 Unstructured data types.


Data types from Data Analytics perspective
16
Structured Data
17

 Structured data is data that adheres to a pre-defined data


model and is therefore straightforward to analyze.
 Structured data conforms to a tabular format with a
relationship between the different rows and columns.
 Common examples of structured data are Excel files or SQL
databases.
 Each of these has structured rows and columns that can be
sorted.
Semi-structured Data
18

 Semi-structured data is a form of structured data that does not


conform with the formal structure of data models associated with
relational databases or other forms of data tables, but nonetheless,
contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Therefore, it is also known as a self-describing structure.
 Examples of semi-structured data include JSON and XML are forms
of semi-structured data.
Unstructured Data
19
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
 Unstructured information is typically text-heavy but may contain
data such as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in
structured databases.
 Common examples of unstructured data include audio, video files or
NoSQL.
Metadata – Data about Data
20

 The last category of data type is metadata.


 From a technical point of view, this is not a separate data structure,
but it is one of the most important elements for Big Data analysis
and big data solutions.
 Metadata is data about data.
 It provides additional information about a specific set of data.
 In a set of photographs, for example, metadata could describe
when and where the photos were taken.
Data value Chain
21

 The Data Value Chain is introduced to describe the information


flow within a big data system as a series of steps needed to
generate value and useful insights from data. The Big Data Value
Chain identifies the following key high-level activities:

Fig2.Data Value Chain


1. Data Acquisition
22

 It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried
out.
 Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
 The infrastructure required to support the acquisition of big data must deliver
low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment and
support flexible and dynamic data structures.
2. Data Analysis
23

 It is concerned with making the raw data acquired amenable to use


in decision-making as well as domain-specific usage.
 Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
 Related areas include data mining, business intelligence, and
machine learning.
3. Data Curation
24

 It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
 Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
 Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
 Data curators (also known as scientific curators or data annotators)
hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
 A key trend for the duration of big data utilizes community and crowd
sourcing approaches.
4. Data Storage
25

 It is the persistence and management of data in a scalable way that


satisfies the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm for nearly
40 years.
 However, the ACID (Atomicity,Consistency,Isolation,and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance and fault tolerance
when data volumes and complexity grow, making them unsuitable for
big data scenarios.
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
5. Data Usage
26

 It covers the data-driven business activities that need


access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
 Data usage in business decision making can enhance
competitiveness through the reduction of costs, increased
added value, or any other parameter that can be
measured against existing performance criteria.
Basic concepts of big data
27

What Is Big Data?


 Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 In this context, a “large dataset” means a dataset too large
to reasonably process or store with traditional tooling or on a
single computer.
 This means that the common scale of big datasets is
constantly shifting and may vary significantly from
organization to organization.
 Big data is characterized by 4V and more:
28
 Volume: large amounts of data Zeta bytes/Massive datasets
 Velocity: Data is live streaming or in motion
 Variety: data comes in many different forms from diverse sources
 Veracity: can we trust the data? How accurate is it? etc.

Fig 3. Characteristics of Big data


Source of Big data

29

Mobile devices
(Tracking all objects all the time)
Areas of Applications of Big Data
30

Health and Well being


Policy making and public opinions
Smart cities and more efficient society
New online educational models: MOOC and
Student-Teacher modeling
Robotics and human-robot interaction
Areas of Applications of Big Data
31

Smarter Multi-
Healthcare channel
sales

Telecom
Homeland
Security

Trading
Analytics
TrafficControl

Search
Quality
Manufacturing
Big Data vs Data
Science
32

Factors Big Data Data Science


Concept Handling large Data Analyzing data
Responsibility Processing huge volume of Understand pattern
data and generate insights within and make
decisions
Industry E-commerce ,security Sales, image
services, telecommunication recognition,
advertisement ,risk
analytics
tools Hadoop Python ,R
33

THANK YOU
?

You might also like