0% found this document useful (0 votes)
14 views27 pages

Chap 2-Data Analysis

Uploaded by

atakilti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

Chap 2-Data Analysis

Uploaded by

atakilti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Adigrat University

College of Engineering and Technology


Department of Computing

Course Title: Introduction to Emerging Technologies


Course Code:

Chapter Two: Data Science


Outlines

• An Overview of Data Science


• Data types and their representation
• Data value Chain
• Basic Concepts of Big Data

2
An Overview of Data Science
• Data science is a multi-disciplinary field
• Uses scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and unstructured
data.
• Data science is much more than simply analyzing data.
• It offers a range of roles and requires a range of skills.

• Let’s consider the data involved in buying a box of cereal(teff, wheat,


or burly) from store:
• Prepare for the purchase by writing “cereal” in your notebook .
This planned purchase is a piece of data.
• In store, use your data as a reminder to grab the item and put it in
your cart.

3
An Overview of Data Science…
• The cashier scans the barcode on your container, and the
cash register logs the price.

• If purchase was one of the last boxes in the store, a computer tells
the stock manager that it is time to request another order from the
distributor.

• At the end of the month, a store manager looks at a collection of


pie charts showing all the different kinds of cereal that were sold
and decides to offer more varieties of these next month.

• So, the small piece of information that began on your notebook


ended up on the desk of a manager as an aid to decision making.

4
An Overview of Data Science…
• On the trip from your pencil(notebook) to the manager’s desk, the
data went through many transformations.
• Pieces of hardware such as the barcode scanner were involved in
collecting, manipulating and storing the data.
• Different pieces of software were used to organize, aggregate,
visualize, and present the data.
• People decided which systems to buy and install, who should get
access to what kinds of data.

• As an academic discipline, data science continues to evolve as one of


the most promising and in-demand career paths for skilled
professionals.
• Today, successful data professionals understand that they must
advance past the traditional skills of analyzing large
amounts of data, data mining, and programming skills.
5
An Overview of Data Science…
What are data and information?
• Data is a representation of facts, concepts, or instructions in a
formalized manner.
• It should be suitable for communication, interpretation, or
processing, by human or electronic machines.
• It can be described as unprocessed facts and figures.
• It is represented with the help of characters such as
 alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,
*, <,>, =, etc.).

• Information is the processed data on which decisions and actions are


based information created from organized, structured, and processed
data in a particular context.

6
An Overview of Data Science…
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
• It consists of the following basic steps
 input, processing, and output
• These three steps constitute the data processing cycle

7
An Overview of Data Science…
Data Processing Cycle...
A. Input: the input data is prepared in some convenient form for
processing depend on the processing machine.
• Fore example, for electronic computers, input data can be
recorded on any one of the several types of storage medium,
such as hard disk, CD and flash disk.
B. Processing: the input data is changed to produce data in a more
useful form.
• Fore example, interest can be calculated on deposit to a bank, a
summary of sales for the month can be calculated from the
sales orders.
C. Output: result of the processing step is collected.
• For example, output data may be payroll for employees

8
Data Types and their Representation
• Data types can be described from diverse perspectives.
• for instance, In computer programming, a data type is simply an
attribute of data that tells the compiler how the programmer intends
to use the data.

Data Types from Computer Programming Perspective

• This data type defines the operations that can be done on the data.
Though different languages may use different terminology.
• Common data types include:
• Integers(int)- is used to store whole numbers
• Floating-point numbers(float)- store real numbers.
• Characters(char)- is used to store a single character
• Booleans(bool)- to one of two values: true or false
• Alphanumeric strings(string)- characters and numbers
9
Data types and their representation…
Data types from Data Analytics perspective
• From a data analytics point of view, there are three common types of
data types or structures
A. Structured
B. Semi-structured
C. Unstructured data types
Below figure describes the three types of data and metadata

10
Data types and their representation…
Data types from Data Analytics perspective…
A. Structured Data: is data that adheres to a pre-defined data
model and is therefore straightforward to analyze
• It conforms to a tabular format with a relationship between
the different rows and columns.
• Common examples of structured data are Excel files or
SQL databases. Each of these has structured rows and
columns that can be sorted.
B. Semi-structured Data: is a form of structured data that
• Does not conform with the formal structure of data models
associated with relational databases.
• Contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields
• It is also known as a self-describing structure
• Example: JSON and XML
11
Data types and their representation…
Data types from Data Analytics perspective…
C. Unstructured Data: is information that either
• Does not have a predefined data model or is not organized
in a pre-defined manner.
• It is typically text-heavy but may contain data such as dates,
numbers, and facts as well.
• This results in irregularities and ambiguities
• For example, audio, video files or NoSQL databases
D. Metadata (Data about Data): provides additional information
about a specific set of data.
• Frequently used by Big Data solutions for initial analysis
• For example, In a set of photographs, metadata could describe
when and where the photos were taken

12
Data Value Chain
• The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data.
• Big Data Value Chain identifies the following key high-level
activities:

13
Data value Chain…

Data Acquisition
•It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out.

• It is one of the major big data challenges in terms of infrastructure


requirements.

• The infrastructure required to support the acquisition of big data must


deliver low, predictable latency in both capturing data and in executing
queries.

• Be able to handle very high transaction volumes, often in a distributed


environment and
• Support flexible and dynamic data structures.
14
Data value Chain…
Data Analysis

• It is concerned with making the raw data acquired amenable to use in


decision-making as well as domain-specific usage.

• Data analysis involves


• Exploring, transforming, and modeling data with the goal of
highlighting relevant data
• Synthesizing and extracting useful hidden information with high
potential from a business point of view.

• Related areas include:


• Data mining
• Business intelligence
• Machine learning
15
Data value Chain…
Data Curation

• It is the active management of data over its life cycle to ensure it


meets the necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities
such as:
• Content creation
• Selection and classification
• Transformation, validation, and preservation
• It is performed by expert curators that are responsible for improving
the accessibility and quality of data.

• Data curators hold the responsibility of ensuring that data are


trustworthy, discoverable, accessible, reusable and fit their purpose.

16
Data value Chain…
Data Storage

• It is the persistence and management of data in a scalable way that


satisfies the needs of applications that require fast access to the data.

• Relational DBMS have been the main solution to the storage


paradigm for nearly 40 years.

• NoSQL technologies have been designed with the scalability goal in


mind and present a wide range of solutions based on alternative data
models.

• However, the ACID (Atomicity, Consistency, Isolation, and


Durability) properties that guarantee database transactions lack
flexibility with regard to schema changes and the performance and fault
tolerance when data volumes and complexity grow.
17
Data value Chain…
Data Usage

• It covers the data-driven business activities that need access to data


and its analysis.

• And the tools needed to integrate the data analysis within the business
activity.

• Data usage in business decision making can enhance competitiveness


through
• The reduction of costs
• Increased added value
• Any other parameter that can be measured against existing
performance criteria.

18
Basic Concepts of Big Data
• Big data is a term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large
datasets.
What Is Big Data?
• Big data is the term for a collection of large and complex data sets.

• It becomes difficult to process using on-hand database management


tools or traditional data processing applications

•Big data is characterized by 3V and more:


• Volume: large amounts of data Zeta bytes/Massive dataset
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it?

19
Basic Concepts of Big Data…
Below figure shows the Characteristics of big data.

20
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Because of the quantities of big data, individual computers are often
inadequate for handling the data at most stages.

• To better address the high storage and computational needs of big


data, computer clusters are a better fit.

• Big data clustering software combines the resources of many smaller


machines, seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage space to hold
data is a clear benefit.
• But CPU and memory pooling are also extremely important.
• Processing large datasets requires large amounts of all three of
these resources.
21
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Clustered Computing…

• High Availability: Clusters can provide varying levels of fault


tolerance and availability guarantees
• To prevent hardware or software failures from affecting access
to data and processing.
• This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.

• Easy Scalability: Clusters make it easy to scale horizontally by


adding additional machines to the group.
• This means the system can react to changes in resource
requirements without expanding the physical resources on a
machine.
22
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Clustered Computing…
• Using clusters requires a solution for managing
• Cluster membership
• Coordinating resource sharing
• Scheduling actual work on individual nodes

• Cluster membership and resource allocation can be handled by


software like Hadoop’s YARN.

• The assembled computing cluster often acts as a foundation that other


software interfaces with to process the data.

• The machines involved in the computing cluster are also typically


involved with the management of a distributed storage system.
23
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction
with big data easier.
• It is a framework that allows for the distributed processing of large
datasets across clusters of computers

• The four key characteristics of Hadoop are:


• Economical: highly economical as ordinary computers can be used
for data processing
• Reliable: as it stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable horizontally and vertically
• Flexible: It is flexible and you can store as much structured and
unstructured data as you need.
24
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem…

• Hadoop has an ecosystem that has evolved from its four core
components:
• Data management
• Access
• Processing
• Storage
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• HBase: NoSQL Database
25
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem…
Below figure shows Hadoop Ecosystem

26
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…

Big Data Life Cycle with Hadoop


1. Ingesting data into the system: this is the first stage and data is
ingested or transferred to Hadoop from various sources such as
relational databases, local files.
2. Processing the data in storage: data is stored and processed. The
data is stored in the distributed file system, HDFS and NoSQL
perform data processing.
3. Computing and analyzing data: data is analyzed by processing
frameworks such as Pig, Hive, and Impala
4. Visualizing the results: this stage is Access, which is performed
by tools such as Hue and Cloudera Search
• The analyzed data can be accessed by users.

27

You might also like