0% found this document useful (0 votes)
42 views

Chapter 2 Introduction To Data Science

Introduction to Data Science Overview for Data Science Definition of data and information Data types and representation Data Value Chain Data Acquisition Data Analysis Data Curating Data Storage Data Usage Basic concepts of Big data

Uploaded by

Bedasa Wayessa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Chapter 2 Introduction To Data Science

Introduction to Data Science Overview for Data Science Definition of data and information Data types and representation Data Value Chain Data Acquisition Data Analysis Data Curating Data Storage Data Usage Basic concepts of Big data

Uploaded by

Bedasa Wayessa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Course Title: Introduction to Emerging Technologies

Credit Hour: 3 hrs.


Course Code: EmTe1012.
ECTS: 5 [3 Lecture hours and 0 Lab hours]
Lecture Schedule: Every ____________

Bedasa Wayessa
[email protected]

EmTe1012 1
Classroom Rules
• Late comer will only tolerated for the first 5 minutes of every class
• Talk to me and Not to each other
• Do not sleep
• Do not use phones
• Fail to obey the Classroom rule  face 2  3 class ban

EmTe1012 2
Assignment Submission
 Guidelines for submission will be provided with every assignment
 Re-grade requests will ONLY be entertained within one week after
the assignments have been handed back to students or assignment due
date
 IMPORTANT: Late submissions are allowed ONLY until 1 day following
the deadline, with 10% marks deduction.
 IMPORTANT: Late + Copy = ZERO Marking

EmTe1012 3
QUIZZES
• Quizzes will NOT be announced
• Re-grade requests will only be entertained within one week after the
marked quizzes have been handed back to students [with tangible and
acceptable reason only]

EmTe1012 4
Chapter 2

Introduction to Data Science

EmTe1012 5
Outlines
• Introduction to Data Science
– Overview for Data Science
• Definition of data and information
• Data types and representation
– Data Value Chain
• Data Acquisition
• Data Analysis
• Data Curating
• Data Storage
• Data Usage
– Basic concepts of Big data

EmTe1012 6
Objectives
• Describe what data science is and the role of data scientists.
• Differentiate data and information.
• Describe data processing life cycle
• Understand different data types from diverse perspectives
• Describe data value chain in emerging era of big data.
• Understand the basics of Big Data.
• Describe the purpose of the Hadoop ecosystem components.

EmTe1012 7
Activity
 What is data science?
 Can you describe the role of data in emerging technology?
 What are data and information?
 What is big data?

EmTe1012 8
Definition of Data Science
• Data science is a multi-disciplinary field that uses
 Scientific methods,
 Processes,
 Algorithms, and
 Systems to extract knowledge and insights from
 Structured,
 Semi-structured and
 Unstructured data.
• Data science is much more than simply analyzing data.
• It offers a range of roles and requires a range of skills.

EmTe1012 9
Data and Information
• Data can be defined as a representation of
 facts,
 concepts, or
 instructions in a formalized manner which should be suitable for
communication, interpretation, or processing, by human or
electronic machines.
• It can be described as unprocessed facts and figures
• It is represented with the help of characters such as
 alphabets (A-Z, a-z),
 digits (0-9) or
 special characters
EmTe1012 10
Data and Information
• Information
 is the processed data on which decisions and actions are based.
 It is data that has been processed into a form that is meaningful to
the recipient.
• Information is interpreted data;
• created from
 organized,
 structured, and
 processed data in a particular context.

EmTe1012 11
Data Processing cycle
• Data processing - is the re-structuring or re-ordering of data by
people or machines to increase their usefulness and add values for a
particular purpose.
• Data processing consists of the following basic steps –
 input,
 processing, and output.
• These three steps constitute the data processing cycle

EmTe1012 12
Data Processing cycle
 Input
 The input data is prepared in some convenient form for processing.
 The form will depend on the processing machine.
 For example, when electronic computers are used, the input data
can be recorded on any one of the several types of storage medium,
such as hard disk, CD, flash disk and so on.
 Processing
 The input data is changed to produce data in a more useful form.
 For example, interest can be calculated on deposit to a bank, or a
summary of sales for the month can be calculated from the sales
orders.

EmTe1012 13
Data Processing cycle
• Output –
 The result of the proceeding processing step is collected.
 The particular form of the output data depends on the use of the
data.
 For example, output data may be payroll for employees.

EmTe1012 14
Data Types and Their Representation
 Data types can be described from diverse perspectives.
1. Data types from Computer programming perspective
– In computer programming a data type is simply an attribute of data
that tells the compiler or interpreter how the programmer intends
to use the data.
– Data types help ensure the correct use and interpretation of data.
– They prevent errors and improve code readability.
– Different data types have different properties and operations.
• Common data types include:
 Integers(int)- is used to store whole numbers, mathematically
known as integers

EmTe1012 15
Data Types and Their Representation
 Data types can be described from diverse perspectives.
1. Data types from Computer programming perspective
o Common data types include:
 Booleans(bool)- is used to represent restricted to one of two
values: true or false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of
characters and numbers

EmTe1012 16
Data Types and Their Representation
• Common data types include: continued
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of
characters and numbers
• A data type makes the values that expression, such as a variable or
a function might take.
• This data type defines:
the operations that can be done on the data,
the meaning of the data, and
the way values of that type can be stored.

EmTe1012 17
Data Types and Their Representation
2. Data types from Data Analytics perspective
• From a data analytics point of view, there are three common types of
data types:
 Structured: Relational databases, spreadsheets, CSV files.
 Semi-structured: XML, JSON, log files.
 Unstructured data types. Text documents, images, videos, audio
recordings

EmTe1012 18
Data Types and Their Representation
 Structured data
 Is a data that adheres to a pre-defined data model and is therefore
straightforward to analyze.
 Structured data conforms to a tabular format with a relationship
between the different rows and columns.
o Common examples of structured data are Excel files or SQL
databases.
 Each of these has structured rows and columns that can be sorted.
 Easy to store, query, and analyze.
 Well-understood and supported by many tools and technologies.
 Provides a clear and organized view of the data.

EmTe1012 19
Data Types and Their Representation
 Semi-structured Data
 is a form of structured data that does not conform with
• the formal structure of data models associated with relational
databases or other forms of data tables contains tags or other
markers to separate semantic elements and enforce hierarchies
of records and fields within the data.
• More flexible than structured data.
 Therefore, it is also known as a self-describing structure.
 Can accommodate a wide variety of data formats and structures.
 Examples: JSON (JavaScript Object Notation), XML (Extensible
Markup Language), HTML (HyperText Markup Language)

EmTe1012 20
Data Types and Their Representation
 Unstructured Data
 Is the information that either does not have a predefined data model
or is not organized in a pre-defined manner.
 Unstructured information is typically text-heavy but may contain data
such as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that that make it difficult
to understand using traditional programs as compared to data stored
in structured databases.
 Contains valuable information that is not captured by structured data.
 Can provide insights into customer behavior, sentiment, and trends.
 Examples of unstructured data include:
o Text documents, audio, video files or No-SQL databases.
EmTe1012 21
Data Types and Their Representation
 Metadata – Data about Data
 Metadata is data that provides additional information about other data.
 It is essentially "data about data.“
 In a set of photographs, for example, metadata could describe
when and where the photos were taken.
The metadata then provides fields for dates and locations
which, by themselves, can be considered structured data.
 It is one of the most important elements for Big Data analysis and
big data solutions.
To describe the characteristics of a dataset.
To provide context and meaning to the data.
To make data more discoverable and reusable.
EmTe1012 22
Activity
• Discuss data types from programing and analytics perspectives.
• Compare metadata with structured, unstructured and semi-structured data
• Given at least one example of structured, unstructured and semi-
structured data types.

EmTe1012 23
Data value Chain
• The Big Data Value Chain describes the key steps involved in
transforming raw data into valuable insights and actions.
• It consists of the following high-level activities:
– Acquisition, Analysis, Curation, Storage and Usage.

EmTe1012 24
Data value Chain
1. Data Acquisition
• It is the process of:
gathering, • Before data put in
filtering, and data warehouse
cleaning data
• It is one of the major big data challenges in terms of infrastructure
requirements.
• The infrastructure required to support the acquisition of big data must deliver:
 low, predictable latency in both capturing data and in executing queries;
 be able to handle very high transaction volumes, often in a
distributed environment;
 support flexible and dynamic data structures

EmTe1012 25
Data value Chain
2. Data Analysis
• It is concerned with making the raw data acquired amenable to use
in decision-making as well as domain-specific usage.
• Data analysis involves: Exploring, transforming, and modeling data
with the goal of:
• highlighting relevant data,
• synthesizing and
• extracting useful hidden information with high potential from a
business point of view.
• Related areas include data mining, business intelligence, and machine
learning.

EmTe1012 26
Data value Chain
3. Data Curation
• It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as:
• content creation, selection, classification, transformation,
validation, and preservation.
• Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
• Data curators are responsible for ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.

EmTe1012 27
Data value Chain
4. Data Storage
• Data storage refers to the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast access to the data.
• Traditional Data Storage: RDBMS have been the main, and almost
unique, a solution to the storage paradigm for nearly 40 years.
• However, the ACID (Atomicity, Consistency, Isolation, and Durability)
properties that guarantee database transactions
– Lacks flexibility with regard to schema changes and the performance.
– fault tolerance when data volumes and complexity grow, making
them unsuitable for big data scenarios.
• NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models.
EmTe1012 28
Data value Chain
5. Data Usage
• It covers the data-driven business activities that need
– access to data, its analysis, and the tools needed to integrate the
data analysis within the business activity.
• Data usage in business decision-making can enhance competitiveness
through:
– the reduction of costs,
– increased added value,
– Enhanced decision-making or
– any other parameter that can be measured against existing
performance criteria.

EmTe1012 29
Basic concepts of big data
• Big data is a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
– large dataset means a dataset too large to reasonably process or store
with traditional tool or on a single computer.
• Big data is characterized by 3V and more:
– Volume: Large amounts of data Zeta bytes/Massive datasets
– Velocity: Data is live streaming or in motion
– Variety: Data comes in many different forms from diverse sources
– Veracity: Can we trust the data? How accurate is it?

EmTe1012 30
Basic concepts of big data

Figure: Characteristics of big data

• Variability: Big data can be highly dynamic and constantly changing.


• Complexity: Big data analysis often requires specialized tools and techniques.
• Privacy and security: Ensuring the privacy and security of big data is crucial.
EmTe1012 31
Clustered Computing and Hadoop Ecosystem
• Individual computers are often inadequate for handling the big data at
most stages.
• Challenges of Individual Computers for Big Data:
– Limited storage capacity
– Insufficient processing power
– Scalability limitations
– Single point of failure
• Solutions for Minimizing Individual Computers:
– Distributed computing
– Cloud computing
– Clustered Computing

EmTe1012 32
Clustered Computing and Hadoop Ecosystem
• Clustered Computing
• To better address the high storage and computational needs of big data,
computer clusters are a better fit.
• Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits.
• Benefits of Clustered Computing for Big Data:
– Resource Pooling: Combining the available storage space to hold
data is a clear benefit, but CPU and memory pooling are also
extremely important.
• Processing large datasets requires large amounts of all three of
these resources.

EmTe1012 33
Clustered Computing and Hadoop Ecosystem
 Clustered Computing
– High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.
– Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• This means the system can react to changes in resource
requirements without expanding the physical resources on a
machine.

EmTe1012 34
Clustered Computing and Hadoop Ecosystem
• Using clusters requires a solution for managing:
– cluster membership,
– coordinating resource sharing and
– scheduling actual work on individual nodes.
• Cluster Management Software:
• Cluster membership and resource allocation can be handled by software like
– Hadoop YARN (Yet Another Resource Negotiator): A popular
resource management framework for Hadoop clusters.
• Clustered computing provides a powerful and flexible solution for addressing
the storage, computational, and scalability challenges of big data.

EmTe1012 35
Activity
• List and discuss the characteristics of big data
• Describe the big data life cycle.
• Which step you think most useful and why?
• List and describe each technology or tool used in the big data life cycle.
• Discuss the three methods of computing over a large dataset.

EmTe1012 36
Hadoop and its Ecosystem
 Hadoop is an open-source framework designed for distributed
processing of large datasets across clusters of computers.
 The Hadoop Ecosystem is a suite of tools that provides various
services to solve big data problems.
 It is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
 It is inspired by a technical document published by Google.
 The Hadoop Ecosystem refers to a collection of open-source
software tools and technologies that work together to provide a
comprehensive solution for big data processing and analysis.

EmTe1012 37
Hadoop and its Ecosystem
• The four key characteristics of Hadoop are:
– Economical: Its systems are highly economical as ordinary computers
can be used for data processing. [Cost-effective, Open-source]
– Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure. [Fault tolerance, Data
redundancy]
– Scalable: It is easily scalable both, horizontally and vertically.
• A few extra nodes help in scaling up the framework. And vertically
adding more resources (CPU, memory) to existing nodes.
– Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
• Data format and Programming language agnostic
EmTe1012 38
Hadoop and its Ecosystem
• There are four major elements of Hadoop i.e.
– HDFS, MapReduce,YARN, and Hadoop Common.
• Hadoop has an ecosystem that has evolved from its four core
components:
– data management,
– access,
– processing, and
– storage.
– It is continuously growing to meet the needs of Big Data.

EmTe1012 39
Hadoop and its Ecosystem
• It comprises the following components and many others:
– HDFS: Hadoop Distributed File System
– YARN:Yet Another Resource Negotiator
– MapReduce: Programming based Data Processing
– Spark: In-Memory data processing
– PIG, HIVE: Query-based processing of data services
– HBase: NoSQL Database
– Mahout, Spark MLLib: Machine Learning algorithm libraries
– Solar, Lucene: Searching and Indexing
– Zookeeper: Managing cluster
– Oozie: Job Scheduling
EmTe1012 40
Hadoop and its Ecosystem
• n

Hadoop Ecosystem
EmTe1012 41
Four Major Elements of Hadoop
 HDFS: is responsible for storing large data sets of structured or
unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
– It breaks down large files into smaller blocks and distributes them
across a cluster of machines.
 HDFS consists of two core components i.e.
 Name node and Data Node
 YARN: YARN (Yet Another Resource Negotiator):
– YARN is the resource management layer of Hadoop that manages
resources and schedules tasks across the cluster.
– It allows different data processing engines (like MapReduce, Spark,
etc.) to run on the same cluster.
EmTe1012 42
Four Major Elements of Hadoop
 MapReduce:
– A programming model and processing engine used to process and
generate large datasets in parallel across a distributed cluster.
– It consists of two main functions: map, which processes data into key-
value pairs, and reduce, which performs summarization or
aggregation on the output of the map phase.
• Hadoop Common: refers to the collection of common utilities and
libraries that support other Hadoop modules.
– It is an essential part or module of the Apache Hadoop Framework,
along with the Hadoop Distributed File System (HDFS), Hadoop
YARN and Hadoop MapReduce.

EmTe1012 43
Big Data Life Cycle with Hadoop
• Ingesting data into the system
– The data is ingested or transferred to Hadoop from various sources
such as relational databases, systems, or local files.
– Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
– Processing the data in storage
– In this stage, the data is stored and processed.
– The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase. Spark and MapReduce perform data
processing.

EmTe1012 44
Big Data Life Cycle with Hadoop
– Computing and analyzing data
– Here, the data is analyzed by processing frameworks such as Pig,
Hive, and Impala.
– Pig converts the data using a map and reduce and then analyzes it.
– Hive is also based on the map and reduce programming and is most
suitable for structured data.
• Visualizing the results
– The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
– In this stage, the analyzed data can be accessed by users.

EmTe1012 45
Chapter One Review Questions

Reading Assignment

EmTe1012 46
End of Chapter 2

Next:

EmTe1012 47

You might also like