Lesson 3 Data Science
Lesson 3 Data Science
I. LEARNING OBJECTIVES
Upon completion of this lesson, the students should be able to:
• Describe what data science is and the role of data scientists.
• Differentiate data and information.
• Describe data processing life cycle
• Understand different data types from diverse perspectives
• Describe data value chain in emerging era of big data.
• Understand the basics of Big Data.
• Describe the purpose of the Hadoop ecosystem components.
II. SUBTOPICS
This lesson engages students to acquire knowledge of the following
topics:
• Data science definition
• Data and information
• Data value chain
• Big data
III. LESSON PROPER
• Introduction
In this lesson, you are going to learn more about data science, data vs.
information, data types and representation, data value chain, and basic
concepts of big data.
In order to uncover useful intelligence for their organizations, data scientists must
master the full spectrum of the data science life cycle and possess a level of
flexibility and understanding to maximize returns at each phase of the process.
Data scientists need to be curious and result-oriented, with exceptional industry-
specific knowledge and communication skills that allow them to explain highly
technical results to their non-technical counterparts. They possess a strong
quantitative background in statistics and linear algebra as well as programming
knowledge with focuses on data warehousing, mining, and modeling to build and
analyze algorithms. In this chapter, we will talk about basic definitions of data and
information, data types and representation, data value change and basic concepts
of big data.
What are data and information?
Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines. It can be
described as unprocessed facts and figures. It is represented with the help of
characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,
*, <,>, =, etc.).
Whereas information is the processed data on which decisions and actions are
based. It is data that has been processed into a form that is meaningful to the
recipient and is of real or perceived value in the current or the prospective action
or decision of recipient. Furtherer more, information is interpreted data; created
from organized, structured, and processed data in a particular context.
• Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular purpose.
Data processing consists of the following basic steps - input, processing, and
output. These three steps constitute the data processing cycle.
Input − in this step, the input data is prepared in some convenient form for
processing. The form will depend on the processing machine. For example,
when electronic computers are used, the input data can be recorded on any
one of the several types of storage medium, such as hard disk, CD, flash disk
and so on.
Processing − in this step, the input data is changed to produce data in a more
useful form. For example, interest can be calculated on deposit to a bank, or a
summary of sales for the month can be calculated from the sales orders.
Output − at this stage, the result of the proceeding processing step is
collected. The particular form of the output data depends on the use of the
data. For example, output data may be payroll for employees.
Structured Data
Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze. Structured data conforms to a tabular
format with a relationship between the different rows and columns. Common
examples of structured data are Excel files or SQL databases. Each of these has
structured rows and columns that can be sorted.
Semi-structured Data
Semi-structured data is a form of structured data that does not conform with
the formal structure of data models associated with relational databases or
other forms of data tables, but nonetheless, contains tags or other markers to
separate semantic elements and enforce hierarchies of records and fields within
the data. Therefore, it is also known as a self-describing structure. Examples of
semi-structured data include JSON and XML are forms of semi-structured data.
Unstructured Data
Unstructured data is information that either does not have a predefined data
model or is not organized in a pre-defined manner. Unstructured information is
typically text-heavy but may contain data such as dates, numbers, and facts as
well. This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in structured
databases. Common examples of unstructured data include audio, video files or
No-SQL databases.
Metadata – Data about Data
The last category of data type is metadata. From a technical point of view, this
is not a separate data structure, but it is one of the most important elements
for Big Data analysis and big data solutions. Metadata is data about data. It
provides additional information about a specific set of data.
In a set of photographs, for example, metadata could describe when and where
the photos were taken. The metadata then provides fields for dates and
locations which, by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions for initial analysis.
High Availability: Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures from
affecting access to data and processing. This becomes increasingly important
as we continue to emphasize the importance of real-time analytics.
Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Hadoop has an ecosystem that has evolved from its four core components: data
management, access, processing, and storage. It is continuously growing to meet
the needs of Big Data. It comprises the following components and many others:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
BIG DATA LIFE CYCLE WITH HADOOP
Ingesting data into the system
The first stage of Big Data processing is Ingest. The data is ingested or transferred
to Hadoop from various sources such as relational databases, systems, or local
files. Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event
data.
Processing the data in storage
The second stage is Processing. In this stage, the data is stored and processed.
The data is stored in the distributed file system, HDFS, and the NoSQL distributed
data, HBase. Spark and MapReduce perform data processing.
Computing and analyzing data
The third stage is to Analyze. Here, the data is analyzed by processing frameworks
such as Pig, Hive, and Impala. Pig converts the data using a map and reduce and
then analyzes it. Hive is also based on the map and reduce programming and is
most suitable for structured data.
Visualizing the results
The fourth stage is Access, which is performed by tools such as Hue and Cloudera
Search. In this stage, the analyzed data can be accessed by users.
IV. ASSESSMENT
Answer the following questions below.
V. REFERENCES
Keteraw, Y. (2019).Introduction to Emerging Technology, Retrieved
from: https://www.studocu.com/row/document/addis-ababa-
university/introduction-to-emerging-technologies/introduction-to-
emerging-technology/8812270