0% found this document useful (0 votes)
17 views35 pages

Chapter 2 - Introduction To Data Science

Chapter 2 provides an introduction to data science, defining it as a multi-disciplinary field focused on extracting insights from data and highlighting the role of data scientists. It discusses the differences between data and information, the data processing life cycle, various data types, and the significance of big data and the Hadoop ecosystem. The chapter emphasizes the importance of data in decision-making processes and outlines the data value chain in the context of big data.

Uploaded by

Swunet mosie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views35 pages

Chapter 2 - Introduction To Data Science

Chapter 2 provides an introduction to data science, defining it as a multi-disciplinary field focused on extracting insights from data and highlighting the role of data scientists. It discusses the differences between data and information, the data processing life cycle, various data types, and the significance of big data and the Hadoop ecosystem. The chapter emphasizes the importance of data in decision-making processes and outlines the data value chain in the context of big data.

Uploaded by

Swunet mosie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter 2

Data Science
OUTLINE
 After completing this chapter:

 Describe what data science is and the role of data scientists.

 Differentiate data and information.

 Describe data processing life cycle

 Understand different data types from diverse perspectives

 Describe data value chain in emerging era of big data.

 Understand the basics of Big Data.

 Describe the purpose of the Hadoop ecosystem components.


INTRODUCTION TO DATA
SCIENCE
 Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from data.

 The main purpose of data science is to find the patterns within the
data and uses several techniques to analyze and draw the
perceptions from data.

 Data science is much more than simply analyzing data. It offers a


range of roles and requires a range of skills.

 Data scientist is the person who performs analysis on data and give
insights to decision makers. In Data Science, the data scientist has
the responsibility of making the predictions from the data.
INTRODUCTION TO DATA

SCIENCE
Data scientist aims to derive conclusions from the whole data.
With the help of these conclusions, the data scientist can support
the industries in making smarter business decisions.

 Data Science is all about:

 Ask correct questions and analyzing the raw data

 Visualize data for getting a better perspective.

 Modeling the data by using various complex and efficient


algorithms.

 Understand the data to make a better decision and finding


the final result.
EXAMPLE OF DATA SCIENCE
 Let’s us consider choosing department from available departments in Dilla
university as data science problem example:

 Before choosing a department you write your preferences on your paper as Law,
Anthropology, Economics, Accounting etc. based on what you see in the semester.
This planned choice is a piece of data though it is written by pencil or on
mobile that you can read.

 When it is time to choose, you use your data as a reminder to choose department.

 When you choose department, the registrar register you to the department you
want.

 In the registrar, a system tells that the department capacity is full and to assign
students to other department.

 Finally, at the end of registration the registrar employees see different graphs of
students data based on sex, age, country etc. They use this information to prepare
for next year.
EXAMPLE OF DATA SCIENCE
 The small piece of information that began on your notebook
ended up in many different places, most notably on the registrar
office as an aid to decision making.

 The data went through many transformations. In addition to the


computers where the data might have stopped by or stayed on
for the long term, lots of other pieces of hardware were involved
in collecting, manipulating, transmitting, and storing the data.

 In addition, many different pieces of software were used to


organize, aggregate, visualize, and present the data. Finally,
many different human systems were involved in working with
the data.
DATA AND INFORMATION
 Data can be defined as a representation of facts, concepts, or

instructions in a formalized manner, which should be suitable

for communication, interpretation, or processing, by human or

electronic machines.

 It can be described as unprocessed facts and figures.

 It is represented with the help of characters such as alphabets

(A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =,

etc.).
DATA AND INFORMATION
 Information is the processed data on which decisions and
actions are based. It is data that has been processed into a form
that is meaningful to the recipient.

 Furtherer more, information is interpreted data; created from


organized, structured, and processed data in a particular
context
DATA AND INFORMATION
Data Information
 raw facts  data with context
 no context  processed data
 just numbers and text  value-added to data
 summarized
 organized
 analyzed

 For Example:
Data: 51012
 Information:
 5/10/12 The date of your final exam.
 51,012 Birr The starting salary of an accounting major.
 51012 Zip code of Dilla.
DATA PROCESSING CYCLE
 Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose. There are three steps constitute the data processing cycle.

 Input: in this step, the input data is prepared in some convenient form
for processing. The input data can be recorded on any one of the
several types of storage medium, such as hard disk, CD, flash disk ,
Papers and so on.

 E.g when opening account in CBE branch your data will be stored in
DATA PROCESSING CYCLE
 Processing: in this step, the input data is changed to produce data
in a more useful form. For example, interest can be calculated on
deposit to a bank, or a summary of withdraws in the month can be
calculated.

 Output: at this stage, the result of the processing step is collected.


The particular form of the output data depends on the use of the
data. For example, output data may be summary like bank
statement etc.

 There are various types of data science tools which are used to
analysis different types of data:

 SAS (Statistical Analysis Software)


DATA PROCESSING TOOLS
 SAS (Statistical Analysis Software): is one of data science
tools which are specially designed for statistical operations used
by large organizations to analyze data.

 Weka tool is a software, which allows easier implementation of


machine learning algorithms through interacting platform. The
user can understand the functioning of machine learning on the
data without having to write a line of code. The Weka is the ideal
for the Data Scientists who are the beginners in machine
learning.

 Excel: Microsoft developed Excel for the spreadsheet


calculations, but nowadays, it is widely used for data processing,
DATA TYPE

 Data types can be described from diverse perspectives.

 In Computer science, data type is simply a behavior


( characteristics) of data that tells the programmer the intends
use of the data.

 int x=5;

 String name=“Abebe”;

 float p=3.14;

 From a data analytics point of view, there are three types of data
types Structured, Semi-structured, and Unstructured
DATA TYPE
 Structured data: is a data whose elements are organized in
tabular structured format. It is easy to retrieve, search and perform
analysis on this type of data. Example: Excel files, Sql databases.

 Semi-structured data: is a form of structured data that does not


obey the tabular structure format but have some organizational
properties that make it easier to analyze ,i.e. it might contain tags
or markers to separate semantic elements in a data. Therefore it’s
also known as self describing structure . Example: JSON and
XML.

 Unstructured data: is a data which is not organized in a pre-


defined manner or does not have a pre-defined data structure.
DATA TYPE
 Metadata is not a separate data type, but it is one of the most
important elements for Big Data analysis and big data solutions.

 Metadata is data about data. It provides additional information


about a specific set of data.

 In a set of photographs, for example, metadata could describe


when and where the photos were taken.

 The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions
for initial analysis.
DATA VALUE CHAIN
 Data value chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.

 The Big Data Value Chain identifies the following key high-level
activities:
DATA ACQUISITION
 It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse .

 A data warehouse is a subject-oriented, integrated(unified), time-


variant, and nonvolatile collection of data in support of management’s
decision-making process. A data warehouse is a decision support
database that is maintained separately from the organization’s
operational database.

 Data acquisition is one of the major big data challenges in terms of


infrastructure requirements.

 The infrastructure required to support the acquisition of big data must


deliver low, predictable latency in capturing data. It must be able to
handle very high transaction volumes, often in a distributed
DATA ANALYSIS

 Data analysis involves exploring, transforming, and modeling


data with the goal of highlighting relevant data,
synthesizing(combining) and extracting useful hidden
information with high potential from a business point of view.

 The data will be analyzed to get the following characteristics


from the data:

 Classification

 Clustering

 Association etc

 To perform analysis different computer science areas include


DATA CURATION
 It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective
usage.

 Data curation processes can be categorized into different


activities such as

 Content creation and selection

 Transformation and validation

 Preservation.

 Data curators (scientific curators or data annotators) hold the


responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
DATA STORAGE
 In data storage phase, the collected data is stored and prepared for
being used in the next phase. The collected data needs to be secured.

 In order to guarantee the safety of the collected data, some security


measures can be used like data anonymization approach,
permutation, and data partitioning (vertically or horizontally).

 In data storage we must ensure the persistence management of data


in a scalable way that satisfies the needs of applications that require
fast access to the data.

 RDBMS have been the main, and almost unique, a solution to the
storage paradigm for nearly 40 years but lacks flexibility when data
volumes and complexity grow . For this reason NoSQL technologies
have been designed.
DATA USAGE
 It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity .

 Data usage in business decision-making can enhance


competitiveness through the reduction of costs, increased added
value, or any other parameter that can be measured against
existing performance criteria.

 Data usage is given through specific tools and in turn through


query and scripting languages that typically depend on the
underlying data stores, their execution engines, APIs, and
programming models.
BIG DATA
 Big data is a blanket term for the non-traditional strategies and
technologies needed to gather, organize, process, and gather insights
from large datasets.

 While the problem of working with data that exceeds the computing
power or storage of a single computer is not new, the pervasiveness,
scale, and value of this type of computing have greatly expanded in
recent years.

 Big Data is a collection of data that is huge in volume, yet growing


exponentially with time. It is a data with so large size and complexity
that none of traditional data management tools can store it or process
it efficiently.

 A “large” means a dataset too large to reasonably process or store with


BIG DATA
 Big data is characterized by 4V.
 Volume: large amounts of data Zeta bytes/Massive datasets
 Velocity: the speed of data processing. Data is live streaming or
in motion
 Variety: data comes in many different forms from diverse
sources.
 Veracity: can we trust the data? How accurate/truthful is it?
BIG DATA
 Big Data is used for the analysis to get insight that will help you with the
business decision.

 Some real-world examples that will explain how big data is used are as
follows:

 Big Data is used to find out consumer shopping habits.

 It can be used to monitor health conditions through data from wearables.

 The transportation industry uses fuel optimization tools where big data is
used.

 It is used for predictive inventory ordering.

 It can help you with real-time data monitoring(like knowing how your
networks, applications, and services are performing.) and cybersecurity
protocols.
BIG DATA
 Big Data is responsible to handle, manage, and process different
types of data like Structured, Semi-structured, and Unstructured.

 It is cost-effective in terms of maintaining a large amount of data.


It works on a distributed system.

 We can save large amounts of data for a long time using Big Data
techniques. So it is easy to handle historical data and generate
accurate reports.

 Data processing speed is very fast and thus social media is using
Big Data techniques.

 It allows users to make efficient decisions for their business


based on current and historical data.
WHO’S GENERATING BIG DATA
 There are many companies who are generating big data.

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Mobile devices Sensor technology and


(tracking all objects all the networks
time) (measuring all kinds of data)
CLUSTERED COMPUTING
 To better address the high storage and computational needs of big
data, we need fast, secure and reliable computing environment.

 Cluster Computing is coming to solve the problems of stand alone


technology. The objective is to improve the performance/power efficiency
of a single processor for storing and mining the large data sets, using the
multiple disks and CPUs.

 Cluster computing refers that many of the computers connected on a


network and they perform like a single entity. Each computer that is
connected to the network is called a node.

 Cluster computing offers solutions to solve complicated problems by


providing faster computational speed, and enhanced data integrity. The
connected computers execute operations all together thus creating the
impression like a single system (virtual machine).
CLUSTERED COMPUTING
 Clustering software provide a number of benefits like:

 Resource Pooling: combining the available storage space to


hold data.
 High Availability: guarantees to availability incase of hardware or
software failures.

 Easy Scalability: easy to scale horizontally by adding machines to


the group.

 Clusters requires a solution for managing cluster membership,


coordinating resource sharing, and scheduling actual work on
individual nodes.

 Cluster membership and resource allocation can be handled by


HADOOP ECOSYSTEM
 Hadoop is an open-source framework intended to make interaction
with big data easier. It is a framework that allows for the distributed
processing of large datasets across clusters of computers using simple
programming models.

 The four key characteristics of Hadoop are:

 Economical: its systems are highly economical as ordinary

computers can be used for data processing.

 Reliable: stores copies of the data on different machines and is

resistant to hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically.

 Flexible: can store as much structured and unstructured data .


HADOOP ECOSYSTEM
 Hadoop has an ecosystem that has evolved from its four core

components: data management, access, processing, and

storage.

 In the ecosystem there are two main components for data

storage.

 HDFS(Hadoop Distributed File System): is the major

component of Hadoop ecosystem and is responsible for

storing large data sets of structured or unstructured data


HADOOP ECOSYSTEM
 For processing the ecosystem have:

 Yet Another Resource Negotiator(YARN): manage the resources

across the clusters. It performs scheduling and resource allocation

for the Hadoop System.

 MapReduce: makes it possible to carry over the processing’s logic

and helps to write applications which transform big data sets into a

manageable one. It takes away the complexity of distributed

programming by exposing two processing steps that developers

implement: 1) Map and 2) Reduce. In the Mapping step, data is split


HADOOP ECOSYSTEM
 For the data access:

 HIVE: performs reading and writing of large data sets.

 PIG: structure the data flow, processing and analyzing huge


data sets.

 MAHOUT: provides various functionalities such as


collaborative filtering, clustering, and classification. It allows
invoking algorithms with the help of its own libraries.

 For the data management the ecosystem have:


 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
BIG DATA LIFE CYCLE WITH HADOOP
 Ingesting data into the system: The first stage of Big Data processing is
Ingest.

 The data is ingested or transferred to Hadoop from various sources such


as relational databases, systems, or local files. Sqoop transfers data from
RDBMS to HDFS, whereas Flume transfers event data such as social
media data, clickstreams etc.).

 Processing: in this stage, the data is stored and processed. The data is
stored in HDFS, or/and HBase. Spark and MapReduce perform data
processing.

 Analyze: data is analyzed by processing frameworks such as Pig, Hive,


and Impala.

 Access: which is performed by tools such as Hue and Cloudera Search. In


SUMMARY
 Let’s us understand the Hadoop ecosystem from a real life example. There is a
company that has established its branches in three different countries, let’s
assume a branch in, A.A, Awassa and Dilla.

 In every branch, the entire customer data is stored in the Local Database daily.

 Every quarterly, half-yearly or yearly basis, the organization wants to analyze


this data for business development. To do this, the organization collect all this
data from multiple sources, perform necessary cleaning steps and put it in
Data Warehouse.

 Then, we can use it for analytical purposes. So for analysis, we can generate
reports from the data available in the Data Warehouse. Multiple charts and
reports can be generated using Business Intelligence Tools.

 We require analysis for analytical purposes to grow the business and make
appropriate decisions for the organizations.
THE END
?

You might also like