0% found this document useful (0 votes)
184 views

Chapter 2 - Introduction To Data Science

The document discusses data science and introduces key concepts like data types, the data processing cycle, and big data. It defines data science, the role of data scientists, and differentiates between data and information. It also describes structured, semi-structured, and unstructured data types.

Uploaded by

Bacha Tariku
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
184 views

Chapter 2 - Introduction To Data Science

The document discusses data science and introduces key concepts like data types, the data processing cycle, and big data. It defines data science, the role of data scientists, and differentiates between data and information. It also describes structured, semi-structured, and unstructured data types.

Uploaded by

Bacha Tariku
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Chapter 2

Data Science
OUTLINE
 After completing this chapter:

 Describe what data science is and the role of data scientists.

 Differentiate data and information.

 Describe data processing life cycle

 Understand different data types from diverse perspectives

 Describe data value chain in emerging era of big data.

 Understand the basics of Big Data.

 Describe the purpose of the Hadoop ecosystem components.


INTRODUCTION TO DATA SCIENCE
 Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from data.

 The main purpose of data science is to find the patterns within the data and uses
several techniques to analyze and draw the perceptions from data.

 Data science is much more than simply analyzing data. It offers a range of roles
and requires a range of skills.

 Data scientist is the person who performs analysis on data and give insights to
decision makers. In Data Science, the data scientist has the responsibility of
making the predictions from the data.
INTRODUCTION TO DATA SCIENCE
 Data scientist aims to derive conclusions from the whole data. With the help of
these conclusions, the data scientist can support the industries in making smarter
business decisions.

 Data Science is all about:

 Ask correct questions and analyzing the raw data

 Visualize data for getting a better perspective.

 Modeling the data by using various complex and efficient algorithms.

 Understand the data to make a better decision and finding the final result.
DATA AND INFORMATION
 Data can be defined as a representation of facts, concepts, or instructions in a

formalized manner, which should be suitable for communication, interpretation,

or processing, by human or electronic machines.

 It can be described as unprocessed facts and figures.

 It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-

9) or special characters (+, -, /, *, <,>, =, etc.).


DATA AND INFORMATION
 Information is the processed data on which decisions and actions are based. It is
data that has been processed into a form that is meaningful to the recipient.

 Furtherer more, information is interpreted data; created from organized,


structured, and processed data in a particular context
DATA AND INFORMATION
Data Information
 raw facts  data with context
 no context  processed data
 just numbers and text  value-added to data
 summarized
 organized
 analyzed

 For Example:
Data: 51012
 Information:
 5/10/12 The date of your final exam.
 51,012 Birr The starting salary of an accounting major.
 51012 Zip code of Dilla.
EXAMPLE OF DATA SCIENCE
 Let’s us consider choosing department from available departments in Dilla university as
data science problem example:

 Before choosing a department you write your preferences on your paper as Law,
Anthropology, Economics, Accounting etc. based on what you see in the semester. This
planned choice is a piece of data though it is written by pencil or on mobile that you
can read.

 When it is time to choose, you use your data as a reminder to choose department.

 When you choose department, the registrar register you to the department you want.

 In the registrar, a system tells that the department capacity is full and to assign students to
other department.

 Finally, at the end of registration the registrar employees see different graphs of students
data based on sex, age, country etc. They use this information to prepare for next year.
EXAMPLE OF DATA SCIENCE
 The small piece of information that began on your notebook ended up in many
different places, most notably on the registrar office as an aid to decision making.

 The data went through many transformations. In addition to the computers where
the data might have stopped by or stayed on for the long term, lots of other pieces
of hardware such as computers were involved in collecting, manipulating,
transmitting, and storing the data.

 In addition, many different pieces of software were used to organize, aggregate,


visualize, and present the data. Finally, many different human systems were
involved in working with the data.

 All of this process is what constitute data science.


DATA TYPE

 Data types can be described from diverse perspectives.

 In Computer science, data type is simply a behavior ( characteristics) of data that


tells the programmer the intends use of the data.

 Number x=5;

 String name=“Melaku”;

 Float p=3.14;

 From a data analytics point of view, there are three types of data types Structured,
Semi-structured, and Unstructured
DATA TYPE
 Structured data: is a data whose elements are organized in tabular structured
format. It is easy to retrieve, search and perform analysis on this type of data.
Example: data of your list .

 Semi-structured data: is a form of structured data that does not obey the tabular
structure format but have some organizational properties that make it easier to
analyze. Example: Facebook pages content.

 Unstructured data: is a data that is which is not organized in a pre-defined


manner or does not have a pre-defined data structure.

 Example: Word, PDF, Text, Music.


DATA TYPE
 Metadata is not a separate data type, but it is one of the most important elements
for Big Data analysis and big data solutions.

 Metadata is data about data. It provides additional information about a specific set
of data.

 In a set of photographs, for example, metadata could describe when and where the
photos were taken.

 The metadata then provides fields for dates and locations which, by themselves,
can be considered structured data. Because of this reason, metadata is frequently
used by Big Data solutions for initial analysis.
DATA PROCESSING CYCLE
 Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose. There are three
steps constitute the data processing cycle.

 Input: in this step, the input data is prepared in some convenient form for
processing. The input data can be recorded on any one of the several types of
storage medium, such as hard disk, CD, flash disk , Papers and so on.

 E.g when opening account in CBE branch your data will be stored in computers.
DATA PROCESSING CYCLE
 Processing: in this step, the input data is changed to produce data in a more
useful form. For example, interest can be calculated on deposit to a bank, or a
summary of withdraws in the month can be calculated.

 Output: at this stage, the result of the processing step is collected. The particular
form of the output data depends on the use of the data. For example, output data
may be summary like bank statement etc.

 There are various types of data science tools which are used to analysis different
types of data:

 SAS (Statistical Analysis Software)

 EXCEL etc.
DATA PROCESSING TOOLS
 SAS (Statistical Analysis Software): is one of data science tools which are
specially designed for statistical operations used by large organizations to analyze
data.

 Weka tool is a software, which allows easier implementation of machine learning


algorithms through interacting platform. The user can understand the functioning
of machine learning on the data without having to write a line of code. The Weka
is the ideal for the Data Scientists who are the beginners in machine learning.

 Excel: Microsoft developed Excel for the spreadsheet calculations, but nowadays,
it is widely used for data processing, Visualization, and complex calculations. It is
the most powerful analytical tool for data science. Excel comes with different
formulae, tables, filters and user can also make their custom functions.
BIG DATA
 Big data is a blanket term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large datasets.

 While the problem of working with data that exceeds the computing power or
storage of a single computer is not new, the pervasiveness, scale, and value of this
type of computing have greatly expanded in recent years.

 Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently.

 A “large” means a dataset too large to reasonably process or store with traditional
tooling or on a single computer.
BIG DATA
 Big data is characterized by 4V.
 Volume: large amounts of data Zeta bytes/Massive datasets
 Velocity: the speed of data processing. Data is live streaming or in motion
 Variety: data comes in many different forms from diverse sources.
 Veracity: can we trust the data? How accurate is it?
BIG DATA
 Big Data is used for the analysis to get insight that will help you with the business
decision.

 Some real-world examples that will explain how big data is used are as follows:

 Big Data is used to find out consumer shopping habits.

 It can be used to monitor health conditions through data from wearables.

 The transportation industry uses fuel optimization tools where big data is used.

 It is used for predictive inventory ordering.

 It can help you with real-time data monitoring and cybersecurity protocols.

 It is used for forecasting.


BIG DATA
 Big Data is responsible to handle, manage, and process different types of data like
Structured, Semi-structured, and Unstructured.

 It is cost-effective in terms of maintaining a large amount of data. It works on a


distributed system.

 We can save large amounts of data for a long time using Big Data techniques. So it
is easy to handle historical data and generate accurate reports.

 Data processing speed is very fast and thus social media is using Big Data
techniques.

 It allows users to make efficient decisions for their business based on current and
historical data.
WHO’S GENERATING BIG
 DATA
There are many companies who are generating big data.

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Mobile devices Sensor technology and


(tracking all objects all the networks
time) (measuring all kinds of data)
DATA VALUE CHAIN
 Data value chain describe the information flow within a big data system as a series
of steps needed to generate value and useful insights from data. The data value
chain describes the process of data creation and use from first identifying a need
for data to its final use and possible reuse.

 The Big Data Value Chain identifies the following key high-level activities:
DATA ACQUISITION
 It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse (Processed).

 A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile


collection of data in support of management’s decision-making process. A data
warehouse is a decision support database that is maintained separately from the
organization’s operational database.

 Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.

 The infrastructure required to support the acquisition of big data must deliver low,
predictable latency in capturing data. It must be able to handle very high transaction
volumes, often in a distributed environment; and support flexible and dynamic data
structures.
DATA ANALYSIS

 Data analysis involves exploring, transforming, and modeling data with the goal
of highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.

 The data will be analysis to get the following characteristics from the data:

 Classification

 Clustering

 Association etc

 To perform analysis different computer science areas include like data mining,
business intelligence, and machine learning etc are used.
DATA CURATION
 It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.

 Data curation processes can be categorized into different activities such as

 Content creation and selection

 Classification

 Transformation and validation

 Preservation.

 Data curators (scientific curators or data annotators) hold the responsibility of


ensuring that data are trustworthy, discoverable, accessible, reusable and fit their
purpose.
DATA STORAGE
 In data storage phase, the collected data is stored and prepared for being used in
the next phase. As the collected data may contain of sensitive information, it is
essential to take sufficient precautions during data storing.

 In data storage we must ensure the persistence management of data in a scalable


way that satisfies the needs of applications that require fast access to the data.

 In order to guarantee the safety of the collected data, some security measures can
be used like data anonymization approach, permutation, and data partitioning
(vertically or horizontally).

 RDBMS have been the main, and almost unique, a solution to the storage
paradigm for nearly 40 years.
DATA USAGE
 It covers the data-driven business activities that need access to data, its analysis, and
the tools needed to integrate the data analysis within the business decision making.

 Data usage in business decision-making can enhance competitiveness through the


reduction of costs, increased added value, or any other parameter that can be
measured against existing performance criteria.

 The process of decision-making includes reporting, exploration of data (browsing


and lookup), and exploratory search (finding correlations, comparisons, what-if
scenarios, etc.).

 Data usage is given through specific tools and in turn through query and scripting
languages that typically depend on the underlying data stores, their execution
engines, APIs, and programming models.
SUMMARY OF DATA SCIENCE AND BIG DATA
 Example: take an example of the banking industry to explain the task of Big Data
and Data Science:

 Data science will analyze work systems, information, procedures, and documents
of the bank. Big data analysis will assess the financial and management aspects of
the bank based on cost & time for specific user.
Data Science will help the Big Data will help the banking industry with
banking industry with
 Fraud detection  Provide personalized banking solutions to
 Risk management their customers.
 Customer data analysis  Boosting performance.
 Marketing and sales  Performing effective customer feedback
 AI-driven Chatbots & virtual analysis
assistants.  Effective risk management.
CLUSTERED COMPUTING
 To better address the high storage and computational needs of big data, we need fast,
secure and reliable computing environment.

 Cluster Computing is coming to solve the problems of stand alone technology. The
objective is to improve the performance/power efficiency of a single processor for
storing and mining the large data sets, using the multiple disks and CPUs.

 Cluster computing refers that many of the computers connected on a network and they
perform like a single entity. Each computer that is connected to the network is called a
node.

 Cluster computing offers solutions to solve complicated problems by providing faster


computational speed, and enhanced data integrity. The connected computers execute
operations all together thus creating the impression like a single system (virtual
machine).
CLUSTERED COMPUTING
 Clustering software provide a number of benefits like:

 Resource Pooling: combining the available storage space to hold data.

 High Availability: guarantees to availability incase of hardware or software


failures.

 Easy Scalability: easy to scale horizontally by adding machines to the group.

 Clusters requires a solution for managing cluster membership, coordinating


resource sharing, and scheduling actual work on individual nodes.

 Cluster membership and resource allocation can be handled by software like


Hadoop’s ecosystem.
HADOOP ECOSYSTEM
 Hadoop is an open-source framework intended to make interaction with big data
easier. It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.

 The four key characteristics of Hadoop are:

 Economical: highly economical as computers can be used for data processing.

 Reliable: stores copies of the data on different machines and is resistant to

hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically.

 Flexible: can store as much structured and unstructured data .


HADOOP ECOSYSTEM
 Hadoop has an ecosystem that has evolved from its four core components: data

management, access, processing, and storage.

 In the ecosystem there are two main components for data storage.

 HDFS: is the major component of Hadoop ecosystem and is responsible for

storing large data sets of structured or unstructured data across various nodes.

 HBASE: is NoSQL database which supports all kinds of data and thus

capable of handling anything of Hadoop Database.


HADOOP ECOSYSTEM
 For processing the ecosystem have:

 Yet Another Resource Negotiator(YARN): manage the resources across the

clusters. It performs scheduling and resource allocation for the Hadoop

System.

 MapReduce: makes it possible to carry over the processing’s logic and helps

to write applications which transform big data sets into a manageable one.
HADOOP ECOSYSTEM
 For the data access:

 HIVE: performs reading and writing of large data sets.

 PIG: structure the data flow, processing and analyzing huge data sets.

 MAHOUT: provides various functionalities such as collaborative filtering,


clustering, and classification. It allows invoking algorithms with the help of
its own libraries.

 For the data management the ecosystem have:


 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
 Flume: monitoring
BIG DATA LIFE CYCLE WITH HADOOP
 Ingesting data into the system: The first stage of Big Data processing is Ingest.

 The data is ingested or transferred to Hadoop from various sources such as


relational databases, systems, or local files. Sqoop transfers data from RDBMS
to HDFS, whereas Flume transfers event data.

 Processing: in this stage, the data is stored and processed. The data is stored in
HDFS, or/and HBase. Spark and MapReduce perform data processing.

 Analyze: data is analyzed by processing frameworks such as Pig, Hive, and


Impala.

 Access: which is performed by tools such as Hue and Cloudera Search. In this
stage, the analyzed data can be accessed by users.
SUMMARY
 Let’s us understand the Hadoop ecosystem from a real life example. There is a company
that has established its branches in three different countries, let’s assume a branch in India,
A.A, Awassa and Dilla.

 In every branch, the entire customer data is stored in the Local Database daily.

 Every quarterly, half-yearly or yearly basis, the organization wants to analyze this data for
business development. To do this, the organization collect all this data from multiple
sources, perform necessary cleaning steps and put it in Data Warehouse.

 Then, we can use it for analytical purposes. So for analysis, we can generate reports from
the data available in the Data Warehouse. Multiple charts and reports can be generated
using Business Intelligence Tools.

 We require analysis for analytical purposes to grow the business and make appropriate
decisions for the organizations.
THE END
?

You might also like