Chapter 2 - Introduction To Data Science
Chapter 2 - Introduction To Data Science
Data Science
OUTLINE
After completing this chapter:
The main purpose of data science is to find the patterns within the data and uses
several techniques to analyze and draw the perceptions from data.
Data science is much more than simply analyzing data. It offers a range of roles
and requires a range of skills.
Data scientist is the person who performs analysis on data and give insights to
decision makers. In Data Science, the data scientist has the responsibility of
making the predictions from the data.
INTRODUCTION TO DATA SCIENCE
Data scientist aims to derive conclusions from the whole data. With the help of
these conclusions, the data scientist can support the industries in making smarter
business decisions.
Understand the data to make a better decision and finding the final result.
DATA AND INFORMATION
Data can be defined as a representation of facts, concepts, or instructions in a
It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-
For Example:
Data: 51012
Information:
5/10/12 The date of your final exam.
51,012 Birr The starting salary of an accounting major.
51012 Zip code of Dilla.
EXAMPLE OF DATA SCIENCE
Let’s us consider choosing department from available departments in Dilla university as
data science problem example:
Before choosing a department you write your preferences on your paper as Law,
Anthropology, Economics, Accounting etc. based on what you see in the semester. This
planned choice is a piece of data though it is written by pencil or on mobile that you
can read.
When it is time to choose, you use your data as a reminder to choose department.
When you choose department, the registrar register you to the department you want.
In the registrar, a system tells that the department capacity is full and to assign students to
other department.
Finally, at the end of registration the registrar employees see different graphs of students
data based on sex, age, country etc. They use this information to prepare for next year.
EXAMPLE OF DATA SCIENCE
The small piece of information that began on your notebook ended up in many
different places, most notably on the registrar office as an aid to decision making.
The data went through many transformations. In addition to the computers where
the data might have stopped by or stayed on for the long term, lots of other pieces
of hardware such as computers were involved in collecting, manipulating,
transmitting, and storing the data.
Number x=5;
String name=“Melaku”;
Float p=3.14;
From a data analytics point of view, there are three types of data types Structured,
Semi-structured, and Unstructured
DATA TYPE
Structured data: is a data whose elements are organized in tabular structured
format. It is easy to retrieve, search and perform analysis on this type of data.
Example: data of your list .
Semi-structured data: is a form of structured data that does not obey the tabular
structure format but have some organizational properties that make it easier to
analyze. Example: Facebook pages content.
Metadata is data about data. It provides additional information about a specific set
of data.
In a set of photographs, for example, metadata could describe when and where the
photos were taken.
The metadata then provides fields for dates and locations which, by themselves,
can be considered structured data. Because of this reason, metadata is frequently
used by Big Data solutions for initial analysis.
DATA PROCESSING CYCLE
Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose. There are three
steps constitute the data processing cycle.
Input: in this step, the input data is prepared in some convenient form for
processing. The input data can be recorded on any one of the several types of
storage medium, such as hard disk, CD, flash disk , Papers and so on.
E.g when opening account in CBE branch your data will be stored in computers.
DATA PROCESSING CYCLE
Processing: in this step, the input data is changed to produce data in a more
useful form. For example, interest can be calculated on deposit to a bank, or a
summary of withdraws in the month can be calculated.
Output: at this stage, the result of the processing step is collected. The particular
form of the output data depends on the use of the data. For example, output data
may be summary like bank statement etc.
There are various types of data science tools which are used to analysis different
types of data:
EXCEL etc.
DATA PROCESSING TOOLS
SAS (Statistical Analysis Software): is one of data science tools which are
specially designed for statistical operations used by large organizations to analyze
data.
Excel: Microsoft developed Excel for the spreadsheet calculations, but nowadays,
it is widely used for data processing, Visualization, and complex calculations. It is
the most powerful analytical tool for data science. Excel comes with different
formulae, tables, filters and user can also make their custom functions.
BIG DATA
Big data is a blanket term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large datasets.
While the problem of working with data that exceeds the computing power or
storage of a single computer is not new, the pervasiveness, scale, and value of this
type of computing have greatly expanded in recent years.
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently.
A “large” means a dataset too large to reasonably process or store with traditional
tooling or on a single computer.
BIG DATA
Big data is characterized by 4V.
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: the speed of data processing. Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources.
Veracity: can we trust the data? How accurate is it?
BIG DATA
Big Data is used for the analysis to get insight that will help you with the business
decision.
Some real-world examples that will explain how big data is used are as follows:
The transportation industry uses fuel optimization tools where big data is used.
It can help you with real-time data monitoring and cybersecurity protocols.
We can save large amounts of data for a long time using Big Data techniques. So it
is easy to handle historical data and generate accurate reports.
Data processing speed is very fast and thus social media is using Big Data
techniques.
It allows users to make efficient decisions for their business based on current and
historical data.
WHO’S GENERATING BIG
DATA
There are many companies who are generating big data.
The Big Data Value Chain identifies the following key high-level activities:
DATA ACQUISITION
It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse (Processed).
Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
The infrastructure required to support the acquisition of big data must deliver low,
predictable latency in capturing data. It must be able to handle very high transaction
volumes, often in a distributed environment; and support flexible and dynamic data
structures.
DATA ANALYSIS
Data analysis involves exploring, transforming, and modeling data with the goal
of highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.
The data will be analysis to get the following characteristics from the data:
Classification
Clustering
Association etc
To perform analysis different computer science areas include like data mining,
business intelligence, and machine learning etc are used.
DATA CURATION
It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
Classification
Preservation.
In order to guarantee the safety of the collected data, some security measures can
be used like data anonymization approach, permutation, and data partitioning
(vertically or horizontally).
RDBMS have been the main, and almost unique, a solution to the storage
paradigm for nearly 40 years.
DATA USAGE
It covers the data-driven business activities that need access to data, its analysis, and
the tools needed to integrate the data analysis within the business decision making.
Data usage is given through specific tools and in turn through query and scripting
languages that typically depend on the underlying data stores, their execution
engines, APIs, and programming models.
SUMMARY OF DATA SCIENCE AND BIG DATA
Example: take an example of the banking industry to explain the task of Big Data
and Data Science:
Data science will analyze work systems, information, procedures, and documents
of the bank. Big data analysis will assess the financial and management aspects of
the bank based on cost & time for specific user.
Data Science will help the Big Data will help the banking industry with
banking industry with
Fraud detection Provide personalized banking solutions to
Risk management their customers.
Customer data analysis Boosting performance.
Marketing and sales Performing effective customer feedback
AI-driven Chatbots & virtual analysis
assistants. Effective risk management.
CLUSTERED COMPUTING
To better address the high storage and computational needs of big data, we need fast,
secure and reliable computing environment.
Cluster Computing is coming to solve the problems of stand alone technology. The
objective is to improve the performance/power efficiency of a single processor for
storing and mining the large data sets, using the multiple disks and CPUs.
Cluster computing refers that many of the computers connected on a network and they
perform like a single entity. Each computer that is connected to the network is called a
node.
hardware failure.
In the ecosystem there are two main components for data storage.
storing large data sets of structured or unstructured data across various nodes.
HBASE: is NoSQL database which supports all kinds of data and thus
System.
MapReduce: makes it possible to carry over the processing’s logic and helps
to write applications which transform big data sets into a manageable one.
HADOOP ECOSYSTEM
For the data access:
PIG: structure the data flow, processing and analyzing huge data sets.
Processing: in this stage, the data is stored and processed. The data is stored in
HDFS, or/and HBase. Spark and MapReduce perform data processing.
Access: which is performed by tools such as Hue and Cloudera Search. In this
stage, the analyzed data can be accessed by users.
SUMMARY
Let’s us understand the Hadoop ecosystem from a real life example. There is a company
that has established its branches in three different countries, let’s assume a branch in India,
A.A, Awassa and Dilla.
In every branch, the entire customer data is stored in the Local Database daily.
Every quarterly, half-yearly or yearly basis, the organization wants to analyze this data for
business development. To do this, the organization collect all this data from multiple
sources, perform necessary cleaning steps and put it in Data Warehouse.
Then, we can use it for analytical purposes. So for analysis, we can generate reports from
the data available in the Data Warehouse. Multiple charts and reports can be generated
using Business Intelligence Tools.
We require analysis for analytical purposes to grow the business and make appropriate
decisions for the organizations.
THE END
?