Chapter 2 - Introduction To Data Science
Chapter 2 - Introduction To Data Science
Data Science
OUTLINE
After completing this chapter:
The main purpose of data science is to find the patterns within the
data and uses several techniques to analyze and draw the
perceptions from data.
Data scientist is the person who performs analysis on data and give
insights to decision makers. In Data Science, the data scientist has
the responsibility of making the predictions from the data.
INTRODUCTION TO DATA
SCIENCE
Data scientist aims to derive conclusions from the whole data.
With the help of these conclusions, the data scientist can support
the industries in making smarter business decisions.
Before choosing a department you write your preferences on your paper as Law,
Anthropology, Economics, Accounting etc. based on what you see in the semester.
This planned choice is a piece of data though it is written by pencil or on
mobile that you can read.
When it is time to choose, you use your data as a reminder to choose department.
When you choose department, the registrar register you to the department you
want.
In the registrar, a system tells that the department capacity is full and to assign
students to other department.
Finally, at the end of registration the registrar employees see different graphs of
students data based on sex, age, country etc. They use this information to prepare
for next year.
EXAMPLE OF DATA SCIENCE
The small piece of information that began on your notebook
ended up in many different places, most notably on the registrar
office as an aid to decision making.
electronic machines.
etc.).
DATA AND INFORMATION
Information is the processed data on which decisions and
actions are based. It is data that has been processed into a form
that is meaningful to the recipient.
For Example:
Data: 51012
Information:
5/10/12 The date of your final exam.
51,012 Birr The starting salary of an accounting major.
51012 Zip code of Dilla.
DATA PROCESSING CYCLE
Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose. There are three steps constitute the data processing cycle.
Input: in this step, the input data is prepared in some convenient form
for processing. The input data can be recorded on any one of the
several types of storage medium, such as hard disk, CD, flash disk ,
Papers and so on.
E.g when opening account in CBE branch your data will be stored in
DATA PROCESSING CYCLE
Processing: in this step, the input data is changed to produce data
in a more useful form. For example, interest can be calculated on
deposit to a bank, or a summary of withdraws in the month can be
calculated.
There are various types of data science tools which are used to
analysis different types of data:
DATA PROCESSING TOOLS
SAS (Statistical Analysis Software): is one of data science
tools which are specially designed for statistical operations used
by large organizations to analyze data.
int x=5;
String name=“Abebe”;
float p=3.14;
From a data analytics point of view, there are three types of data
types Structured, Semi-structured, and Unstructured
DATA TYPE
Structured data: is a data whose elements are organized in
tabular structured format. It is easy to retrieve, search and perform
analysis on this type of data. Example: Excel files, Sql databases.
The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions
for initial analysis.
DATA VALUE CHAIN
Data value chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
The Big Data Value Chain identifies the following key high-level
activities:
DATA ACQUISITION
It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse .
Classification
Clustering
Association etc
Preservation.
RDBMS have been the main, and almost unique, a solution to the
storage paradigm for nearly 40 years but lacks flexibility when data
volumes and complexity grow . For this reason NoSQL technologies
have been designed.
DATA USAGE
It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity .
While the problem of working with data that exceeds the computing
power or storage of a single computer is not new, the pervasiveness,
scale, and value of this type of computing have greatly expanded in
recent years.
Some real-world examples that will explain how big data is used are as
follows:
The transportation industry uses fuel optimization tools where big data is
used.
It can help you with real-time data monitoring(like knowing how your
networks, applications, and services are performing.) and cybersecurity
protocols.
BIG DATA
Big Data is responsible to handle, manage, and process different
types of data like Structured, Semi-structured, and Unstructured.
We can save large amounts of data for a long time using Big Data
techniques. So it is easy to handle historical data and generate
accurate reports.
Data processing speed is very fast and thus social media is using
Big Data techniques.
storage.
storage.
and helps to write applications which transform big data sets into a
Processing: in this stage, the data is stored and processed. The data is
stored in HDFS, or/and HBase. Spark and MapReduce perform data
processing.
In every branch, the entire customer data is stored in the Local Database daily.
Then, we can use it for analytical purposes. So for analysis, we can generate
reports from the data available in the Data Warehouse. Multiple charts and
reports can be generated using Business Intelligence Tools.
We require analysis for analytical purposes to grow the business and make
appropriate decisions for the organizations.
THE END
?