CHAPTER 1 - Introduction To Data Science
CHAPTER 1 - Introduction To Data Science
UNIT 1
INTRODUCTION TO DATA SCIENCE
Data Science is a concept used to tackle big data and
includes data cleansing, preparation and analysis.
Let's start by defining what data is. it is very important
to understand data.
Whenever we use the word “data", we refer to a
collection of information in either an organized or
unorganized format:
• Organized data: This refers to data that is sorted into a
row/column structure, where every row represents a
single observation and the columns represent the
characteristics of that observation.
• Unorganized data: This is the type of data that is in the
free form, usually text or raw audio/signals that must be
parsed further to become organized.
Data science is the art and science of acquiring
knowledge through data.
Data science is all about how we take data, use it to
acquire knowledge, and then use that knowledge to do
the following:
Make decisions
Predict the future
Understand the past/present
Create new industries/products
Types of Data
Data is defined as the collection of facts and details like
text, figures, observations, symbols or simply
description of things, event or entity gathered with a
view to drawing inferences.
It is the raw fact, which should be processed to gain
information. It is the unprocessed data, that contains
numbers, statements and characters before it is refined
by the researcher
The term data is derived from Latin term ‘datum’ which
refers to ‘something given’.
The concept of data is connected with scientific research,
which is collected by various organizations, government
departments, institutions and non-government agencies
for a variety of reasons.
We can classify data in two main ways – based on
its type and on its measurement level.
2. Ordinal Data :
In ordinal data, data is assigned in categories and there is
an intrinsic rank or order to the categories.
For example, age group – Young, Adult, Senior Citizen
3. Binary Data :
Binary data can take only two possible values.
For example Yes/No , True/False.
2. Quantitative Data OR Numerical Data
The data that is in numerical format is considered as
Quantitative Data.
There are various methods to collect Quantitative data
like surveys, online polls, telephone interviews, etc.
Examples of quantitative data are height, weight,
temperature, etc.
Quantitative data is further divided into two types:
i. Discrete Data
ii. Continuous Data
i. Discrete Data:
Discrete data is based on count and it can only take a finite
number of values. Typically it involved integers.
A good example would be the number of cars that you want to
buy. Even if you don’t know exactly how many, you are
absolutely sure that the value will be an integer such as 0, 1, 2,
or even 10.
ii. Continuous Data:
Continuous Data represents measurements and therefore
their values can’t be counted but they can be measured. An
example would be the height of a person, which you can
describe by using intervals on the real number line.
Continuous data represent measurements; their possible values
cannot be counted and can only be described using intervals on
the real number line.
For example, Height, weight, temperature, etc.
Programming Languages
A programming language defines a set of instructions that are
compiled together to perform a specific task by the CPU
(Central Processing Unit).
Programming languages can be classified into two categories:
• Low-level language
• High-level language
1) Low-level language
The languages that come under this category are
the Machine level language and Assembly
language.
i)Machine-level language
• A computer’s native language is called Machine
Language. Machine language is the most
primitive or basic programming language that
starts or takes instructions in the form of raw
binary code.
• So that if we want to give a computer an
instruction in its native or Machine language, you
have to manually enter the instructions as binary
code.
ii) Assembly Language
The assembly language contains some human-
The problems which we were facing in machine-level
language are reduced to some extent by using an
extended form of machine-level language known as
assembly language. Since assembly language
instructions are written in English words like mov,
add, sub, so it is easier to write and understand.
First another program called the assembler is used to translate the
Assembly Language into machine code.
2) High-Level Language
The high-level language is a programming language that
allows a programmer to write the programs which are
independent of a particular type of computer. The high-level
languages are considered as high-level because they are closer
to human languages than machine-level languages.
Advantages
i) Readability
High level language is closer to natural language so they
are easier to learn and understand
ii) Machine independent
High level language program have the advantage of being
portable between machines.
iii) Easy debugging
Integrated Development Environment (IDE)
An IDE, or Integrated Development Environment, enables
programmers to consolidate the different aspects of writing a
computer program.
IDEs increase programmer productivity by combining
common activities of writing software into a single
application: editing source code, building executables, and
debugging.
• Without an IDE, a developer must select, deploy, integrate
and manage all of these tools separately. An IDE brings many
of those development-related tools together as a single
framework, application or service. The integrated toolset is
designed to simplify software development and can identify
and minimize coding mistakes and typos.
Benefits of using IDEs
productivity
Syntax Highlighting
Autocomplete
Debugging
EDA (EXPLOARATORY DATA ANALYSIS)
AND
DATA VISUALIATION
EDA (EXPLOARATORY DATA ANALYSIS)
Exploratory Data Analysis refers to the critical process
of performing initial investigations on data so as to
discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary
statistics and graphical representations.
The process of exploring data is not defined simply. It
involves the ability to recognize the different types of
data, transform data types, and use code to
systemically improve the quality of the entire dataset
to prepare it for the modeling stage.
Data Visualization
Data visualization is the graphical representation of
information and data. By using visual elements like
charts, graphs, and maps, data visualization tools
provide an accessible way to see and understand
trends, outliers, and patterns.
Data visualization is another form of visual art that
grabs our interest and keeps our eyes on the message.
When we see a chart, we quickly see trends and
outliers. If we can see something, we internalize it
quickly.
Because of the way the human brain processes
information, using charts or graphs to visualize large
amounts of complex data is easier than poring over
spreadsheets or reports. Data visualization is a quick,
easy way to convey concepts in a universal manner –
and you can experiment with different scenarios by
making slight adjustments.
1. HISTOGRAM
2. BOXPLOT
3. SCATTERPLOT
4. BARPLOT
Key Purposes of Data Visualization in Data Science