1 Unit 1 Introduction To Data Science
1 Unit 1 Introduction To Data Science
Syllabus
Data Science : Benefits and uses - facets of data Defining research goals - Retrieving data
- Data preparation - Exploratory Data analysis - build the model presenting findings and
building applications Warehousing - Basic Statistical descriptions of Data.
1.2Facets of Data
• Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized in
a structure. The most common form of structured data or records is a database where
specific information is stored based on a methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required
information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information from
these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in
nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and
sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in
many modern real-world applications. The natural language processing community
has had success in entity recognition, topic recognition, summarization, text
completion and sentiment analysis.
•For natural language processing to help machines understand human language, it
must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.
Machine - Generated Data
• Machine-generated data is an information that is created without human interaction
as a result of a computer process or application activity. This means that data entered
manually by an end-user is not recognized to be machine-generated.
• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.
• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
• Examples of machine data are web server logs, call detail records, network event
logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions
generate machine data. Machine data is generated continuously by every processor-
based system, as well as many consumer-oriented systems.
• It can be either structured or unstructured. In recent years, the increase of machine
data has surged. The expansion of mobile devices, virtual servers and desktops, as
well as cloud- based services and RFID technologies, is making IT infrastructures
more complex.
Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between
entities in complex systems. In general, a graph contains a collection of entities called
nodes and another collection of interactions between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents.
Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of thinking
about and using it.
• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph
databases, we can use relationships to process financial and purchase transactions in
near-real time. With fast graph queries, we are able to detect that, for example, a
potential purchaser is using the same email address and credit card as included in a
known fraud case.
• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people sharing
the same IP address but residing in different physical addresses.
• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories such
as customer interests, friends and purchase history. We can use a highly available
graph database to make product recommendations to a user based on which products
are purchased by others who follow the same sport and have similar purchase history.
• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes
and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on the
activities or opinion of other users by way of followership or influence on decision
made by other users on the network as shown in Fig. 1.2.1.
• Graph theory has proved to be very effective on large-scale datasets such as social
network data. This is because it is capable of by-passing the building of an actual
visual representation of the data to run directly on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage format
for sound/music and moving pictures information. Audio and video digital recording,
also referred as audio and video codecs, can be uncompressed, lossless compressed or
lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources
of information and knowledge; the integration, transformation and indexing of
multimedia data bring significant challenges in data management and analysis. Many
challenges have to be addressed including big data, multidisciplinary nature of Data
Science and heterogeneity.
• Data Science is playing an important role to address these challenges in multimedia
data. Multimedia data usually contains various forms of media, such as text, image,
video, geographic coordinates and even pulse waveforms, which come from multiple
sources. Data Science can be a key instrument covering big data, machine learning
and data mining solutions to store, handle and analyze such heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or geospatial
services and telemetry from connected devices or instrumentation in data centers.
Difference between Structured and Unstructured Data
1.3 Data Science Process
Data science process consists of six stages :
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation
• Fig. 1.3.1 shows data science design process.
2. Appending tables
• Appending table is called stacking table. It effectively adding observations from one
table to another table. Fig. 1.6.3 shows Appending table. (See Fig. 1.6.3 on next page)
• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of
appending these tables is a larger one with the observations from Table 1 as well as
Table 2. The equivalent operation in set theory would be the union and this is also the
command in SQL, the common language of relational databases. Other set operators
are also used in data science, such as set difference and intersection.
3. Using views to simulate data joins and appends
• Duplication of data is avoided by using view and append. The append table requires
more space for storage. If table size is in terabytes of data, then it becomes
problematic to duplicate the data. For this reason, the concept of a view was invented.
• Fig. 1.6.4 shows how the sales data from the different months is combined virtually
into a yearly sales table instead of duplicating the data.
Transforming Data
• In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Relationships between an input variable and an output
variable aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes
the model difficult to handle and certain techniques don't perform well when user
overload them with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10
variables. Data scientists use special methods to reduce the number of variables but
retain the maximum amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is
calculated as the square root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables canonly take two
values: true (1) or false√ (0). They're used to indicate the absence of acategorical effect
that may explain the observation.
1.7 Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is a general approach to exploring datasets by
means of simple summary statistics and graphic visualizations in order to gain a
deeper understanding of data.
• EDA is used by data scientists to analyze and investigate data sets and summarize
their main characteristics, often employing data visualization methods. It helps
determine how best to manipulate data sources to get the answers user need, making
it easier for data scientists to discover patterns, spot anomalies, test a hypothesis or
check assumptions.
• EDA is an approach/philosophy for data analysis that employs a variety of
techniques to:
1. Maximize insight into a data set;
2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
• With EDA, following functions are performed:
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences
• Box plots are an excellent tool for conveying location and variation information in
data sets, particularly for detecting and illustrating location and variation changes
between different groups of data.
• Exploratory data analysis is majorly performed using the following methods:
1. Univariate analysis: Provides summary statistics for each field in the raw data
set (or) summary only on one variable. Ex : CDF,PDF,Box plot
2. Bivariate analysis is performed to find the relationship between each variable in
the dataset and the target variable of interest (or) using two variables and finding
relationship between them. Ex: Boxplot, Violin plot.
3. Multivariate analysis is performed to understand interactions between different
fields in the dataset (or) finding interactions between variables more than 2.
• A box plot is a type of chart often used in explanatory data analysis to visually show
the distribution of numerical data and skewness through displaying the data quartiles
or percentile and averages.
1. Minimum score: The lowest score, exlcuding outliers.
2. Lower quartile : 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by the line that
divides the box into two parts.
4. Upper quartile : 75 % of the scores fall below the upper quartiel value.
5. Maximum score: The highest score, excluding outliers.
6. Whiskers: The upper and lower whiskers represent scores outside the middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of scores.
• Boxplots are also extremely usefule for visually checking group differences. Suppose
we have four groups of scores and we want to compare them by teaching method.
Teaching method is our categorical grouping variable and score is the continuous
outcomes variable that the researchers measured.
• The middle tier is the application layer giving an abstracted view of the database. It
arranges the data to make it more suitable for analysis. This is done with an OLAP
server, implemented using the ROLAP or MOLAP model.
• OLAPS can interact with both relational databases and multidimensional databases,
which lets them collect data better based on broader parameters.
• The top tier is the front-end of an organization's overall business intelligence suite.
The top-tier is where the user accesses and interacts with data via queries, data
visualizations and data analytics tools.
• The top tier represents the front-end client layer. The client level which includes the
tools and Application Programming Interface (API) used for high-level data analysis,
inquiring and reporting. User can use reporting tools, query, analysis or data mining
tools.
Needs of Data Warehouse
1) Business user: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to
them in an elementary form.
2) Store historical data: Data warehouse is required to store the time variable data
from the past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected
loads and types of queries, which demands a significant degree of flexibility and quick
response time.
Benefits of Data Warehouse
a) Understand business trends and make better forecasting decisions.
b) Data warehouses are designed to perform well enormous amounts of data.
c) The structure of data warehouses is more accessible for end-users to navigate,
understand and query.
d) Queries that would be complex in many normalized databases could be easier to
build and maintain in data warehouses.
e) Data warehousing is an efficient method to manage demand for lots of information
from lots of users.
f) Data warehousing provide the capabilities to analyze a large amount of historical
data.
Difference between ODS and Data Warehouse
Metadata
• Metadata is simply defined as data about data. The data that is used to represent
other data is known as metadata. In data warehousing, metadata is one of the
essential aspects.
• We can define metadata as follows:
a) Metadata is the road-map to a data warehouse.
b) Metadata in a data warehouse defines the warehouse objects.
c) Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
• In a data warehouse, we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata, additional metadata is also created
for time-stamping any extracted data, the source of extracted data.
Why is metadata necessary in a data warehouse ?
a) First, it acts as the glue that links all parts of the data warehouses.
b) Next, it provides information about the contents and structures to the developers.
c) Finally, it opens the doors to the end-users and makes the contents recognizable in
their terms.
• Fig. 1.11.2 shows warehouse metadata.
Standard Deviation :
• The standard deviation of a data set is the positive square root of the variance. It is
measured in the same in the same units as the data, making it more easily
interpreted than the variance.
• The standard deviation is computed as follows:
• As well as time series data, line graphs can also be appropriate for displaying data
that are measured over other continuous variables such as distance.
• For example, a line graph could be used to show how pollution levels vary with
increasing distance from a source or how the level of a chemical varies with depth of
soil.
• In a line graph the x-axis represents the continuous variable (for example year or
distance from the initial measurement) whilst the y-axis has a scale and indicated the
measurement.
• Several data series can be plotted on the same line chart and this is particularly
useful for analysing and comparing the trends in different datasets.
• Line graph is often used to visualize rate of change of a quantity. It is more useful
when the given data has peaks and valleys. Line graphs are very simple to draw and
quite convenient to interpret.
4. Pie charts
• A type of graph is which a circle is divided into sectors that each represents a
proportion of whole. Each sector shows the relative size of each value.
• A pie chart displays data, information and statistics in an easy to read "pie slice"
format with varying slice sizes telling how much of one data element exists.
• Pie chart is also known as circle graph. The bigger the slice, the more of that
particular data was gathered. The main use of a pie chart is to show comparisons. Fig.
1.12.2 shows pie chart. (See Fig. 1.12.2 on next page)
• Various applications of pie charts can be found in business, school and at home. For
business pie charts can be used to show the success or failure of certain products or
services.
• At school, pie chart applications include showing how much time is allotted to each
subject. At home pie charts can be useful to see expenditure of monthly income in
different needs.
• Reading of pie chart is as easy figuring out which slice of an actual pie is the biggest.
Limitation of pie chart:
• It is difficult to tell the difference between estimates of similar size.
Error bars or confidence limits cannot be shown on pie graph.
Legends and labels on pie graphs are hard to align and read.
• The human visual system is more efficient at perceiving and discriminating between
lines and line lengths rather than two-dimensional areas and angles.
• Pie graphs simply don't work when comparing data.
Two Marks Questions with Answers
Q.1 What is data science?
Ans;
• Data science is an interdisciplinary field that seeks to extract knowledge or insights
from various forms of data.
• At its core, data science aims to discover and extract actionable knowledge from data
that can be used to make sound business decisions and predictions.
• Data science uses advanced analytical theory and various methods such as time
series analysis for predicting future.
Q.2 Define structured data.
Ans. Structured data is arranged in rows and column format. It helps for application
to retrieve and process data easily. Database management system is used for storing
structured data. The term structured data refers to data that is identifiable because it
is organized in a structure.
Q.3 What is data?
Ans. Data set is collection of related records or information. The information may be
on some entity or some subject area.
Q.4 What is unstructured data ?
Ans. Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
Q.5 What is machine - generated data ?
Ans. Machine-generated data is an information that is created without human
interaction as a result of a computer process or application activity. This means that
data entered manually by an end-user is not recognized to be machine-generated.
Q.6 Define streaming data.
Ans; Streaming data is data that is generated continuously by thousands of data
sources, which typically send in the data records simultaneously and in small sizes
(order of Kilobytes).
Q.7 List the stages of data science process.
Ans.: Stages of data science process are as follows:
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation
Q.8 What are the advantages of data repositories?
Ans.: Advantages are as follows:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data reporting.
iii. Database administrators have easier time tracking problems.
iv. There is value to storing and analyzing data.
Q.9 What is data cleaning?
Ans. Data cleaning means removing the inconsistent data or noise and collecting
necessary information of a collection of interrelated data.
Q.10 What is outlier detection?
Ans. : Outlier detection is the process of detecting and subsequently excluding
outliers from a given set of data. The easiest way to find outliers is to use a plot or a
table with the minimum and maximum values.
Q.11 Explain exploratory data analysis.
Ans. : Exploratory Data Analysis (EDA) is a general approach to exploring datasets by
means of simple summary statistics and graphic visualizations in order to gain a
deeper understanding of data. EDA is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often employing data
visualization methods.
Q.12 Define data mining.
Ans. : Data mining refers to extracting or mining knowledge from large amounts of
data. It is a process of discovering interesting patterns or Knowledge from a large
amount of data stored either in databases, data warehouses, or other information
repositories.
Q.13 What are the three challenges to data mining regarding data mining
methodology?
Ans. Challenges to data mining regarding data mining methodology include the
following:
1. Mining different kinds of knowledge in databases,
2. Interactive mining of knowledge at multiple levels of abstraction,
3. Incorporation of background knowledge.
Q.14 What is predictive mining?
Ans. Predictive mining tasks perform inference on the current data in order to make
predictions. Predictive analysis provides answers of the future queries that move
across using historical data as the chief principle for decisions.
Q.15 What is data cleaning?
Ans. Data cleaning means removing the inconsistent data or noise and collecting
necessary information of a collection of interrelated data.
Q.16 List the five primitives for specifying a data mining task.
Ans. :
1. The set of task-relevant data to be mined
2. The kind of knowledge to be mined
3. The background knowledge to be used in the discovery process
4. The interestingness measures and thresholds for pattern evaluation
5. The expected representation for visualizing the discovered pattern.
Q.17 List the stages of data science process.
Ans. Data science process consists of six stages:
1. Discovery or Setting the research goal 2. Retrieving data 3. Data preparation
4. Data exploration 5. Data modeling 6. Presentation and automation
Q.18 What is data repository?
Ans. Data repository is also known as a data library or data archive. This is a general
term to refer to a data set isolated to be mined for data reporting and analysis. The
data repository is a large database infrastructure, several databases that collect,
manage and store data sets for data analysis, sharing and reporting.
Q.19 List the data cleaning tasks?
Ans. Data cleaning are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
Q.20 What is Euclidean distance ?
Ans. Euclidean distance is used to measure the similarity between observations.
It is calculated as the square root of the sum of differences between each point.