0% found this document useful (0 votes)
599 views80 pages

Cs3352 Foundation of Data Science

FDS lab manual

Uploaded by

hgthtyhy5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
599 views80 pages

Cs3352 Foundation of Data Science

FDS lab manual

Uploaded by

hgthtyhy5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

ARULMURUGAN COLLEGE OF ENGINEERING,THENNILAI,KARUR

Department of Computer Science and Engineering


CS3352-Foundation of Data Science
CS3352 Foundation Of Data Science
UNIT I : Introduction
Data Science : Benefits and uses - facets of data
Defining research goals - Retrieving data - Data
preparation - Exploratory Data analysis - build
the model presenting findings and building
applications Warehousing - Basic Statistical
descriptions of Data.
Data Science
• Data is measurable units of information
gathered or captured from activity of people,
places and things.
Life cycle of data science:

• 1. Capture: Data acquisition, data entry, signal reception


and data extraction.
• 2. Maintain Data warehousing, data cleansing, data
staging, data processing and data architecture.
• 3. Process Data mining, clustering and classification, data
modeling and data summarization.
• 4. Analyze : Data reporting, data visualization, business
intelligence and decision making.
• 5. Communicate: Exploratory and confirmatory analysis,
predictive analysis, regression, text mining and qualitative
analysis.
Difference between Data Science and Big Data
Benefits and Uses of Data Science
Data science example and applications :

a) Anomaly detection: Fraud, disease and crime

b) Classification: Background checks; an email server classifying emails as


"important"
c) Forecasting: Sales, revenue and customer retention

d) Pattern detection: Weather patterns, financial market patterns


e) Recognition : Facial, voice and text
f) Recommendation: Based on learned preferences, recommendation engines
can refer user to movies, restaurants and books

g) Regression: Predicting food delivery times, predicting home prices based on


amenities

h) Optimization: Scheduling ride-share pickups and package deliveries


Facets of Data
Very large amount of data will generate in big data
and data science. These data is various types and
main categories of data are as follows:

a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data

• Structured data is arranged in rows and


column format. It helps for application to
retrieve and process data easily. Database
management system is used for storing
structured data.
• An Excel table is an example of structured
data.
Unstructured Data

Unstructured data is data that does not follow a


specified format. Row and columns are not used
for unstructured data. Therefore it is difficult to
retrieve required information. Unstructured
data has no identifiable structure.

The unstructured data can be in the form of Text:


(Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of
unstructured data.
Characteristics of unstructured data:

1. There is no structural restriction or binding for


the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural
rules.
4. There are no predefined formats, restriction or
sequence for unstructured data.
5. Since there is no structural binding for
unstructured data, it is unpredictable in nature.
Natural Language

• Natural language is a special type of


unstructured data.
• • Natural language processing enables
machines to recognize characters, words and
sentences, then apply meaning and
understanding to that information. This helps
machines to understand language as humans
do.
Machine - Generated Data

Machine-generated data is an information that is


created without human interaction as a result of
a computer process or application activity

Both Machine-to-Machine (M2M) and Human-to


Machine (H2M) interactions generate machine
Data.
Graph-based or Network Data

Graphs are data structures to describe


relationships and interactions between entities
in complex systems.
In general, a graph contains a collection of
entities called nodes and another collection of
interactions between a pair of nodes called
edges.
Audio, Image and Video

Audio, image and video are data types that pose


specific challenges to a data scientist. Tasks that
are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging
for computers.
Streaming Data

Streaming data is data that is generated


continuously by thousands of data sources,
which typically send in the data records
simultaneously and in small sizes (order of
Kilobytes).
Difference between Structured and Unstructured Data
Data Science Process
Data science process consists of six stages :
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation
Step 1: Discovery or Defining research
Goal
• This step involves acquiring data from all the
identified internal and external sources, which
helps to answer the business question.
Step 2: Retrieving data
• It collection of data which required for project.
This is the process of gaining a business
understanding of the data user have and
deciphering what each piece of data means.
Step 3: Data preparation
• Data can have many inconsistencies like
missing values, blank columns, an incorrect
data format, which needs to be cleaned. We
need to process, explore and condition data
before modeling. The clean data, gives the
better predictions.
Step 4: Data exploration
• Data exploration is related to deeper
understanding of data. Try to understand how
variables interact with each other, the
distribution of the data and whether there are
outliers. To achieve this use descriptive
statistics, visual techniques and simple
modeling. This steps is also called as
Exploratory Data Analysis.
Step 5: Data modeling
The actual model building process starts. Here,
Data scientist distributes datasets for training
and testing. Techniques like association,
classification and clustering are applied to the
training data set. The model, once prepared, is
tested against the "testing" dataset.
Step 6: Presentation and automation
• Deliver the final base lined model with
reports, code and technical documents in this
stage. Model is deployed into a real-time
production environment after thorough
testing.
Defining Research Goals
To understand the project, three concept must understand: what, why
and how.

a) What is expectation of company or organization?


b) Why does a company's higher authority define such research value?
c) How is it part of a bigger strategic picture?

Goal of first phase will be the answer of these three questions.

In this phase, the data science team must learn and investigate the
problem, develop context and understanding and learn about the data
sources needed and available for the project.
Defining Research Goals
1. Learning the business domain :
2. Resources :
3. Frame the problem :
4. Identifying key stakeholders:
5. Interviewing the analytics sponsor:
6. Developing initial hypotheses:
7. Identifying potential data sources:
Learning the business domain :
Understanding the domain area of the problem
is essential
Resources :
As part of the discovery phase, the team needs
to assess the resources available to support the
project.
Resources include technology, tools, systems,
data and people.
Frame the problem :
Framing is the process of stating the analytics
problem to be solved. At this point, it is a best
practice to write down the problem statement
and share it with the key stakeholders.
Identifying key stakeholders:
The team can identify the success criteria, key
risks and stakeholders, which should include
anyone who will benefit from the project or will
be significantly impacted by the project.
• Interviewing the analytics sponsor:
The team should plan to collaborate with the
stakeholders to clarify and frame the analytics
problem.
This person understands the problem and
usually has an idea of a potential working
solution.
Developing initial hypotheses:
These Initial Hypotheses form the basis of the
analytical tests the team will use in later phases and
serve as the foundation for the findings in phase.
Identifying potential data sources:

Consider the volume, type and time span of the


data needed to test the hypotheses. Ensure that
the team can access more than simply
aggregated data. In most cases, the team will
need the raw data to avoid introducing bias for
the downstream analysis.
Retrieving Data
• Most of the high quality data is freely available
for public and commercial use. Data can be
stored in various format. It is in text file format
and tables in database. Data may be internal
or external.
• Many companies will have already collected
and stored the data
Data repository can be used to describe
several ways to collect and store data:

• Data warehouse
• Data lake
• Data marts
• Metadata
• Data cubes
Advantages of data repositories:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data
reporting.
iii. Database administrators have easier time
tracking problems.
iv. There is value to storing and analyzing data.
Disadvantages of data repositories :
• i. Growing data sets could slow down systems.

• ii. A system crash could affect all the data.

• iii. Unauthorized users can access all sensitive


data more easily than if it was distributed
across several locations.
Data Preparation
Data preparation means data cleansing,
Integrating and transforming data.
Data Cleaning
Data is cleansed through processes such as
filling in missing values, smoothing the noisy
data or resolving the inconsistencies in the data.
Data cleaning tasks are as follows:

• 1. Data acquisition and meta data


• 2. Fill in missing values
• 3. Unified date format
• 4. Converting nominal to numeric
• 5. Identify outliers and smooth out noisy data
• 6. Correct inconsistent data
Combining Data from Different Data
Sources
• Joining tables
• Appending tables
• Using views to simulate data joins and
appends
Transforming Data
Euclidean distance :
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
Joining tables
Appending tables
Using views to simulate data joins and
appends
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables can only take two values: true (1) or false√ (0).
They're used to indicate the absence of a categorical effect that may explain the observation
Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is a general
approach to exploring datasets by means of
simple summary statistics and graphic
visualizations in order to gain a deeper
understanding of data.
EDA is an approach/philosophy for data
analysis that employs a variety of
techniques to:

1. Maximize insight into a data set;


2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
With EDA, following functions are
performed:
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences
Box plots are an excellent tool for conveying
location and variation information in data sets,
• Exploratory data analysis is majorly performed using
the following methods:
• 1. Univariate analysis: Provides summary statistics for
each field in the raw data set (or) summary only on
one variable. Ex : CDF,PDF,Box plot
• 2. Bivariate analysis is performed to find the
relationship between each variable in the dataset and
the target variable of interest (or) using two variables
and finding relationship between them. Ex: Boxplot,
Violin plot.
• 3. Multivariate analysis is performed to understand
interactions between different fields in the dataset (or)
finding interactions between variables more than 2.
1. Minimum score: The lowest score, exlcuding outliers.

2. Lower quartile : 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by the
line that divides the box into two parts.
4. Upper quartile : 75 % of the scores fall below the upper quartiel value.
5. Maximum score: The highest score, excluding outliers.
6. Whiskers: The upper and lower whiskers represent scores outside the
middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of
scores.
TEACHING METHODS
Build the Models
To build the model, data should be clean and understand the
content
properly. The components of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison

• Building a model is an iterative process. Most models consist of the


following main steps:
1. Selection of a modeling technique and variables to enter in the
model
2. Execution of the model
3. Diagnosis and model comparison
Model and Variable Selection

• For this phase, consider model performance and whether


project meets all the requirements to use model, as well
as other factors:

1. Must the model be moved to a production environment and,


if so, would it be easy to implement?

2. How difficult is the maintenance on the model: how long will it


remain relevant if left untouched?

3. Does the model need to be easy to explain?


Model Execution
Various programming language is used for implementing the
model. For model execution, Python provides libraries like
StatsModels or Scikit-learn. These packages use several of the
most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. Following are the
remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is
easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not
enough evidence exists to show that the influence is there.
• Linear regression works if we want to predict a value, but for
classify something, classification models are used. The k-nearest
neighbors method is one of the best method.
Following commercial tools are used :
• 1. SAS enterprise miner: This tool allows users to run
predictive and descriptive models based on large
volumes of data from across the enterprise.
• 2. SPSS modeler: It offers methods to explore and
analyze data through a GUI.
• 3. Mat lab: Provides a high-level language for performing
a variety of data analytics, algorithms and data
exploration.
• 4. Alpine miner: This tool provides a GUI front end for
users to develop analytic workflows and interact with Big
Data tools and platforms on the back end.
Open Source tools:
• 1. R and PL/R: PL/R is a procedural language for Postgre
SQL with R.
• 2. Octave: A free software programming language for
computational modeling, has some of the functionality
of Matlab.
• 3. WEKA: It is a free data mining software package with
an analytic workbench. The functions created in WEKA
can be executed within Java code.
• 4. Python is a programming language that provides
toolkits for machine learning and analysis.
• 5. SQL in-database implementations, such as MADlib
provide an alterative to in memory desktop analytical
tools.
Model Diagnostics and Model Comparison

• Try to build multiple model and then select best


one based on multiple criteria. Working with a
holdout sample helps user pick the best-
performing model.
• The holdout method has two, basic drawbacks :
• 1. It requires extra dataset.
• 2. It is a single train-and-test experiment, the
holdout estimate of error rate will be misleading
if we happen to get an "unfortunate" split.
Presenting Findings and Building
Applications
• The team delivers final reports, briefings, code
and technical documents.
• • In addition, team may run a pilot project to
implement the models in a production
environment.
• The last stage of the data science process is
where user soft skills will be most useful.
Data Mining
• Data mining refers to extracting or mining knowledge
from large amounts of data. It is a process of
discovering interesting patterns or Knowledge from a
large amount of data stored either in databases, data
warehouses or other information repositories.
• Reasons for using data mining:
• 1. Knowledge discovery: To identify the invisible
correlation, patterns in the database.
• 2. Data visualization: To find sensible way of
displaying data.
• 3. Data correction: To identify and correct incomplete
and inconsistent data.
Functions of Data Mining
Different functions of data mining are
• Characterization - General characteristics
• Association and Correlation analysis- - Rules
showing attribute-value conditions
• Classification- Prediction- To predict some
missing data values.
• Clustering analysis -Support taxonomy
formation.
• Evolution analysis-It may include characterization,
discrimination, association, classification or clustering
of time-related data.
Data mining tasks can be classified into two
categories: Descriptive and Predictive.
Predictive Mining Tasks
To make prediction, predictive mining tasks performs inference
on the current data. Predictive analysis provides answers of the
future queries that move across using historical data as the chief
principle for decisions.
Descriptive Mining Task
Descriptive Analytics is the conventional form of business
intelligence and data analysis, seeks to provide a depiction or
"summary view" of facts and figures in an understandable
format, to either inform or prepare data for further analysis.
Architecture of a Typical Data Mining System
• Components of data mining system
are
• Data source,
• Data warehouse server,
• Data mining engine,
• Pattern evaluation module,
• Graphical user interface and
• Knowledge base.
• Database, data warehouse, WWW or other
information repository:
Classification of DM System
Multi-dimensional View of Data Mining
Classification.
Data Warehousing
• Data warehousing is the process of constructing
and using a data warehouse. A data warehouse
is constructed by integrating data from multiple
heterogeneous sources that support analytical
reporting, structured and/or ad hoc queries and
decision making
• Data warehouse is typically loaded through an
extraction, transformation and loading (ETL)
process from multiple data sources.
• Databases and data warehouses are related
Goals of data warehousing:

1. To help reporting as well as analysis.


2. Maintain the organization's historical
information.
3. Be the foundation for decision making.
Characteristics of Data Warehouse

• Subject oriented Data


• Integrated:
• Non-volatile
• Time variant :
Key characteristics of a Data Warehouse

1. Data is structured for simplicity of access and high-


speed query performance.
2. End users are time-sensitive and desire speed-of-
thought response times.
3. Large amounts of historical data are used.
4. Queries often retrieve large amounts of data, perhaps
many thousands of rows.
5. Both predefined and ad hoc queries are common.
6. The data load involves multiple sources and
transformations.
Multitier Architecture of Data Warehouse

• Data warehouse architecture is a data storage


framework's design of an organization.
• Data warehouse system is constructed in three
ways. These approaches are classified the
number of tiers in the architecture.
• a) Single-tier architecture.
• b) Two-tier architecture.
• c) Three-tier architecture (Multi-tier
architecture).
Three tier (Multi-tier) architecture:
SERVER
• the most widely used architecture for data
warehouse systems.
• Three tier architecture sometimes called multi-
tier architecture.
• Online analytical processing(OLAB)
• This is done with an OLAP server, implemented
using the ROLAP or MOLAP model.
• Relational Online Analytical Processing.(ROLAB)
• Multidimensional Online Analytical Processing
• The top tier represents the front-end client
layer. The client level which includes the tools
and Application Programming Interface (API)
used for high-level data analysis, inquiring and
reporting. User can use reporting tools, query,
analysis or data mining tools.
Needs of Data Warehouse

• Business user:
• Store historical data:
• Make strategic decisions:
• For data consistency
• High response time:
Difference between ODS and Data Warehouse
Metadata
Metadata is simply defined as data about data.
The data that is used to represent other data is known
as metadata. In data warehousing, metadata is one
of the essential aspects.
• We can define metadata as follows:
a) Metadata is the road-map to a data warehouse.
b) Metadata in a data warehouse defines the
warehouse objects.
c) Metadata acts as a directory. This directory helps
the decision support system to locate the contents
of a data warehouse.
Why is metadata necessary in a data
warehouse ?
a) First, it acts as the glue that links all parts of
the data warehouses.
b) Next, it provides information about the
contents and structures to the developers.
c) Finally, it opens the doors to the end-users
and makes the contents recognizable in their
terms.
Basic Statistical Descriptions of Data

You might also like