0% found this document useful (0 votes)

599 views80 pages

Cs3352 Foundation of Data Science

FDS lab manual

Uploaded by

hgthtyhy5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

599 views80 pages

Cs3352 Foundation of Data Science

FDS lab manual

Uploaded by

hgthtyhy5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 80

ARULMURUGAN COLLEGE OF ENGINEERING,THENNILAI,KARUR

Department of Computer Science and Engineering

CS3352-Foundation of Data Science
CS3352 Foundation Of Data Science
UNIT I : Introduction
Data Science : Benefits and uses - facets of data
Defining research goals - Retrieving data - Data
preparation - Exploratory Data analysis - build
the model presenting findings and building
applications Warehousing - Basic Statistical
descriptions of Data.
Data Science
• Data is measurable units of information
gathered or captured from activity of people,
places and things.
Life cycle of data science:

• 1. Capture: Data acquisition, data entry, signal reception

and data extraction.
• 2. Maintain Data warehousing, data cleansing, data
staging, data processing and data architecture.
• 3. Process Data mining, clustering and classification, data
modeling and data summarization.
• 4. Analyze : Data reporting, data visualization, business
intelligence and decision making.
• 5. Communicate: Exploratory and confirmatory analysis,
predictive analysis, regression, text mining and qualitative
analysis.
Difference between Data Science and Big Data
Benefits and Uses of Data Science
Data science example and applications :

a) Anomaly detection: Fraud, disease and crime

b) Classification: Background checks; an email server classifying emails as

"important"
c) Forecasting: Sales, revenue and customer retention

d) Pattern detection: Weather patterns, financial market patterns

e) Recognition : Facial, voice and text
f) Recommendation: Based on learned preferences, recommendation engines
can refer user to movies, restaurants and books

g) Regression: Predicting food delivery times, predicting home prices based on

amenities

h) Optimization: Scheduling ride-share pickups and package deliveries

Facets of Data
Very large amount of data will generate in big data
and data science. These data is various types and
main categories of data are as follows:

a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data

• Structured data is arranged in rows and

column format. It helps for application to
retrieve and process data easily. Database
management system is used for storing
structured data.
• An Excel table is an example of structured
data.
Unstructured Data

Unstructured data is data that does not follow a

specified format. Row and columns are not used
for unstructured data. Therefore it is difficult to
retrieve required information. Unstructured
data has no identifiable structure.

The unstructured data can be in the form of Text:

(Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of
unstructured data.
Characteristics of unstructured data:

1. There is no structural restriction or binding for

the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural
rules.
4. There are no predefined formats, restriction or
sequence for unstructured data.
5. Since there is no structural binding for
unstructured data, it is unpredictable in nature.
Natural Language

• Natural language is a special type of

unstructured data.
• • Natural language processing enables
machines to recognize characters, words and
sentences, then apply meaning and
understanding to that information. This helps
machines to understand language as humans
do.
Machine - Generated Data

Machine-generated data is an information that is

created without human interaction as a result of
a computer process or application activity

Both Machine-to-Machine (M2M) and Human-to

Machine (H2M) interactions generate machine
Data.
Graph-based or Network Data

Graphs are data structures to describe

relationships and interactions between entities
in complex systems.
In general, a graph contains a collection of
entities called nodes and another collection of
interactions between a pair of nodes called
edges.
Audio, Image and Video

Audio, image and video are data types that pose

specific challenges to a data scientist. Tasks that
are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging
for computers.
Streaming Data

Streaming data is data that is generated

continuously by thousands of data sources,
which typically send in the data records
simultaneously and in small sizes (order of
Kilobytes).
Difference between Structured and Unstructured Data
Data Science Process
Data science process consists of six stages :
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation
Step 1: Discovery or Defining research
Goal
• This step involves acquiring data from all the
identified internal and external sources, which
helps to answer the business question.
Step 2: Retrieving data
• It collection of data which required for project.
This is the process of gaining a business
understanding of the data user have and
deciphering what each piece of data means.
Step 3: Data preparation
• Data can have many inconsistencies like
missing values, blank columns, an incorrect
data format, which needs to be cleaned. We
need to process, explore and condition data
before modeling. The clean data, gives the
better predictions.
Step 4: Data exploration
• Data exploration is related to deeper
understanding of data. Try to understand how
variables interact with each other, the
distribution of the data and whether there are
outliers. To achieve this use descriptive
statistics, visual techniques and simple
modeling. This steps is also called as
Exploratory Data Analysis.
Step 5: Data modeling
The actual model building process starts. Here,
Data scientist distributes datasets for training
and testing. Techniques like association,
classification and clustering are applied to the
training data set. The model, once prepared, is
tested against the "testing" dataset.
Step 6: Presentation and automation
• Deliver the final base lined model with
reports, code and technical documents in this
stage. Model is deployed into a real-time
production environment after thorough
testing.
Defining Research Goals
To understand the project, three concept must understand: what, why
and how.

a) What is expectation of company or organization?

b) Why does a company's higher authority define such research value?
c) How is it part of a bigger strategic picture?

Goal of first phase will be the answer of these three questions.

In this phase, the data science team must learn and investigate the
problem, develop context and understanding and learn about the data
sources needed and available for the project.
Defining Research Goals
1. Learning the business domain :
2. Resources :
3. Frame the problem :
4. Identifying key stakeholders:
5. Interviewing the analytics sponsor:
6. Developing initial hypotheses:
7. Identifying potential data sources:
Learning the business domain :
Understanding the domain area of the problem
is essential
Resources :
As part of the discovery phase, the team needs
to assess the resources available to support the
project.
Resources include technology, tools, systems,
data and people.
Frame the problem :
Framing is the process of stating the analytics
problem to be solved. At this point, it is a best
practice to write down the problem statement
and share it with the key stakeholders.
Identifying key stakeholders:
The team can identify the success criteria, key
risks and stakeholders, which should include
anyone who will benefit from the project or will
be significantly impacted by the project.
• Interviewing the analytics sponsor:
The team should plan to collaborate with the
stakeholders to clarify and frame the analytics
problem.
This person understands the problem and
usually has an idea of a potential working
solution.
Developing initial hypotheses:
These Initial Hypotheses form the basis of the
analytical tests the team will use in later phases and
serve as the foundation for the findings in phase.
Identifying potential data sources:

Consider the volume, type and time span of the

data needed to test the hypotheses. Ensure that
the team can access more than simply
aggregated data. In most cases, the team will
need the raw data to avoid introducing bias for
the downstream analysis.
Retrieving Data
• Most of the high quality data is freely available
for public and commercial use. Data can be
stored in various format. It is in text file format
and tables in database. Data may be internal
or external.
• Many companies will have already collected
and stored the data
Data repository can be used to describe
several ways to collect and store data:

• Data warehouse
• Data lake
• Data marts
• Metadata
• Data cubes
Advantages of data repositories:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data
reporting.
iii. Database administrators have easier time
tracking problems.
iv. There is value to storing and analyzing data.
Disadvantages of data repositories :
• i. Growing data sets could slow down systems.

• ii. A system crash could affect all the data.

• iii. Unauthorized users can access all sensitive

data more easily than if it was distributed
across several locations.
Data Preparation
Data preparation means data cleansing,
Integrating and transforming data.
Data Cleaning
Data is cleansed through processes such as
filling in missing values, smoothing the noisy
data or resolving the inconsistencies in the data.
Data cleaning tasks are as follows:

• 1. Data acquisition and meta data

• 2. Fill in missing values
• 3. Unified date format
• 4. Converting nominal to numeric
• 5. Identify outliers and smooth out noisy data
• 6. Correct inconsistent data
Combining Data from Different Data
Sources
• Joining tables
• Appending tables
• Using views to simulate data joins and
appends
Transforming Data
Euclidean distance :
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
Joining tables
Appending tables
Using views to simulate data joins and
appends
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables can only take two values: true (1) or false√ (0).
They're used to indicate the absence of a categorical effect that may explain the observation
Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is a general
approach to exploring datasets by means of
simple summary statistics and graphic
visualizations in order to gain a deeper
understanding of data.
EDA is an approach/philosophy for data
analysis that employs a variety of
techniques to:

1. Maximize insight into a data set;

2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
With EDA, following functions are
performed:
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences
Box plots are an excellent tool for conveying
location and variation information in data sets,
• Exploratory data analysis is majorly performed using
the following methods:
• 1. Univariate analysis: Provides summary statistics for
each field in the raw data set (or) summary only on
one variable. Ex : CDF,PDF,Box plot
• 2. Bivariate analysis is performed to find the
relationship between each variable in the dataset and
the target variable of interest (or) using two variables
and finding relationship between them. Ex: Boxplot,
Violin plot.
• 3. Multivariate analysis is performed to understand
interactions between different fields in the dataset (or)
finding interactions between variables more than 2.
1. Minimum score: The lowest score, exlcuding outliers.

2. Lower quartile : 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by the
line that divides the box into two parts.
4. Upper quartile : 75 % of the scores fall below the upper quartiel value.
5. Maximum score: The highest score, excluding outliers.
6. Whiskers: The upper and lower whiskers represent scores outside the
middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of
scores.
TEACHING METHODS
Build the Models
To build the model, data should be clean and understand the
content
properly. The components of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison

• Building a model is an iterative process. Most models consist of the

following main steps:
1. Selection of a modeling technique and variables to enter in the
model
2. Execution of the model
3. Diagnosis and model comparison
Model and Variable Selection

• For this phase, consider model performance and whether

project meets all the requirements to use model, as well
as other factors:

1. Must the model be moved to a production environment and,

if so, would it be easy to implement?

2. How difficult is the maintenance on the model: how long will it

remain relevant if left untouched?

3. Does the model need to be easy to explain?

Model Execution
Various programming language is used for implementing the
model. For model execution, Python provides libraries like
StatsModels or Scikit-learn. These packages use several of the
most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. Following are the
remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is
easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not
enough evidence exists to show that the influence is there.
• Linear regression works if we want to predict a value, but for
classify something, classification models are used. The k-nearest
neighbors method is one of the best method.
Following commercial tools are used :
• 1. SAS enterprise miner: This tool allows users to run
predictive and descriptive models based on large
volumes of data from across the enterprise.
• 2. SPSS modeler: It offers methods to explore and
analyze data through a GUI.
• 3. Mat lab: Provides a high-level language for performing
a variety of data analytics, algorithms and data
exploration.
• 4. Alpine miner: This tool provides a GUI front end for
users to develop analytic workflows and interact with Big
Data tools and platforms on the back end.
Open Source tools:
• 1. R and PL/R: PL/R is a procedural language for Postgre
SQL with R.
• 2. Octave: A free software programming language for
computational modeling, has some of the functionality
of Matlab.
• 3. WEKA: It is a free data mining software package with
an analytic workbench. The functions created in WEKA
can be executed within Java code.
• 4. Python is a programming language that provides
toolkits for machine learning and analysis.
• 5. SQL in-database implementations, such as MADlib
provide an alterative to in memory desktop analytical
tools.
Model Diagnostics and Model Comparison

• Try to build multiple model and then select best

one based on multiple criteria. Working with a
holdout sample helps user pick the best-
performing model.
• The holdout method has two, basic drawbacks :
• 1. It requires extra dataset.
• 2. It is a single train-and-test experiment, the
holdout estimate of error rate will be misleading
if we happen to get an "unfortunate" split.
Presenting Findings and Building
Applications
• The team delivers final reports, briefings, code
and technical documents.
• • In addition, team may run a pilot project to
implement the models in a production
environment.
• The last stage of the data science process is
where user soft skills will be most useful.
Data Mining
• Data mining refers to extracting or mining knowledge
from large amounts of data. It is a process of
discovering interesting patterns or Knowledge from a
large amount of data stored either in databases, data
warehouses or other information repositories.
• Reasons for using data mining:
• 1. Knowledge discovery: To identify the invisible
correlation, patterns in the database.
• 2. Data visualization: To find sensible way of
displaying data.
• 3. Data correction: To identify and correct incomplete
and inconsistent data.
Functions of Data Mining
Different functions of data mining are
• Characterization - General characteristics
• Association and Correlation analysis- - Rules
showing attribute-value conditions
• Classification- Prediction- To predict some
missing data values.
• Clustering analysis -Support taxonomy
formation.
• Evolution analysis-It may include characterization,
discrimination, association, classification or clustering
of time-related data.
Data mining tasks can be classified into two
categories: Descriptive and Predictive.
Predictive Mining Tasks
To make prediction, predictive mining tasks performs inference
on the current data. Predictive analysis provides answers of the
future queries that move across using historical data as the chief
principle for decisions.
Descriptive Mining Task
Descriptive Analytics is the conventional form of business
intelligence and data analysis, seeks to provide a depiction or
"summary view" of facts and figures in an understandable
format, to either inform or prepare data for further analysis.
Architecture of a Typical Data Mining System
• Components of data mining system
are
• Data source,
• Data warehouse server,
• Data mining engine,
• Pattern evaluation module,
• Graphical user interface and
• Knowledge base.
• Database, data warehouse, WWW or other
information repository:
Classification of DM System
Multi-dimensional View of Data Mining
Classification.
Data Warehousing
• Data warehousing is the process of constructing
and using a data warehouse. A data warehouse
is constructed by integrating data from multiple
heterogeneous sources that support analytical
reporting, structured and/or ad hoc queries and
decision making
• Data warehouse is typically loaded through an
extraction, transformation and loading (ETL)
process from multiple data sources.
• Databases and data warehouses are related
Goals of data warehousing:

1. To help reporting as well as analysis.

2. Maintain the organization's historical
information.
3. Be the foundation for decision making.
Characteristics of Data Warehouse

• Subject oriented Data

• Integrated:
• Non-volatile
• Time variant :
Key characteristics of a Data Warehouse

1. Data is structured for simplicity of access and high-

speed query performance.
2. End users are time-sensitive and desire speed-of-
thought response times.
3. Large amounts of historical data are used.
4. Queries often retrieve large amounts of data, perhaps
many thousands of rows.
5. Both predefined and ad hoc queries are common.
6. The data load involves multiple sources and
transformations.
Multitier Architecture of Data Warehouse

• Data warehouse architecture is a data storage

framework's design of an organization.
• Data warehouse system is constructed in three
ways. These approaches are classified the
number of tiers in the architecture.
• a) Single-tier architecture.
• b) Two-tier architecture.
• c) Three-tier architecture (Multi-tier
architecture).
Three tier (Multi-tier) architecture:
SERVER
• the most widely used architecture for data
warehouse systems.
• Three tier architecture sometimes called multi-
tier architecture.
• Online analytical processing(OLAB)
• This is done with an OLAP server, implemented
using the ROLAP or MOLAP model.
• Relational Online Analytical Processing.(ROLAB)
• Multidimensional Online Analytical Processing
• The top tier represents the front-end client
layer. The client level which includes the tools
and Application Programming Interface (API)
used for high-level data analysis, inquiring and
reporting. User can use reporting tools, query,
analysis or data mining tools.
Needs of Data Warehouse

• Business user:
• Store historical data:
• Make strategic decisions:
• For data consistency
• High response time:
Difference between ODS and Data Warehouse
Metadata
Metadata is simply defined as data about data.
The data that is used to represent other data is known
as metadata. In data warehousing, metadata is one
of the essential aspects.
• We can define metadata as follows:
a) Metadata is the road-map to a data warehouse.
b) Metadata in a data warehouse defines the
warehouse objects.
c) Metadata acts as a directory. This directory helps
the decision support system to locate the contents
of a data warehouse.
Why is metadata necessary in a data
warehouse ?
a) First, it acts as the glue that links all parts of
the data warehouses.
b) Next, it provides information about the
contents and structures to the developers.
c) Finally, it opens the doors to the end-users
and makes the contents recognizable in their
terms.
Basic Statistical Descriptions of Data

Ad3351 Daa Unit I
No ratings yet
Ad3351 Daa Unit I
135 pages
Ferrero Rocher Cake Recipe Booklet
100% (1)
Ferrero Rocher Cake Recipe Booklet
13 pages
Urgent Update On HP
100% (1)
Urgent Update On HP
5 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
FDS Unit II Notes
No ratings yet
FDS Unit II Notes
48 pages
CS3352 - Foundation of Data Science
No ratings yet
CS3352 - Foundation of Data Science
2 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
Ad3491 Fdsa Unit 4 Notes Eduengg-2
No ratings yet
Ad3491 Fdsa Unit 4 Notes Eduengg-2
16 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
Da Unit-2
No ratings yet
Da Unit-2
23 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Data Warehousing & Data Mining Important Questions
No ratings yet
Data Warehousing & Data Mining Important Questions
1 page
FDS IMPORTANT QUESTIONS EduEngg
100% (1)
FDS IMPORTANT QUESTIONS EduEngg
7 pages
Unit 5 Fod (1) (Repaired)
No ratings yet
Unit 5 Fod (1) (Repaired)
28 pages
Data Science Fundamentals QB
No ratings yet
Data Science Fundamentals QB
23 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
MC4102 OOSE Question Bank
No ratings yet
MC4102 OOSE Question Bank
4 pages
Ai-Unit2 - QB-VDP
No ratings yet
Ai-Unit2 - QB-VDP
13 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Dbms
No ratings yet
Dbms
99 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
Data Structures Design - AD3251 - Important Questions With Answer - Unit 1 - Abstract Data Types
No ratings yet
Data Structures Design - AD3251 - Important Questions With Answer - Unit 1 - Abstract Data Types
15 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Ad3311 Set 1
No ratings yet
Ad3311 Set 1
2 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
CS8691 AI CO-PO Mapping
No ratings yet
CS8691 AI CO-PO Mapping
6 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
No ratings yet
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
24 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Cp4251 Internet of Things
No ratings yet
Cp4251 Internet of Things
61 pages
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
23 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
BTCS9202 Data Sciences Lab Manual
No ratings yet
BTCS9202 Data Sciences Lab Manual
39 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
CS8392 - Oop - Unit 1 - PPT - 1.1
67% (3)
CS8392 - Oop - Unit 1 - PPT - 1.1
28 pages
DDM Lab Manual
100% (1)
DDM Lab Manual
80 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Ad3351 Daa Lecture Notes Units 1,2,3
No ratings yet
Ad3351 Daa Lecture Notes Units 1,2,3
79 pages
CD3291 Data Structurres and Algorithm Lab Manual
No ratings yet
CD3291 Data Structurres and Algorithm Lab Manual
84 pages
CP4292 Syllabus
No ratings yet
CP4292 Syllabus
4 pages
CS3353 Question Bank
No ratings yet
CS3353 Question Bank
35 pages
IT6006 Data Analytics
No ratings yet
IT6006 Data Analytics
12 pages
Security Trends, Legal, Ethical and Professional Aspects of Security
No ratings yet
Security Trends, Legal, Ethical and Professional Aspects of Security
3 pages
UNIT 1 - CS3401-Algorithms
No ratings yet
UNIT 1 - CS3401-Algorithms
22 pages
LM7 Approximate Inference in BN
No ratings yet
LM7 Approximate Inference in BN
18 pages
Data Science Fundamentals Syllabus
No ratings yet
Data Science Fundamentals Syllabus
3 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Ad3251 Data Structures Design
No ratings yet
Ad3251 Data Structures Design
2 pages
Unit 3
No ratings yet
Unit 3
24 pages
Artificial Intelligence Class 6: Skill Education for Class 6th, Code (417)
From Everand
Artificial Intelligence Class 6: Skill Education for Class 6th, Code (417)
Geeta Zunjani
No ratings yet
Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
VSB5 Draft Public Comment
No ratings yet
VSB5 Draft Public Comment
27 pages
Aeroshell LGF
No ratings yet
Aeroshell LGF
3 pages
Literary Criticism (LITT 501) October 13, 2018 Deautomatizing Perception
No ratings yet
Literary Criticism (LITT 501) October 13, 2018 Deautomatizing Perception
4 pages
Agricultural Land Preparation: A. Description
100% (2)
Agricultural Land Preparation: A. Description
47 pages
Smart Sheets I
No ratings yet
Smart Sheets I
17 pages
AVC (Average Variable Cost) ATC (Average Total Cost) MC (Marginal Cost)
No ratings yet
AVC (Average Variable Cost) ATC (Average Total Cost) MC (Marginal Cost)
2 pages
Exposure Java Multiple Choice Test Arraylist Class: This Test Is A Key Do Not Write On This Test
No ratings yet
Exposure Java Multiple Choice Test Arraylist Class: This Test Is A Key Do Not Write On This Test
14 pages
Check Answers
No ratings yet
Check Answers
2 pages
How To Write A Thesis Statement
100% (2)
How To Write A Thesis Statement
12 pages
Chennai, Bangalore and Hyderabad
50% (2)
Chennai, Bangalore and Hyderabad
52 pages
T 10D
No ratings yet
T 10D
7 pages
Docu85238 - Data Domain Boost For OpenStorage 3.4.1.1 Release Notes
No ratings yet
Docu85238 - Data Domain Boost For OpenStorage 3.4.1.1 Release Notes
8 pages
4 - Lighting and Energy Standards and Codes
No ratings yet
4 - Lighting and Energy Standards and Codes
34 pages
Evermotion Archexteriors Vol 2 PDF
No ratings yet
Evermotion Archexteriors Vol 2 PDF
2 pages
3-Intestate Estate of Gonzales v. People G.R. No. 181409 February 11, 2010
No ratings yet
3-Intestate Estate of Gonzales v. People G.R. No. 181409 February 11, 2010
12 pages
Open Source Demystified Level 1
No ratings yet
Open Source Demystified Level 1
22 pages
(22-23) Anh 8. Ôn Tập (Chuyên Đề 8 Stress)
No ratings yet
(22-23) Anh 8. Ôn Tập (Chuyên Đề 8 Stress)
5 pages
Sight Screen Catalog
No ratings yet
Sight Screen Catalog
3 pages
San Diego YouGotPosted Lawsuit: Motion To DIsmiss: Plaintiff's Supplemental Evidence
No ratings yet
San Diego YouGotPosted Lawsuit: Motion To DIsmiss: Plaintiff's Supplemental Evidence
42 pages
MainNav GPS Manual MG-950d User Manual 2008-09-16
No ratings yet
MainNav GPS Manual MG-950d User Manual 2008-09-16
20 pages
Ethical Dilemmas Group Three
No ratings yet
Ethical Dilemmas Group Three
23 pages
Comics and Novelization A Literary History of Bandes Dessines Benot Glaude PDF Download
No ratings yet
Comics and Novelization A Literary History of Bandes Dessines Benot Glaude PDF Download
76 pages
AI-Based Adaptive Traffic Signal Control For Congestion Mitigation
No ratings yet
AI-Based Adaptive Traffic Signal Control For Congestion Mitigation
7 pages
Basics of Jyotish Science
No ratings yet
Basics of Jyotish Science
2 pages
UNIT 8 GRADE 10 MOCK TEST - Key
No ratings yet
UNIT 8 GRADE 10 MOCK TEST - Key
6 pages
Lesson 6 Evan S Dela Rosa
No ratings yet
Lesson 6 Evan S Dela Rosa
6 pages
FprEN - 1992 1 1 BD
No ratings yet
FprEN - 1992 1 1 BD
4 pages
Help Electrical Explained PDF
No ratings yet
Help Electrical Explained PDF
18 pages

Cs3352 Foundation of Data Science

Uploaded by

Cs3352 Foundation of Data Science

Uploaded by

ARULMURUGAN COLLEGE OF ENGINEERING,THENNILAI,KARUR

Department of Computer Science and Engineering

• 1. Capture: Data acquisition, data entry, signal reception

a) Anomaly detection: Fraud, disease and crime

b) Classification: Background checks; an email server classifying emails as

d) Pattern detection: Weather patterns, financial market patterns

g) Regression: Predicting food delivery times, predicting home prices based on

h) Optimization: Scheduling ride-share pickups and package deliveries

• Structured data is arranged in rows and

Unstructured data is data that does not follow a

The unstructured data can be in the form of Text:

1. There is no structural restriction or binding for

• Natural language is a special type of

Machine-generated data is an information that is

Both Machine-to-Machine (M2M) and Human-to

Graphs are data structures to describe

Audio, image and video are data types that pose

Streaming data is data that is generated

a) What is expectation of company or organization?

Goal of first phase will be the answer of these three questions.

Consider the volume, type and time span of the

• ii. A system crash could affect all the data.

• iii. Unauthorized users can access all sensitive

• 1. Data acquisition and meta data

1. Maximize insight into a data set;

• Building a model is an iterative process. Most models consist of the

• For this phase, consider model performance and whether

1. Must the model be moved to a production environment and,

2. How difficult is the maintenance on the model: how long will it

3. Does the model need to be easy to explain?

• Try to build multiple model and then select best

1. To help reporting as well as analysis.

• Subject oriented Data

1. Data is structured for simplicity of access and high-

• Data warehouse architecture is a data storage

You might also like