0% found this document useful (0 votes)

50 views26 pages

Fdsa Unit 1 Aids Sem 4

Uploaded by

Dinesh 1812

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views26 pages

Fdsa Unit 1 Aids Sem 4

Uploaded by

Dinesh 1812

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

lOMoARcPSD|33947538

FDSA UNIT 1 - AIDS Sem 4

Fundamentals of Data Science and Analytics (Mailam Engineering College)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Dinesh 1812 ([email protected])
lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

UNIT I – INTRODUCTION TO DATA SCIENCE

SYLLABUS:
Need for data science – benefits and uses – facets of data – data
science process – setting the research goal – retrieving data –
cleansing, integrating, and transforming data – exploratory data
analysis – build the models – presenting and building applications.

PART A
1. What is Bigdata?
 Big data is a huge volume, high velocity and variety of data that cannot
be processed by traditional processing system.
 They are characterized by the 7 Vs: velocity, variety, volume, variability,
visualization, value and veracity.

2. What are the Characteristics of Bigdata?

 Velocity - refers to the speed of data processing
 Volume - refers to the amount of data
 Value - refers to the benefits that the organization derives from the data.
 Variety - refers to the different types of big data.
 Veracity - refers to the accuracy of your data.
 Validity – refers to the relevance of data for the intended purpose.
 Volatility – refers to constantly changing
 Visualization - Visualization refers to showing your big data-generated
insights
 through visual representations such as charts and graphs.

3. Define Data Science.

 Data science is the field of study of data, using modern scientific techniques,
statistical methods and algorithms to derive insights from huge volume of
data and to create business and IT strategies.
 It deals about where the data comes from, what it represents, and the ways
by which it can be transformed into valuable inputs and resources

4. What are the benefits and uses of Bigdata

 Commercial Companies
 Human Resource professionals
 Financial institutions
 Governmental organizations
 Nongovernmental organizations (NGOs)
 Universities

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 1

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

5. List out the Facets of data.

The facets of data are categorized below,
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming

6. Define Structured data.

 Structured data is data that depends on a data model and resides in a
fixed field within a record.
 It’s easy to store structured data in tables within databases or Excel files.
 SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases.
 Example: Excel files

7. Define unstructured data

 Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying.
 Example: Email

8. What is Machine Generated Data?

 Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention.
 The analysis of machine data relies on highly scalable tools, due to its high
volume and speed.
 Examples: web server logs, call detail records, network event logs, and
telemetry

9. What is Streaming Data?

 The data flows into the system in a continuous manner when an event
happens instead of being loaded into a data store in a batch.
 Examples - “What’s trending” on Twitter, live sporting or music events, and
the stock market.

10. Define Graph based or Network data

 “Graph” points to mathematical graph theory.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 2

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 In graph theory, a graph is a mathematical structure to model pair-wise

relationships between objects.
 Graph or network data is, a data that focuses on the relationship or
adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and store
graphical data.
 Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
 Example: social media websites
o For instance, on LinkedIn you can see who you know at which
company.
o Your follower list on Twitter is another example of graph-based data.

11. List out the steps in Data Science Process

The data science process typically consists of six steps.

12. What is meant by Project Charter?

 All the information which are related to research goal is best collected in a
project charter.
 A project charter requires teamwork, and input covers at least the following:
o A clear research goal
o The project mission and context
o How to perform analysis
o What resources to use
o Proof that it’s an achievable project, or proof of concepts
o Deliverables and a measure of success
o A timeline

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 3

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

13. How to retrieving the data in Data Science process?

 The second step is to collect data by finding suitable data and getting
access to the data from the data owner.
 Data can also be delivered by third-party companies and take many
forms ranging from Excel spreadsheets to different types of databases.
 The result is data in its raw form, which probably needs polishing and
transformation before it becomes usable.

14. What is Data Repositories?

 A data repository is also known as a data library or data archive.
 The data repository is a large database infrastructure — several
databases — that collect, manage, and store data sets for data analysis,
sharing and reporting.
 Example: Database, Data Warehouse, Data mart, Data Lake.

15. Difference between Data Marts and Data warehouse.

Data Warehouse Data Mart
Data Warehouse stores a large amount of Data Mart contains only the specific
data which is collected from different data from data warehouse, which is
sources required by the company for analysis
Data Warehouse is focused on all Data Mart focuses on a specific group.
departments in an organization
Data Warehouse designing process is Data Mart process is easy to design.
complicated
Data Warehouse takes a long time for data Data Mart takes a short time for data
handling handling.
Data Warehouse size range is 100 GB to 1 Data Mart size is less than 100 GB.
TB+

16. Define Data Lake.

 A data lake is a large data repository that stores unstructured data that is
classified and tagged with metadata.

17. What is Exploratory Data Analysis (EDA)?

 Data exploration is concerned with building a deeper understanding of the
data to know how variables interact with each other, the distribution of
the data, and whether there are outliers.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 4

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

18. Define Data Modeling.

 Building a model is an iterative process that involves selecting the variables
for the model, executing the model, and model diagnostics.
 Models consist of the following main steps:
o Selection of a modeling technique and variables to enter in the model
o Execution of the model
o Diagnosis and model comparison

19. Define linking and brushing technique.

 With brushing and linking can combine and link different graphs and
tables
so changes in one graph are automatically transferred to the other
graphs.

20. What is Histogram and Boxplot?

 In a histogram a variable is cut into discrete categories and the number of
occurrences in each category are summed up and shown in the graph.
 The boxplot, doesn’t show how many observations are present but does offer
an impression of the distribution within categories.
 It can show the maximum, minimum, median, and other characterizing
measures at the same time.

21. Define Presentation and automation steps in Data Science process.

 Finally presenting the results to the business.
 These results can take many forms, ranging from presentations to research
reports.
 Sometimes need to automate the execution of the process because the
business will use the insights gained in another project or enable an
operational process to use the outcome from the model.

22. Discuss the three sub-phases of Data preparation.

 This includes transforming the data from a raw form into data that’s directly
usable in your models.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 5

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 This phase consists of three sub-phases:

i) Data cleansing removes false values from a data source and
inconsistencies across data sources,
ii) Data integration enriches data sources by combining information from
multiple data sources, and
iii) Data transformation ensures that the data is in a suitable format for
use in your models.

23. Define common errors that occur during cleansing data.

24. Define outlier.

 An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a different
logic or generative process than the other observations.
 The easiest way to find outliers is to use a plot or a table with the
minimum and maximum values.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 6

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

PART B
1. Give the description about data science and its applications, also
discuss the benefits and uses of Data Science and Big Data.

Contents
 Big Data
 Data Science
 Benefits and Uses:
1. Commercial Companies
2. Human Resource Professionals
3. Financial Institutions
4. Government Organizations
5.Non-governmental organizations
(NGOs)
6. Universities
 Data Science Tools
 Real Time Applications of Data Science

Data
 Data is a collection of discrete states that convey information,
describing quantity, quality, fact and statistics.

Big data
 Big data is a huge volume, high velocity and variety of data that
cannot be processed by traditional processing system.
 They are characterized by the 7 Vs: velocity, variety, volume,
variability, visualization, value and veracity.

Data science
 Data science is the field of study of data, using modern scientific
techniques, statistical methods and algorithms to derive insights
from huge volume of data and to create business and IT strategies.
 It deals about where the data comes from, what it represents, and
the ways by which it can be transformed into valuable inputs and
resources

Benefits and uses of data science

1. Commercial Companies
 Commercial companies use data science to gain insights into their
customers, processes, staff, completion, and products.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 7

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 Many companies use data science to offer customers a better user

experience, cross-sell, up-sell, and personalize their offerings.
 Example:
o Google AdSense - collects data from internet users so relevant
commercial messages can be matched to the person browsing the
internet.
o MaxPoint - example of real-time personalized advertising.
2. Human Resource Professionals
 Human resource professionals use people analytics and text mining
to screen candidates, monitor the mood of employees, and study
informal networks among co-workers.
3. Financial Institutions
 Financial institutions use data science to predict stock markets,
determine the risk of lending money, and learn how to attract new
clients for their services.
4. Government Organizations
 Governmental organizations are also aware of data’s value.
 Example:
o Data.gov is the home of the US Government’s open data.
5. Non-governmental organizations (NGOs)
 Non-governmental organizations (NGOs) use it to raise money and
defend their causes.
 Example:
o The World Wildlife Fund (WWF), employs data scientists to
increase the effectiveness of their fund raising efforts.
o DataKind is a data scientist group that devotes it’s time to the
benefit of mankind.
6. Universities
 Universities use data science in their research to enhance the study
experience of their students.
 Example:
o The rise of massive open online courses (MOOC) produces a lot of
data, which allows universities to study.
o Coursera, Udacity, and edX.

Data Science Tools

1. SAS - processing Statistical operations
2. Apache Spark - handles batch processing and stream processing
3. BigML - processing Machine Learning Algorithms
4. MATLAB - processing Mathematical Information
5. Tableau - Data Visualization Software

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 8

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

6. Jupyter - Used for writing code in Python.

7. MatplotLib - Library for plotting and visualization in python.
8. NLTK - Natural Language Processing
9. Tensor flow - Machine Learning Algorithm
10. Numpy - Numerical python for Data Analysis
11. Scipy - Scientific python for scientific and technical
Computations
12. Pandas - Used for Data Analysis

Real Time Applications of Data Science

 Fraud and Risk Detection
 Healthcare
o Medical Image Analysis
o Medical Drug Development
o Virtual Assistance for patients and customer support
 Internet Search
 Target Advertising
 Website Recommendation
 Speech Recognition
 Gaming
 Augmented Reality
 Robotics

2. List and explain the facets of data or different types of data or categories of
data.

Contents
1. Structured
2. Unstructured
3. Natural Language
4. Machine-generated
5. Graph-based
6. Audio, video, and images
7. Streaming

 Categories of data:
1. Structured data
 Structured data is data that depends on a data model and resides in a
fixed field within a record.
 It’s easy to store structured data in tables within databases or Excel files.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 9

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 SQL, or Structured Query Language, is the preferred way to manage and

query data that resides in databases.
Example: Refer Figure 1.1

2. Unstructured data
 Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying.
 Example - regular email. (Figure 1.2).

 In Figure 1.2, email contains structured elements such as the sender,

title, and body text, it’s a challenge to find the number of people who
have written an email complaint about a specific employee because so
many ways exist to refer to a person, for example.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 10

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

3. Natural language
 Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific data
science techniques and linguistics.
 The natural language processing community had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize
well to other domains.
4. Machine-generated data
 Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without human
intervention.
 The analysis of machine data relies on highly scalable tools, due to its
high volume and speed.
 Examples - web server logs, call detail records, network event logs, and
telemetry (Figure 1.3).

 The machine data in figure 1.3 would fit nicely in a classic table-
structured database.
 This isn’t the best approach for highly interconnected or “networked”
data, where the relationships between entities have a valuable role to
play.
5 Graph-based or network data
 “Graph” points to mathematical graph theory.
 In graph theory, a graph is a mathematical structure to model pair-
wise relationships between objects.
 Graph or network data is, a data that focuses on the relationship or
adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and
store graphical data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 11

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 Graph-based data is a natural way to represent social networks, and

its structure allows to calculate specific metrics such as the influence of a
person and the shortest path between two people.
 Example: graph-based data can be found on many social media websites
such as Follower list on Twitter. (figure 1.4).

 Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
6. Audio, image, and video
 Audio, image, and video are data types that pose specific challenges to
a data scientist.
 Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
 High-speed cameras at stadiums will capture ball and athlete movements
to calculate in real time, for example, the path taken by a defender
relative to two baselines.
 Recently a company called DeepMind succeeded at creating an algorithm
that’s capable of learning how to play video games.
 This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
 This prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans.
7. Streaming data
 The data flows into the system in a continuous manner when an event
happens instead of being loaded into a data store in a batch.
 Examples - “What’s trending” on Twitter, live sporting or music events,
and the stock market.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 12

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

3 Explain in detail about data design process with examples.

Content:

 The data science process – An Overview

Figure 1.5: Steps of Data Science Process

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 13

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 The data science process typically consists of six steps, as shown in

figure 1.5
1. Setting the research goal

 The first step of this process is defining a research goal by creating a

project charter.
 A project charter requires teamwork, and input covers at least the
following:
o A clear research goal
o The project mission and context
o How to perform analysis
o What data and resources to use
o Proof that it’s an achievable project, or proof of concepts
o Deliverables and a measure of success
o A timeline
2 Retrieving data

 The second step is to collect data by finding suitable data and getting
access to the data from the data owner.
 Start with data stored within the company
o The data can be stored in official data repositories such as
databases, data marts, data warehouses, and data lakes
maintained by a team of IT professionals.
o The primary goal of a database is data storage, while a data
warehouse is designed for reading and analyzing that data.
o A data mart is a subset of the data warehouse and geared toward
serving a specific business unit.
o While data warehouses and data marts are home to preprocessed
data, data lakes contains data in its natural or raw format which
probably needs polishing and transformation before it becomes
usable..
 Don’t be afraid to shop around
o Many companies specialize in collecting valuable information.
o Data can also be delivered by third-party companies and take
many forms ranging from Excel spreadsheets to different types of
databases. Refer Table 1.2

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 14

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Table 1.2 – Open Data Sites

 Do data quality checks to prevent problems later

o Expect to spend a good portion of your project time doing data
correction and cleansing, sometimes up to 80%.

3 Data preparation

 Data collection is an error-prone process; this phase enhance the

quality of the data and prepare it for use in subsequent steps.
 This phase consists of three sub-phases:
1. Data cleansing - Data cleansing is a sub process of the data science
that removes false values from a data source and inconsistencies
across data sources,.
Types of errors
 Interpretation error – Taking value for granted.
Example: person’s age is greater than 300 years
 Inconsistencies – between data sources and standardized value.
Example: putting “Female” in one table and “F” in another

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 15

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Common Errors
Table 1.3 – Common Errors

1. Data Entry Errors

 Data collection and data entry are error-prone processes.
 They often require human intervention, and because humans
are only human, they make typos or lose their concentration
for a second and introduce an error into the chain.
 Example
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
2. Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors. Fixing
redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that
will remove the leading and trailing whitespaces.
Example:
Python the strip() function is used to remove leading and trailing
spaces.
3. Impossible Values And Sanity Checks
Sanity checks are another valuable type of data check.
Example:
Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
4. Outliers
An outlier is an observation that seems to be distant from other
observations or, more

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 16

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

specifically, one observation that follows a different logic or

generative process than
the other observations. The easiest way to find outliers is to use a
plot or a table with
the minimum and maximum values.
An example is shown in figure 1.6.

Figure 1.6 Distribution plots are helpful in detecting

outliers and helping you understand the variable.

5. Dealing With Missing Values

Missing values aren’t necessarily wrong, but still need to handle
them separately;

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 17

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Table 1.4 An overview of techniques to handle missing data

6. Deviations From A Code Book

 Detecting errors in larger data sets against a code book or against
standardized values can be done with the help of set operations.
 A code book is a description of your data, a form of metadata.
 It contains things such as the number of variables per observation,
the number of observations, and what each encoding within a
variable means.
 (For instance “0” equals “negative”, “5” stands for “very positive”.)
 A code book also tells the type of data looking at: is it hierarchical,
graph, something else
7. Different Units Of Measurement
 When integrating two data sets, should pay attention to their
respective units of measurement.
 An example of this would be when studying the prices of gasoline in
the world, gather data from different data providers.
 Data sets can contain prices per gallon and others can contain
prices per liter.
 A simple conversion will do the trick in this case.
8. Different Levels Of Aggregation
 Having different levels of aggregation is similar to having different
types of measurement.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 18

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 An example of this would be a data set containing data per week

versus one containing data per work week.
 This type of error is generally easy to detect, and summarizing (or
the inverse, expanding) the data sets will fix it.
 After cleaning the data errors, combine information from different
data sources.

Correct errors as early as possible

 Data should be cleansed when acquired for many reasons:
o Decision-makers may make costly mistakes on information based
on incorrect data from applications that fail to correct for the
faulty data.
o If errors are not corrected early on in the process, the cleansing
will have to be done for every project that uses that data.
o Data errors may point to a business process that isn’t working as
designed.
o Data errors may point to defective equipment, such as broken
transmission lines and defective sensors.
o Data errors can point to bugs in software or in the integration of
software that may be critical to the company.

Combining data from different data sources

The different ways of combining data
 The first operation is joining: enriching an observation from one table
with information from another table.
 The second operation is appending or stacking: adding the
observations of one table to those of another table.
1. Joining Tables
 Joining tables allows to combine the information of one
observation found in one table with the information that found in
another table.
 To join tables, use variables that represent the same object in
both tables, such as a date, a country name,.
 These common fields are known as keys.
 When these keys also uniquely define the records in the table
they are called primary keys

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 19

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Example:

Figure 1.6 : Joining two tables

on the Item and Region keys
In figure 1.6, both tables contain the client name, and this
makes it easy to enrich the client expenditures with the region
of the client.

2. Appending or stacking:
 Appending or stacking tables is effectively adding observations
from one table to another table.
 The equivalent operation in set theory would be the union, and
this is also the command in SQL, the common language of
relational databases.
 Other set operators are also used in data science, such as set
difference and intersection.
Example:

Figure 1.7: Appending tables

In figure 1.7, Appending data from tables is a common
operation but requires an equal structure in the tables being
appended.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 20

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

3. View
 Views are kind of virtual tables.
 Can create a view by selecting fields from one or more tables
present in the database.
 A View can either have all the rows of a table or specific rows
based on certain condition.

Figure 1.8: Views

4 Data transformation
 Certain models require their data to be in a certain shape.
 Ensures that the data is in a suitable format for use in data
models.
 Taking the log of the independent variables simplifies the
estimation problem dramatically.
Example – Refer Figure 1.9
Relationships between an input variable and an output variable aren’t always
linear.

Figure 1.9: Transformation

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 21

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Figure 1.9 Transforming x to log x makes the relationship between x

and y linear (right), compared with the non-log x (left).

5. Data exploration or EDA (Exploratory Data Analysis)

o Data exploration is concerned with building a deeper understanding of
the data to know how variables interact with each other, the distribution
of the data, and whether there are outliers.
o The visualization techniques used in this phase range from simple line
graphs or histograms, to more complex diagrams such as Sankey and
network graphs.

 Graphs: - Simple and Combined Graphs

In figure 1.11 - From top to bottom, a bar chart, a line plot, and a
Distribution is some of the graphs used in exploratory analysis.

 Brushing and linking.

With brushing and linking can combine and link different graphs
and tables so changes in one graph are automatically transferred to
the other graphs.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 22

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Figure 1.11 - Graphs used in exploratory analysis

Histogram
 In a histogram a variable is cut into discrete categories and the
number of occurrences in each category are summed up and shown in
the graph.

Figure 1.12 - Example Histogram

 Example – Figure 1.12 shows the number of people in the age groups
of 5-year intervals

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 23

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

The boxplot
 The boxplot, offers an impression of the distribution within categories.
 It can show the maximum, minimum, median, and other
characterizing measures at the same time.
Example:

Figure 1.13 - Boxplot

In figure 1.13 each user category has a distribution of the appreciation
each has for a certain picture on a photography website.

6 Data modeling or model building

 Building a model is an iterative process that involves selecting the

variables for the model, executing the model, and model
diagnostics.
 Models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the
model
2. Execution of the model
3. Diagnosis and model comparison

7 Presentation and automation

 Finally presenting the results to the business.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 24

Downloaded by Dinesh 1812 ([email protected])

lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 These results can take many forms, ranging from presentations to

research reports.
 Sometimes need to automate the execution of the process because the
business will use the insights gained in another project or enable an
operational process to use the outcome from the model.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 25

Downloaded by Dinesh 1812 ([email protected])

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Big Data in Telecomunications
No ratings yet
Big Data in Telecomunications
20 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Big Data and The Future For Privacy: Neil M. Richards Jonathan H. King
No ratings yet
Big Data and The Future For Privacy: Neil M. Richards Jonathan H. King
22 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Data Spaces: Edward Curry Simon Scerri Tuomo Tuikka Eds
No ratings yet
Data Spaces: Edward Curry Simon Scerri Tuomo Tuikka Eds
367 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Big Data Analytics Using Multiple Criteria Decision-Making Models (2017)
No ratings yet
Big Data Analytics Using Multiple Criteria Decision-Making Models (2017)
387 pages
Cloudera Infobrief Final
No ratings yet
Cloudera Infobrief Final
19 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Deep Learning Part 1 (IITM) - Unit 14 - Week 11
No ratings yet
Deep Learning Part 1 (IITM) - Unit 14 - Week 11
3 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Success Stories in The Process Industries: Big Data
No ratings yet
Success Stories in The Process Industries: Big Data
5 pages
Information Systems and Technology (In/It) : Purdue University Global 2022-2023 Catalog - 1
No ratings yet
Information Systems and Technology (In/It) : Purdue University Global 2022-2023 Catalog - 1
9 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Research Assignment 02burhan Ul Din
No ratings yet
Research Assignment 02burhan Ul Din
8 pages
Introduction To Big Data BS (CS) 6 Lecture # 4: Dr. Syed Attique Shah (PH.D.)
No ratings yet
Introduction To Big Data BS (CS) 6 Lecture # 4: Dr. Syed Attique Shah (PH.D.)
19 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
Application of Cloud Computing
No ratings yet
Application of Cloud Computing
17 pages
Yogananda Reddy Nusi: Sensitivity: Internal & Restricted
No ratings yet
Yogananda Reddy Nusi: Sensitivity: Internal & Restricted
7 pages
Fdsa Unit 1
No ratings yet
Fdsa Unit 1
25 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Anomaly Detection Firewalls Capabilities and Limitations ICCSE1.2018.8374204
No ratings yet
Anomaly Detection Firewalls Capabilities and Limitations ICCSE1.2018.8374204
5 pages
For Business. For Growth. For Life.: Apps Launch
No ratings yet
For Business. For Growth. For Life.: Apps Launch
20 pages
Challenges For Mapreduce in Big Data: Scholarship@Western
No ratings yet
Challenges For Mapreduce in Big Data: Scholarship@Western
10 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Valoración Y Negociación de Tecnología Step 1 - Identify Intellectual Property As An Asset
No ratings yet
Valoración Y Negociación de Tecnología Step 1 - Identify Intellectual Property As An Asset
9 pages
Fods QB
No ratings yet
Fods QB
35 pages
Universality of Preference Behaviors in Online Music-Listener Bipartite Networks: A Big Data Analysis
No ratings yet
Universality of Preference Behaviors in Online Music-Listener Bipartite Networks: A Big Data Analysis
23 pages
Unit I 2 Marks With Ans
No ratings yet
Unit I 2 Marks With Ans
7 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Big Data Mining Literature Review
100% (2)
Big Data Mining Literature Review
7 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
FDS Notes
No ratings yet
FDS Notes
148 pages
II CSE CS3352 FDS QB Unit1
No ratings yet
II CSE CS3352 FDS QB Unit1
4 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Master Thesis Topics in Aviation
100% (3)
Master Thesis Topics in Aviation
7 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Digital Supply Chain Chalanges and Future Directions
No ratings yet
Digital Supply Chain Chalanges and Future Directions
3 pages
CS3352-QB Fds
No ratings yet
CS3352-QB Fds
12 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
II Cse Cs3352 Fds QB Unit1
No ratings yet
II Cse Cs3352 Fds QB Unit1
5 pages
Fdsa 12 - 2M
No ratings yet
Fdsa 12 - 2M
15 pages
Usa Batch 3
No ratings yet
Usa Batch 3
56 pages
The Application of Big Data and Artificial Intelligence Technology in Enterprise Information Security Management and Risk Assessment
No ratings yet
The Application of Big Data and Artificial Intelligence Technology in Enterprise Information Security Management and Risk Assessment
15 pages
2marks Unit 1 2marks Unit 1: Foundations of Datascience (Anna University) Foundations of Datascience (Anna University)
No ratings yet
2marks Unit 1 2marks Unit 1: Foundations of Datascience (Anna University) Foundations of Datascience (Anna University)
8 pages
Data Science Fundamentals QB
No ratings yet
Data Science Fundamentals QB
23 pages
FDS Unit1
No ratings yet
FDS Unit1
30 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
3.question Bank
No ratings yet
3.question Bank
7 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
AD3491-Unit 1
No ratings yet
AD3491-Unit 1
32 pages
Ocs353 Data Science Fundamentals Notes
No ratings yet
Ocs353 Data Science Fundamentals Notes
145 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 5 - Foundations of Business Intelligence Database and Information Management
No ratings yet
Chapter 5 - Foundations of Business Intelligence Database and Information Management
30 pages
FDS Notes
No ratings yet
FDS Notes
5 pages
PDS Question Bank
No ratings yet
PDS Question Bank
19 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
FDS Unit 1 QB
No ratings yet
FDS Unit 1 QB
7 pages
AI at War
No ratings yet
AI at War
4 pages
1972 6968 1 PB
No ratings yet
1972 6968 1 PB
6 pages
II CSE - A&B (96) DS-int 1 QP ANS-set1
No ratings yet
II CSE - A&B (96) DS-int 1 QP ANS-set1
7 pages
Stock Management System Project Proposal
No ratings yet
Stock Management System Project Proposal
20 pages
IV AI-DS AD3491 FDSA QB Unit1
No ratings yet
IV AI-DS AD3491 FDSA QB Unit1
5 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
No ratings yet
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
6 pages
DTS 201 Lecture Note
No ratings yet
DTS 201 Lecture Note
24 pages
FDS - Aids Complete Notes
No ratings yet
FDS - Aids Complete Notes
138 pages
ACG WORLD - Renewed Shortlist Final
No ratings yet
ACG WORLD - Renewed Shortlist Final
169 pages
Class VII Data Analytics
No ratings yet
Class VII Data Analytics
2 pages
FDS Notes PDF
No ratings yet
FDS Notes PDF
140 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
Ad3491-FDA Unit 1 Question Bank
No ratings yet
Ad3491-FDA Unit 1 Question Bank
8 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
Foundation of Data Science (BSC)
No ratings yet
Foundation of Data Science (BSC)
64 pages
Fds Question Bank
No ratings yet
Fds Question Bank
116 pages
2 Marks Foundations of Data Science
No ratings yet
2 Marks Foundations of Data Science
13 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet