0% found this document useful (0 votes)
891 views28 pages

Fundamentals of Data Science and Analytics - AD3491 - Important Questions With Answer - Unit 1 - Introduction To Data Science

The document outlines the curriculum for a data science course, detailing subjects across various semesters, including topics like Big Data, data science processes, and data analysis techniques. It defines key concepts such as structured and unstructured data, machine-generated data, and the importance of data repositories. Additionally, it discusses the benefits and applications of data science in various sectors, emphasizing its role in deriving insights and informing business strategies.

Uploaded by

crazymohan4334
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
891 views28 pages

Fundamentals of Data Science and Analytics - AD3491 - Important Questions With Answer - Unit 1 - Introduction To Data Science

The document outlines the curriculum for a data science course, detailing subjects across various semesters, including topics like Big Data, data science processes, and data analysis techniques. It defines key concepts such as structured and unstructured data, machine-generated data, and the importance of data repositories. Additionally, it discusses the benefits and applications of data science in various sectors, emphasizing its role in deriving insights and informing business strategies.

Uploaded by

crazymohan4334
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Click on Subject/Paper under Semester to enter.

Professional English Discrete Mathematics Environmental Sciences


Professional English - - II - HS3252 - MA3354 and Sustainability -
I - HS3152 GE3451
Digital Principles and
Statistics and Probability and
Computer Organization
Matrices and Calculus Numerical Methods - Statistics - MA3391
- CS3351
- MA3151 MA3251
3rd Semester
1st Semester

4th Semester
2nd Semester

Database Design and Operating Systems -


Engineering Physics - Engineering Graphics
Management - AD3391 AL3452
PH3151 - GE3251

Physics for Design and Analysis of Machine Learning -


Engineering Chemistry Information Science Algorithms - AD3351 AL3451
- CY3151 - PH3256
Data Exploration and Fundamentals of Data
Basic Electrical and
Visualization - AD3301 Science and Analytics
Problem Solving and Electronics Engineering -
BE3251 - AD3491
Python Programming -
GE3151 Artificial Intelligence
Data Structures Computer Networks
- AL3391
Design - AD3251 - CS3591

Deep Learning -
AD3501

Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester

Security - CW3551 Ethics - GE3791


6th Semester

7th Semester

8th Semester

Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics - Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

UNIT I – INTRODUCTION TO DATA SCIENCE


SYLLABUS:
Need for data science – benefits and uses – facets of data – data
science process – setting the research goal – retrieving data –
cleansing, integrating, and transforming data – exploratory data
analysis – build the models – presenting and building applications.

PART A
1. What is Bigdata?
 Big data is a huge volume, high velocity and variety of data that cannot
be processed by traditional processing system.
 They are characterized by the 7 Vs: velocity, variety, volume, variability,
visualization, value and veracity.

2. What are the Characteristics of Bigdata?


 Velocity - refers to the speed of data processing
 Volume - refers to the amount of data
 Value - refers to the benefits that the organization derives from the data.
 Variety - refers to the different types of big data.
 Veracity - refers to the accuracy of your data.
 Validity – refers to the relevance of data for the intended purpose.
 Volatility – refers to constantly changing
 Visualization - Visualization refers to showing your big data-generated
insights
 through visual representations such as charts and graphs.

3. Define Data Science.


 Data science is the field of study of data, using modern scientific techniques,
statistical methods and algorithms to derive insights from huge volume of
data and to create business and IT strategies.
 It deals about where the data comes from, what it represents, and the ways
by which it can be transformed into valuable inputs and resources

4. What are the benefits and uses of Bigdata


 Commercial Companies
 Human Resource professionals
 Financial institutions
 Governmental organizations
 Nongovernmental organizations (NGOs)
 Universities

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 1

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

5. List out the Facets of data.


The facets of data are categorized below,
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming

6. Define Structured data.


 Structured data is data that depends on a data model and resides in a
fixed field within a record.
 It’s easy to store structured data in tables within databases or Excel files.
 SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases.
 Example: Excel files

7. Define unstructured data


 Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying.
 Example: Email

8. What is Machine Generated Data?


 Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention.
 The analysis of machine data relies on highly scalable tools, due to its high
volume and speed.
 Examples: web server logs, call detail records, network event logs, and
telemetry

9. What is Streaming Data?


 The data flows into the system in a continuous manner when an event
happens instead of being loaded into a data store in a batch.
 Examples - “What’s trending” on Twitter, live sporting or music events, and
the stock market.

10. Define Graph based or Network data


 “Graph” points to mathematical graph theory.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 2

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 In graph theory, a graph is a mathematical structure to model pair-wise


relationships between objects.
 Graph or network data is, a data that focuses on the relationship or
adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and store
graphical data.
 Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
 Example: social media websites
o For instance, on LinkedIn you can see who you know at which
company.
o Your follower list on Twitter is another example of graph-based data.

11. List out the steps in Data Science Process


The data science process typically consists of six steps.

12. What is meant by Project Charter?


 All the information which are related to research goal is best collected in a
project charter.
 A project charter requires teamwork, and input covers at least the following:
o A clear research goal
o The project mission and context
o How to perform analysis
o What resources to use
o Proof that it’s an achievable project, or proof of concepts
o Deliverables and a measure of success
o A timeline

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 3

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

13. How to retrieving the data in Data Science process?


 The second step is to collect data by finding suitable data and getting
access to the data from the data owner.
 Data can also be delivered by third-party companies and take many
forms ranging from Excel spreadsheets to different types of databases.
 The result is data in its raw form, which probably needs polishing and
transformation before it becomes usable.

14. What is Data Repositories?


 A data repository is also known as a data library or data archive.
 The data repository is a large database infrastructure — several
databases — that collect, manage, and store data sets for data analysis,
sharing and reporting.
 Example: Database, Data Warehouse, Data mart, Data Lake.

15. Difference between Data Marts and Data warehouse.


Data Warehouse Data Mart
Data Warehouse stores a large amount of Data Mart contains only the specific
data which is collected from different data from data warehouse, which is
sources required by the company for analysis
Data Warehouse is focused on all Data Mart focuses on a specific group.
departments in an organization
Data Warehouse designing process is Data Mart process is easy to design.
complicated
Data Warehouse takes a long time for data Data Mart takes a short time for data
handling handling.
Data Warehouse size range is 100 GB to 1 Data Mart size is less than 100 GB.
TB+

16. Define Data Lake.


 A data lake is a large data repository that stores unstructured data that is
classified and tagged with metadata.

17. What is Exploratory Data Analysis (EDA)?


 Data exploration is concerned with building a deeper understanding of the
data to know how variables interact with each other, the distribution of
the data, and whether there are outliers.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 4

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

18. Define Data Modeling.


 Building a model is an iterative process that involves selecting the variables
for the model, executing the model, and model diagnostics.
 Models consist of the following main steps:
o Selection of a modeling technique and variables to enter in the model
o Execution of the model
o Diagnosis and model comparison

19. Define linking and brushing technique.


 With brushing and linking can combine and link different graphs and
tables
so changes in one graph are automatically transferred to the other
graphs.

20. What is Histogram and Boxplot?


 In a histogram a variable is cut into discrete categories and the number of
occurrences in each category are summed up and shown in the graph.
 The boxplot, doesn’t show how many observations are present but does offer
an impression of the distribution within categories.
 It can show the maximum, minimum, median, and other characterizing
measures at the same time.

21. Define Presentation and automation steps in Data Science process.


 Finally presenting the results to the business.
 These results can take many forms, ranging from presentations to research
reports.
 Sometimes need to automate the execution of the process because the
business will use the insights gained in another project or enable an
operational process to use the outcome from the model.

22. Discuss the three sub-phases of Data preparation.


 This includes transforming the data from a raw form into data that’s directly
usable in your models.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 5

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 This phase consists of three sub-phases:


i) Data cleansing removes false values from a data source and
inconsistencies across data sources,
ii) Data integration enriches data sources by combining information from
multiple data sources, and
iii) Data transformation ensures that the data is in a suitable format for
use in your models.

23. Define common errors that occur during cleansing data.

24. Define outlier.


 An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a different
logic or generative process than the other observations.
 The easiest way to find outliers is to use a plot or a table with the
minimum and maximum values.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 6

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

PART B
1. Give the description about data science and its applications, also
discuss the benefits and uses of Data Science and Big Data.

Contents
 Big Data
 Data Science
 Benefits and Uses:
1. Commercial Companies
2. Human Resource Professionals
3. Financial Institutions
4. Government Organizations
5.Non-governmental organizations
(NGOs)
6. Universities
 Data Science Tools
 Real Time Applications of Data Science

Data
 Data is a collection of discrete states that convey information,
describing quantity, quality, fact and statistics.

Big data
 Big data is a huge volume, high velocity and variety of data that
cannot be processed by traditional processing system.
 They are characterized by the 7 Vs: velocity, variety, volume,
variability, visualization, value and veracity.

Data science
 Data science is the field of study of data, using modern scientific
techniques, statistical methods and algorithms to derive insights
from huge volume of data and to create business and IT strategies.
 It deals about where the data comes from, what it represents, and
the ways by which it can be transformed into valuable inputs and
resources

Benefits and uses of data science


1. Commercial Companies
 Commercial companies use data science to gain insights into their
customers, processes, staff, completion, and products.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 7

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 Many companies use data science to offer customers a better user


experience, cross-sell, up-sell, and personalize their offerings.
 Example:
o Google AdSense - collects data from internet users so relevant
commercial messages can be matched to the person browsing the
internet.
o MaxPoint - example of real-time personalized advertising.
2. Human Resource Professionals
 Human resource professionals use people analytics and text mining
to screen candidates, monitor the mood of employees, and study
informal networks among co-workers.
3. Financial Institutions
 Financial institutions use data science to predict stock markets,
determine the risk of lending money, and learn how to attract new
clients for their services.
4. Government Organizations
 Governmental organizations are also aware of data’s value.
 Example:
o Data.gov is the home of the US Government’s open data.
5. Non-governmental organizations (NGOs)
 Non-governmental organizations (NGOs) use it to raise money and
defend their causes.
 Example:
o The World Wildlife Fund (WWF), employs data scientists to
increase the effectiveness of their fund raising efforts.
o DataKind is a data scientist group that devotes it’s time to the
benefit of mankind.
6. Universities
 Universities use data science in their research to enhance the study
experience of their students.
 Example:
o The rise of massive open online courses (MOOC) produces a lot of
data, which allows universities to study.
o Coursera, Udacity, and edX.

Data Science Tools


1. SAS - processing Statistical operations
2. Apache Spark - handles batch processing and stream processing
3. BigML - processing Machine Learning Algorithms
4. MATLAB - processing Mathematical Information
5. Tableau - Data Visualization Software

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 8

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

6. Jupyter - Used for writing code in Python.


7. MatplotLib - Library for plotting and visualization in python.
8. NLTK - Natural Language Processing
9. Tensor flow - Machine Learning Algorithm
10. Numpy - Numerical python for Data Analysis
11. Scipy - Scientific python for scientific and technical
Computations
12. Pandas - Used for Data Analysis

Real Time Applications of Data Science


 Fraud and Risk Detection
 Healthcare
o Medical Image Analysis
o Medical Drug Development
o Virtual Assistance for patients and customer support
 Internet Search
 Target Advertising
 Website Recommendation
 Speech Recognition
 Gaming
 Augmented Reality
 Robotics

2. List and explain the facets of data or different types of data or categories of
data.

Contents
1. Structured
2. Unstructured
3. Natural Language
4. Machine-generated
5. Graph-based
6. Audio, video, and images
7. Streaming

 Categories of data:
1. Structured data
 Structured data is data that depends on a data model and resides in a
fixed field within a record.
 It’s easy to store structured data in tables within databases or Excel files.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 9

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 SQL, or Structured Query Language, is the preferred way to manage and


query data that resides in databases.
Example: Refer Figure 1.1

2. Unstructured data
 Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying.
 Example - regular email. (Figure 1.2).

 In Figure 1.2, email contains structured elements such as the sender,


title, and body text, it’s a challenge to find the number of people who
have written an email complaint about a specific employee because so
many ways exist to refer to a person, for example.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 10

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

3. Natural language
 Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific data
science techniques and linguistics.
 The natural language processing community had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize
well to other domains.
4. Machine-generated data
 Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without human
intervention.
 The analysis of machine data relies on highly scalable tools, due to its
high volume and speed.
 Examples - web server logs, call detail records, network event logs, and
telemetry (Figure 1.3).

 The machine data in figure 1.3 would fit nicely in a classic table-
structured database.
 This isn’t the best approach for highly interconnected or “networked”
data, where the relationships between entities have a valuable role to
play.
5 Graph-based or network data
 “Graph” points to mathematical graph theory.
 In graph theory, a graph is a mathematical structure to model pair-
wise relationships between objects.
 Graph or network data is, a data that focuses on the relationship or
adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and
store graphical data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 11

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 Graph-based data is a natural way to represent social networks, and


its structure allows to calculate specific metrics such as the influence of a
person and the shortest path between two people.
 Example: graph-based data can be found on many social media websites
such as Follower list on Twitter. (figure 1.4).

 Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
6. Audio, image, and video
 Audio, image, and video are data types that pose specific challenges to
a data scientist.
 Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
 High-speed cameras at stadiums will capture ball and athlete movements
to calculate in real time, for example, the path taken by a defender
relative to two baselines.
 Recently a company called DeepMind succeeded at creating an algorithm
that’s capable of learning how to play video games.
 This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
 This prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans.
7. Streaming data
 The data flows into the system in a continuous manner when an event
happens instead of being loaded into a data store in a batch.
 Examples - “What’s trending” on Twitter, live sporting or music events,
and the stock market.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 12

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

3 Explain in detail about data design process with examples.


Content:

 The data science process – An Overview

Figure 1.5: Steps of Data Science Process

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 13

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 The data science process typically consists of six steps, as shown in


figure 1.5
1. Setting the research goal

 The first step of this process is defining a research goal by creating a


project charter.
 A project charter requires teamwork, and input covers at least the
following:
o A clear research goal
o The project mission and context
o How to perform analysis
o What data and resources to use
o Proof that it’s an achievable project, or proof of concepts
o Deliverables and a measure of success
o A timeline
2 Retrieving data

 The second step is to collect data by finding suitable data and getting
access to the data from the data owner.
 Start with data stored within the company
o The data can be stored in official data repositories such as
databases, data marts, data warehouses, and data lakes
maintained by a team of IT professionals.
o The primary goal of a database is data storage, while a data
warehouse is designed for reading and analyzing that data.
o A data mart is a subset of the data warehouse and geared toward
serving a specific business unit.
o While data warehouses and data marts are home to preprocessed
data, data lakes contains data in its natural or raw format which
probably needs polishing and transformation before it becomes
usable..
 Don’t be afraid to shop around
o Many companies specialize in collecting valuable information.
o Data can also be delivered by third-party companies and take
many forms ranging from Excel spreadsheets to different types of
databases. Refer Table 1.2

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 14

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Table 1.2 – Open Data Sites

 Do data quality checks to prevent problems later


o Expect to spend a good portion of your project time doing data
correction and cleansing, sometimes up to 80%.

3 Data preparation

 Data collection is an error-prone process; this phase enhance the


quality of the data and prepare it for use in subsequent steps.
 This phase consists of three sub-phases:
1. Data cleansing - Data cleansing is a sub process of the data science
that removes false values from a data source and inconsistencies
across data sources,.
Types of errors
 Interpretation error – Taking value for granted.
Example: person’s age is greater than 300 years
 Inconsistencies – between data sources and standardized value.
Example: putting “Female” in one table and “F” in another

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 15

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Common Errors
Table 1.3 – Common Errors

1. Data Entry Errors


 Data collection and data entry are error-prone processes.
 They often require human intervention, and because humans
are only human, they make typos or lose their concentration
for a second and introduce an error into the chain.
 Example
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
2. Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors. Fixing
redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that
will remove the leading and trailing whitespaces.
Example:
Python the strip() function is used to remove leading and trailing
spaces.
3. Impossible Values And Sanity Checks
Sanity checks are another valuable type of data check.
Example:
Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
4. Outliers
An outlier is an observation that seems to be distant from other
observations or, more

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 16

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

specifically, one observation that follows a different logic or


generative process than
the other observations. The easiest way to find outliers is to use a
plot or a table with
the minimum and maximum values.
An example is shown in figure 1.6.

Figure 1.6 Distribution plots are helpful in detecting


outliers and helping you understand the variable.

5. Dealing With Missing Values


Missing values aren’t necessarily wrong, but still need to handle
them separately;

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 17

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Table 1.4 An overview of techniques to handle missing data

6. Deviations From A Code Book


 Detecting errors in larger data sets against a code book or against
standardized values can be done with the help of set operations.
 A code book is a description of your data, a form of metadata.
 It contains things such as the number of variables per observation,
the number of observations, and what each encoding within a
variable means.
 (For instance “0” equals “negative”, “5” stands for “very positive”.)
 A code book also tells the type of data looking at: is it hierarchical,
graph, something else
7. Different Units Of Measurement
 When integrating two data sets, should pay attention to their
respective units of measurement.
 An example of this would be when studying the prices of gasoline in
the world, gather data from different data providers.
 Data sets can contain prices per gallon and others can contain
prices per liter.
 A simple conversion will do the trick in this case.
8. Different Levels Of Aggregation
 Having different levels of aggregation is similar to having different
types of measurement.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 18

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 An example of this would be a data set containing data per week


versus one containing data per work week.
 This type of error is generally easy to detect, and summarizing (or
the inverse, expanding) the data sets will fix it.
 After cleaning the data errors, combine information from different
data sources.

Correct errors as early as possible


 Data should be cleansed when acquired for many reasons:
o Decision-makers may make costly mistakes on information based
on incorrect data from applications that fail to correct for the
faulty data.
o If errors are not corrected early on in the process, the cleansing
will have to be done for every project that uses that data.
o Data errors may point to a business process that isn’t working as
designed.
o Data errors may point to defective equipment, such as broken
transmission lines and defective sensors.
o Data errors can point to bugs in software or in the integration of
software that may be critical to the company.

Combining data from different data sources


The different ways of combining data
 The first operation is joining: enriching an observation from one table
with information from another table.
 The second operation is appending or stacking: adding the
observations of one table to those of another table.
1. Joining Tables
 Joining tables allows to combine the information of one
observation found in one table with the information that found in
another table.
 To join tables, use variables that represent the same object in
both tables, such as a date, a country name,.
 These common fields are known as keys.
 When these keys also uniquely define the records in the table
they are called primary keys

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 19

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Example:

Figure 1.6 : Joining two tables


on the Item and Region keys
In figure 1.6, both tables contain the client name, and this
makes it easy to enrich the client expenditures with the region
of the client.

2. Appending or stacking:
 Appending or stacking tables is effectively adding observations
from one table to another table.
 The equivalent operation in set theory would be the union, and
this is also the command in SQL, the common language of
relational databases.
 Other set operators are also used in data science, such as set
difference and intersection.
Example:

Figure 1.7: Appending tables


In figure 1.7, Appending data from tables is a common
operation but requires an equal structure in the tables being
appended.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 20

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

3. View
 Views are kind of virtual tables.
 Can create a view by selecting fields from one or more tables
present in the database.
 A View can either have all the rows of a table or specific rows
based on certain condition.

Figure 1.8: Views

4 Data transformation
 Certain models require their data to be in a certain shape.
 Ensures that the data is in a suitable format for use in data
models.
 Taking the log of the independent variables simplifies the
estimation problem dramatically.
Example – Refer Figure 1.9
Relationships between an input variable and an output variable aren’t always
linear.

Figure 1.9: Transformation

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 21

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Figure 1.9 Transforming x to log x makes the relationship between x


and y linear (right), compared with the non-log x (left).

5. Data exploration or EDA (Exploratory Data Analysis)


o Data exploration is concerned with building a deeper understanding of
the data to know how variables interact with each other, the distribution
of the data, and whether there are outliers.
o The visualization techniques used in this phase range from simple line
graphs or histograms, to more complex diagrams such as Sankey and
network graphs.

 Graphs: - Simple and Combined Graphs


In figure 1.11 - From top to bottom, a bar chart, a line plot, and a
Distribution is some of the graphs used in exploratory analysis.

 Brushing and linking.


With brushing and linking can combine and link different graphs
and tables so changes in one graph are automatically transferred to
the other graphs.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 22

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

Figure 1.11 - Graphs used in exploratory analysis

Histogram
 In a histogram a variable is cut into discrete categories and the
number of occurrences in each category are summed up and shown in
the graph.

Figure 1.12 - Example Histogram


 Example – Figure 1.12 shows the number of people in the age groups
of 5-year intervals

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 23

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
lOMoARcPSD|45333583

www.BrainKart.com
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1

 These results can take many forms, ranging from presentations to


research reports.
 Sometimes need to automate the execution of the process because the
business will use the insights gained in another project or enable an
operational process to use the outcome from the model.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 25

https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes
Click on Subject/Paper under Semester to enter.
Professional English Discrete Mathematics Environmental Sciences
Professional English - - II - HS3252 - MA3354 and Sustainability -
I - HS3152 GE3451
Digital Principles and
Statistics and Probability and
Computer Organization
Matrices and Calculus Numerical Methods - Statistics - MA3391
- CS3351
- MA3151 MA3251
3rd Semester
1st Semester

4th Semester
2nd Semester

Database Design and Operating Systems -


Engineering Physics - Engineering Graphics
Management - AD3391 AL3452
PH3151 - GE3251

Physics for Design and Analysis of Machine Learning -


Engineering Chemistry Information Science Algorithms - AD3351 AL3451
- CY3151 - PH3256
Data Exploration and Fundamentals of Data
Basic Electrical and
Visualization - AD3301 Science and Analytics
Problem Solving and Electronics Engineering -
BE3251 - AD3491
Python Programming -
GE3151 Artificial Intelligence
Data Structures Computer Networks
- AL3391
Design - AD3251 - CS3591

Deep Learning -
AD3501

Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester

Security - CW3551 Ethics - GE3791


6th Semester

7th Semester

8th Semester

Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics - Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering

You might also like