AD3491 UNIT 1 NOTES EduEngg

CONNECT WITH US
WEBSITE: www.eduengineering.in
TELEGRAM: @eduengineering
 Best website for Anna University Affiliated College Students

 Regular Updates for all Semesters
 All Department Notes AVAILABLE
 All Lab Manuals AVAILABLE
 Handwritten Notes AVAILABLE
 Printed Notes AVAILABLE
 Past Year Question Papers AVAILABLE
 Subject wise Question Banks AVAILABLE
 Important Questions for Semesters AVAILABLE
 Various Author Books AVAILABLE
Lecture Notes Unit I
Subject: Fundamentals of Data Science & Analytics
1. What is Data Science?
Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains. Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.
Data Science: Data Science is a field or domain which includes and involves working with a huge
amount of data and uses it for building predictive, prescriptive and prescriptive analytical models.
It’s about digging, capturing, (building the model) analyzing (validating the model) and utilizing
the data (deploying the best model).
It is an intersection of Data and computing. It is a blend of the field of Computer Science, Business
Management and Statistics together.
Big Data: It is huge, large or voluminous data, information or the relevant statistics acquired by
the large organizations and ventures. Many software and data storage created and prepared as it is
difficult to compute the big data manually.
It is used to discover patterns and trends and make decisions related to human behavior and
interaction technology.
Example Applications:
Fraud and Risk Detection
Healthcare
Internet Search
Targeted Advertising
Website Recommendations
Advanced Image Recognition
Speech Recognition
Airline Route Planning
Gaming
Augmented Reality
2. What are all the difference between the Big Data and Data Science?
Data Science Big Data
Big Data is a technique to collect, maintain and
process the huge information.
Data Science is an area.
It is about collection, processing, analyzing

and utilizing of data into various operations.
It is about extracting the vital and valuable
It is more conceptual.
information from huge amount of the data.
It is a field of study just like the Computer

Science, Applied Statistics or Applied
Mathematics, Data Base Management
It is a technique of tracking and discovering of
System.
trends of complex data sets.
The goal is to make data more vital and usable

i.e. by extracting only important information
from the huge data within existing traditional
The goal is to build data-dominant products
aspects.
for a venture.
Tools mainly used in Data Science includes Tools mostly used in Big Data includes
SAS, R, Python, etc Hadoop, Spark, Flink, etc.
It is a super set of Big Data as data science

consists of Data scrapping, cleaning,
It is a sub set of Data Science as mining
visualization, statistics and many more
activities which is in a pipeline of the Data
techniques.
science.
It is mainly used for business purposes and

customer satisfaction.
It is mainly used for scientific purposes.
Uses mathematics and statistics extensively
along with programming skills to develop a
Used by businesses to track their presence in
model to test the hypothesis and make
the market which helps them develop agility
decisions in the business and gain a competitive advantage over others.
Internet search, digital advertisements, text- Telecommunication, financial service, health

to-speech recognition, risk detection, and and sports, research and development, and
other activities. security and law enforcement
3. What are all the characteristics of big data?

The characteristics of big data are explained with ‘Five V’ approach. If it satisfy the five
characteristics then it is known as Big Data.
■ Volume--how much data is there? To determine the value of data, size of data plays a very crucial
role. If the volume of data is very large then it is actually considered as a ‘Big Data’.
■ Variety--How diverse are different types of data? It refers to nature of data that is structured,
semi-structured and unstructured data. It also refers to heterogeneous sources.
■ Velocity--At what speed is new data generated? Velocity refers to the high speed of
accumulation of data. In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
■ Veracity -- How accurate is the data? It refers to inconsistencies and uncertainty in data that is
data which is available can sometimes get messy and quality and accuracy are difficult to control.
■ Value -- How effectively to transform a tsunami of data into business? Data in itself is of no use
or importance but it needs to be converted into something valuable to extract Information.
4. What are all the Challenges of Big Data?
The main challenges of Big Data are
i. Data capture - Data capture, or electronic data capture, is the process of extracting
information from a document and converting it into data readable by a computer. ii. Curation
- Data curation includes "all the processes needed for principled and controlled data creation,
maintenance, and management, together with the capacity to add value to data".
iii. Storage - Data storage refers to magnetic, optical or mechanical media that records and
preserves digital information for ongoing or future operations.
iv. Search - Searching is designed to check for an element/item or retrieve an element
from any data storage.
v. Sharing - Data sharing is the practice of making data used for scholarly research
available to other investigators
vi. Transfer - Data transfer refers to the secure exchange of large files between systems
or organizations.
vii. Visualization - Data visualization is the graphical representation of information and
data.
5. What are all the Benefits and uses/advantage of Data Science and Big Data Analytics?
There are number of benefits/advantages while using the Big Data Analytics. Some of them are
listed below.
• Commercial Companies in all business wish to analyses and gain insights into their
customers, processes, staff, completion, and products. Many companies use data science
to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize
their offerings.
• Human resource professionals use people analytics and text mining to screen candidates,
monitor the mood of employees, and study informal networks among coworkers.
• Financial institutions use data science to predict stock markets, determine the risk of
lending money, and learn how to attract new clients for their services.
• Many governmental organizations not only rely on internal data scientists to discover
valuable information, but also share their data with the public. You can use this data to gain
insights or build data-driven applications.
• Nongovernmental organizations (NGOs) can use it as a source for get funding. Many data
scientists devote part of their time to helping NGOs, because NGOs often lack the resources
to collect data and employ data scientists.
• Universities use data science in their research but also to enhance the study experience of
their students. The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement traditional
classes.
• Data accumulation from multiple sources, including the Internet, social media platforms,
online shopping sites, company databases, external third-party sources, etc.
• Real-time forecasting and monitoring of business as well as the market.
• Identify crucial points hidden within large datasets to influence business decisions.
• Promptly mitigate risks by optimizing complex decisions for unforeseen events and
potential threats.
6. List the different types of Data used in Big Data Analytics and Data Science.
The major category of data types used in Big Data Analytics and Data Science are as follows.
■ Structured - Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within databases or Excel
files. SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.
An Excel Table is an example for structured data

■ Unstructured: Unstructured data is data that isn’t easy to fit into a data model because the content
is context-specific or varying. One example of unstructured data is your regular email.
Email is the best example for unstructured data and natural language data.
■ Natural language: Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained in one
domain don’t generalize well to other domains
■ Machine-generated: Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human intervention.
Machine Generated Data
■ Graph-based or Network data: The graph structures use nodes, edges, and properties to represent
and store graphical data. Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a person and the shortest
path between two people
■ Audio, video, and images: Audio, image, and video are data types that pose specific challenges
to a data scientist. Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games.
■ Streaming: While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being loaded into a data
store in a batch.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.
7. Explain the data science process steps.

The data science process typically consists of six steps, it is shown in the following figure.
Setting the research goal: Data science is mostly applied in the context of an organization. To find
the research goal and project character. The most important work in an organization is to do find
out data value, and to identify, how the company benefits from that, what data and resources you
need, a timetable, and deliverables.
Retrieving data: The second step is to collect data. In this step you ensure that you can use the
data in your program, which means checking the existence of, quality, and access to the data. Data
can also be delivered by third-party companies and takes many forms ranging from Excel
spreadsheets to different types of databases.
Data Science Process
Data preparation: in this phase you enhance the quality of the data and prepare it for use in
subsequent steps. This phase consists of three sub-phases: data cleansing removes false values
from a data source and inconsistencies across data sources, data integration enriches data sources
by combining information from multiple data sources, and data transformation ensures that the
data is in a suitable format for use in your models.
Data exploration: Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the data, and
whether there are outliers. To achieve this you mainly use descriptive statistics, visual techniques,
and simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data
Analysis.
Data modeling or model building: In this phase you use models, domain knowledge, and insights
about the data you found in the previous steps to answer the research question. You select a
technique from the fields of statistics, machine learning, operations research, and so on. Building
a model is an iterative process that involves selecting the variables for the model, executing the
model, and model diagnostics.
Presentation and automation: Finally, you present the results to your business. These results can
take many forms, ranging from presentations to research reports. Sometimes you’ll need to
automate the execution of the process because the business will want to use the insights you gained
in another project or enable an operational process to use the outcome from your model.
8. Explain the Data Science Process in detail.
1. The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project
this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3. This step includes transforming the data from a raw form into data that’s directly usable in
your models. To achieve this, you’ll detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it. If you have successfully completed
this step, you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You may
still need to convince the business that your findings will indeed change the business process
as expected. This is where you can shine in your influencer role. The importance of this step
is more apparent in projects on a strategic and tactical level. Certain projects require you to
perform the business process over and over again, so automating the project will save time.
Step 1: Defining research goals and creating a project charter
• A project starts by understanding the what, the why, and the how of your project.
• What does the company expect you to do?
• Why does management place such a value on your research?
• The outcome should be a good understanding of the context, well-defined deliverables, and
a plan of action with a timetable.
• Spend time understanding the goals and context of your research.
• Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your
research is going to change the business, and understand how they’ll use your results.
Create a project charter
After understanding the problems and goals, try to get a formal agreement on the
deliverables.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
Step 2: Retrieving data

• Data collection is the most important step.
• There are readymade data sets are available from Companies and organizations.
• Start with data stored within the company. Normally, it is stored in the Databases, Data
marts, Data Warehouses, and Data Lakes.
• Knowledge of the data may be dispersed as people change positions and leave the
company.
• Organizations understand the value and sensitivity of data.
• Often have policies in place so everyone has access to what they need.
• Don’t be afraid to shop around. There are number of companies are selling the data around
the world.
• Many companies specialize in collecting valuable information. Nielsen and GFK are
well known for this in the retail industry. Twitter, LinkedIn, and Facebook.
• Do data quality checks now to prevent problems later.

• The retrieval of data is the first time you’ll inspect the data in the data science process.
• With data preparation, you do a more elaborate check, otherwise, those people are all using
the dataset will be affected. You can use the statistical method to ensure the quality of data.
• During the exploratory phase your focus shifts to what you can learn from the data
Step 3: Cleansing, integrating, and transforming data
CLEANSING DATA
• Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
• First type is the interpretation error. The common error types are given the following
table.
• Second type of error points to inconsistencies between data sources.
• Sometimes a single observation has too much influence, this can point to an error in
the data, but it can also be a valid point.
Data Entry Errors
Data collection and data entry are error-prone processes.
Redundant Whitespace: At times we enter the redundant whitespace, this will lead
complication in identification of strings. Most of the time string terminates with a
whitespace.
Capital Letter Mismatches: Some time instead of using the capital letter, we will use the
small letter, this is another problem in the data processing.
Impossible Values and Sanity Checks: It will accept the set of values in acceptable limits,
it is essential to check such values in the real world. Example the age of a person cannot
exceed 120. Therefore we need check such values.
Check = 1 <= age <= 120
Outliers
An observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations.
The easiest way to find outliers is to use a plot or a table with the minimum and maximum
values.
Dealing with Missing Values: Missing values need not be wrong, but still it need to be
treated. The methods of treatment and their advantage and disadvantages are given in the
following table.
Deviations from a code book: A code book is a description of your data, a form of metadata.
It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means. One can use such a code
book to correct the data, if it is missing or erroneous one.
Different Units of Measurement: Data set may be combined from different sources. Each
source will use their own measurement. Therefore we need to look in carefully and correct
/ convert the same in a standard measure.
Different Levels of Aggregation: An example of this would be a data set containing data
per week versus one containing data per work week. This type of error is generally easy to
detect, and summarizing (or the inverse, expanding) the data sets will fix it. After cleaning
the data errors, you combine information from different data sources.
CORRECT ERRORS AS EARLY AS POSSIBLE
Data should be cleansed when acquired for many reasons:
• Decision Makers take important decisions.
• If errors are not corrected early on in the process, the cleansing will have to be done
for every project that uses that data.
• Data errors may point to a business process that isn’t working as designed.
• Data errors may point to defective equipment, such as broken transmission lines
and defective sensors.
• Data errors can point to bugs in software or in the integration of software that may
be critical to the company.
COMBINING DATA FROM DIFFERENT DATA SOURCES.
• Data is acquired from different sources and hence Data varies in size, type, and
structure, ranging from databases and Excel files to text documents.
• The different ways of combining Data.
o Joining Data from different tables.
o Appending or stocking from different tables.
• An example for Joining Tables is shown below. The key uniquely identified is
called as Primary key, which is used for joining data and elimination of redundancy
of data.
o An example for Appending or Stacking Data from different Tables. In general, the SQL
query is used for appending the or stacking the tables.
o Sometimes the Join or Appending is harder due to the disk space restrictions. In order
to avoid this problem, the Views are used to Join or append the tables. It will only create
the logical view, without physical creation of data. This is shown in the following table.
o Enriching Aggregated Measures.
o Extra measures such as these can add perspective. Looking at figure 2.10, we now have
an aggregated data set, which in turn can be used to calculate the participation of each
product within its category. This could be useful during data exploration but more so
when creating data models.
TRANSFORMING DATA
o Transforming data into suitable model
o Transforming the data into suitable to the model is essential. This helps to identify the
relationship among the data set.
o In the above example, we just want to transform the Y = aebx data into linear model by
taking the log x value. This is essential in some cases. o Reducing the number of variable.
o In some cases it is essential to identify the most important attributes and select those
value for analysis. The PCA (Principal Component Analysis) used for this purpose,
which will avoid the unessential variables.
o Turning Variables into Dummies.
Sometimes it is essential to transform the dataset into binary values to avoid the
processing difficulties and this mechanism is known as dummies.
Step 4: Exploratory Data Analysis
The exploratory analysis are the methods used to understand the trend, relationship among the
variables. This could be performed by means for different methods as given below. Information
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical
techniques to gain an understanding of your data and the interactions between variables.
Step 5: Build the models
It is a three step process
Building a model is an iterative process that involves selecting the variables for the model,
executing the model, and model diagnostics.
Model and variable selection
In order to select the variable we should ask the following questions yourself and do the variable
selection.
o Must the model be moved to a production environment and, if so, would it be easy to
implement?
o How difficult is the maintenance on the model: how long will it remain relevant if left
untouched? o Does the model need to be easy to explain?
Model Execution
There are number mechanisms are available from both statistical and machine learning domain.
First one is the linear Regression: This is the normal line fitting mechanism to extrapolate the value
or doing the prediction. Normally, we solve the equation y = mx+b and try to find the value of m
and b with respect to the given dataset.
Model fit: It is essential to ensure the fitting of the model with respect to the given data. In order to
do that one we find the difference between the R-Square value and Adj. R-Square value should be
minimal. If it is minimal then it is considered to be the suitable model for given dataset.
Predictor Variable have coefficients. It will change because of adding more sample into the system.
Predictor significance in the target variable p. It should be less than 0.5. It is another factor which
study the impact of the model in the prediction.
Second one for the Classification: K-Nearest Neighbor is the one of the supervised classification
mechanism. In this classification, the output is a class membership. An object is classified by a
plurality vote of its neighbors, with the object being assigned to the class most common among its
k nearest neighbors (k is a positive integer, typically small).
o It is an error matrix to represent the accuracy of prediction.
Model diagnostics and model comparison
o Mean Square Error is the standard measurement used to predict the accuracy of the
system. If it is small then accuracy of the classification model is high.
Step 6: Presenting findings and building applications on top of them.
o Present the visualization as per your requirements. o The visualization techniques are
discussed in the exploratory analysis in detail. o Change the visualization as per
changes or modifications in the dataset automatically.
CONNECT WITH US
WEBSITE: www.eduengineering.in
 Best website for Anna University Affiliated College Students

 Regular Updates for all Semesters
 All Department Notes AVAILABLE
 All Lab Manuals AVAILABLE
 Handwritten Notes AVAILABLE
 Printed Notes AVAILABLE
 Past Year Question Papers AVAILABLE
 Subject wise Question Banks AVAILABLE
 Important Questions for Semesters AVAILABLE
 Various Author Books AVAILABLE

AD3491 UNIT 1 NOTES EduEngg

Uploaded by

Copyright:

Available Formats

AD3491 UNIT 1 NOTES EduEngg

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AD3491 UNIT 1 NOTES EduEngg

Uploaded by

Copyright:

Available Formats

CONNECT WITH US

 Best website for Anna University Affiliated College Students

Subject: Fundamentals of Data Science & Analytics

Data Science Big Data

It is about collection, processing, analyzing

It is a field of study just like the Computer

The goal is to make data more vital and usable

It is a super set of Big Data as data science

It is mainly used for business purposes and

Internet search, digital advertisements, text- Telecommunication, financial service, health

3. What are all the characteristics of big data?

An Excel Table is an example for structured data

7. Explain the data science process steps.

Data Science Process

Step 1: Defining research goals and creating a project charter

Step 2: Retrieving data

• Do data quality checks now to prevent problems later.

It is a three step process

 Best website for Anna University Affiliated College Students

You might also like