AD3491 UNIT 1 NOTES EduEngg
AD3491 UNIT 1 NOTES EduEngg
AD3491 UNIT 1 NOTES EduEngg
WEBSITE: www.eduengineering.in
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
1. What is Data Science?
Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains. Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.
Data Science: Data Science is a field or domain which includes and involves working with a huge
amount of data and uses it for building predictive, prescriptive and prescriptive analytical models.
It’s about digging, capturing, (building the model) analyzing (validating the model) and utilizing
the data (deploying the best model).
It is an intersection of Data and computing. It is a blend of the field of Computer Science, Business
Management and Statistics together.
Big Data: It is huge, large or voluminous data, information or the relevant statistics acquired by
the large organizations and ventures. Many software and data storage created and prepared as it is
difficult to compute the big data manually.
It is used to discover patterns and trends and make decisions related to human behavior and
interaction technology.
Example Applications:
Fraud and Risk Detection
Healthcare
Internet Search
Targeted Advertising
Website Recommendations
Advanced Image Recognition
Speech Recognition
Airline Route Planning
Gaming
Augmented Reality
2. What are all the difference between the Big Data and Data Science?
TELEGRAM: @eduengineering
Big Data is a technique to collect, maintain and
process the huge information.
Data Science is an area.
Tools mainly used in Data Science includes Tools mostly used in Big Data includes
SAS, R, Python, etc Hadoop, Spark, Flink, etc.
TELEGRAM: @eduengineering
Uses mathematics and statistics extensively
along with programming skills to develop a
Used by businesses to track their presence in
model to test the hypothesis and make
the market which helps them develop agility
decisions in the business and gain a competitive advantage over others.
TELEGRAM: @eduengineering
v. Sharing - Data sharing is the practice of making data used for scholarly research
available to other investigators
vi. Transfer - Data transfer refers to the secure exchange of large files between systems
or organizations.
vii. Visualization - Data visualization is the graphical representation of information and
data.
5. What are all the Benefits and uses/advantage of Data Science and Big Data Analytics?
There are number of benefits/advantages while using the Big Data Analytics. Some of them are
listed below.
• Commercial Companies in all business wish to analyses and gain insights into their
customers, processes, staff, completion, and products. Many companies use data science
to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize
their offerings.
• Human resource professionals use people analytics and text mining to screen candidates,
monitor the mood of employees, and study informal networks among coworkers.
• Financial institutions use data science to predict stock markets, determine the risk of
lending money, and learn how to attract new clients for their services.
• Many governmental organizations not only rely on internal data scientists to discover
valuable information, but also share their data with the public. You can use this data to gain
insights or build data-driven applications.
• Nongovernmental organizations (NGOs) can use it as a source for get funding. Many data
scientists devote part of their time to helping NGOs, because NGOs often lack the resources
to collect data and employ data scientists.
• Universities use data science in their research but also to enhance the study experience of
their students. The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement traditional
classes.
• Data accumulation from multiple sources, including the Internet, social media platforms,
online shopping sites, company databases, external third-party sources, etc.
• Real-time forecasting and monitoring of business as well as the market.
• Identify crucial points hidden within large datasets to influence business decisions.
• Promptly mitigate risks by optimizing complex decisions for unforeseen events and
potential threats.
6. List the different types of Data used in Big Data Analytics and Data Science.
The major category of data types used in Big Data Analytics and Data Science are as follows.
■ Structured - Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within databases or Excel
TELEGRAM: @eduengineering
files. SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.
Email is the best example for unstructured data and natural language data.
■ Natural language: Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained in one
domain don’t generalize well to other domains
■ Machine-generated: Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human intervention.
TELEGRAM: @eduengineering
Machine Generated Data
■ Graph-based or Network data: The graph structures use nodes, edges, and properties to represent
and store graphical data. Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a person and the shortest
path between two people
■ Audio, video, and images: Audio, image, and video are data types that pose specific challenges
to a data scientist. Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games.
TELEGRAM: @eduengineering
■ Streaming: While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being loaded into a data
store in a batch.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.
TELEGRAM: @eduengineering
Data preparation: in this phase you enhance the quality of the data and prepare it for use in
subsequent steps. This phase consists of three sub-phases: data cleansing removes false values
from a data source and inconsistencies across data sources, data integration enriches data sources
by combining information from multiple data sources, and data transformation ensures that the
data is in a suitable format for use in your models.
Data exploration: Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the data, and
whether there are outliers. To achieve this you mainly use descriptive statistics, visual techniques,
and simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data
Analysis.
Data modeling or model building: In this phase you use models, domain knowledge, and insights
about the data you found in the previous steps to answer the research question. You select a
technique from the fields of statistics, machine learning, operations research, and so on. Building
a model is an iterative process that involves selecting the variables for the model, executing the
model, and model diagnostics.
Presentation and automation: Finally, you present the results to your business. These results can
take many forms, ranging from presentations to research reports. Sometimes you’ll need to
automate the execution of the process because the business will want to use the insights you gained
in another project or enable an operational process to use the outcome from your model.
TELEGRAM: @eduengineering
8. Explain the Data Science Process in detail.
1. The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project
this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3. This step includes transforming the data from a raw form into data that’s directly usable in
your models. To achieve this, you’ll detect and correct different kinds of errors in the data,
TELEGRAM: @eduengineering
combine data from different data sources, and transform it. If you have successfully completed
this step, you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You may
still need to convince the business that your findings will indeed change the business process
as expected. This is where you can shine in your influencer role. The importance of this step
is more apparent in projects on a strategic and tactical level. Certain projects require you to
perform the business process over and over again, so automating the project will save time.
• A project starts by understanding the what, the why, and the how of your project.
• What does the company expect you to do?
• Why does management place such a value on your research?
• The outcome should be a good understanding of the context, well-defined deliverables, and
a plan of action with a timetable.
• Spend time understanding the goals and context of your research.
• Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your
research is going to change the business, and understand how they’ll use your results.
Create a project charter
After understanding the problems and goals, try to get a formal agreement on the
deliverables.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
TELEGRAM: @eduengineering
■ A timeline
TELEGRAM: @eduengineering
CLEANSING DATA
• Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
• First type is the interpretation error. The common error types are given the following
table.
• Second type of error points to inconsistencies between data sources.
TELEGRAM: @eduengineering
• Sometimes a single observation has too much influence, this can point to an error in
the data, but it can also be a valid point.
TELEGRAM: @eduengineering
Data Entry Errors
Data collection and data entry are error-prone processes.
Redundant Whitespace: At times we enter the redundant whitespace, this will lead
complication in identification of strings. Most of the time string terminates with a
whitespace.
Capital Letter Mismatches: Some time instead of using the capital letter, we will use the
small letter, this is another problem in the data processing.
Impossible Values and Sanity Checks: It will accept the set of values in acceptable limits,
it is essential to check such values in the real world. Example the age of a person cannot
exceed 120. Therefore we need check such values.
Check = 1 <= age <= 120
Outliers
An observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations.
The easiest way to find outliers is to use a plot or a table with the minimum and maximum
values.
TELEGRAM: @eduengineering
Dealing with Missing Values: Missing values need not be wrong, but still it need to be
treated. The methods of treatment and their advantage and disadvantages are given in the
following table.
TELEGRAM: @eduengineering
Deviations from a code book: A code book is a description of your data, a form of metadata.
It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means. One can use such a code
book to correct the data, if it is missing or erroneous one.
Different Units of Measurement: Data set may be combined from different sources. Each
source will use their own measurement. Therefore we need to look in carefully and correct
/ convert the same in a standard measure.
Different Levels of Aggregation: An example of this would be a data set containing data
per week versus one containing data per work week. This type of error is generally easy to
detect, and summarizing (or the inverse, expanding) the data sets will fix it. After cleaning
the data errors, you combine information from different data sources.
CORRECT ERRORS AS EARLY AS POSSIBLE
Data should be cleansed when acquired for many reasons:
• Decision Makers take important decisions.
• If errors are not corrected early on in the process, the cleansing will have to be done
for every project that uses that data.
• Data errors may point to a business process that isn’t working as designed.
• Data errors may point to defective equipment, such as broken transmission lines
and defective sensors.
• Data errors can point to bugs in software or in the integration of software that may
be critical to the company.
COMBINING DATA FROM DIFFERENT DATA SOURCES.
• Data is acquired from different sources and hence Data varies in size, type, and
structure, ranging from databases and Excel files to text documents.
• The different ways of combining Data.
o Joining Data from different tables.
o Appending or stocking from different tables.
• An example for Joining Tables is shown below. The key uniquely identified is
called as Primary key, which is used for joining data and elimination of redundancy
of data.
TELEGRAM: @eduengineering
o An example for Appending or Stacking Data from different Tables. In general, the SQL
query is used for appending the or stacking the tables.
o Sometimes the Join or Appending is harder due to the disk space restrictions. In order
to avoid this problem, the Views are used to Join or append the tables. It will only create
the logical view, without physical creation of data. This is shown in the following table.
TELEGRAM: @eduengineering
o Enriching Aggregated Measures.
o Extra measures such as these can add perspective. Looking at figure 2.10, we now have
an aggregated data set, which in turn can be used to calculate the participation of each
product within its category. This could be useful during data exploration but more so
when creating data models.
TRANSFORMING DATA
o Transforming data into suitable model
o Transforming the data into suitable to the model is essential. This helps to identify the
relationship among the data set.
TELEGRAM: @eduengineering
o In the above example, we just want to transform the Y = aebx data into linear model by
taking the log x value. This is essential in some cases. o Reducing the number of variable.
o In some cases it is essential to identify the most important attributes and select those
value for analysis. The PCA (Principal Component Analysis) used for this purpose,
which will avoid the unessential variables.
o Turning Variables into Dummies.
Sometimes it is essential to transform the dataset into binary values to avoid the
processing difficulties and this mechanism is known as dummies.
TELEGRAM: @eduengineering
Step 4: Exploratory Data Analysis
TELEGRAM: @eduengineering
The exploratory analysis are the methods used to understand the trend, relationship among the
variables. This could be performed by means for different methods as given below. Information
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical
techniques to gain an understanding of your data and the interactions between variables.
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
Step 5: Build the models
Building a model is an iterative process that involves selecting the variables for the model,
executing the model, and model diagnostics.
Model and variable selection
In order to select the variable we should ask the following questions yourself and do the variable
selection.
o Must the model be moved to a production environment and, if so, would it be easy to
implement?
o How difficult is the maintenance on the model: how long will it remain relevant if left
untouched? o Does the model need to be easy to explain?
Model Execution
There are number mechanisms are available from both statistical and machine learning domain.
First one is the linear Regression: This is the normal line fitting mechanism to extrapolate the value
or doing the prediction. Normally, we solve the equation y = mx+b and try to find the value of m
and b with respect to the given dataset.
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
Model fit: It is essential to ensure the fitting of the model with respect to the given data. In order to
do that one we find the difference between the R-Square value and Adj. R-Square value should be
minimal. If it is minimal then it is considered to be the suitable model for given dataset.
Predictor Variable have coefficients. It will change because of adding more sample into the system.
Predictor significance in the target variable p. It should be less than 0.5. It is another factor which
study the impact of the model in the prediction.
Second one for the Classification: K-Nearest Neighbor is the one of the supervised classification
mechanism. In this classification, the output is a class membership. An object is classified by a
plurality vote of its neighbors, with the object being assigned to the class most common among its
k nearest neighbors (k is a positive integer, typically small).
TELEGRAM: @eduengineering
TELEGRAM: @eduengineering
o It is an error matrix to represent the accuracy of prediction.
TELEGRAM: @eduengineering
Model diagnostics and model comparison
o Mean Square Error is the standard measurement used to predict the accuracy of the
system. If it is small then accuracy of the classification model is high.
TELEGRAM: @eduengineering
Step 6: Presenting findings and building applications on top of them.
o Present the visualization as per your requirements. o The visualization techniques are
discussed in the exploratory analysis in detail. o Change the visualization as per
changes or modifications in the dataset automatically.
TELEGRAM: @eduengineering
CONNECT WITH US
WEBSITE: www.eduengineering.in
TELEGRAM: @eduengineering