Data Science Introduction
Data Science Introduction
Data Science
Data Science is an interdisciplinary field that allows you to extract knowledge from structured or
unstructured data. Data science enables you to translate a business problem into a research project and
then translate it back into a practical solution. Data science is the process of deriving knowledge and
insights from a huge and diverse set of data through organizing, processing and analyzing the data. It
involves many different disciplines like mathematical and statistical modeling, extracting data from it
source and applying data visualization techniques. Data science is related to statistics, mining, deep
learning and big data. The fundamental goal of data science is to help companies make quicker and
better decisions, which can take them to the top of their market, or at least – especially in the toughest
red oceans – be a matter of long-term survival.
Statistics:
Statistics is the most critical unit in Data science. It is the method or science of collecting and
analyzing numerical data in large quantities to get useful insights.
Visualization:
Visualization technique helps you to access huge amounts of data in easy to understand and
digestible visuals
Machine Learning:
Machine Learning explores the building and study of algorithms which learn to make predictions
about unforeseen/future data
Deep Learning:
Deep Learning method is new machine learning research where the algorithm is applied to handle
huge amount of data.
Is it a Data Science problem?
Identify patterns
Identify anomalies
Show correlations
Predict outcomes
Applications:
1.Healthcare
The healthcare industry, in particular, benefits greatly from data science applications. Data science is
making huge strides in the healthcare business. Data science is used in a variety of sectors in health care.
Drug Development
From the first screening of medicinal compounds through the prediction of the success rate based on
biological variables, data science applications and machine learning algorithms simplify and shorten this
process, bringing a new viewpoint to each stage. Instead of "lab tests," these algorithms can predict how
the chemical will behave in the body using extensive mathematical modelling and simulations. The goal
of computational drug discovery is to construct computer model simulations in the form of a
physiologically appropriate network, which makes it easier to anticipate future events with high accuracy.
Basic healthcare help may be provided via AI-powered smartphone apps, which are often chatbots. You
just explain your symptoms or ask questions, and you'll get vital information about your health condition
gleaned from a vast network of symptoms and effects. Apps can help you remember to take your
medication on time and, if required, schedule a doctor's visit.
2.Targeted Advertising
If you thought Search was the most important data science use, consider this: the whole digital marketing
spectrum. Data science algorithms are used to determine virtually anything, from display banners on
various websites to digital billboards at airports. This is why digital commercials have a far greater CTR
(Call-Through Rate) than conventional marketing. They can be tailored to a user's previous actions.That’s
the reason why you may see advertisements for Data Science Training Programs while someone else
would see an advertisement for apparel in the same area at the same time.
3.Website Recommendations
Many businesses have aggressively exploited this engine to advertise their products based on user interest
and information relevancy. This method is used by internet companies such as Amazon, Twitter, Google
Play, Netflix, Linkedin, IMDb, and many more to improve the user experience.
4.E-Commerce
The e-commerce sector benefits greatly from data science techniques and machine learning ideas such as
natural language processing (NLP) and recommendation systems. Such approaches may be used by e-
commerce platforms to analyses consumer purchases and comments in order to gain valuable information
for their company development.
5. Transport
In the field of transportation, the most significant breakthrough or evolution that data science has brought
us is the introduction of self-driving automobiles. Through a comprehensive study of fuel usage trends,
driver behavior, and vehicle tracking, data science has established a foothold in transportation.
It is creating a reputation for itself by making driving situations safer for drivers, improving car
performance, giving drivers more autonomy, and much more. Vehicle makers can build smarter vehicles
and improve logistical routes by using reinforcement learning and introducing autonomy.
Airline Route Planning
The airline industry has a reputation for persevering in the face of adversity. However, a few
airline service providers are striving to maintain their occupancy ratios and working
benefits. The necessity to give considerable limitations to customers has been compounded
by skyrocketing air-fuel costs and the need to offer reductions in air-fuel expenses. It wasn't
long before airlines began employing data science to identify the most important areas for
improvement. Airlines may use data science to make strategic changes such as anticipating
flight delays, selecting which aircraft to buy, planning routes and layovers, and developing
marketing tactics such as a customer loyalty programme.
9.Customer Insights
Data on your clients may offer a lot of information about their behaviors, demographics, interests,
aspirations, and more. With so many possible sources of consumer data, a basic grasp of data science may
assist in making sense of it.For example, you may collect information on a customer every time they visit
your website or physical store, add an item to their basket, make a purchase, read an email, or interact
with a social network post. After you've double-checked that the data from each source is correct, you'll
need to integrate it in a process known as data wrangling.
10.Augmented Reality
This is the last of the data science applications that appear to have the most potential in the
future. Augmented reality is a term that refers to one of the most exciting uses of technology . Because a
VR headset incorporates computer expertise, algorithms, and data to provide you with the greatest
viewing experience, Data Science and Virtual Reality have a connection. The popular game Pokemon GO
is a modest step in the right direction. The ability to wander about and gaze at Pokemon on walls, streets,
and other non-existent objects. To determine the locations of the Pokemon and gyms, the game's
designers used data from Ingress, the company's previous software. Data Science, on the other hand, will
make more sense if the VR economy becomes more affordable and consumers begin to utilize it in the
same way they do other application
Data architect : manages data and data storage; sometimes manages data collection
The sponsor is responsible for deciding whether the project is a success or failure.
CLIENT
The client is the role that represents the model’s end users’ interests.
Generally the client belongs to a different group in the organization and has other responsibilities
beyond your project.
The client is more hands-on than the sponsor;
They’re the interface between the technical details of building a good model and the day-to-day
work process into which the model will be deployed.
The client is more hands-on than the sponsor; they’re the interface between the technical details
of building a good model and the day-to-day work process into which the model will be
deployed. They aren’t necessarily mathematically or statistically sophisticated, but are familiar
with the relevant business processes and serve as the domain expert on the team.
DATA SCIENTIST
Data scientist is responsible for taking all necessary steps to make the project succeed.
Responsible for setting the project strategy ,design, pick the data sources, and pick the tools to be
used and the techniques
They design the project steps, pick the data sources, and pick the tools to be used.
They’re also responsible for project planning and tracking ,testing and procedures, applies
machine learning models, and evaluates results.
The data scientist also looks at the data, performs statistical tests and procedures, applies machine
learning models, and evaluates results—the science portion of data science.
DATA ARCHITECT
The data architect is responsible for all of the data and its storage.
Data architects often manage data warehouses for many different projects
OPERATIONS
The operations role is critical both in acquiring data and delivering the final results.
The person filling this role usually has operational responsibilities outside of the data science
group.For example, if you’re deploying a data science result that affects how products are sorted
on an online shopping site, then the person responsible for running the site will have a lot to say
about how such a thing can be deployed
The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects.
The lifecycle has six phases, and project work can occur in several phases at once. For most phases in the
lifecycle, the movement can be either forward or backward. This iterative depiction of the lifecycle is
intended to more closely portray a real project, in which aspects of the project move forward and may
return to earlier stages as new information is uncovered and team members learn more about various
stages of the project. This enables participants to move iteratively through the process and drive toward
operationalizing the project work.
What are they doing to solve the problem now, and why isn’t that good enough?
What resources will you need: what kind of data and how much staff?
Will you have domain experts to collaborate with, and what are the computational resources?
How do the project sponsors plan to deploy your results? What are the constraints that have to be
met for successful deployment.
Example: The ultimate business goal is to reduce the bank’s losses due to bad loans.
Credit.history
Savings.Account.or.bonds (balance/amount)
Present.employment.since
Installment.rate.in.percentage.of.disposable.income
Personal.status.and.sex
Present.residence.since
Age.in.years
Other.installment.plans (other loans/lines of credit—the type)
Number.of.dependents
Extracting useful insights from the data in order to achieve your goals.
Phase 1- Discovery: In Phase 1, the team learns the business domain, including relevant history
such as whether the organization or business unit has attempted similar projects in the past from
which they can learn. The team assesses the resources available to support the project in terms of
people, technology, time, and data. Important activities in this phase include framing the business
problem as an analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data
. • Phase 2- Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the
team can work with data and perform analytics for the duration of the project. The team needs to
execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into
the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed
in the ETLT process so the team can work with it and analyze it. In this phase, the team also
needs to familiarize itself with the data thoroughly and take steps to condition the data.
• Phase 3-Model planning: Phase 3 is model planning, where the team determines the methods,
techniques, and workflow it intends to follow for the subsequent model building phase. The team
explores the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.
• Phase 4-Model building: In Phase 4, the team develops data sets for testing, training, and
production purposes. In addition, in this phase the team builds and executes models based on the
work done in the model planning phase. The team also considers whether its existing tools will
suffice for running the models, or if it will need a more robust environment for executing models
and work flows (for example, fast hardware and parallel processing, if applicable).
• Phase 5-Communicate results: In Phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria developed in
Phase 1. The team should identify key findings, quantify the business value, and develop a
narrative to summarize and convey findings to stakeholders.
• Phase 6-0perationalize: In Phase 6, the team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to implement the models in a production
environment.
Phase 1: Discovery
The data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project.
o Resources
Resources:
As part of the discovery phase, the team needs to assess the resources available to support the
project.
In this context, resources include technology, tools, systems, data, and people.
Does the requisite level of expertise exist within the organization today, or will it need to be
cultivated?
The team will need to determine whether it must collect additional data, purchase it from outside
sources, or transform existing data.
Ensure the project team has the right mix of domain experts, customers, analytic talent, and
project management to be effective.
Framing the Problem :
It is crucial to state the analytics problem, as well as why and to whom it is important
it is important to identify the main objectives of the project, identify what needs to be achieved in
business terms, and identify what needs to be done to meet the needs.
It is best practice to share the statement of goals and success criteria with the team and confirm
alignment with the project sponsor's expectations.
Establishing criteria for both success and failure helps the participants to avoid unproductive
effort and remain aligned with the project sponsors
Identifying Key Stakeholders:
Important step is to identify the key stakeholders and their interests in the project.
During these discussions, the team can identify the success criteria, key risks, and stakeholders
When interviewing stakeholders, learn about the domain area and any relevant history from
similar analytics projects.
Depending on the number of stakeholders and participants, the team may consider outlining the
type of activity and participation expected from each stakeholder and participant.
This will set clear expectations with the participants and avoid delays later
Interviewing the Analytics Sponsor:
The team should plan to collaborate with the stakeholders to clarify and frame the analytics
problem.
Sponsors may have a predetermined solution that may not necessarily realize the desired
Outcome.
In these cases, the team must use its knowledge and expertise to identify the true underlying
Prepare for the interview; draft questions, and review with colleagues.
Document what the team heard, and review it with the sponsors.
Developing Initial Hypotheses:
This step involves forming ideas that the team can test with data.
In this way, the team can compare its answers with the outcome of an experiment or test to
generate additional possible solutions to problems
Another part of this process involves gathering and assessing hypotheses from stakeholders and
domain experts who may have their own perspective on what the problem is, what the solution
should be, and how to arrive at a solution.
To get the data into the sandbox, the team needs to perform ETLT, by a combination of
extracting, transforming, and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and become familiar with it.
The team may perform data visualizations to help team members understand the data,
including its trends, outliers, and relationship among data variables. The step involves
o Preparing the Analytic Sandbox
o Performing ETLT
o Learning About the Data
o Data Conditioning
o Survey and Visualize
o Common Tools for the Data Preparation Phase
Preparing the Analytic Sandbox
When developing the analytic sandbox, it is a best practice to collect all kinds of data there, as
team members need access to high volumes and varieties of data for a Big Data analytics project.
This can include Analytic Sandbox: everything from summary-level aggregated data, structured
data, raw data feeds, and unstructured text data from call logs or web logs, depending on the kind
of analysis the team plans to undertake
Expect the sandbox to be large.lt may contain raw data, aggregated data, and other data types that
are less commonly used in organizations.
Sandbox size can vary greatly depending on the project. A good rule is to plan for the sandbox to
be at least 5-10 times the size of the original data sets, partly because copies of the data may be
created that serve as specific tables or data stores for specific kinds of analysis in the project.
Performing ETLT
In ETL, users perform extract, transform, load processes to extract data from a datastore, perform
data transformations, and load the data back into the datastore.
ln this case, the data is extracted in its raw form and loaded into the data store, where analysts
can choose to transform the data into a new state or leave it in its original, raw condition.
Using ETL, outliers may be filtered out or transformed and cleaned before being loaded into the
datastore.
Moving large amounts of data is sometimes referred to as Big ETL. The data movement can be
parallelized by technologies such as Hadoop or MapReduce
As part of the ETLT step, it is advisable to make an inventory of the data and compare the data
currently available with datasets the team needs (Gap Analysis).
Learning About the Data:
A critical aspect of a data science project is to become familiar with the data itself
Clarifies the data that the data science team has access to at the start of the project
Highlights gaps by identifying datasets within an organization that the team may find useful
Identifies datasets outside the organization that may be useful to obtain, through open APis, data
sharing, or purchasing data to supplement already existing datasets
Data Conditioning:
Data conditioning refers to the process of cleaning data, normalizing datasets, and performing
transformations On the data
Data conditioning can involve many complex steps to join or merge data sets or otherwise get
datasets into a state that enables analysis in further phases.
What are the data sources? What are the target fields (for example, columns of the tables)?
After the team has collected and obtained at least some of the datasets needed for the subsequent
analysis, a useful step is to leverage data visualization tools to gain an overview of the data.
Seeing high-level patterns in the data enables one to understand characteristics about the data
very quickly
Review data to ensure that calculations remained consistent within columns or across tables for a
given data field.
Does the data distribution stay consistent over all the data? If not, what kinds of actions should be
taken to address this problem?
Assess the granularity of the data, the range of values, and the level of aggregation of the data.
For time-related variables, are the measurements daily, weekly, monthly?
For geospatial datasets, are state or country abbreviations consistent across the data?
Common Tools for the Data Preparation Phase:
Hadoop :can perform massively parallel and custom analysis for web traffic parsing, GPS
location analytics and genomic analysis
Alpine Miner : provides a graphical user interface (GUI) for creating analytic work flows,
including data manipulations and a series of analytic events such as data-mining techniques
Open Refine :(formerly called Google Refine) is "a free, open source, powerful tool for working
with messy data." It is a popular GUI-based tool.
Data Wrangler :is an interactive tool for data clean ing and transformation. Wrangler was
developed at Stanford University and can be used to perform many transformations on a given
dataset forming data transformations
MODEL PLANNING
Data science team identifies candidate models to apply to the data for clustering, classifying, or
finding relationships in the data depending on the goal of the project
o Ensure that the analytical techniques enable the team to meet the business objectives and
accept or reject the working hypotheses.
Model Selection
the objective of the data exploration is to understand the relationships among the variables
to inform selection of the variables and methods and to understand the problem domain
The key to this approach is to aim for capturing the most essential predictors and variables
rather than considering every possible variable that people think may influence the
outcome.
Model Selection
The team's main goal is to choose an analytical technique, or a short list of candidate techniques,
based on the end goal of the project.
In the case of machine learning and data mining, these rules and conditions are grouped into
several general sets of techniques, such as classification, association rules, and clustering.
Teams create the initial models using a statistical software package such as R, SAS, or Matlab.
Although these tools are designed for data mining and machine learning algorithms, they may
have limitations when applying the models to very large datasets, as is common with Big Data.
Does the model appear valid and accurate on the test data?
Does the model output/behavior make sense to the domain experts? That is, does it appear as if
the model is giving answers that make sense in this context?
Do the parameter values of the fitted model make sense in the context of the domain?
Is a different form of the model required to address the business problem? If so, go back to the
model planning phase and revise the modeling approach.
COMMON TOOLS FOR THE MODEL PLANNING PHASE
R has a complete set of modeling capabilities and provides a good environment for building
interpretive models with high-quality code.
SQL Analysis services can perform in-database analytics of common data mining functions,
involved aggregations, and basic predictive models.
SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple data
connectors such as OBDC, JDBC, and OLE DB.
MODEL BUILDING
The data science team needs to develop data sets for training, testing, and production purposes.
These data sets enable the data scientist to develop the analytical model and train it while holding
aside some of the data for testing the model.
During this phase, users run models from analytical software packages, such as R or SAS, on file
extracts and small data sets for testing purposes. On a small scale, assess the validity of the model
and its results.
Common Tools for the Model Building Phase
SAS Enterprise Miner allows users to run predictive and descriptive models based on large
volumes of data from across the enterprise.
SPSS Modeler offers methods to explore and analyze data through a GUI.
Matlab provides a high-level language for performing a variety of data analytics, algorithms, and
data exploration
Alpine Miner provides a GUI front end for users to develop analytic work flows and interact
with Big Data tools and platforms on the back end.
STATISTICA and Mathematica are popular and well-regarded data mining and analytics
tools.
Open Source tools:
Octave: a free software programming language for computational modeling, has some of the
functionality of Matlab.
Python is a programming language that provides toolkits for machine learning and analysis, such
as scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib
COMMUNICATE RESULTS
After executing the model, the team needs to compare the outcomes of the modeling to the
criteria established for success and failure.
When conducting this assessment, determine if the results are statistically significant and valid
The best practice in this phase is to record all the findings and then select the three most
significant ones that can be shared with the stakeholders.
The team will have documented the key findings and major insights derived from the analysis.
OPERATIONALIZE
The team communicates the benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem
of users.
This allows the team to learn from the deployment and make any needed adjustments before
launching the model across the enterprise.
The presentation needs to include supporting information about analytical methodology and data
sources
Creating a mechanism for performing ongoing monitoring of model accuracy and, if accuracy
degrades, finding ways to retrain the model.
Expectation of different stakeholders at the conclusion of a project
Business User typically tries to determine the benefits and implications of the findings to the
business.
Project Sponsor typically asks questions related to the business impact of the project, the risks
and return on investment (ROI)
Project Manager needs to determine if the project was completed on time and within budget and
Data Engineer and Database Administrator (DBA) typically need to share their code from the
analytics project and create a technical document on how to implement it.
Data Scientist needs to share the code and explain the model to managers, and other stakeholders.
ANALYTICAL PLAN
Initial Hypotheses
Business impact
KEY DELIVERABLES OF ANALYTICS PROJECT
Developing Core Material for Multiple Audiences
Project Goals
Main Findings
Approach
Model Description
Model Details
Keep it short.
Your project sponsor might use your presentation to help sell the project or its results to the
rest of the organization.
Introduce your results early in the presentation, rather than building up to them.
The team should anticipate questions from IT related to how computationally expensive it
will be to run the model in the production environment.
Teams should approach writing technical documentation for their code as if it were an
application programming interface (API).
Model evaluation and critique
Once you have a model, you need to determine if it meets your goals:
Is it accurate enough for your needs? Does it generalize well?
Does it perform better than “the obvious guess”?
Better than whatever estimate you currently use?
Do the results of the model (coefficients, clusters, rules) make sense in the context of the problem
domain?
Confusion Matrix:
Confusion Matrix is the technique we use to measure the performance of classification models
The Confusion Matrix is in a tabular form where each row represents actual classes and columns are
predicated classes.
Accuracy
Accuracy is defined as the ratio of total number of correct predictions to the total number of samples.
Accuracy =(True Positive + True Negative) / (True positive +True Negative+ False Positive +False
Negative)
Precision
Precision defines the correct identification of actual positives
Recall or True Positive Rate is defined as the ratio of true positives to total number of true positives
and false negatives.
FI score is the measure that provides the single measure of combined precision and recall.
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Misclassification Rate or Error rate:
It is defined as the ratio between total number of misclassified samples and total number of samples
TP/(TP+FN).
TN/(TN+FP).
FP/(FP+TN)
TP/(TP+FN).
TN/(TN+FP).
FP/(FP+TN)
To better understand the market drivers related to Big Data, it is helpful to first understand some past
history of data stores and the kinds of repositories and tools to manage these data stores.
The data now comes from multiple sources, such as these:
Medical information, such as genomic sequencing and diagnostic imaging
• Photos and video footage uploaded to the World Wide Web
• Video surveillance, such as the thousands of video cameras spread across a city
• Mobile devices, which provide geospatiallocation data of the users, as well as metadata about text
messages, phone calls, and application usage on smart phones
• Smart devices, which provide sensor-based collection of information from smart electric grids,
smart buildings, and many other public and industry infrastructures
• Nontraditional IT devices, including the use of radio-frequency identification (RFID) readers, GPS
navigation systems, and seismic processing
Prepared By
D.Deva Hema