0% found this document useful (0 votes)
111 views

Data Science Introduction

Data science is an interdisciplinary field that allows extracting knowledge from structured or unstructured data. It involves organizing, processing, and analyzing large and diverse datasets using techniques like statistics, machine learning, and visualization. Data science has many applications including healthcare, advertising, ecommerce, transportation, gaming, and more. It helps companies make better decisions by identifying patterns, correlations, and predicting outcomes from data.

Uploaded by

Deva Hema
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Data Science Introduction

Data science is an interdisciplinary field that allows extracting knowledge from structured or unstructured data. It involves organizing, processing, and analyzing large and diverse datasets using techniques like statistics, machine learning, and visualization. Data science has many applications including healthcare, advertising, ecommerce, transportation, gaming, and more. It helps companies make better decisions by identifying patterns, correlations, and predicting outcomes from data.

Uploaded by

Deva Hema
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DATASCIENCE INTRODUCTIONA AND APPLICATION

Data Science

Data Science is an interdisciplinary field that allows you to extract knowledge from structured or
unstructured data. Data science enables you to translate a business problem into a research project and
then translate it back into a practical solution. Data science is the process of deriving knowledge and
insights from a huge and diverse set of data through organizing, processing and analyzing the data. It
involves many different disciplines like mathematical and statistical modeling, extracting data from it
source and applying data visualization techniques.  Data science is related to statistics, mining, deep
learning and big data. The fundamental goal of data science is to help companies make quicker and
better decisions, which can take them to the top of their market, or at least – especially in the toughest
red oceans – be a matter of long-term survival.

     Example:  Good or bad customer identification in banking

Data Science involves the following:

Statistics:

 Statistics is the most critical unit in Data science. It is the method or science of collecting and
analyzing numerical data in large quantities to get useful insights.

Visualization:

 Visualization technique helps you to access huge amounts of data in easy to understand and
digestible visuals
Machine Learning:

 Machine Learning explores the building and study of algorithms which learn to make predictions
about unforeseen/future data

Deep Learning:

 Deep Learning method is new machine learning research where the algorithm is applied to handle
huge amount of data.
Is it a Data Science problem?

A true data science problem may:

 Categorize or group data

 Identify patterns

 Identify anomalies

 Show correlations

 Predict outcomes
Applications:

1.Healthcare 

The healthcare industry, in particular, benefits greatly from data science applications. Data science is
making huge strides in the healthcare business. Data science is used in a variety of sectors in health care.

 Image Analysis in Medicine 

 Genetics and Genomics 

 Drug Development

 Virtual Assistants and Health bots

Medical Image Analysis


 To discover ideal parameters for jobs like lung texture categorization, procedures like detecting
malignancies, artery stenosis, and organ delineation use a variety of methodologies and frameworks
like MapReduce. For solid texture classification, it uses machine learning techniques such as support
vector machines (SVM), content-based medical picture indexing, and wavelet analysis.
 
Genetics & Genomics
 
Through genetics and genomics research, Data Science applications also offer a higher level of therapy
customization. The objective is to discover specific biological links between genetics, illnesses, and
medication response in order to better understand the influence of DNA on human health. 
Drug Development 

From the first screening of medicinal compounds through the prediction of the success rate based on
biological variables, data science applications and machine learning algorithms simplify and shorten this
process, bringing a new viewpoint to each stage. Instead of "lab tests," these algorithms can predict how
the chemical will behave in the body using extensive mathematical modelling and simulations. The goal
of computational drug discovery is to construct computer model simulations in the form of a
physiologically appropriate network, which makes it easier to anticipate future events with high accuracy.

Virtual Assistants and Health bots 

Basic healthcare help may be provided via AI-powered smartphone apps, which are often chatbots. You
just explain your symptoms or ask questions, and you'll get vital information about your health condition
gleaned from a vast network of symptoms and effects. Apps can help you remember to take your
medication on time and, if required, schedule a doctor's visit.

2.Targeted Advertising
If you thought Search was the most important data science use, consider this: the whole digital marketing
spectrum. Data science algorithms are used to determine virtually anything, from display banners on
various websites to digital billboards at airports. This is why digital commercials have a far greater CTR
(Call-Through Rate) than conventional marketing. They can be tailored to a user's previous actions.That’s
the reason why you may see advertisements for Data Science Training Programs while someone else
would see an advertisement for apparel in the same area at the same time.
 
3.Website Recommendations
 
Many businesses have aggressively exploited this engine to advertise their products based on user interest
and information relevancy. This method is used by internet companies such as Amazon, Twitter, Google
Play, Netflix, Linkedin, IMDb, and many more to improve the user experience.

4.E-Commerce
 The e-commerce sector benefits greatly from data science techniques and machine learning ideas such as
natural language processing (NLP) and recommendation systems. Such approaches may be used by e-
commerce platforms to analyses consumer purchases and comments in order to gain valuable information
for their company development. 
5. Transport
 
In the field of transportation, the most significant breakthrough or evolution that data science has brought
us is the introduction of self-driving automobiles. Through a comprehensive study of fuel usage trends,
driver behavior, and vehicle tracking, data science has established a foothold in transportation. 
 
It is creating a reputation for itself by making driving situations safer for drivers, improving car
performance, giving drivers more autonomy, and much more. Vehicle makers can build smarter vehicles
and improve logistical routes by using reinforcement learning and introducing autonomy. 
 Airline Route Planning
 
The airline industry has a reputation for persevering in the face of adversity. However, a few
airline service providers are striving to maintain their occupancy ratios and working
benefits.  The necessity to give considerable limitations to customers has been compounded
by skyrocketing air-fuel costs and the need to offer reductions in air-fuel expenses. It wasn't
long before airlines began employing data science to identify the most important areas for
improvement.  Airlines may use data science to make strategic changes such as anticipating
flight delays, selecting which aircraft to buy, planning routes and layovers, and developing
marketing tactics such as a customer loyalty programme.

6.Text and Advanced Image Recognization


 Speech and picture recognition are ruled by data science algorithms. In our daily lives, we can see the
wonderful work of these algorithms. Have you ever needed the help of a virtual speech assistant like
Google Assistant, Alexa, or Siri? Its speech recognition technology, on the other hand, is working behind
the scenes, attempting to comprehend and evaluate your words and delivering useful results from your
use. Image recognition may be found on Facebook, Instagram, and Twitter, among other social media
platforms. When you post a photo of yourself with someone on your profile, these applications offer to
identify them and tag them.
7.Gaming
 
Machine learning algorithms are increasingly used to create games that grow and upgrade as the player
progresses through the levels. In motion gaming, your opponent (computer) also studies your past actions
and adjusts its game appropriately. EA Sports, Zynga, Sony, Nintendo, and Activision-Blizzard have all
used data science to take gaming to the next level.
 
8.Security
 
Data science may be utilized to improve your company's security and secure critical data. Banks, for
example, utilize sophisticated machine-learning algorithms to detect fraud based on a user's usual
financial activity. Because of the massive amount of data created every day, these algorithms can detect
fraud faster and more accurately than people. Even if you don't work at a financial institution, such
algorithms can be used to secure confidential material. 

9.Customer Insights
 
Data on your clients may offer a lot of information about their behaviors, demographics, interests,
aspirations, and more. With so many possible sources of consumer data, a basic grasp of data science may
assist in making sense of it.For example, you may collect information on a customer every time they visit
your website or physical store, add an item to their basket, make a purchase, read an email, or interact
with a social network post. After you've double-checked that the data from each source is correct, you'll
need to integrate it in a process known as data wrangling.

10.Augmented Reality
 
This is the last of the data science applications that appear to have the most potential in the
future. Augmented reality is a term that refers to one of the most exciting uses of technology . Because a
VR headset incorporates computer expertise, algorithms, and data to provide you with the greatest
viewing experience, Data Science and Virtual Reality have a connection. The popular game Pokemon GO
is a modest step in the right direction. The ability to wander about and gaze at Pokemon on walls, streets,
and other non-existent objects. To determine the locations of the Pokemon and gyms, the game's
designers used data from Ingress, the company's previous software. Data Science, on the other hand, will
make more sense if the VR economy becomes more affordable and consumers begin to utilize it in the
same way they do other application

The roles in a data science project


 Project sponsor : represents the business interests; champions the project

 Client :represents end user 


 Data scientist : sets and executes analytic strategy; communicates with sponsor and client

 Data architect : manages data and data storage; sometimes manages data collection

 Operations: manages infrastructure; deploys final project results


PROJECT SPONSOR
 The sponsor is the person who wants the data science result.

 The sponsor is responsible for deciding whether the project is a success or failure.

CLIENT
 The client is the role that represents the model’s end users’ interests.

 Generally the client belongs to a different group in the organization and has other responsibilities
beyond your project.
 The client is more hands-on than the sponsor;

 They’re the interface between the technical details of building a good model and the day-to-day
work process into which the model will be deployed.

 The client is more hands-on than the sponsor; they’re the interface between the technical details
of building a good model and the day-to-day work process into which the model will be
deployed. They aren’t necessarily mathematically or statistically sophisticated, but are familiar
with the relevant business processes and serve as the domain expert on the team.
DATA SCIENTIST
 Data scientist is responsible for taking all necessary steps to make the project succeed.

 Responsible for setting the project strategy ,design, pick the data sources, and pick the tools to be
used and the techniques 

 They design the project steps, pick the data sources, and pick the tools to be used.

 They’re also responsible for project planning and tracking ,testing and procedures, applies
machine learning models, and evaluates results.

 The data scientist also looks at the data, performs statistical tests and procedures, applies machine
learning models, and evaluates results—the science portion of data science.
DATA ARCHITECT
 The data architect is responsible for all of the data and its storage. 
 Data architects often manage data warehouses for many different projects
OPERATIONS
 The operations role is critical both in acquiring data and delivering the final results.

 The person filling this role usually has operational responsibilities outside of the data science
group.For example, if you’re deploying a data science result that affects how products are sorted
on an online shopping site, then the person responsible for running the site will have a lot to say
about how such a thing can be deployed

Data Analytics Lifecycle Overview

The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects.
The lifecycle has six phases, and project work can occur in several phases at once. For most phases in the
lifecycle, the movement can be either forward or backward. This iterative depiction of the lifecycle is
intended to more closely portray a real project, in which aspects of the project move forward and may
return to earlier stages as new information is uncovered and team members learn more about various
stages of the project. This enables participants to move iteratively through the process and drive toward
operationalizing the project work.

Stages of Data Science project

Defining the goal


 The first task in a data science project is to define a measurable and quantifiable goal.

 Why do the sponsors want the project in the first place? 


 What do they lack, and what do they need?

 What are they doing to solve the problem now, and why isn’t that good enough?
 What resources will you need: what kind of data and how much staff? 

 Will you have domain experts to collaborate with, and what are the computational resources? 

  How do the project sponsors plan to deploy your results? What are the constraints that have to be
met for successful deployment.

 Example: The ultimate business goal is to reduce the bank’s losses due to bad loans.

Data collection and management


This step encompasses identifying the data you need, exploring it, and conditioning it to be suitable for
analysis. This stage is often the most time-consuming step in the process. It’s also one of the most
important: 
What data is available to me?
Will it help me solve the problem?
Is it enough?
Is the data quality good enough?
Loan data attributes:

 Status.of.existing.checking.account (at time of application)

 Duration.in.month (loan length)

 Credit.history

 Purpose (car loan, student loan, etc.)

 Credit.amount (loan amount)

 Savings.Account.or.bonds (balance/amount)

 Present.employment.since

 Installment.rate.in.percentage.of.disposable.income

 Personal.status.and.sex

 Present.residence.since

 Collateral (car, property, etc.)

 Age.in.years
 Other.installment.plans (other loans/lines of credit—the type)

 Housing (own, rent, etc.)

 Job (employment type)

 Number.of.dependents

 Telephone (do they have one)

 Good.Loan (dependent variable)


Modeling
 Finalize the statistics and machine learning during the modeling, or analysis, stage.

 Extracting useful insights from the data in order to achieve your goals.

 The loan application problem is a classification problem


The most common data science modeling tasks are these:

 Classification—Deciding if something belongs to one category or another

 Scoring—Predicting or estimating a numeric value, such as a price or probability

 Ranking—Learning to order items by preferences

 Clustering—Grouping items into most-similar groups

 Finding relations—Finding correlations or potential causes of effects seen in the data

 Characterization—Very general plotting and report generation from data

Overview of Data Analytics Lifecycle

Phase 1- Discovery: In Phase 1, the team learns the business domain, including relevant history
such as whether the organization or business unit has attempted similar projects in the past from
which they can learn. The team assesses the resources available to support the project in terms of
people, technology, time, and data. Important activities in this phase include framing the business
problem as an analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data

. • Phase 2- Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the
team can work with data and perform analytics for the duration of the project. The team needs to
execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into
the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed
in the ETLT process so the team can work with it and analyze it. In this phase, the team also
needs to familiarize itself with the data thoroughly and take steps to condition the data.

• Phase 3-Model planning: Phase 3 is model planning, where the team determines the methods,
techniques, and workflow it intends to follow for the subsequent model building phase. The team
explores the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.

• Phase 4-Model building: In Phase 4, the team develops data sets for testing, training, and
production purposes. In addition, in this phase the team builds and executes models based on the
work done in the model planning phase. The team also considers whether its existing tools will
suffice for running the models, or if it will need a more robust environment for executing models
and work flows (for example, fast hardware and parallel processing, if applicable).

• Phase 5-Communicate results: In Phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria developed in
Phase 1. The team should identify key findings, quantify the business value, and develop a
narrative to summarize and convey findings to stakeholders.

• Phase 6-0perationalize: In Phase 6, the team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to implement the models in a production
environment.
Phase 1: Discovery
 The data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project.

o Learning the Business Domain

o Resources

o  Framing the Problem

o  Identifying Key Stakeholders

o  Interviewing the Analytics Sponsor

o Developing Initial Hypotheses

o Identifying Potential Data Sources

Learning the Business Domain:


 Understanding the domain area of the problem is essential.
 Data scientists have deep knowledge of the methods, techniques, and ways for applying heuristics
to a variety of business and conceptual problems

Resources:

 As part of the discovery phase, the team needs to assess the resources available to support the
project. 

 In this context, resources include technology, tools, systems, data, and people.

 Does the requisite level of expertise exist within the organization today, or will it need to be
cultivated? 

 The team will need to determine whether it must collect additional data, purchase it from outside
sources, or transform existing data.

 Ensure the project team has the right mix of domain experts, customers, analytic talent, and
project management to be effective.
Framing the Problem :

 Framing is the process of stating the analytics problem to be solved.

 It is crucial to state the analytics problem, as well as why and to whom it is important

 it is important to identify the main objectives of the project, identify what needs to be achieved in
business terms, and identify what needs to be done to meet the needs.
 It is best practice to share the statement of goals and success criteria with the team and confirm
alignment with the project sponsor's expectations.

 Establishing criteria for both success and failure helps the participants to avoid unproductive
effort and remain aligned with the project sponsors
Identifying Key Stakeholders:
 Important step is to identify the key stakeholders and their interests in the project.
 During these discussions, the team can identify the success criteria, key risks, and stakeholders
 When interviewing stakeholders, learn about the domain area and any relevant history from
similar analytics projects.
 Depending on the number of stakeholders and participants, the team may consider outlining the
type of activity and participation expected from each stakeholder and participant.
 This will set clear expectations with the participants and avoid delays later
Interviewing the Analytics Sponsor:

 The team should plan to collaborate with the stakeholders to clarify and frame the analytics
problem. 

 Sponsors may have a predetermined solution that may not necessarily realize the desired

    Outcome. 

 In these cases, the team must use its knowledge and expertise to identify the true underlying

     Problem and appropriate solution.


 Data science team typically may have a more objective understanding of the problem set than the
stakeholders, who may be suggesting solutions.

Some tips for interviewing project sponsors:

 Prepare for the interview; draft questions, and review with colleagues.

 Use open-ended questions; avoid asking leading questions.

 Probe for details and pose follow-up questions.

 Document what the team heard, and review it with the sponsors. 
Developing Initial Hypotheses:

 This step involves forming ideas that the team can test with data.

 In this way, the team can compare its answers with the outcome of an experiment or test to
generate additional possible solutions to problems

 Another part of this process involves gathering and assessing hypotheses from stakeholders and
domain  experts who may have their own perspective on what the problem is, what the solution
should be, and how to arrive at a solution.

Phase 2: Data Preparation


Identifying Potential Data Sources:
 The team should perform five main activities during this step of the discovery phase:
o Identify data sources
o Capture aggregated data sources
o Review the raw data
o Evaluate the data structures and tools needed
o Sort of data infrastructure needed for this type of problem
 The second phase of the Data Analytics Lifecycle involves data preparation, which
includes the steps to explore, preprocess, and condition data prior to modeling and
analysis.

 To get the data into the sandbox, the team needs to perform ETLT, by a combination of
extracting, transforming, and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and become familiar with it.

 The team may perform data visualizations to help team members understand the data,
including its trends, outliers, and relationship among data variables. The step involves
o Preparing the Analytic Sandbox
o Performing ETLT
o Learning About the Data
o Data Conditioning
o Survey and Visualize
o Common Tools for the Data Preparation Phase
Preparing the Analytic Sandbox

 When developing the analytic sandbox, it is a best practice to collect all kinds of data there, as
team members need access to high volumes and varieties of data for a Big Data analytics project.

 This can include Analytic Sandbox: everything from summary-level aggregated data, structured
data, raw data feeds, and unstructured text data from call logs or web logs, depending on the kind
of analysis the team plans to undertake
 Expect the sandbox to be large.lt may contain raw data, aggregated data, and other data types that
are less commonly used in organizations.

 Sandbox size can vary greatly depending on the project. A good rule is to plan for the sandbox to
be at least 5-10 times the size of the original data sets, partly because copies of the data may be
created that serve as specific tables or data stores for specific kinds of analysis in the project.
Performing ETLT

 In ETL, users perform extract, transform, load processes to extract data from a datastore, perform
data transformations, and load the data back into the datastore.

 ln this case, the data is extracted in its raw form and  loaded into the data store, where analysts
can choose to transform the data into a new state or leave it in its original, raw condition. 
 Using ETL,   outliers may be filtered out or transformed and cleaned before being loaded into the
datastore.

 Moving large amounts of data is sometimes referred to as Big ETL. The data movement can be
parallelized by technologies such as Hadoop   or MapReduce

 As part of the ETLT step, it is advisable to make an inventory of the data and compare the data
currently available with datasets the team needs (Gap Analysis).
Learning About the Data:

 A critical aspect of a data science project is to become familiar with the data itself

 Clarifies the data that the data science team has access to at the start of the project

 Highlights gaps by identifying datasets within an organization that the team may find useful

 Identifies datasets outside the organization that may be useful to obtain, through open APis, data
sharing, or purchasing data to supplement already existing datasets
Data Conditioning:

 Data conditioning refers to the process of cleaning data, normalizing datasets, and performing
transformations On the data

 Data conditioning can involve many complex steps to join or merge data sets or otherwise get
datasets into a state that enables analysis in further phases.

 What are the data sources? What are the target fields (for example, columns of the tables)?

  How clean is the data?

 How consistent are the contents and files?

 Review the content of data columns or other inputs

 Look for any evidence of systematic error.

Survey and Visualize:

 After the team has collected and obtained at least some of the datasets needed for the subsequent
analysis, a useful step is to leverage data visualization tools to gain an overview of the data. 

 Seeing high-level patterns in the data enables one to understand characteristics about the data
very quickly
 Review data to ensure that calculations remained consistent within columns or across tables for a
given data field.

 Does the data distribution stay consistent over all the data? If not, what kinds of actions should be
taken to address this problem?

 Assess the granularity of the data, the range of values, and the level of aggregation of the data.
 For time-related variables, are the measurements daily, weekly, monthly?

 Is the data standardized/normalized? Are the scales consistent?

 For geospatial datasets, are state or country abbreviations consistent across the data?
Common Tools for the Data Preparation Phase:

 Hadoop :can perform massively parallel and custom analysis for web traffic parsing, GPS
location analytics and  genomic analysis

 Alpine Miner : provides a graphical user interface (GUI) for creating analytic work flows,
including data manipulations and a series of analytic events such as data-mining techniques
 Open Refine :(formerly called Google Refine) is "a free, open source, powerful tool for working
with messy data." It is a popular GUI-based tool.

 Data Wrangler :is an interactive tool for data clean ing and transformation. Wrangler was
developed at Stanford University and can be used to perform many transformations on a given
dataset forming data transformations
MODEL PLANNING
 Data science team identifies candidate models to apply to the data for clustering, classifying, or
finding relationships in the data depending on the goal of the project

o Assess the structure of the datasets.

o Ensure that the analytical techniques enable the team to meet the business objectives and
accept or reject the working hypotheses.

o Determine if the situation warrants a single model or a series of techniques as part of a


larger analytic workflow.
 Data Exploration and Variable Selection

 Model Selection

 Common Tools for the Model Planning Phase


Data Exploration and Variable Selection:
 Data exploration takes place in the data preparation phase, those activities focus mainly on
data hygiene and on assessing the quality of the data itself. 

 the objective of the data exploration is to understand the relationships among the variables
to inform selection of the variables and methods and to understand the problem domain

 The key to this approach is to aim for capturing the most essential predictors and variables
rather than considering every possible variable that people think may influence the
outcome.
Model Selection

 The team's main goal is to choose an analytical technique, or a short list of candidate techniques,
based on the end goal of the project.

 In the case of machine learning and data mining, these rules and conditions are grouped into
several general sets of techniques, such as classification, association rules, and clustering.

 Teams create the initial models using a statistical software package such as R, SAS, or Matlab.

 Although these tools are designed for data mining and machine learning algorithms, they may
have limitations when applying the models to very large datasets, as is common with Big Data.
 Does the model appear valid and accurate on the test data?

 Does the model output/behavior make sense to the domain experts? That is, does it appear as if
the model is giving answers that make sense in this context?

 Do the parameter values of the fitted model make sense in the context of the domain?

 Is the model sufficiently accurate to meet the goal?

 Does the model avoid intolerable mistakes?


 Are more data or more inputs needed? Do any of the inputs need to be transformed or eliminated?

 Will the kind of model chosen support the runtime requirements?

  Is a different form of the model required to address the business problem? If so, go back to the
model planning phase and revise the modeling approach.
COMMON TOOLS FOR THE MODEL PLANNING PHASE
 R  has a complete set of modeling capabilities and provides a good environment for building
interpretive models with high-quality code. 

 SQL Analysis services  can perform in-database analytics of common data mining functions,
involved aggregations, and basic predictive models.
 SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple data
connectors such as OBDC, JDBC, and OLE DB.
MODEL BUILDING
 The data science team needs to develop data sets for training, testing, and production purposes.

 These data sets enable the data scientist to develop the analytical model and train it while holding
aside some of the data for testing the model.

 During this phase, users run models from analytical software packages, such as R or SAS, on file
extracts and small data sets for testing purposes. On a small scale, assess the validity of the model
and its results.
Common Tools for the Model Building Phase

 SAS Enterprise Miner  allows users to run predictive and descriptive models based on large
volumes of data from across the enterprise.

 SPSS Modeler offers methods to explore and analyze data through a GUI.

 Matlab provides a high-level language for performing a variety of data analytics, algorithms, and
data exploration

 Alpine Miner  provides a GUI front end for users to develop analytic work flows and interact
with Big Data tools and platforms on the back end.

  STATISTICA  and Mathematica  are popular and well-regarded data mining and analytics
tools.
Open Source tools:

 R and PL/R :PL/R is a procedural language for PostgreSQL with R.

 Octave: a free software programming language for computational modeling, has some of the
functionality of Matlab.

 WEKA  is a free data mining software package with an analytic workbench

 Python is a programming language that provides toolkits for machine learning and analysis, such
as scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib

COMMUNICATE RESULTS
 After executing the model, the team needs to compare the outcomes of the modeling to the
criteria established for success and failure.

 When conducting this assessment, determine if the results are statistically significant and valid
 The best practice in this phase is to record all the findings and then select the three most
significant ones that can be shared with the stakeholders.

 The team will have documented the key findings and major insights derived from the analysis.

OPERATIONALIZE
 The team communicates the benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem
of users.
 This allows the team to learn from the deployment and make any needed adjustments before
launching the model across the enterprise.
 The presentation needs to include supporting information about analytical methodology and data
sources
 Creating a mechanism for performing ongoing monitoring of model accuracy and, if accuracy
degrades, finding ways to retrain the model.
Expectation of different stakeholders at the conclusion of a project

 Business User typically tries to determine the benefits and implications of the findings to the
business.

 Project Sponsor typically asks questions related to the business impact of the project, the risks
and return on investment (ROI)

 Project Manager needs to determine if the project was completed on time and within budget and

   how well the goals were met.


 Business Intelligence Analyst needs to know if the reports and dashboards he manages will be
impacted and need to change.

 Data Engineer and Database Administrator (DBA) typically need to share their code from the
analytics project and create a technical document on how to implement it. 

 Data Scientist needs to share the code and explain the model to managers, and other stakeholders.
ANALYTICAL PLAN

 Discovery , Business problem framed

 Initial Hypotheses

 Data and Scope


 Model planning-Analytic Techniques

 Result and Key finding

 Business impact
KEY DELIVERABLES OF ANALYTICS PROJECT
 Developing Core Material for Multiple Audiences

 Project Goals

 Main Findings

 Approach

 Model Description

 Model Details

 Providing Technical Specifications and Code

PRESENTING YOUR RESULTS TO THE PROJECT SPONSOR


 project sponsor is the person who wants the data science result—generally for the business need
that it will fill.

   1.Summarize the motivation behind the project, and its goals.

   2.State the project’s results.

   3.Back up the results with details (Code), as needed.

   4.Discuss recommendations, outstanding issues, and possible future work.


Project sponsor presentation takeaways

 Keep it short.

  Keep it focused on the business issues, not the technical ones.

 Your project sponsor might use your presentation to help sell the project or its results to the
rest of the organization.

  Introduce your results early in the presentation, rather than building up to them.

PROVIDING TECHNICAL SPECIFICATIONS AND CODE

 The team should anticipate questions from IT related to how computationally expensive it
will be to run the model in the production environment.
 Teams should approach writing technical documentation for their code as if it were an
application programming interface (API).
Model evaluation and critique

 Once you have a model, you need to determine if it meets your goals:
 Is it accurate enough for your needs? Does it generalize well?
 Does it perform better than “the obvious guess”?
 Better than whatever estimate you currently use?
 Do the results of the model (coefficients, clusters, rules) make sense in the context of the problem
domain?

Confusion Matrix:
Confusion Matrix is the technique we use to measure the performance of classification models
The Confusion Matrix is in a tabular form where each row represents actual classes and columns are
predicated classes. 
Accuracy
Accuracy is defined as the ratio of total number of correct predictions to the total number of samples.
Accuracy =(True Positive + True Negative) / (True positive +True Negative+ False Positive +False
Negative)
Precision
Precision defines the correct identification of actual positives

                   Precision = True Positives / (True positives + False positives)


Recall  

Recall or True Positive Rate is defined as the ratio of true positives to total number of true positives
and false negatives. 

               Recall = True Positives / (True positives + False Negatives)


FI Score

 FI score is the measure that provides the single measure of combined precision and recall.
    F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Misclassification Rate or Error rate: 
 It is defined as the ratio between total number of misclassified samples and total number of samples

Error rate = (False Positive+False Negative)/Total Number of Samples

Sensitivity (True positive rate)

TP/(TP+FN).

Specificity (True negative rate) 

TN/(TN+FP).

False positive rate:

          FP/(FP+TN)

Sensitivity (True positive rate)

        TP/(TP+FN).

Specificity (True negative rate) 

         TN/(TN+FP).

False positive rate:

          FP/(FP+TN)

Presentation and documentation


Once you have a model that meets your success criteria, you’ll present your results to your project
sponsor and other stakeholders.You must also document the model for those in the organization who
are responsible for using, running, and maintaining the model once it has been deployed

Model deployment and maintenance


Finally, the model is put into operation.In many organizations this means the data scientist no longer
has primary responsibility for the day-to-day operation of the model.

Drivers of Big Data

To better understand the market drivers related to Big Data, it is helpful to first understand some past
history of data stores and the kinds of repositories and tools to manage these data stores.
The data now comes from multiple sources, such as these:
Medical information, such as genomic sequencing and diagnostic imaging
• Photos and video footage uploaded to the World Wide Web
• Video surveillance, such as the thousands of video cameras spread across a city
• Mobile devices, which provide geospatiallocation data of the users, as well as metadata about text
messages, phone calls, and application usage on smart phones
• Smart devices, which provide sensor-based collection of information from smart electric grids,
smart buildings, and many other public and industry infrastructures
• Nontraditional IT devices, including the use of radio-frequency identification (RFID) readers, GPS
navigation systems, and seismic processing

Prepared By
D.Deva Hema

You might also like