0% found this document useful (0 votes)

33 views

Analytics and Data Science - Self Notes

The document discusses various roles and responsibilities in predictive analytics and data science. It describes analyzing large datasets to develop models and algorithms to predict outcomes and understand drivers of business metrics. Statistical modeling, machine learning techniques, and programming skills are preferred for these roles.

Uploaded by

shardullavande

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Analytics and Data Science - Self Notes

Uploaded by

shardullavande

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

3What kind of analytics?

The Predictive Analytics, Data Scientist provides support in making strategic data-related decisions by analyzing,
manipulating, tracking, internally managing and reporting data. These positions function both as a consultant and
as a high-level data analyst. The position works with clients to develop the right set of questions/hypotheses, using
the appropriate tools/skills to derive the best information to address customer needs.

Preferred

Statistical modeling (including use of panel data) and predictive modeling experience (including logistic
regression, decision tree, random forest, neural network, clustering analysis)

Extensive experience with Excel, SAS, R and/or SQL

Interest in statistical methods, root cause analysis

In this role, you will report to the Manager of Predictive Modeling and Analytics. Team members of the Predictive
Modeling and Analytics team are responsible for analyzing large data sets to develop and train custom models and
algorithms that drive business solutions. The predictive analytics team works on classification models, causal
inference, clustering approaches, and time series methods to predict and understand which drivers impact
membership growth.

The Data Analyst position will practice and leverage concepts of data science, using research methods, predictive
analytics, data mining, machine learning and various statistical techniques to help solve business problems.

Active participation in a range of market strategy projects essential to KP's membership and margin goals. Market
Strategy & Analysis Consultants provide analytical/strategic-thinking and leadership skills that enable project
teams to:

1) isolate business issues;

2) design and execute analytics for studying business issues (market research, scenario planning, forecasting,
market share, profitability, etc);

3) bring technical/content expertise (competitive intelligence, utilization, financial analysis, deep data analysis &
programming);

4) vet findings and make formal recommendations to senior levels of KP leadership;

5) create documents (strategic segment plans, utilization reports) that inform critical strategic issues.
Essential Functions

- Collect and organize data from mainframe files, data warehouse reports, vendor extracts, departmental
spreadsheets and databases, and internet/intranet sites for easy use by internal business and analytical clients.
- Create and maintain databases as a tool for delivering data to internal clients.
- Use data to investigate identified business issue and to address hypotheses.
- Create more complex analytical views of data, identifying major assumptions and gaps.
- Develop preliminary conclusions- tell the story

7 Steps of Machine Learning

Gathering data
Preparing that data
Choosing a model
Training
Evaluation
Hyperparameter tuning
Prediction

Y = mX + C

In machine learnings there can be many Ms (ie many factors)

Training Data => Model (matrix of Weights Biases) => Prediction => Test and Update Weights and Biases

https://playground.tensorflow.org
2
Choice of features matter
Throwing too many features in – may in fact give us some over fitting
In particular, deciding the weights of the features has a real impact (eg Number of legs – 0 vs 4 – has more weight?
Vs other binary features – so better to use manhattan distance or convert legs column to binary column eg has 0
legs or no? )

Regression
Supervised ML – features (attributes) and labels
Use Meaningful features – to do pattern recognition – don’t need columns which are related (high correlation)
Unsupervised learning

Correlation coefficient – to determine if two datasets are relate (more closer to 1 – more tight the relation is)
Clustering – scatter chart
Python libraries: scikit, tensor flows
- Consult with researchers on the feasibility, design and methods of proposed research projects.

- Perform advanced statistical analyses independently, such as logistic regression, survival analysis, hierarchical
modeling.

- Provide data extractions and develop analytic datasets for individual studies.

- Provide high-level analytic programming and statistical consultation projects with minimal supervision.

- Perform other programming, analytic and consulting duties as required.

- Experience working with very large databases.

- Ability to work on and manage multiple small projects simultaneously.

- MPH in Epi/Biostat.

- Strong statistical analysis and consulting background.

- Based primarily in chronic disease epidemiology (obesity, cardiovascular disease, and diabetes in women).

Terminology in Data Analytics

As data continue to grow at a faster rate than either population or economic activity, so do organizations' efforts to
deal with the data deluge, and use it to capture value. And so, do the methods used to analyze data, which creates
an expanding set of terms (including some buzzwords) used to describe these methods.

Predictive modeling:
Used when you seek to predict a target (outcome) variable (feature) using records (cases) where the target is
known. Statistical or machine learning models are "trained" using the known data, then applied to data where the
outcome variable is unknown. Includes both classification (where the outcome is categorical, often binary) and
prediction (where the outcome is continuous).

Predictive analytics:
Basically, the same thing as predictive modeling, but less specific and technical. Often used to describe the field
more generally.
Supervised Learning:
Another synonym for predictive modeling.

Unsupervised Learning:
Data mining methods not involving the prediction of an outcome based on training models on data where the
outcome is known. Unsupervised methods include cluster analysis, association rules, outlier detection, dimension
reduction and more.

Business intelligence:
An older term that has come to mean the extraction of useful information from business data without benefit of
statistical or machine learning models (e.g. dashboards to visualize key indicators, queries to databases).

Data mining:
This term means different things in different contexts. To a lay person, it might mean the automated searching of
large databases. To an analyst. it may refer to the collection of statistical and machine learning methods used with
those databases (predictive modeling, clustering, recommendation systems, ...)

Text mining:
The application of data mining methods to text.

Text analytics:
A broader term that includes the preparation of text for mining, the mining itself, and specialized applications such
as sentiment analysis. Preparing text for analysis involves automated parsing and interpretation (natural language
processing), then quantification (e.g. identifying the presence or absence of key terms).

Data science, data analytics, analytics:

Cover all of the concepts described on this page. "Data science" is often used to define a (new) profession whose
practitioners are capable in many or all the above areas; one often sees the term "data scientist" in job postings.
While "statistician" typically implies familiarity with research methods and the collection of data for studies, "data
scientist" implies the ability to work with large volumes of data generated not by studies, but by ongoing
organizational processes. Due to the complexity of dealing with large datasets and data flows, most of the day-to-
day work of a data scientist lies in data pipeline challenges - storing relevant data, getting it into appropriate form
for analysis, and managing the real-time implementation of models. "Data analytics" and "analytics," by contrast,
are general terms used to describe the field and a comprehensive collection of associated methods. All these terms
tend to be used for the application of analytic methods to data that large organizations generate or have available
("big data").
Statistics:
Covers nearly all of the above methods, and also carries the mantle of a well-established profession dating back to
the mid 1800's. Although statisticians work on "big data" problems, the field of statistics has traditionally been
focused on focused research studies (e.g. drug trials).

Big Data:
Refers to the huge amounts of data that large businesses and other organizations collect and store. It might be
unstructured text (streams of tweets) or structured quantitative data (transaction databases). In the 1990's
organizations began making efforts to extract useful information from this data. The challenges of big data lie
mainly in the pre-analysis stage, in the IT domain.

Our friend, Gregory Piatetsky-Shapiro, Editor and Analytics/Data Mining Expert at KDnuggets conducted the
following poll:

What will replace "Big Data" as a hot buzzword ? [262 voters]

Smart Data (76) 29%
Big Analytics (73) 28%
Data+ (26) 9.9%
Linked Data (25) 9.5%
Internet of Things (23)8.8%
Power Data (9)3.4%
Good Data (5) 1.9%
Other(28) 11%

For the full report, go to http://www.kdnuggets.com/polls/2012/what-will-replace-big-data.html

Machine Learning:
Analytics in which computers "learn" from data to produce models or rules that apply to those data and to other
similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision
trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of
these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling
methods that have highly structured model forms, such as linear regression, logistic regression and discriminant
analysis are generally not considered part of machine learning. Unsupervised learning methods such as
association rules and clustering are also considered part of machine learning.

Network Analytics:
The science of describing and, especially, visualizing the connections among objects. The objects might be human,
biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary's classic 1977
network diagram of a karate club reveals the centrality of two individuals, and presages the club's subsequent split
into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines
representing connections).

(Wayne Zachary. An information flow model for conflict and fission in small groups, Journal of Anthropological
Research, 33(4):452–473, 1977; cited in D. Easley & J. Kleinberg, Networks, Crowds, and Markets: Reasoning about
a Highly Connected World, Cambridge University Press, 2010, available also at
http://www.cs.cornell.edu/home/kleinber/networks-book/ where this figure is drawn from.)

Social Network Analytics:

Network analytics applied to connections among humans. Recently it has come also to encompass the analysis of
web sites and internet services like Facebook.

Web Analytics:
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions
(sales), generally with a view to learning what web presentations are most effective in achieving the organizational
goal (usually sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to
purchase advertising on other sites, or to collect contact information. Key challenges in web analytics are the
volume and constant flow of data, and the navigational complexity and sometimes lengthy gaps that precede users'
relevant web decisions.
Uplift or Persuasion Modeling:
A combination of treatment comparisons (e.g. send a sales solicitation to one group, send nothing to another
group) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which
treatments. Here are the steps, in conceptual terms, for a typical uplift model:

1. Conduct A-B test, where B is control

2. Combine all the data from both groups

3. Divide the data into a number of segments, each having roughly similar numbers of subjects who got treatment
A and control. Tree-based methods are typically used for this.

4. The segments should be drawn such that, within each segment, the response to treatment A is substantially
different from the response to control.

5. Considering each segment as the modeling unit, build a model that predicts whether a subject will respond
positively to treatment A.

The challenge (and the novelty) is to recognize that the model cannot operate on individual cases, since subjects
get either treatment A, OR control, but not both, so the "uplift" from treatment Z compared to control cannot be
observed at the individual level, but only at the group level. Hence the need for the segments described in steps 3
and 4.

Note: Traditional A-B testing would stop at step 1, and apply the more successful treatment to all subjects.

Reference: "Real World Uplift Modelling with Significance-Based Uplift Trees," by N. J. Radcliffe and P. D. Surry,
available as a white paper at stochasticsolutions.com/
Tableau Public (free) Vs Tableau Server (for organizations)

Data Mining and Analytics : Data Manipulation Techniques

Difference between classification and Clustering: classification starts with pre-defined labels ; In clustering labels are
created after the fact

Machine Learning:
Natural Language Processing (NLP)
One of the biggest challenges of a data scientist is to sort through this unstructured data and pre-process it so that data
mining and analytics tools can take over to extract the ultimate knowledge they are seeking. Luckily for the data
scientists, there are already well-developed NLP tools patched into program languages such as Python. Some of these
tools are also built into an operating system such as Unix or Linux.

At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables,
distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA and chi-square. You also
need to know how to use common statistical analysis tools, including R, Excel and SAS. At a more advanced level a data
scientist needs to be familiar with concepts and algorithms, like logistic regression, support vector machines, or SVMs,
and Bayesian methods.

Visualization

- To overcome the challenge of effectively communicating the results of data analytics to a lay audience, there are
scientists frequently rely on visualization.

Tableau offers one of the most popular and comprehensive visualization tools for data scientists. It supports a variety of
visualization elements such as different types of charts, graphs, maps, and other more advanced options.

there are job titles such as data scientist, data engineer, business intelligence architect, machine learning specialist, data
analytics specialist, and data visualization developer.

A4 Business Analytics - The Science of Data-Driven Decision Making July 2024 v4
No ratings yet
A4 Business Analytics - The Science of Data-Driven Decision Making July 2024 v4
6 pages
Introduction To Business Analytics: Alka Vaidya Nibm
100% (1)
Introduction To Business Analytics: Alka Vaidya Nibm
41 pages
Big Data Analysis
No ratings yet
Big Data Analysis
25 pages
Intro Class
No ratings yet
Intro Class
27 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
32 pages
03-Data Science Methodology
No ratings yet
03-Data Science Methodology
8 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Glossary of Problem & Approach
No ratings yet
Glossary of Problem & Approach
3 pages
Crash Course_Introduction to Data Science
No ratings yet
Crash Course_Introduction to Data Science
121 pages
AnClub Placements Prepbook 2024
No ratings yet
AnClub Placements Prepbook 2024
85 pages
Introduction To Analytics
No ratings yet
Introduction To Analytics
40 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
Business Analytics
No ratings yet
Business Analytics
33 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Business Analytics Unit 1
No ratings yet
Business Analytics Unit 1
57 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Road Map To Predictive Analytics: Dr. P.K.Viswanathan Professor (Analytics)
No ratings yet
Road Map To Predictive Analytics: Dr. P.K.Viswanathan Professor (Analytics)
17 pages
Ch01_ICS422_01
No ratings yet
Ch01_ICS422_01
42 pages
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics
No ratings yet
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics
35 pages
MBAS901 1 LectureA
No ratings yet
MBAS901 1 LectureA
66 pages
Unit1
No ratings yet
Unit1
21 pages
C1 Part2
No ratings yet
C1 Part2
28 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
Data Analytics Introduction
No ratings yet
Data Analytics Introduction
9 pages
An Analytic Data Set (ADS) Is The
No ratings yet
An Analytic Data Set (ADS) Is The
27 pages
Analytics Prepbook Laterals 2019-2020
100% (1)
Analytics Prepbook Laterals 2019-2020
40 pages
Analytics-Career-Sheet
No ratings yet
Analytics-Career-Sheet
6 pages
BUSINESS ANALYTICS UNIT I
No ratings yet
BUSINESS ANALYTICS UNIT I
45 pages
Business Analytics - L1-L3, Ch. 1-2
No ratings yet
Business Analytics - L1-L3, Ch. 1-2
22 pages
Data Science_ppt
No ratings yet
Data Science_ppt
45 pages
PTDLKT
No ratings yet
PTDLKT
11 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Data Analytics Chapter -1
No ratings yet
Data Analytics Chapter -1
42 pages
Cs - Fundamentals of Data Science
No ratings yet
Cs - Fundamentals of Data Science
203 pages
1 Introduction
No ratings yet
1 Introduction
64 pages
HubSpots Guide To Data Analytics
No ratings yet
HubSpots Guide To Data Analytics
50 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Big - Data Unit-2
100% (2)
Big - Data Unit-2
64 pages
HRDF IndSF DATASCIENCE
No ratings yet
HRDF IndSF DATASCIENCE
3 pages
Accounting Analytics 2
No ratings yet
Accounting Analytics 2
41 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
PAM All Files
No ratings yet
PAM All Files
90 pages
Unit 3
No ratings yet
Unit 3
11 pages
Types of Data
No ratings yet
Types of Data
11 pages
PGP-DSBA_Brochure (1) (1)
No ratings yet
PGP-DSBA_Brochure (1) (1)
28 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
1 Introduction to DA Course
No ratings yet
1 Introduction to DA Course
29 pages
Lesson 2 Business Analytics Framework
No ratings yet
Lesson 2 Business Analytics Framework
29 pages
Big Data Analytics
No ratings yet
Big Data Analytics
65 pages
KMBN IT01 LM Consolidated
No ratings yet
KMBN IT01 LM Consolidated
123 pages
Beginners Guide To Analytics Ebook
100% (1)
Beginners Guide To Analytics Ebook
50 pages
Predictive Analytics
No ratings yet
Predictive Analytics
40 pages
chapter_1
No ratings yet
chapter_1
50 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Programming Analysis Competencies
No ratings yet
Programming Analysis Competencies
2 pages
Building Scalable Web Architectures: Aaron Bannert
No ratings yet
Building Scalable Web Architectures: Aaron Bannert
75 pages
Project Management - Insurance Products - 232082
No ratings yet
Project Management - Insurance Products - 232082
2 pages
Ten Critical Factors For Health Plan Success in Implementing ICD-10
No ratings yet
Ten Critical Factors For Health Plan Success in Implementing ICD-10
6 pages
Donner HW v1
No ratings yet
Donner HW v1
3 pages
Aug 18,23,26,30,01 - Panoramic Europe - 5 Day - 435
No ratings yet
Aug 18,23,26,30,01 - Panoramic Europe - 5 Day - 435
1 page
Bella Casa - 141216 - Information Handbook
No ratings yet
Bella Casa - 141216 - Information Handbook
34 pages
Castlight Health
No ratings yet
Castlight Health
7 pages
Year 1979 1980 1981 1982 1983 1984 Period 0 1 2 3 4 5
33% (3)
Year 1979 1980 1981 1982 1983 1984 Period 0 1 2 3 4 5
30 pages
ZARA Case Study
No ratings yet
ZARA Case Study
11 pages
471a - Session 1
No ratings yet
471a - Session 1
2 pages