0% found this document useful (0 votes)
33 views

Analytics and Data Science - Self Notes

The document discusses various roles and responsibilities in predictive analytics and data science. It describes analyzing large datasets to develop models and algorithms to predict outcomes and understand drivers of business metrics. Statistical modeling, machine learning techniques, and programming skills are preferred for these roles.

Uploaded by

shardullavande
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Analytics and Data Science - Self Notes

The document discusses various roles and responsibilities in predictive analytics and data science. It describes analyzing large datasets to develop models and algorithms to predict outcomes and understand drivers of business metrics. Statistical modeling, machine learning techniques, and programming skills are preferred for these roles.

Uploaded by

shardullavande
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

3What kind of analytics?

The Predictive Analytics, Data Scientist provides support in making strategic data-related decisions by analyzing,
manipulating, tracking, internally managing and reporting data. These positions function both as a consultant and
as a high-level data analyst. The position works with clients to develop the right set of questions/hypotheses, using
the appropriate tools/skills to derive the best information to address customer needs.

Preferred

Statistical modeling (including use of panel data) and predictive modeling experience (including logistic
regression, decision tree, random forest, neural network, clustering analysis)

Extensive experience with Excel, SAS, R and/or SQL

Interest in statistical methods, root cause analysis


In this role, you will report to the Manager of Predictive Modeling and Analytics. Team members of the Predictive
Modeling and Analytics team are responsible for analyzing large data sets to develop and train custom models and
algorithms that drive business solutions. The predictive analytics team works on classification models, causal
inference, clustering approaches, and time series methods to predict and understand which drivers impact
membership growth.

The Data Analyst position will practice and leverage concepts of data science, using research methods, predictive
analytics, data mining, machine learning and various statistical techniques to help solve business problems.

Active participation in a range of market strategy projects essential to KP's membership and margin goals. Market
Strategy & Analysis Consultants provide analytical/strategic-thinking and leadership skills that enable project
teams to:

In this role, you will report to the Manager of Predictive Modeling and Analytics. Team members of the Predictive
Modeling and Analytics team are responsible for analyzing large data sets to develop and train custom models and
algorithms that drive business solutions. The predictive analytics team works on classification models, causal
inference, clustering approaches, and time series methods to predict and understand which drivers impact
membership growth.

The Data Analyst position will practice and leverage concepts of data science, using research methods, predictive
analytics, data mining, machine learning and various statistical techniques to help solve business problems.

Active participation in a range of market strategy projects essential to KP's membership and margin goals. Market
Strategy & Analysis Consultants provide analytical/strategic-thinking and leadership skills that enable project
teams to:

1) isolate business issues;

2) design and execute analytics for studying business issues (market research, scenario planning, forecasting,
market share, profitability, etc);

3) bring technical/content expertise (competitive intelligence, utilization, financial analysis, deep data analysis &
programming);

4) vet findings and make formal recommendations to senior levels of KP leadership;

5) create documents (strategic segment plans, utilization reports) that inform critical strategic issues.
Essential Functions

- Collect and organize data from mainframe files, data warehouse reports, vendor extracts, departmental
spreadsheets and databases, and internet/intranet sites for easy use by internal business and analytical clients.
- Create and maintain databases as a tool for delivering data to internal clients.
- Use data to investigate identified business issue and to address hypotheses.
- Create more complex analytical views of data, identifying major assumptions and gaps.
- Develop preliminary conclusions- tell the story

7 Steps of Machine Learning


Gathering data
Preparing that data
Choosing a model
Training
Evaluation
Hyperparameter tuning
Prediction

Y = mX + C

In machine learnings there can be many Ms (ie many factors)


Training Data => Model (matrix of Weights Biases) => Prediction => Test and Update Weights and Biases

https://playground.tensorflow.org
2
Choice of features matter
Throwing too many features in – may in fact give us some over fitting
In particular, deciding the weights of the features has a real impact (eg Number of legs – 0 vs 4 – has more weight?
Vs other binary features – so better to use manhattan distance or convert legs column to binary column eg has 0
legs or no? )

Regression
Supervised ML – features (attributes) and labels
Use Meaningful features – to do pattern recognition – don’t need columns which are related (high correlation)
Unsupervised learning

Correlation coefficient – to determine if two datasets are relate (more closer to 1 – more tight the relation is)
Clustering – scatter chart
Python libraries: scikit, tensor flows
- Consult with researchers on the feasibility, design and methods of proposed research projects.

- Perform advanced statistical analyses independently, such as logistic regression, survival analysis, hierarchical
modeling.

- Provide data extractions and develop analytic datasets for individual studies.

- Provide high-level analytic programming and statistical consultation projects with minimal supervision.

- Perform other programming, analytic and consulting duties as required.

- Experience working with very large databases.

- Ability to work on and manage multiple small projects simultaneously.

- MPH in Epi/Biostat.

- Strong statistical analysis and consulting background.

- Based primarily in chronic disease epidemiology (obesity, cardiovascular disease, and diabetes in women).

Terminology in Data Analytics

As data continue to grow at a faster rate than either population or economic activity, so do organizations' efforts to
deal with the data deluge, and use it to capture value. And so, do the methods used to analyze data, which creates
an expanding set of terms (including some buzzwords) used to describe these methods.

Predictive modeling:
Used when you seek to predict a target (outcome) variable (feature) using records (cases) where the target is
known. Statistical or machine learning models are "trained" using the known data, then applied to data where the
outcome variable is unknown. Includes both classification (where the outcome is categorical, often binary) and
prediction (where the outcome is continuous).

Predictive analytics:
Basically, the same thing as predictive modeling, but less specific and technical. Often used to describe the field
more generally.
Supervised Learning:
Another synonym for predictive modeling.

Unsupervised Learning:
Data mining methods not involving the prediction of an outcome based on training models on data where the
outcome is known. Unsupervised methods include cluster analysis, association rules, outlier detection, dimension
reduction and more.

Business intelligence:
An older term that has come to mean the extraction of useful information from business data without benefit of
statistical or machine learning models (e.g. dashboards to visualize key indicators, queries to databases).

Data mining:
This term means different things in different contexts. To a lay person, it might mean the automated searching of
large databases. To an analyst. it may refer to the collection of statistical and machine learning methods used with
those databases (predictive modeling, clustering, recommendation systems, ...)

Text mining:
The application of data mining methods to text.

Text analytics:
A broader term that includes the preparation of text for mining, the mining itself, and specialized applications such
as sentiment analysis. Preparing text for analysis involves automated parsing and interpretation (natural language
processing), then quantification (e.g. identifying the presence or absence of key terms).

Data science, data analytics, analytics:


Cover all of the concepts described on this page. "Data science" is often used to define a (new) profession whose
practitioners are capable in many or all the above areas; one often sees the term "data scientist" in job postings.
While "statistician" typically implies familiarity with research methods and the collection of data for studies, "data
scientist" implies the ability to work with large volumes of data generated not by studies, but by ongoing
organizational processes. Due to the complexity of dealing with large datasets and data flows, most of the day-to-
day work of a data scientist lies in data pipeline challenges - storing relevant data, getting it into appropriate form
for analysis, and managing the real-time implementation of models. "Data analytics" and "analytics," by contrast,
are general terms used to describe the field and a comprehensive collection of associated methods. All these terms
tend to be used for the application of analytic methods to data that large organizations generate or have available
("big data").
Statistics:
Covers nearly all of the above methods, and also carries the mantle of a well-established profession dating back to
the mid 1800's. Although statisticians work on "big data" problems, the field of statistics has traditionally been
focused on focused research studies (e.g. drug trials).

Big Data:
Refers to the huge amounts of data that large businesses and other organizations collect and store. It might be
unstructured text (streams of tweets) or structured quantitative data (transaction databases). In the 1990's
organizations began making efforts to extract useful information from this data. The challenges of big data lie
mainly in the pre-analysis stage, in the IT domain.

Our friend, Gregory Piatetsky-Shapiro, Editor and Analytics/Data Mining Expert at KDnuggets conducted the
following poll:

What will replace "Big Data" as a hot buzzword ? [262 voters]


Smart Data (76) 29%
Big Analytics (73) 28%
Data+ (26) 9.9%
Linked Data (25) 9.5%
Internet of Things (23)8.8%
Power Data (9)3.4%
Good Data (5) 1.9%
Other(28) 11%

For the full report, go to http://www.kdnuggets.com/polls/2012/what-will-replace-big-data.html

Machine Learning:
Analytics in which computers "learn" from data to produce models or rules that apply to those data and to other
similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision
trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of
these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling
methods that have highly structured model forms, such as linear regression, logistic regression and discriminant
analysis are generally not considered part of machine learning. Unsupervised learning methods such as
association rules and clustering are also considered part of machine learning.

Network Analytics:
The science of describing and, especially, visualizing the connections among objects. The objects might be human,
biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary's classic 1977
network diagram of a karate club reveals the centrality of two individuals, and presages the club's subsequent split
into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines
representing connections).

(Wayne Zachary. An information flow model for conflict and fission in small groups, Journal of Anthropological
Research, 33(4):452–473, 1977; cited in D. Easley & J. Kleinberg, Networks, Crowds, and Markets: Reasoning about
a Highly Connected World, Cambridge University Press, 2010, available also at
http://www.cs.cornell.edu/home/kleinber/networks-book/ where this figure is drawn from.)

Social Network Analytics:


Network analytics applied to connections among humans. Recently it has come also to encompass the analysis of
web sites and internet services like Facebook.

Web Analytics:
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions
(sales), generally with a view to learning what web presentations are most effective in achieving the organizational
goal (usually sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to
purchase advertising on other sites, or to collect contact information. Key challenges in web analytics are the
volume and constant flow of data, and the navigational complexity and sometimes lengthy gaps that precede users'
relevant web decisions.
Uplift or Persuasion Modeling:
A combination of treatment comparisons (e.g. send a sales solicitation to one group, send nothing to another
group) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which
treatments. Here are the steps, in conceptual terms, for a typical uplift model:

1. Conduct A-B test, where B is control

2. Combine all the data from both groups

3. Divide the data into a number of segments, each having roughly similar numbers of subjects who got treatment
A and control. Tree-based methods are typically used for this.

4. The segments should be drawn such that, within each segment, the response to treatment A is substantially
different from the response to control.

5. Considering each segment as the modeling unit, build a model that predicts whether a subject will respond
positively to treatment A.

The challenge (and the novelty) is to recognize that the model cannot operate on individual cases, since subjects
get either treatment A, OR control, but not both, so the "uplift" from treatment Z compared to control cannot be
observed at the individual level, but only at the group level. Hence the need for the segments described in steps 3
and 4.

Note: Traditional A-B testing would stop at step 1, and apply the more successful treatment to all subjects.

Reference: "Real World Uplift Modelling with Significance-Based Uplift Trees," by N. J. Radcliffe and P. D. Surry,
available as a white paper at stochasticsolutions.com/
Tableau Public (free) Vs Tableau Server (for organizations)

Data Mining and Analytics : Data Manipulation Techniques


Difference between classification and Clustering: classification starts with pre-defined labels ; In clustering labels are
created after the fact

Machine Learning:
Natural Language Processing (NLP)
One of the biggest challenges of a data scientist is to sort through this unstructured data and pre-process it so that data
mining and analytics tools can take over to extract the ultimate knowledge they are seeking. Luckily for the data
scientists, there are already well-developed NLP tools patched into program languages such as Python. Some of these
tools are also built into an operating system such as Unix or Linux.

At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables,
distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA and chi-square. You also
need to know how to use common statistical analysis tools, including R, Excel and SAS. At a more advanced level a data
scientist needs to be familiar with concepts and algorithms, like logistic regression, support vector machines, or SVMs,
and Bayesian methods.

Visualization

- To overcome the challenge of effectively communicating the results of data analytics to a lay audience, there are
scientists frequently rely on visualization.

Tableau offers one of the most popular and comprehensive visualization tools for data scientists. It supports a variety of
visualization elements such as different types of charts, graphs, maps, and other more advanced options.

there are job titles such as data scientist, data engineer, business intelligence architect, machine learning specialist, data
analytics specialist, and data visualization developer.

You might also like