Analytics and Data Science - Self Notes
Analytics and Data Science - Self Notes
The Predictive Analytics, Data Scientist provides support in making strategic data-related decisions by analyzing,
manipulating, tracking, internally managing and reporting data. These positions function both as a consultant and
as a high-level data analyst. The position works with clients to develop the right set of questions/hypotheses, using
the appropriate tools/skills to derive the best information to address customer needs.
Preferred
Statistical modeling (including use of panel data) and predictive modeling experience (including logistic
regression, decision tree, random forest, neural network, clustering analysis)
The Data Analyst position will practice and leverage concepts of data science, using research methods, predictive
analytics, data mining, machine learning and various statistical techniques to help solve business problems.
Active participation in a range of market strategy projects essential to KP's membership and margin goals. Market
Strategy & Analysis Consultants provide analytical/strategic-thinking and leadership skills that enable project
teams to:
In this role, you will report to the Manager of Predictive Modeling and Analytics. Team members of the Predictive
Modeling and Analytics team are responsible for analyzing large data sets to develop and train custom models and
algorithms that drive business solutions. The predictive analytics team works on classification models, causal
inference, clustering approaches, and time series methods to predict and understand which drivers impact
membership growth.
The Data Analyst position will practice and leverage concepts of data science, using research methods, predictive
analytics, data mining, machine learning and various statistical techniques to help solve business problems.
Active participation in a range of market strategy projects essential to KP's membership and margin goals. Market
Strategy & Analysis Consultants provide analytical/strategic-thinking and leadership skills that enable project
teams to:
2) design and execute analytics for studying business issues (market research, scenario planning, forecasting,
market share, profitability, etc);
3) bring technical/content expertise (competitive intelligence, utilization, financial analysis, deep data analysis &
programming);
5) create documents (strategic segment plans, utilization reports) that inform critical strategic issues.
Essential Functions
- Collect and organize data from mainframe files, data warehouse reports, vendor extracts, departmental
spreadsheets and databases, and internet/intranet sites for easy use by internal business and analytical clients.
- Create and maintain databases as a tool for delivering data to internal clients.
- Use data to investigate identified business issue and to address hypotheses.
- Create more complex analytical views of data, identifying major assumptions and gaps.
- Develop preliminary conclusions- tell the story
Y = mX + C
https://playground.tensorflow.org
2
Choice of features matter
Throwing too many features in – may in fact give us some over fitting
In particular, deciding the weights of the features has a real impact (eg Number of legs – 0 vs 4 – has more weight?
Vs other binary features – so better to use manhattan distance or convert legs column to binary column eg has 0
legs or no? )
Regression
Supervised ML – features (attributes) and labels
Use Meaningful features – to do pattern recognition – don’t need columns which are related (high correlation)
Unsupervised learning
Correlation coefficient – to determine if two datasets are relate (more closer to 1 – more tight the relation is)
Clustering – scatter chart
Python libraries: scikit, tensor flows
- Consult with researchers on the feasibility, design and methods of proposed research projects.
- Perform advanced statistical analyses independently, such as logistic regression, survival analysis, hierarchical
modeling.
- Provide data extractions and develop analytic datasets for individual studies.
- Provide high-level analytic programming and statistical consultation projects with minimal supervision.
- MPH in Epi/Biostat.
- Based primarily in chronic disease epidemiology (obesity, cardiovascular disease, and diabetes in women).
As data continue to grow at a faster rate than either population or economic activity, so do organizations' efforts to
deal with the data deluge, and use it to capture value. And so, do the methods used to analyze data, which creates
an expanding set of terms (including some buzzwords) used to describe these methods.
Predictive modeling:
Used when you seek to predict a target (outcome) variable (feature) using records (cases) where the target is
known. Statistical or machine learning models are "trained" using the known data, then applied to data where the
outcome variable is unknown. Includes both classification (where the outcome is categorical, often binary) and
prediction (where the outcome is continuous).
Predictive analytics:
Basically, the same thing as predictive modeling, but less specific and technical. Often used to describe the field
more generally.
Supervised Learning:
Another synonym for predictive modeling.
Unsupervised Learning:
Data mining methods not involving the prediction of an outcome based on training models on data where the
outcome is known. Unsupervised methods include cluster analysis, association rules, outlier detection, dimension
reduction and more.
Business intelligence:
An older term that has come to mean the extraction of useful information from business data without benefit of
statistical or machine learning models (e.g. dashboards to visualize key indicators, queries to databases).
Data mining:
This term means different things in different contexts. To a lay person, it might mean the automated searching of
large databases. To an analyst. it may refer to the collection of statistical and machine learning methods used with
those databases (predictive modeling, clustering, recommendation systems, ...)
Text mining:
The application of data mining methods to text.
Text analytics:
A broader term that includes the preparation of text for mining, the mining itself, and specialized applications such
as sentiment analysis. Preparing text for analysis involves automated parsing and interpretation (natural language
processing), then quantification (e.g. identifying the presence or absence of key terms).
Big Data:
Refers to the huge amounts of data that large businesses and other organizations collect and store. It might be
unstructured text (streams of tweets) or structured quantitative data (transaction databases). In the 1990's
organizations began making efforts to extract useful information from this data. The challenges of big data lie
mainly in the pre-analysis stage, in the IT domain.
Our friend, Gregory Piatetsky-Shapiro, Editor and Analytics/Data Mining Expert at KDnuggets conducted the
following poll:
Machine Learning:
Analytics in which computers "learn" from data to produce models or rules that apply to those data and to other
similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision
trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of
these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling
methods that have highly structured model forms, such as linear regression, logistic regression and discriminant
analysis are generally not considered part of machine learning. Unsupervised learning methods such as
association rules and clustering are also considered part of machine learning.
Network Analytics:
The science of describing and, especially, visualizing the connections among objects. The objects might be human,
biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary's classic 1977
network diagram of a karate club reveals the centrality of two individuals, and presages the club's subsequent split
into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines
representing connections).
(Wayne Zachary. An information flow model for conflict and fission in small groups, Journal of Anthropological
Research, 33(4):452–473, 1977; cited in D. Easley & J. Kleinberg, Networks, Crowds, and Markets: Reasoning about
a Highly Connected World, Cambridge University Press, 2010, available also at
http://www.cs.cornell.edu/home/kleinber/networks-book/ where this figure is drawn from.)
Web Analytics:
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions
(sales), generally with a view to learning what web presentations are most effective in achieving the organizational
goal (usually sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to
purchase advertising on other sites, or to collect contact information. Key challenges in web analytics are the
volume and constant flow of data, and the navigational complexity and sometimes lengthy gaps that precede users'
relevant web decisions.
Uplift or Persuasion Modeling:
A combination of treatment comparisons (e.g. send a sales solicitation to one group, send nothing to another
group) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which
treatments. Here are the steps, in conceptual terms, for a typical uplift model:
3. Divide the data into a number of segments, each having roughly similar numbers of subjects who got treatment
A and control. Tree-based methods are typically used for this.
4. The segments should be drawn such that, within each segment, the response to treatment A is substantially
different from the response to control.
5. Considering each segment as the modeling unit, build a model that predicts whether a subject will respond
positively to treatment A.
The challenge (and the novelty) is to recognize that the model cannot operate on individual cases, since subjects
get either treatment A, OR control, but not both, so the "uplift" from treatment Z compared to control cannot be
observed at the individual level, but only at the group level. Hence the need for the segments described in steps 3
and 4.
Note: Traditional A-B testing would stop at step 1, and apply the more successful treatment to all subjects.
Reference: "Real World Uplift Modelling with Significance-Based Uplift Trees," by N. J. Radcliffe and P. D. Surry,
available as a white paper at stochasticsolutions.com/
Tableau Public (free) Vs Tableau Server (for organizations)
Machine Learning:
Natural Language Processing (NLP)
One of the biggest challenges of a data scientist is to sort through this unstructured data and pre-process it so that data
mining and analytics tools can take over to extract the ultimate knowledge they are seeking. Luckily for the data
scientists, there are already well-developed NLP tools patched into program languages such as Python. Some of these
tools are also built into an operating system such as Unix or Linux.
At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables,
distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA and chi-square. You also
need to know how to use common statistical analysis tools, including R, Excel and SAS. At a more advanced level a data
scientist needs to be familiar with concepts and algorithms, like logistic regression, support vector machines, or SVMs,
and Bayesian methods.
Visualization
- To overcome the challenge of effectively communicating the results of data analytics to a lay audience, there are
scientists frequently rely on visualization.
Tableau offers one of the most popular and comprehensive visualization tools for data scientists. It supports a variety of
visualization elements such as different types of charts, graphs, maps, and other more advanced options.
there are job titles such as data scientist, data engineer, business intelligence architect, machine learning specialist, data
analytics specialist, and data visualization developer.