Data Science Notes
Data Science Notes
Data Structures
Big data can come in multiple forms,
including structured and non-structured
data such as financial data, text files,
multimedia files, and genetic mappings.
The following shows four types of data
structures, with 80–90% of future data
growth coming from non- structured
data types.
Quasi-structured data: Textual data Current Analytical Architecture
with erratic data formats that can be
formatted with effort, tools, and time The typical data architectures just
(for instance, web clickstream data that described are designed for storing
may contain inconsistencies in data and processing mission-critical data,
values and formats). supporting enterprise applications,
and enabling corporate reporting
activities.
Unstructured Data
Data that has no inherent structure,
which may include text documents,
PDFs, images, and video.
Data Respositories
Spreadsheets/marts – for
recordkeeping. Analyst depends on data
extracts.
Data Warehouse – Centralized data
containers in a purpose-built space.
Supports BI (Business Intelligence) and
reporting.
Analytic Sandbox – Data assets
gathered from multiple sources and
technologies.
State of Practice in Analytics
Business Intelligence - BI tends to provide
reports, dashboards, and queries on business
questions for the current period or in the past. BI
systems make it easy to answer questions.
Review of Descriptive and Inferential
Statistics Data Processing and Visualization
with R
Statistics Refresher
Statistics – Descriptive (Collection,
Organization, Presentation). Inferential (Draw
conclusion for a large group/data, determine
relationship, make predictions).
Regression Analysis – Frequently used
analyzed the relationship between two or more
variables.
- At least two variable need to be
continuous.
Response Variable – Y must be a continuous
variable.
Predictor Variable – X1, X2,…,Xp can be
continuous, discrete or categorical variables.
Logistic Regression