Data Science: Chapter 1: Introduction To Big Data
Data Science: Chapter 1: Introduction To Big Data
Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
What To Do With These Data?
National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
Three Recurring
Data Scientist Activities
variables
If the team plans to run regression analysis, identify the
candidate predictors and outcome variables of the model
Model Selection
The main goal is to choose an analytical technique, or several
candidates, based on the end goal of the project
We observe events in the real world and attempt to construct models
that emulate this behavior with a set of rules and conditions
A model is simply an abstraction from reality
Determine whether to use techniques best suited for structured data,
unstructured data, or a hybrid approach
Teams often create initial models using statistical software packages
such as R, SAS, or Matlab
Which may have limitations when applied to very large datasets
The team moves to the model building phase once it has a good idea
about the type of model to try
Common Tools for the Model
Planning Phase