Facets of Data:: Self-Describing Structure
Facets of Data:: Self-Describing Structure
1. structured data
2. semi-structured data
3. unstructured data
structured data
predefined fields
can be arranged in tables or relational databases
Quantitative and highly organized.
Effective analysis.
Easy to export, store, and organize in a database
Definite format.
Easy Search.
Unstructured
qualitative data. Example: Text, Images, Videos, Sound
difficult to analyze
can be structured with machine learning techniques to extract insights.
Example: Language –Parsing. Images – Segmentation
Big Data:
mixed data sets that are very large and are a mixture of structured and
unstructured data.
V’s of Big Data:
Volume
Velocity
Value
Veracity
Variety
Volume
Amount of data
Velocity
velocity increases data volume, often exponentially, it might shorten the window
of data retention or application
Value
usefulness of gathered data for your business.
Regardless of its volume, bulk data usually isn’t very useful — to be valuable, it
needs to be converted into insights or information, and that is where data
science and analytics step in.
Veracity
assurance of quality or credibility of the collected data.
Variety
Structured/unstructured/semi-structured
Types:
Descriptive
Diagnostic
Predictive
Prescriptive
2ND SLIDE
Data Science Process
Understand the Business Problem
1. Data Acquisition/Collection
2. Data Preparation
3. Data Exploration
4. Data Modeling (in-depth analysis)
5. Visualization
1.Data Acquisition/Collection
• This part of the process involves finding suitable data and getting access to the
data from the data owner
• The result of this step is data in its raw form, which probably needs polishing and
transformation before it becomes usable.
2.Data Preparation
• Transforming data from a raw form into data that is directly usable in your
model.
• (Data Splitting, Data integration, Feature Selection)
3.Data Exploration
• The goal of this step is to gain a deep understanding of the information
contained in the data.
• The goal is to look for patterns, correlations, and deviations based on visual and
descriptive techniques.
4.Data Modeling
• It is now that you attempt to gain the insights or make the predictions stated in
your project charter (business problem).
• This step of the process is where you will have to apply your statistical,
mathematical and technological knowledge and leverage all of the data science
tools at your disposal to crunch the data and find every insight you can.
5.Visualization
All the analysis and technical results that you come up with are of little value unless
you can explain to your stakeholders what the results mean, in a way that’s
comprehensible and compelling. Data storytelling is a critical and underrated skill
that you will build and use here.
3RD SLIDE:
Data Modeling:
• The process of sorting and storing data is called "data modeling"
• process of creating a data model for the data to be stored in a database.
• Data modeling helps in the visual representation of data and enforces
regulations on the data
Data Model:
• A data model is a method by which we can organize and store data.
• data model is a conceptual representation of Data objects, the associations
between different data objects, and the rules.
• Data Models ensure consistency in naming conventions, default values,
semantics, security while ensuring quality of the data
• The Data Model is defined as an abstract model that organizes data description,
data semantics, and consistency constraints of data.
• The data model emphasizes what data is needed and how it should be organized
instead of what operations will be performed on data.
• Data Model is like an architect's building plan, which helps to build conceptual
models and set a relationship between data items.
Schema on Read
• In Schema on Read we upload data as it arrives without any changes or
transformations.
• Schema-on-read has fast data ingestion because data doesn’t follow any internal
schema — you are just copying/moving files.
Schema on Write
• Schema on write is defined as creating a schema for data before writing into the
database.
• This is schema-on-write — the approach in which we define the columns, data
format, relationships of columns, etc. before the actual data upload.