Cs3352 Foundation of Data Science
Cs3352 Foundation of Data Science
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
In this phase, the data science team must learn and investigate the
problem, develop context and understanding and learn about the data
sources needed and available for the project.
Defining Research Goals
1. Learning the business domain :
2. Resources :
3. Frame the problem :
4. Identifying key stakeholders:
5. Interviewing the analytics sponsor:
6. Developing initial hypotheses:
7. Identifying potential data sources:
Learning the business domain :
Understanding the domain area of the problem
is essential
Resources :
As part of the discovery phase, the team needs
to assess the resources available to support the
project.
Resources include technology, tools, systems,
data and people.
Frame the problem :
Framing is the process of stating the analytics
problem to be solved. At this point, it is a best
practice to write down the problem statement
and share it with the key stakeholders.
Identifying key stakeholders:
The team can identify the success criteria, key
risks and stakeholders, which should include
anyone who will benefit from the project or will
be significantly impacted by the project.
• Interviewing the analytics sponsor:
The team should plan to collaborate with the
stakeholders to clarify and frame the analytics
problem.
This person understands the problem and
usually has an idea of a potential working
solution.
Developing initial hypotheses:
These Initial Hypotheses form the basis of the
analytical tests the team will use in later phases and
serve as the foundation for the findings in phase.
Identifying potential data sources:
• Data warehouse
• Data lake
• Data marts
• Metadata
• Data cubes
Advantages of data repositories:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data
reporting.
iii. Database administrators have easier time
tracking problems.
iv. There is value to storing and analyzing data.
Disadvantages of data repositories :
• i. Growing data sets could slow down systems.
2. Lower quartile : 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by the
line that divides the box into two parts.
4. Upper quartile : 75 % of the scores fall below the upper quartiel value.
5. Maximum score: The highest score, excluding outliers.
6. Whiskers: The upper and lower whiskers represent scores outside the
middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of
scores.
TEACHING METHODS
Build the Models
To build the model, data should be clean and understand the
content
properly. The components of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
• Business user:
• Store historical data:
• Make strategic decisions:
• For data consistency
• High response time:
Difference between ODS and Data Warehouse
Metadata
Metadata is simply defined as data about data.
The data that is used to represent other data is known
as metadata. In data warehousing, metadata is one
of the essential aspects.
• We can define metadata as follows:
a) Metadata is the road-map to a data warehouse.
b) Metadata in a data warehouse defines the
warehouse objects.
c) Metadata acts as a directory. This directory helps
the decision support system to locate the contents
of a data warehouse.
Why is metadata necessary in a data
warehouse ?
a) First, it acts as the glue that links all parts of
the data warehouses.
b) Next, it provides information about the
contents and structures to the developers.
c) Finally, it opens the doors to the end-users
and makes the contents recognizable in their
terms.
Basic Statistical Descriptions of Data