152 Watson Analytics Intro To Data V 2 0
152 Watson Analytics Intro To Data V 2 0
Contents
IBM Watson Analytics needs your data to start helping you drive insights! .................................. 4
Data loading and file characteristics................................................................................................ 4
Loading data files......................................................................................................................... 4
Data file sizes and types............................................................................................................... 4
Data file structure......................................................................................................................... 4
Microsoft Excel file restrictions ............................................................................................... 5
CSV file restrictions ................................................................................................................. 6
Data quality ..................................................................................................................................... 6
Data quality improvements .......................................................................................................... 7
Review the Data Quality Report .............................................................................................. 8
Add to the breadth and depth of the data ................................................................................. 9
Use your domain knowledge to determine if the results are making sense.............................. 9
Viewing and changing the properties of a field in Predict........................................................... 9
Changing the role of a field in Predict ................................................................................... 10
Changing the measurement level of a field in Predict............................................................ 10
Browsers currently supported in Watson Analytics ...................................................................... 11
3. In the Add your data area, add your data set. You can add .csv and Microsoft Excel
spreadsheet files.
Important: If your data is filtered in a Microsoft Excel spreadsheet, the data is only
hidden in the spreadsheet and the full original data set is imported into Watson Analytics.
You can use the filtering options available in the Explore and Assemble capabilities to
filter your dataset. Even if you filter the data in an exploration or view, the full data set is
still available if you create a new exploration or view.
After your file loads, it appears on the Welcome page as a data set. Choose a data set to create a
prediction or exploration based on it.
Headers: Because IBM Watson Analytics relies on natural language and matches elements from
the question you ask to elements in the data, files with descriptive column headers are preferred.
Watson Analytics assumes that the first row of your file contains headers.
List Files: List files work best. List files are tabular data, with columns and rows. In Watson
Analytics, we refer to columns as fields and to rows as records. The first row is a header. Watson
Analytics does not currently work with nested headings or row headings.
The following example of a list file works well in Watson Analytics:
The following example of a nested file does not work in Watson Analytics because it contains
row headings and nested headings.
Only the first sheet in a Microsoft Excel file is imported, and remaining sheets are
ignored
Data quality
When a data set is loaded, Watson Analytics creates a data quality report, which includes an
overall average data quality score. The data quality score indicates how ready the data is for
analysis and does not necessarily indicate whether Watson Analytics will provide good predictive
or explorative results. In other words, a low data quality score just indicates that your data is not
suitable for analysis but Watson Analytics might still provide useful insights and answers about
your data. The most problematic fields that cause the average data quality score to be low are
usually excluded from analysis. Additionally, some data preparation steps are taken when Watson
Analytics creates a prediction.
Watson Analytics will compute a data quality score based on the original data, before any
cleansing or transformation has occurred. The score is an average of the data quality score for
every field in the data set, as determined by missing values, constant values, imbalance,
influential categories, outliers, and skewness. Skewness is a measure of the asymmetry of a
distribution. Symmetry describes how values are distributed on either side of the central value.
There are some things you can do to your data that can help improve the data quality score before
you load the data and the score is calculated.
Before loading your data set, clean your data as much as possible, in the following ways:
You can see the score associated with each data set in the list of assets on the Welcome page. In
the following example, 68 is the score assigned to the IBM Sales Sample data set and represents
the datas readiness for analysis.
A score of 68 indicates a data set of medium quality. The score is an
average of the data quality score for every field in the data set, as
determined by missing values, constant values, imbalance, influential
categories, outliers and skewness. The lower the score, the higher the
number of outliers or missing values and other issues associated with some
of the fields in the data set. It is worth mentioning again that a poor data
score is only indicative of how suitable the data is for analysis and not
indicative of the quality of answers you will get for your queries.
You can access the Data Quality Report in the menu on the Main Insight page in the Predict
capability.
For example, while looking at the Analysis Details of your prediction, you may see that some
input fields are omitted. Use the Data Quality Report to determine why they were removed and
perhaps, more importantly, determine if you should be including them.
Watson Analytics might exclude a field from use for various reasons. Use your domain
knowledge to determine whether an excluded field should be included.
Too many categories in the field: If a field contains 50 or more categories, Watson
Analytics will ignore it and does not include it in the subsequent analyses even if you
set the field role to Input.
Constant or near-constant fields: If a field contains a single value over 95% of
valid values, Watson Analytics will set its field role to None.
However, if you set the field role to Input or Target, Watson Analytics will use it in
subsequent analyses.
For example, lets say that you have a Churn field which is extremely unbalanced in
that only 4% of people would be included in the data. In this case, Watson Analytics
excludes Churn from analysis. However, you know it is an important target field, so
you set its role to Target.
Missing values: Watson Analytics ignores a field when the number of missing
values is greater than 25%. However, it will use the field if the user sets it as Input or
Target. Currently, Watson Analytics does not impute missing values for such a field,
so records with missing values for the field are excluded in subsequent analyses.
Alternatively, you can change the default threshold from 25% to another value in the
dropdown box in the Data Quality Report. Watson Analytics would impute missing
values for these fields with missing values that represent less than the threshold value
and use the imputed values in the subsequent analyses.
For example, lets say that you have an Age field with 30% of the values missing.
By default, Watson Analytics excludes it because more than 25% of the values are
missing. However, you know that the Age field is an interesting input field that
might explain the new program preference in a viewer survey. So, you might decide
to include it to see how it will affect the predictive results.
Use your domain knowledge to determine if the results are making sense
You will always need to bring your domain knowledge with you to the analysis part of your
prediction or exploration. IBM Watson Analytics provides you with recommended analytical
starting points and predictive models based on the data you provide it. However, you must
determine what to do with the analysis and recommendations in order to create an appropriate
response.
For example, lets say you are an HR professional trying to analyze employee attrition. In this
case, Watson Analytics may initially determine that whether an employee had an exit interview is
a near-perfect predictor of whether that they have left the company. However, with your domain
knowledge, you know that exit interviews are not a useful predictor of future attrition. In this
situation, you could choose to change the role of the Exit Interview input field from Input to None
and exclude it completely from the analysis.
Similarly, while Watson Analytics does its best to determine what questions you want to answer
with your data, there is no substitute for your own expertise. For example, if you are examining
payments received from customer accounts, Watson Analytics may initially determine that you
want to be able to predict the amount on the invoice. However, in fact you want to predict
whether a customer will pay the invoice by the due date. You can change the Targets identified
by Watson Analytics in order to influence how it interprets the data.
Input: Most fields are input fields. Input fields are fields whose values might
influence another field. For example, if you were conducting a study to analyze the
effect of salary on overall happiness, salary is an input field.
Target: Although input fields are the most common, target fields are the most
important. Target fields are the fields whose outcome you are interested in predicting.
Target fields are influenced by input fields. You cannot have more than five targets in
a workbook.
Record ID: Record ID fields are not used in the analysis. These fields are used for
labeling but do not provide any analytical substance.
None: Fields with a role of None are those fields that are not used in a prediction.
These fields might have too much missing data or might be fields that you choose to
exclude as they include standard data, such as counts that are identical through the
length of the column. A field might have a role of None because it was excluded
automatically by Watson Analytics. Alternatively, you might decide to exclude a
field because of your domain knowledge that it is not important for the analysis that
you want to perform.
10
Nominal: A nominal field is a field with a limited number of distinct values that have
no inherent order or ranking. Examples of nominal fields include department, region,
postal code, and religious affiliation.
Ordinal: An ordinal field is a field with a limited number of distinct values that have
an inherent order or ranking. Examples of ordinal fields include attitude scores that
represent the degree of satisfaction or confidence and preference rating scores. Like
continuous fields, ordinal fields can be measured numerically. However, unlike
continuous fields, distance comparisons between values are not appropriate.
Continuous: A continuous field is measured numerically so that distance
comparisons between values are appropriate. Examples of continuous fields include
age in years and income in thousands of dollars.
11