Chapter 1 Notes
Chapter 1 Notes
Plan (1–2)
1. Define the problem.
2. Collect and/or find data and
identify the variables.
Do (3–6)
3. Prepare and wrangle data.
4. Characterize the data.
5. Explore the data.
Summarize
Visualize
6. Model (if appropriate).
Check conditions and
assumptions for modeling.
Fit the model and make the
necessary calculations.
Report (7)
7. Communicate and present.
- When companies try to obtain actionable information from data that may have been collected in the course
of doing business (such as records of transactions or a customer database) it is usually called data mining.
Sometimes the analysis is called predictive analytics if it focuses on future performance
- Newspaper journalists know that the lead paragraph of a good story should establish the “Five W’s”: who,
what, when, where, and (if possible) why. Often, we add how to the list as well. Answering these questions
connects the data to the business problem at hand.
- The columns are called variables. You’ll usually find the name of the variable at the top of the column as in
Table 1.1. We call cases by different names, depending on the situation. Individuals who answer a survey
are referred to as respondents. People on whom we experiment are subjects or (in an attempt to
acknowledge the importance of their role in the experiment) participants, but animals, plants, websites, and
other inanimate subjects are often called experimental units. Often we call cases just what they are: for
example, customers, economic quarters, or companies. When referring to a transaction, rows are often
called records. In Table 1.1, the rows are the individual orders, or purchase records. A common place to
find the who of the table is the leftmost column. It’s often an identifying variable for the cases, in this
example, the order number.
- A general term for a data table like the one shown in Table 1.1 is a spreadsheet, a name that comes from
bookkeeping ledgers of financial information. The data were typically spread across facing pages of a
bound ledger, the book used by an accountant for keeping records of expenditures and sources of income.
- When the values of a variable are simply the names of categories we call it a categorical, or qualitative,
variable. When the values of a variable are measured numerical quantities, we call it a quantitative variable.
Descriptive responses to questions are often categories
- Identifier variables are categorical variables whose only purpose is to assign a unique identifier code to
each individual in the data set. Your student ID number, social security number, and phone number are all
identifiers. Identifier variables are crucial in this era of Big Data because, by uniquely identifying the cases,
they make it possible to combine data from different sources and provide unique labels.
- The identifiers in Table 1.2 are the Customer Number, Product ID, and Transaction Number. Variables like
UPS Tracking Number and Social Security Number are other examples of identifiers.
- When the values of a categorical variable have an intrinsic order, we can say that the variable is ordinal. By
contrast, a categorical variable with unordered categories is sometimes called nominal. Values can be
individually ordered (e.g., the ranks of employees based on the number of days they’ve worked for the
company) or ordered in classes (e.g., Freshman, Sophomore, Junior, Senior).
- The quantitative variable Total Revenue in Table 1.4 is an example of a time series. A time series is an
ordered sequence of values of a single quantitative variable measured at regular intervals over time. Time
series are common in business. Typical measuring points are months, quarters, or years, but virtually any
consistently spaced time interval is possible.
Understand the business context of the data and the problem you are trying to solve to be successful when making
decisions from data.
• Who, what, why, where, when (and how)—the W’s—help nail down the context of the data.
• We must know who, what, and why to be able to say anything useful based on the data. The who are the cases
(or records or rows). The what are the variables. A variable gives information about each of the cases. The why
helps us decide which way to treat the variables.
• Stop and identify the W’s whenever you have data, and be sure you can identify the cases and the variables.
Big Data - The collection and analysis of data sets so large and complex that traditional methods typically brought
to bear on the problem would be overwhelmed.
Business analytics - The process of using statistical analysis and modeling to drive business decisions.
Categorical (or qualitative) variable - A variable that names categories (whether with words or numerals) is called categorical
or qualitative.
Context - The context ideally tells who was measured, what was measured, how the data were collected, where the
data were collected, and when and why the study was performed.
Cross-sectional data - Data taken from situations that vary over time but measured at a single time instant are said to be a
cross-section of the time series.
Data - Recorded values, whether numbers or labels, together with their context.
Case - A case is an individual about whom or which we have data. Also called a record or row.
Data mining (or predictive analytics) - The process of using a variety of statistical tools to analyze large databases or data
warehouses.
Data table - An arrangement of data in which each row represents a case and each column represents a variable.
Data warehouse - A large database of information collected by a company or other organization usually to record transactions
that the organization makes, but also used for analysis via data mining.
Experimental unit - An individual in a study for which or for whom data values are recorded. Human experimental units are
usually called subjects or participants.
Identifier variable - A categorical variable that records a unique value for each case, used to name or identify it.
Metadata - Auxiliary information about variables in a database, typically including how, when, and where (and possibly
why) the data were collected; who each case represents; and the definitions of all the variables.
Nominal variable - The term “nominal” can be applied to a variable whose values are used only to name categories.
Ordinal variable - The term “ordinal” can be applied to a variable whose categorical values possess some kind of order.
Participant A human experimental unit. Also called a subject.
Quantitative variable - A variable in which the numbers are values of measured quantities with units. Record Information about
an individual in a database.
Relational database - A relational database stores and retrieves information. Within the database, information is kept in data
tables that can be “related” to each other.
Spreadsheet - A spreadsheet is a layout designed for accounting that is often used to store and manage data tables. Excel is a
common example of a spreadsheet program.
Time series Data - measured over time. Usually the time intervals are equally spaced or regularly spaced (e.g., every week,
every quarter, or every year).
Units - A quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams.
Variable - A variable holds information about the same characteristic for many cases.
Quantitative data are data about numeric variables (e.g. how many; how much; or
how often). Qualitative data are measures of 'types' and may be represented by a
name, symbol, or a number code. Qualitative data are data about categorical variables
(e.g. what type).