0% found this document useful (0 votes)
35 views59 pages

Data Analyst Cheat Sheet FROM Parth Roy

THIS IS A CHEAT SHEET SOLUTION FOR ALL DATA ANALYST FREINDS AND YOUR WELCOME

Uploaded by

Aditya Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views59 pages

Data Analyst Cheat Sheet FROM Parth Roy

THIS IS A CHEAT SHEET SOLUTION FOR ALL DATA ANALYST FREINDS AND YOUR WELCOME

Uploaded by

Aditya Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

English version - October 2019

Power BI Cheat Sheet https://bit.ly/cheatsheetpbi (PDF download)


https://bit.ly/PBICheatSheetGitHub (Contribute to the cheat sheet)

Power Query Data Model


Tips & Tricks Tips & Tricks
1. Give every step an explanatory name and merge steps of the same type, for better manageability. 1. Always use a separate Date table in your data model. Mark it as a Date Table.
Some people prefer to not use spaces in the name of a step, for a better M coding experience. 2. Only use DAX Calculated Columns when it’s not possible to create it using Power Query. This
2. Give queries and columns user-friendly names, as they will become part of your data model. improves clarity and manageability of your report as transformations are located where you expect
3. Make sure that each column has the correct data type. Making the model smaller and faster. them. It also improves query speed of the model and reduces refresh duration.

4. Remove columns you are not going to use in your report. Prefer ‘Remove Other Columns’ above the 3. Give measures a prefix (%, #, €).
‘Remove Columns’ option, for lower risk that structural changes in your data source break the query. 4. Use abbreviations like YTD, LY, PY, PP as a suffix, to keep the base fields together in the sort order.
5. Maximize the use of Query Folding for faster and more efficient queries. With Query Folding, multiple 5. Hide columns that are needed but are irrelevant for the user.
transformations are merged as one query and then sent to the source. If ‘View Native Query’ is not 6. Hide the key at the many side of a many-to-one relation (e.g. [OrderDate] in the ‘Revenue’ table).
available, Query Folding has stopped before that step.
7. For each measure column in your data model, make a DAX Calculated Measure instead of using the
6. In general, prefer “Import” over “DirectQuery”. Unless the amount of data is too large to import, or ‘Default Summarization’, then hide the original column. This way all measures will have the same
when there are other requirements (like real-time insights). icon. And it enables you to easily change the calculation in the future (e.g. adding a filter condition).
7. Use Date.From instead of DateTime.Date to extract a date from another field, and to make sure Also, it is easier to reference this measure in other DAX calculations.
query folding won’t break. More info on this blog post: http://bit.ly/DateFrom. 8. Always use the table name when you refer to a column, for example: ‘Product’[Category].
8. Turn off ‘Enable Load’ for queries/tables that you don’t need in the Data Model. 9. Use DIVIDE() to prevent division by 0, and to improve the speed of your divisions.
9. Re-use Power Query code and lower impact on your data source by using Power BI dataflows. 10. Use IsInScope to get the right hierarchy level in DAX (read all about it in Kasper de Jonge’s blog:
10. Turn on the Formula Bar so you get familiar with Power Query (M) code. https://bit.ly/KasperOnBIInScope).

11. Automatically beautify all column names in a query, e.g. “CustomerName” → “Customer Name”, by 11. In DAX: (un)comment DAX lines by pressing Alt + Shift + A or CTRL + /, and Shift + Enter for line breaks.
using the Power Query function Alex Powers shared on his GitHub repo: http://bit.ly/PQSplitByCase. 12. Use aggregations to keep your model small and performant, and still have all detailed data available.
Note: he also has a function to replace underscores in all column names automatically. 13. Use Tabular Editor to make changes to your Power BI file (currently unsupported by Microsoft). Also,
make sure to check-out its best-practices analyzer.
Code examples (don’t forget that Power Query / M is case-sensitive!) 14. Avoid bi-directional cross filtering and make use of measure filters http://bit.ly/MeasureFilters.
• if T > 0 then A else B 15. For very large models, group measures or fields in display folders for better usability.
• try A/B otherwise 0 16. Use DAX Studio to capture all DAX queries executed on your Premium Capacity.
• #table( { “X”, “Y” }, { { 1, 2 }, { 3, 4 } } ) 17. Keep your PBI desktop file fast and small by using TOP N (http://bit.ly/ImproveReportBuilding) and
• DateTime.LocalNow() switch underlying data source in PBI service after publishing (http://bit.ly/ParameterizeDatasource).
• Date.From( DateTime.LocalNow() )
Resources
• Excel.Workbook(Web.Contents("[url]/[filename].xlsx"), null, true)
• #shared to list all functions and get PQ documentation • Increase the readability of your DAX calculations: https://www.daxformatter.com.
• Use DAX Studio to analyze and tune your calculations: http://daxstudio.org.
Resources • Find all about DAX expressions: https://dax.guide.
• Power Query M Formula Reference: http://bit.ly/PQMReference. • Use Tabular Editor to easily build and manage your models: https://tabulareditor.github.io/.
• Repo by Imke Feldmann with a lot of custom Power Query functions: https://github.com/ImkeF/M/.

I’VE GOT THE POWER BI


Data Visualization Miscellaneous
Tips & Tricks Tips & Tricks
Themes & Colors 1. Edit your report in Power BI Desktop and not in the Service. Making sure there is one golden version.
1. Use a theme file (.json) with branding of your organization (visit https://powerbi.tips/ for a generator). 2. The best way to share your Power BI solution with others is using Apps. Prefer to use Security Groups.
2. Always pick colors from your theme file instead of recent colors. Otherwise you will break the 3. Create the same data source in a gateway multiple times, with different credentials based on the
relationship with your theme file for this object and changes in your theme file will not be applied. required security context for each one.
3. Use a Power BI Template file (.pbit), to bring consistency in the look & feel of reports in your 4. Make sure you have versioning for your Power BI Desktop files (local OneDrive sync works well).
organization (e.g. add a default background, add common data sources). 5. Use dataflows to host generic Power Queries and reuse the table output across multiple data models.
4. Don’t use noisy images as a background. They distract from the data. 6. Optimize usage of the dataflows compute engine. Read details here: http://bit.ly/DataflowsCompute.
5. Apply colors with a purpose and not because it looks fancy. Consider the interpretation of colors (for
7. Want to test-drive Power BI Premium? The Azure Power BI Embedded A SKU will give you the
example red is usually associated with a negative situation, green with a positive situation).
Premium experience, without the long-term commitment and up-front costs.
6. Think about colorblindness (8% of male, 0.6% of women have red-green color blindness).
8. Only use Publish to Web when your data can be out in the open. Browse the Gallery of Public Reports
7. Use HEX color codes in your dimensional tables or DAX expressions to apply color formatting. Read all and more information about the risks of this feature on https://bit.ly/ModernDataPublishToWeb.
about it on this page: https://bit.ly/DataMarcAdvancedControls.
9. Be aware that users can also access your data model through the Q&A, Quick Insights and Analyze in
Excel functions. That’s great, but think about data protection using (dynamic) row level security.
Report layout
10. Use the Power BI admin API to get an overview of Power BI content within your tenant and monitor
8. Try out Figma for wireframing/prototyping of your report pages (https://bit.ly/Frigma).
content usage by using the Power BI Audit logs.
9. Put slicers on a consistent place on each page for user-friendliness and recognition.
11. Getting started with Power BI? Visit: https://docs.microsoft.com/en-us/power-bi/guided-learning/.
10. Give all relevant visuals a clear and explanatory title, this will show up in the selection pane and will
12. Thinking about embedding Power BI visuals in your custom application? See it in action and try out
be used as filename when using the export option of a visual (you can hide the title after renaming).
all the options in this live demo environment: https://bit.ly/PBIEmbedded.
11. Use the new grouping functionality to group object together and easily maintain your report.
12. Use the ▲ and ▼ buttons on the Selection Pane, to change the display order of visuals and groups. Resources
13. Use drillthrough and tooltip pages to add extra context. Hide drillthrough and tooltip pages from your • Follow these Power BI product team members: Amanda Cofsky, Arun Ulag, Chris Webb,
navigation bar. Give the ‘back button’ extra accent for clear report navigation. GuyInACube (Adam & Patrick), Christian Wade, Josh Kaplan, Justyna Lucznik, Kasper de Jonge,
14. Disable visual header for a clean look-and-feel and drill up/down using right click. Kay Unkroth, Kelly Kaye, Kim Manis, Lukasz Pawlowski, Matt Mason, Matthew Roche, Miguel Martinez,
15. Use Canva's drag-and-drop feature to design one or more-page layouts (https://bit.ly/CanvaLayouts). Nikhil Gaekwad, Nimrod Shalit, Phil Seamark, Will Thompson.
• Follow these bloggers and influencers for tips, examples and news : Alberto Ferrari, Brett Powell,
Graphs & Visuals David Eldersveld, Devin Knight, Gil Raviv, Imke Feldmann, Ivan Bond, Jason Thomas, Ken Puls,
Ken Russel, Leila Etaati, Maegon Longoria, Matt Allington, Marco Russo, Maxim Zelensky,
16. Prefer graphs over tables for better insights. When you do use tables, apply conditional formatting.
Melissa Coates, Miguel Eskobar, Paul Turley, Prathy Kamasani, Reid Havens, Reza Rad, Rob Collie,
17. Think about using relative date filters, such as “Last Month” or “Rolling Year”. Rob Farley, Ruth Pozuelo (Curbal), Nicky van Vroenhoven.
18. Need a special visual? Check out https://charticulator.com/ and create bespoke chart designs
without programming. Free and Open Source from Microsoft Research. Contribute!
19. When using custom visuals, test the impact on the performance of your report. Also, check when it
Do have suggestions or questions regarding the Power BI Cheat Sheet? We’d love to hear from you!
was last updated by the developer. Strongly consider using only certified custom visuals.
Contact us on LinkedIn or Twitter, or send an email to [email protected].
20. Check and improve all ‘visual interactions’. Prefer cross-filtering over cross-highlighting. You are welcome to submit your changes via GitHub: https://bit.ly/PBICheatSheetGitHub.

Resources
• OK VIZ Visual reference: https://sqlbi.com/ref/power-bi-visuals-reference. Dave Ruijter Marc Lelijveld
• SQL Jason Financial Times Visual Vocabulary: https://bit.ly/SQLJasonVisualVocabulary. linkedin.com/in/daveruijter linkedin.com/in/marclelijveld
twitter.com/daveruijter twitter.com/marclelijveld
I’VE GOT THE POWER BI https://moderndata.ai/ https://data-marc.com/
TABLEAU CHEAT SHEET
Relevant videos are linked throughout the document. You must be signed in to your Tableau account in order to view the videos.

Workbook Components
Sheet: A sheet is a singular chart or map in Tableau. A sheet is represented in Tableau with this symbol:

Dashboard: A dashboard is a canvas for displaying multiple sheets at a time and allowing them to interact with each
other. A dashboard is represented in Tableau with this symbol:

Container: A container is a layout frame on a dashboard that can house sheets, images, filters/parameters, and text
boxes. Containers can be horizontal (objects placed go side-by-side) or vertical (objects placed are on top of one
another). Double-click any sheet on a dashboard by the center “grip” marks to select the container that the sheet sits in.

Story: A story is a viewing portal that contains a sequence of worksheets or dashboards that work together to convey
information. Each individual sheet in a story is called a story point. A story is represented in Tableau with this symbol:

Workbook: A workbook is the entire Tableau file containing your sheets and dashboards.

Packaged Workbook: A single zip file with a .twbx extension that contains a workbook along with any supporting
local file data sources and background images. Use this format to package your work for sharing with others who don’t
have access to the data.

Getting Started with Dashboards and Building a Dashboard Dashboard Layouts and Formatting
Stories (6 min) (4 min) (6 min)
Story Points
(4 min)

Tableau Interface
Data Pane: The default left pane that lists your open data sources and the dimensions and measures contained in the
selected data sources. Sets and Parameters are also listed here.

Analytics Pane: Clicking the Analytics tab on the left pane will display available analyses for the data displayed on
your sheet. Inapplicable analyses will be grayed out. Analyses include adding constant lines, box plots, trend lines,
forecasts, and reference bands.

Marks Card: The Marks card is the tool used to create a sheet that controls most of the visual elements in a sheet.
Using the Marks card, you can switch between different chart types (bar, line, symbol, filled map, and so on), change
colors and sizes, add labels, change the level of detail, and edit the tool tips.

Rows and Columns Shelves: The Rows shelf and the Columns shelf is where you determine which variables will go
on what axis. Put data you want displayed along the X-axis on the Columns shelf and data you want displayed on the Y-
axis on the Rows shelf.

The Tableau Interface


(4 min)
TABLEAU CHEAT SHEET
Data
Dimension: A categorical variable from the dataset that is used to slice and dice the data into different categories.
Dimensions are often discrete data. Examples include country, gender, student ID, and name. When a dimension is
pulled into your sheet, it takes the form of a blue pill.
Dimension

Measure: A variable from the dataset that is meant to be aggregated. (This means it should be a number that it makes
sense to do math with: sum, average, and so on.) Measures are often continuous data. Examples include GPA, sales,
quantity, quota, height, and salary. When a measure is pulled into your sheet, it takes the form of a green pill.
Measure

Pill: The visual representation of a data item brought into your sheet. Pills can sit on the rows and columns shelves, the
marks card, and the filters card.

Data Types: Data fields will have an icon beside them to visually indicate what type of data field they are.
String Integer Geographic Loc. Date Group Set Hierarchy Bin Calculated Field

Getting Started
(4:50 – 7:00)

Filters/Parameters
Filters: A filter is used to limit what data is being displayed on the sheet. Visible controls for a filter on a sheet or
dashboard are called Quick Filters. Each filter is for an individual data field. Both dimensions and measures can be used
as filters.

Parameters: While filters limit the data shown in the view, parameters act as a variable in an equation that can be
controlled by the end user. Parameters only work in conjunction with either filters, sets, reference lines, or calculated
fields. Parameters are workbook-wide and can be used in multiple places (i.e. a single parameter can influence multiple
filters and calculated fields across different data sources in the workbook). Parameters are located separate from
Dimensions and Measures on the data pane.

NOTE: Filters, when layered appropriately, can affect the values displayed in other filters to show only relevant values
(i.e. selecting ENGR division will cause major filter selection to only should ENGR majors). Parameters cannot influence
filters in this manner (i.e. selecting “Undergraduate” through a parameter, will not limit the major filter selection to only
undergraduate majors)

Ways to Filter Using Filter Shelf Interactive Filters


(2 min) (7 min) (4 min)
Where Tableau Filters Additional Filtering Topics Parameters
(4 min) (7 min) (5 min)
TABLEAU CHEAT SHEET
Data Groupings and Relationships
Groups: Simplifies large numbers of dimension members by combining them into higher level sub-categories. When a
field is grouped, a new dimension node is created and replaces the original dimension in the view. Groups can be made
by:
• Highlighting multiple header names or data points then right clicking will allow you to form on-the-fly groupings
of dimension levels in the view.
• Clicking on the dimension you want to group in the data pane, then selecting Create > Group… will give access
to greater control over your groups.

Sets: A subset of your data that meets certain conditions based on existing dimensions. Unlike a group, sets only have
two values: IN and OUT. A member is either IN your set, or not (OUT). Like parameters, sets can be used throughout a
workbook on multiple sheets. Also like parameters, sets are located separate from Dimensions and Measures on the
data pane. Sets can be created by:
• Highlighting multiple header names or data points then right clicking will give you the option to put those
dimension fields in a set.
• Clicking on the dimension you want to group in the data pane, then selecting Create > Set… will give access to
greater control over your sets and the ability to create computed sets based on conditions.

Bins: Bins are buckets based on a range of values. While groups and sets are used for grouping dimensions, bins are
used for grouping measures. The created bin will set in the Dimensions shelf. Bins can be created by right-clicking on a
measure, then selecting Create > Bins…

Hierarchies: Often data sources have related dimensions that have an inherent hierarchy. For example, a data source
may have fields for Country, State, and City. These fields could be grouped into a hierarchy called Location. In this
example, a user can expand country and breakdown the data into by state and city. Hierarchies can be created by:
• Using the CTRL key, select the dimensions you want to be in your hierarchy, right click and Create Hierarchy.
Once the hierarchy is created it’s simple to put into the correct order, just drag and drop the dimensions in the
hierarchy into the correct position.
• Clicking a field and dragging it on-top of another field will also create a hierarchy.

Grouping Additional Ways to Group Creating Sets


(4 min) (4 min) (6 min)
Working with Sets Drill Down and Hierarchies
(4 min) (5 min)

Other Terminology
Action: An interaction that you can add to your views. There are three types of action: Filter, Highlight, and URL.

Aggregation: A result of a mathematical operation applied to a measure. Predefined aggregations include summation
and average. You can convert dimensions to measures by aggregating them as a count. For relational data sources, all
measures must be either aggregated or disaggregated (unless they appear on the Filters shelf). Tableau aggregates
measures, usually as a summation, when you place them on a shelf.
TABLEAU CHEAT SHEET
Aliases: an alternative name assigned to a dimension member, or to a field name. Aliases can be created by:
• Right-clicking on an individual dimension header and selecting Edit alias…
• Right-clicking on a dimension in the data pane and selecting Aliases…
• Clicking opening Data from the top toolbar, going to your data sources, and selecting Edit Aliases…

Calculated Field: A new field that you create by using a formula to modify the existing fields in your data source.

Caption: A description of the current view on the active worksheet. For example, “Sum of Sales for each Market”. You
can automatically generate captions or create your own custom captions. Show and hide the caption by selecting
Worksheet > Show Caption.

Crosstab: A text table view. Use text tables to display the numbers associated with dimension members.

Dropdown Carrot: The small triangle to the right of a selected field.

Sheet Description: A thorough summary of the data used in a worksheet including all dimensions and measures
used, a written description of the view (marks, rows, columns), the formulas for all calculated fields used on the sheet,
and the data source details.

Tooltip: Tooltips are text boxes that appear when hovering over a mark on a sheet in order to give more information.
The text and text formatting in them are easily edited through the Marks card.
TABLEAU CHEAT SHEET
Shortcuts
Description Windows Mac

New worksheet Ctrl+M Command-T

New workbook Ctrl+N Command-N

Undo Ctrl+Z Command-Z

Redo Ctrl+Y Command-Shift-Z

Clear the current worksheet Alt+Shift+Backspace Option-Shift-Delete

Describe sheet Ctrl+E Command-E

Adds a field to the view Double-click Double-click

Place selected field on Columns shelf Alt+Shift+C Option-Shift-C

Place selected field on Rows shelf Alt+Shift+R Option-Shift-R

Opens the Drop Field menu Right-click+Drag to shelf Option-Drag to shelf

Copies a field in the view to be placed on another shelf or Ctrl+Drag Command-Drag


card

Swap rows and columns Ctrl+W Control-Command-W

Open Show Me Ctrl+1 , Ctrl+Shift+1 Command-1

Connect to data source Ctrl+D Command-D

Refreshes the data source F5 Command-R

Clears the selection Esc Esc

Selects the mark Click Click

Selects a group of marks Drag Drag

Adds individual marks to the selection Ctrl+Click Command-Click

Adds a group of marks to the selection Ctrl+Drag Command-Drag

Sources:
http://www.dummies.com/programming/big-data/big-data-visualization/tableau-for-dummies-cheat-sheet/
http://onlinehelp.tableau.com/current/pro/desktop/en-us/glossary.html
https://www.tableau.com/learn/training
Population ­ entire collection of objects or ➔ Mean ­ arithmetic average of data ➔ Variance ­ the average distance
individuals about which information is desired. values squared
➔ easier to take a sample ◆ **Highly susceptible to n
∑ (xi x)2
◆ Sample ­ part of the population extreme values (outliers).
that is selected for analysis Goes towards extreme values
sx2 = i=1
n 1
◆ Watch out for: ◆ Mean could never be larger or
● Limited sample size that smaller than max/min value but ◆ sx2 gets rid of the negative
might not be values
could be the max/min value
representative of
◆ units are squared
population
◆ Simple Random Sampling­ ➔ Median ­ in an ordered array, the
Every possible sample of a certain median is the middle number ➔ Standard Deviation ­ shows variation
size has the same chance of being ◆ **Not affected by extreme about the mean
values


selected n
∑ (xi x)2
i=1
Observational Study ­ there can always be ➔ Quartiles ­ split the ranked data into 4 s= n 1
lurking variables affecting results equal groups
➔ i.e, strong positive association between ◆ Box and Whisker Plot ◆ highly affected by outliers
shoe size and intelligence for boys ◆ has same units as original
➔ **should never show causation data
◆ finance = horrible measure of
Experimental Study­ lurking variables can be risk (trampoline example)
controlled; can give good evidence for causation

Descriptive Statistics Part I


Descriptive Statistics Part II
➔ Summary Measures
Linear Transformations
➔ Range = X maximum X minimum
◆ Disadvantages: Ignores the
way in which data are
distributed; sensitive to outliers

➔ Interquartile Range (IQR) = 3rd


➔ Linear transformations change the
quartile ­ 1st quartile
center and spread of data
◆ Not used that much
◆ Not affected by outliers ➔ V ar(a + bX) = b2 V ar(X)
➔ Average(a+bX) = a+b[Average(X)]
➔ Effects of Linear Transformations: Skewness ◆ Correlation doesn't imply
◆ meannew = a + b*mean ➔ measures the degree of asymmetry causation
◆ mediannew = a + b*median exhibited by data ◆ The correlation of a variable
◆ stdev new = |b| *stdev ◆ negative values= skewed left with itself is one
◆ IQRnew = |b| *IQR ◆ positive values= skewed right
➔ Z­score ­ new data set will have mean ◆ if |skewness| < 0.8 = don't need Combining Data Sets
0 and variance 1 to transform data ➔ Mean (Z) = Z = aX + bY
z= X X ➔ Var (Z) = sz2 = a2 V ar(X) + b2 V ar(Y ) +
S Measurements of Association 2abCov(X, Y )
➔ Covariance
Empirical Rule
◆ Covariance > 0 = larger x, Portfolios
➔ Only for mound­shaped data
larger y ➔ Return on a portfolio:
Approx. 95% of data is in the interval:
◆ Covariance < 0 = larger x,
(x 2sx , x + 2sx ) = x + / 2sx smaller y
➔ only use if you just have mean and std.
Rp = w A RA + w B RB
n
1
dev. ◆ sxy = n 1 ∑ (x x)(y y)
i=1 ◆ weights add up to 1
Chebyshev's Rule ◆ Units = Units of x Units of y ◆ return = mean
➔ Use for any set of data and for any ◆ Covariance is only +, ­, or 0 ◆ risk = std. deviation
number k, greater than 1 (1.2, 1.3, etc.) (can be any number)
➔ 1 1 ➔ Variance of return of portfolio
2
k ➔ Correlation ­ measures strength of a
➔ (Ex) for k=2 (2 standard deviations), linear relationship between two sp2 = wA2 sA2 + wB2 sB2 + 2wA wB (sA,B )
75% of data falls within 2 standard variables
deviations covariancexy
◆ r xy = (std.dev. )(std. dev. ) ◆ Risk(variance) is reduced when
x y
Detecting Outliers stocks are negatively
◆ correlation is between ­1 and 1
➔ Classic Outlier Detection correlated. (when there's a
◆ Sign: direction of relationship
◆ doesn't always work negative covariance)
◆ Absolute value: strength of
◆ |z | = || X S X || ≥ 2 relationship (­0.6 is stronger
➔ The Boxplot Rule relationship than +0.4)
Probability
◆ Value X is an outlier if: ➔ measure of uncertainty
X<Q1­1.5(Q3­Q1) ➔ all outcomes have to be exhaustive
or (all options possible) and mutually
X>Q3+1.5(Q3­Q1) exhaustive (no 2 outcomes can
occur at the same time)
Probability Rules ➔ Another way to find joint probability: ➔ Expected Value Solution =
1. Probabilities range from P (A and B) = P (A|B) P (B)
0 ≤ P rob(A) ≤ 1 P (A and B) = P (B|A) P (A) E M V = X 1 (P 1 ) + X 2 (P 2 )... + X n (P n )
2. The probabilities of all outcomes must
add up to 1 2 x 2 Table
3. The complement rule = A happens
or A doesn't happen
P (A) = 1 P (A)
Decision Tree Analysis
P (A) + P (A) = 1 ➔ square = your choice
4. Addition Rule: ➔ circle = uncertain events
P (A or B) = P (A) + P (B) P (A and B)

Contingency/Joint Table Discrete Random Variables


➔ To go from contingency to joint table, ➔ P X (x) = P (X = x)
divide by total # of counts
➔ everything inside table adds up to 1 Expectation

Conditional Probability ➔ μx = E(x) = ∑ xi P (X = xi )


➔ P (A|B)
P (A and B) Decision Analysis ➔ Example: (2)(0.1) + (3)(0.5) = 1.7
➔ P (A|B) = P (B) ➔ Maximax solution = optimistic
➔ Given event B has happened, what is approach. Always think the best is Variance
the probability event A will happen? going to happen ➔ σ 2 = E (x2 ) μx2
➔ Look out for: "given", "if" ➔ Maximin solution = pessimistic ➔ Example:
approach. (2)2 (0.1) + (3)2 (0.5) (1.7)2 = 2.01
Independence
➔ Independent if: Rules for Expectation and Variance
P (A|B) = P (A) or P (B|A) = P (B) ➔ μs = E (s) = a + bμx
➔ If probabilities change, then A and B
➔ Var(s)= b2 σ 2
are dependent
➔ **hard to prove independence, need
Jointly Distributed Discrete Random
to check every value
Variables
➔ Independent if:
Multiplication Rules
➔ If A and B are INDEPENDENT:
P x,y (X = x and Y = y ) = P x (x) P y (y)
P (A and B) = P (A) P (B)
➔ Combining Random Variables 2.) All Successes Continuous Probability Distributions
◆ If X and Y are independent: P (all successes) = pn ➔ the probability that a continuous
3.) At least one success random variable X will assume any
E (X + Y ) = E (X) + E (Y ) P (at least 1 success) = 1 (1 p)n particular value is 0
V ar(X + Y ) = V ar(X) + V ar(Y ) 4.) At least one failure ➔ Density Curves
P (at least 1 f ailure) = 1 pn ◆ Area under the curve is the
◆ If X and Y are dependent: 5.) Binomial Distribution Formula for probability that any range of
E (X + Y ) = E (X) + E (Y ) x=exact value values will occur.
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) ◆ Total area = 1

➔ Covariance: Uniform Distribution


C ov(X, Y ) = E (XY ) E (X)E(Y )
➔ If X and Y are independent, Cov(X,Y)
=0

6.) Mean (Expectation) ◆ X ~ U nif (a, b)


μ = E (x) = np
7.) Variance and Standard Dev. Uniform Example
σ 2 = npq
σ = √npq
q=1 p

Binomial Example

Binomial Distribution
➔ doing something n times
➔ only 2 outcomes: success or failure
➔ trials are independent of each other (Example cont'd next page)
➔ probability remains constant

1.) All Failures


P (all f ailures) = (1 p)n
X μ
➔ Z = σ/√n

Sums of Normals
➔ Mean for uniform distribution:
(a+b)
E (X) = 2
➔ Variance for unif. distribution:
(b a)2
V ar(X) = 12 Confidence Intervals = tells us how good our
estimate is
Normal Distribution Sums of Normals Example: **Want high confidence, narrow interval
➔ governed by 2 parameters: **As confidence increases , interval also
μ (the mean) and σ (the standard increases
deviation)
➔ X ~ N (μ, σ 2 ) A. One Sample Proportion

Standardize Normal Distribution:


X μ
Z= σ
➔ Z­score is the number of standard
deviations the related X is from its
︿ number of successes in sample
mean ➔ Cov(X,Y) = 0 b/c they're independent ➔ p = nx = sample size
➔ **Z< some value, will just be the
probability found on table Central Limit Theorem
➔ **Z> some value, will be ➔ as n increases,
(1­probability) found on table ➔ x should get closer to μ (population ➔
mean) ➔ We are thus 95% confident that the true
➔ mean( x) = μ population proportion is in the interval…
Normal Distribution Example ︿
➔ variance (x) = σ 2 /n ➔ We are assuming that n is large, n p >5 and
2 our sample size is less than 10% of the
➔ X ~ N (μ, σn ) population size.
◆ if population is normally distributed,
n can be any value
◆ any population, n needs to be ≥ 30
Standard Error and Margin of Error B. One Sample Mean *Stata always uses the t­distribution when
For samples n > 30 computing confidence intervals
Confidence Interval:

Hypothesis Testing
➔ Null Hypothesis:
➔ H 0 , a statement of no change and is
➔ If n > 30, we can substitute s for assumed true until evidence indicates
σ so that we get: otherwise.
➔ Alternative Hypothesis: H a is a
statement that we are trying to find
evidence to support.
Example of Sample Proportion Problem ➔ Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
➔ Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.

Example of Type I and Type II errors

Determining Sample Size


︿ ︿
(1.96)2 p(1 p)
n= e2 ︿
➔ If given a confidence interval, p is For samples n < 30
the middle number of the interval
➔ No confidence interval; use worst
case scenario
︿
◆ p =0.5
T Distribution used when:
➔ σ is not known, n < 30, and data is Methods of Hypothesis Testing
normally distributed 1. Confidence Intervals **
2. Test statistic
3. P­values **
➔ C.I and P­values always safe to do
because don’t need to worry about
size of n (can be bigger or smaller
than 30)
One Sample Hypothesis Tests
1. Confidence Interval (can be
used only for two­sided tests)

4. P­Values
➔ a number between 0 and 1
➔ the larger the p­value, the more
consistent the data is with the null
➔ the smaller the p­value, the more
consistent the data is with the
2. Test Statistic Approach alternative
(Population Mean) ➔ **If P is low (less than 0.05),
3. Test Statistic Approach (Population
H 0 must go ­ reject the null
Proportion)
hypothesis
Two Sample Hypothesis Tests ➔ Test Statistic for Two Proportions 2. Comparing Two Means (large
1. Comparing Two Proportions independent samples n>30)
(Independent Groups)
➔ Calculate Confidence Interval ➔ Calculating Confidence Interval

➔ Test Statistic for Two Means

Matched Pairs
➔ Two samples are DEPENDENT
Example:
︿
➔ Interpretation of slope ­ for each ➔ corr (Y , e) = 0
additional x value (e.x. mile on
odometer), the y value decreases/ A Measure of Fit: R2
increases by an average of b1 value
➔ Interpretation of y­intercept ­ plug in
︿
0 for x and the value you get for y is
the y­intercept (e.x.
y=3.25­0.0614xSkippedClass, a
student who skips no classes has a
gpa of 3.25.)
➔ **danger of extrapolation ­ if an x
value is outside of our data set, we
can't confidently predict the fitted y ➔ Good fit: if SSR is big, SEE is small
value ➔ SST=SSR, perfect fit
Simple Linear Regression
➔ R2 : coefficient of determination
➔ used to predict the value of one
variable (dependent variable) on the Properties of the Residuals and Fitted R2 = SSTSSR
= 1 SSE SST
basis of other variables (independent Values ➔ R2 is between 0 and 1, the closer R2
variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit
︿ the residuals = 0
➔ Y = b0 + b1 X ➔ Interpretation of R2 : (e.x. 65% of the
︿ 2. Mean of original values is the same variation in the selling price is explained by
➔ Residual: e = Y Y f itted ︿
as mean of fitted values Y = Y the variation in odometer reading. The rest
➔ Fitting error: 35% remains unexplained by this model)
︿
ei = Y i Y i = Y i b0 bi X i ➔ ** R2 doesn’t indicate whether model
◆ e is the part of Y not related is adequate**
to X ➔ As you add more X’s to model, R2
➔ Values of b0 and b1 which minimize goes up
the residual sum of squares are: ➔ Guide to finding SSR, SSE, SST
sy
(slope) b1 = r s
x
b0 = Y b1 X 3.
4. Correlation Matrix
Assumptions of Simple Linear Regression Example of Prediction Intervals: Regression Hypothesis Testing
1. We model the AVERAGE of something *always a two­sided test
rather than something itself ➔ want to test whether slope ( β 1 ) is
needed in our model
2. ➔ H 0 : β 1 = 0 (don’t need x)
H a : β 1 =/ 0 (need x)
➔ Need X in the model if:
a. 0 isn’t in the confidence
interval
Standard Errors for b1 and b0 b. t > 1.96
➔ standard errors when noise c. P­value < 0.05
➔ sb0 amount of uncertainty in our
estimate of β 0 (small s good, large s Test Statistic for Slope/Y­intercept
bad) ➔ can only be used if n>30
➔ sb1 amount of uncertainty in our
➔ if n < 30, use p­values
estimate of β 1

◆ As ε (noise) gets bigger, it’s


harder to find the line

Confidence Intervals for b1 and b0


Estimating S e
2 ➔
➔ S e = SSE n 2
➔ S e2 is our estimate of σ 2


➔ S e = S e2 is our estimate of σ
➔ 95% of the Y values should lie within ➔
+
the interval b0 + b1 X 1.96S e

➔ n small → bad
se big → bad
s2x small→ bad (wants x’s spread out for
better guess)
Multiple Regression

➔ Variable Importance:
◆ higher t­value, lower p­value =
variable is more important
◆ lower t­value, higher p­value =
variable is less important (or not
Interaction Terms
needed)
➔ allow the slopes to change
➔ interaction between 2 or more x
Adjusted R­squared variables that will affect the Y variable
➔ k = # of X’s
Modeling Regression How to Create Dummy Variables (Nominal
Backward Stepwise Regression Variables)
1. Start will all variables in the model ➔ If C is the number of categories, create
2. at each step, delete the least important (C­1) dummy variables for describing
➔ Adj. R­squared will as you add junk x variable based on largest p­value above the variable
variables 0.05 ➔ One category is always the
➔ Adj. R­squared will only if the x you 3. stop when you can’t delete anymore
“baseline”, which is included in the
add in is very useful ➔ Will see Adj. R­squared and Se
intercept
➔ **want Adj. R­squared to go up and Se
low for better model Dummy Variables
➔ An indicator variable that takes on a
The Overall F Test value of 0 or 1, allow intercepts to
change

➔ Always want to reject F test (reject


null hypothesis)
Recoding Dummy Variables
➔ Look at p­value (if < 0.05, reject null)
Example: How many hockey sticks sold in
➔ H 0 : β 1 = β 2 = β 3 ... = β k = 0 (don’t
the summer (original equation)
need any X’s) hockey = 100 + 10W tr 20Spr + 30F all
H a : β 1 = β 2 = β 3 ... = β k =/ 0 (need at Write equation for how many hockey sticks
least 1 X) sold in the winter
➔ If no x variables needed, then SSR=0 hockey = 110 + 20F all 30Spri 10Summer
and SST=SSE ➔ **always need to get same exact
values from the original equation
Regression Diagnostics so that we can compare models. ◆ Homoskedastic: band around the
Standardize Residuals Can’t compare models if you take log values
of Y. ◆ Heteroskedastic: as x goes up,
◆ Transformations cheatsheet the noise goes up (no more band,
fan­shaped)
Check Model Assumptions ◆ If heteroskedastic, fix it by
➔ Plot residuals versus Yhat logging the Y variable
◆ If heteroskedastic, fix it by
making standard errors robust

➔ Multicollinearity
◆ when x variables are highly
correlated with each other.
◆ ovtest: a significant test ◆ R2 > 0.9
statistic indicates that ◆ pairwise correlation > 0.9
➔ Outliers polynomial terms should be ◆ correlate all x variables, include
◆ Regression likes to move added y variable, drop the x variable
towards outliers (shows up ◆ H 0 : data = no transf ormation that is less correlated to y
as R2 being really high) H a : data =/ no transf ormation
◆ want to remove outlier that is Summary of Regression Output
extreme in both x and y
➔ Nonlinearity (ovtest)
◆ Plotting residuals vs. fitted
values will show a
relationship if data is ➔ Normality (sktest)
nonlinear ( R2 also high) ◆ H 0 : data = normality
H a : data =/ normality
◆ don’t want to reject the null
hypothesis. P­value should
be big

◆ Log transformation ­
accommodates non­linearity,
reduces right skewness in the Y, ➔ Homoskedasticity (hettest)
eliminates heteroskedasticity ◆ H 0 : data = homoskedasticity
◆ **Only take log of X variable ◆ H a : data =/ homoskedasticity
    

 
  

  
       #   #$
        

 
     
     #   ( 
μ  &   

  
      ( ! 
 #    !  * !
    +

   
   
 #      
   ,      #    
  '

  2
=
 (x  μ) 2

     2
=
x 2

 μ 2 
n n
2
   s &      #'
( x ) 2
 (x  x ) 2
x 2

n
 s2 =     s2 = 
n 1 n 1

     
   
       *  !  +
      s     

   
       )  $     
 
"!     
s
  %    !! %
n
1 1 2 s12 s22
    s  +  &    + 
 n1 n 2  n1 n 2

   
 !"!%" " '!" #"#"! " !$#

 #'"!!243+!""!"""#"!#"!$#2μ43
" "$'"!!23+!""!"""!"#1! " "1!!!"μ4
"-!""!"+!" (!" "% μ4
x  μ0
t=    ! 23>
s
n
"""-!" #" "$#! "-$#2 "'"""!%!"'
$μ4!"#"3#!"#""-!""!" ! ,
+μAμ4""-!""!"!'!"$* "!$
+μ@μ4""-!""!"!'"$*""-!" #"!!'" ! " "'!""-
!""!"% !"$
"+""-!""!"!"/% 0!)"-$#!5#!"$" "
+μ?μ4 "-$#!""-!""!"% !"$#"2"! "!!" " 
"3
"-$#!!!"" " $# !2!#!#'4,493) ""#
'"!!""" "$'"!!,

  +
# &  !!!! """"#!!#"&"',#"
!# "!#$!'# "%" ' %,#  !#"! $%,  
&#""$ ! %" !4,491,!"!#$'# "%" &
""2!!#>4,493.

' # 4+μ>4,49*+μA4,49
1 #""!" $"'# !+
5 4,495 x = 0.0508 
6 4,4949  (x  x ) 2 (0.051 0.0508) 2 + (0.0505  0.0508) 2 + etc...
2
7 4,48= s = = = 9.15 107 
8 4,495:
n 1 6
9 4,496 s = s2 = 9.56 104
: 4,494< 
; 4,494:

x  μ0 0.0508  0.05
"-!""!"!+ t = = = 2.17 " !  >;-5>:
s 9.56 104
n 7
"""-!" #" "$#!")6,5;%": ! !"%4,49>4,469,
!!"""-$#!!!"4,49)!'# "4#"""!#$'# "
%" &!"",


  
 !"!%" "!"%#"! !"' " " 
  
$# # !! "'"$#""  #*+ " $#!"  #
" "" $#""
#" """%$#! $#""!"$#!2" !3#!
μ4>4"  -!"-"!"
  
"%#"! "
4+!""!"""!""%#"! #2μ5>μ63
+!""!"""!""%#"! ## ! " """ 2μ5?μ6)μ5Aμ6)μ5@μ63

!+ !! !)
x1  x 2 x1  x 2
 ""# ) t =  """# ) t = 
 1 1   s 2
s 2
s2 +  1
 + 
2
n
 1 n 2 n
 1 n 2
 ;.1+1/:.2+1/

!!!+ !"!!#" !+#"" !"!!+ !! ! 
* !! !!+ !! !. "!!!  /!
"!+#"μ1<μ2*

 
 ! 18! *12$ !!!##700& . !
#!;21& /* !6" !! !##668& . !
#!;30& /*  !!! ! ! . ";0*05/,

μ1;700'1;21'1;12(μ2;668'2;30'2;6
0)μ1;μ2
)μ1=μ2." $&  !!!   /
! "!!!# !!$"! "" !!! " !
!#!& *
x1  x 2 700  668
!+ !! ! ) t = = = 2.342 
 s12 s22  212 30 2
 +  +
 n1 n 2  12 6
 ;.1+1/:.2+1/;.12+1/:.6+1/;16
!!+ !"!!'!+#" !$0*010*02' $!0*  !!
!  ! ! *

   
     
 $!! #!!!!  # !"!
0)! #!!! !# !"!
)! #!! !!!# !"!
(O  E) 2
+ " !! ! )  = 
2
. ! ##" !%!#"/
E
 ;"! ! !"!-1

!!+#"!!2!#" " !"!2#" *!+#"  !'!
 #! !!!%! !"!*='!!&! !%! !"!


 
""  $"!!!!!!!"!&*
"  !"!" * ""'"$!
"'" *#" !1!"* 2  $
$*! !"!& !!&. ";0*05/,
 "   " " " 
143 60 55 18

! !! !&!9)3)3)1' )
0)! # !"!2 ! 9)3)3)1 !"!
)! # !"!2  !!9)3)3)1 !"!
%!#" )
 "   " " " 
155*25 51*75 51*75 17*25
(O  E) 2 (143 155.25) 2 (60  51.75) 2 (55  51.75) 2 (18 17.25) 2
2 =  = + + + = 2.519 
E 155.25 51.75 51.75 17.25
<5,2<4
!##3!#%$")#,%$"!#!#1+36)"#"!#!$!!#(
""!##($""+

 
$!"#$(###!"!"$"#!#!#
!!"#"!($"+#"$+#$""($!
"#!$#.""$<1+16/-

1*#!"$!"#!$#$"
*#!"#$!"#!$#$"
!!##36$")"#!"$!"#!$#)#!"$3+889
$"!" $!+ !#$"!%%$"!3)5)5)4)4)4)3)4)2)"#3"##"#"*
(O  E) 2 (1 2.778) 2  (2  2.778) 2   (3  2.778) 2   (4  2.778) 2 
2 =  = + 2 + 4 + 2 = 2.72 
E 2.778  2.778   2.778   2.778 
<:,2<9
!##3!#%$")#,%$"!#!#1+36)"&#!#1+ $""(
$!"#!$#+

   
  
 "&#!#&#!%!"!!#!#./
1*##&%!"!#
*##&%!"!##
"#(""$#"$#'#"#!$#
"!%%$".02)03)04)05/!$"$(!"#"#+!&"#!(%!2
$"#!(%!3+
  !2 #"
  #!( #!( 
!3 #!( 02 0 3 02;03
#!( 04 0 5 04;05
#"  02;04 03;05 02;03;04;05
!!##!(%!2"#$!%$"#!(%(###$!
%$"  #1 +# 3  +""$)#'#$!%$"##&##!(

 #1 +# 2 +# 3 +# 4 
%!3"#!!##!($#(#$!%$"#!(
 #1 +# 3  + $")#'#%$"*
 ( #1 +# 2 )
 #1 +# 2 +# 3 +# 4 
(#1 +# 3 )(#1 +# 2 ) (row total)(column total) 
E= =
#1 +# 2 +# 3 +# 4 grand total
!"!<.,2/.,2/ &!"#$!!&""#$!$"
2
," $!"##"#""#  2 =  (O  E) 
E
#,%$"!##3!#%$"+

 
%##&)"#!!#"#&#""%"#".""$<1+16/-
 #"" % 
 & $, & $, 
%!" 224 224 221 26: 5:6
!!"!" 22: 246 283 2:1 727
2#:!##"( 88 :2 97 76 42:
=21!##"( 292 263 235 84 641
 5:1 5:2 5:3 598 2:71

,&!   
 &!   
%"   # '%# &
(row total)(column total) (495)(490)
E= = = 123.75 
grand total 1960
 !
 "  ("  ( 
! -./'31 -.0 -.0'.2 -..'55
 -10 -10'/- -10'2/ -1/',2
-5  $ 35'31 35'5- 4,',4 35'.2
7-,  $ -/.'1 -/.'33 -//',0 -/-'25
(O  E) 2 (113 123.75) 2 (113 124) 2 (110 124.26) 2
2 =  = + + + etc... = 91.73 
E 123.75 124 124.26
6*(-+*(-+6*0(-+*0(-+65

 . !  %(!   ,',,-%",    
  "!  '

  
 $   $
 

  
 $    $


   

  
  $!  
 P(A and B) = P(A)  P(B) "!  

   
  $!! 
P(A or B) = P(A) + P(B)  P(A and B) 

    
  $! ! !  
P(A and B) P(B | A)  P(A)
 P(A | B) =     P(A | B) =  
P(B) P(B)

 
 &
  $   $  $"" '" 
 ' "($ "$     ($"  !'    $'  
 $   !  $)

$ "    "
0.5  0.5  0.5 = 0.125 = P(A and B) = P(2 children = bb and 1st child bb) 
 $ " . % P(B) = P(1st child = bb) = 0.5 
P(A and B) 0.125
$ P(2 children = bb | 1st child bb) = P(A | B) = = 0.25 
P(B) 0.5
 
    "$      1-2   #  /-2 
  !' $   #&     "$  
 "  #(

)%+ 3 #*3 ,
) + 3 # 3 ,    & $%
P(B | A)  P(A)
P(A | B) =
P(B) 
P(S = Z |W = Zz) = 0.5 +0-2    ,
P(W = Zz) = 0.7 +,
P(S = Z) = (0.7  0.5) + (0.3 1) = 0.65 +       # ,
0.5  0.7
P(W = Zz | S = Z) = = 0.538 
0.65

      

     
 "  $" "
m (nm )
n! p  (1 p)
P(X = m) = 
m!(n  m)!
    3 " 
 
  "    "(
5!0.51  0.5 4
P(1 boy of 5 children) = = 0.15625 
1!(4)!
     
           $   
 "&
       &
enp  (n  p) m
P(X = m) = 
m!
%    ! 

PIVOT TABLE
CHEAT SHEET
INSERTING A PIVOT TABLE

Click anywhere your data source or Table


and choose Insert > PivotTable

Alt N V T

REFRESH A PIVOT TABLE

Right click anywhere in the pivot table


and select Refresh

Alt F5
DRILL DOWN TO AUDIT

Double click with your mouse in a


Pivot Table value

PIVOT TABLE STYLES

Click anywhere in your Pivot Table, then


choose Design > Pivot Table Styles
INSERT/REMOVE SUBTOTALS
& GRAND TOTALS

Click anywhere in your Pivot Table, then


choose Design > Subtotals/Grand Totals

NUMBER FORMATTING

Right click on any values within the Pivot


Table and Select Number Format
PIVOT TABLE OPTIONS

Right click anywhere in the Pivot Table and


select Pivot Table Options
SUMMARIZE VALUES BY /
SHOW VALUES AS

Right click anywhere in the Pivot Table and


select Summarize Values By or Show Values As

…Or in the Field List Values Area, choose the


drop down arrow and select
Value Field Settings > Summarize
Values By or Show Values As
GROUPING

Right click anywhere in the Pivot Table


and select Group
SORTING

Right click anywhere in the Pivot Table


and select Sort

FILTERING

Click on the Row/Column drop down arrow


to access the Filter
DES/ACTIVATE GETPIVOTDATA

Click anywhere in your Pivot Table, then


choose Options/Analyze > Options drop
down box > Generate GetPivotData
SLICERS
Click anywhere in your Pivot Table, then choose
Options/Analyze > Slicers

Click on the Slicer to activate the


Slicer Tools/Options tab
CALCULATED FIELDS

Click in the values area of your Pivot Table,


choose Options/Analyze > Fields, Items &
Sets > Calculated Field
PIVOT CHARTS

Click anywhere in your Pivot Table, then


choose Options/Analyze > PivotChart

CONDITIONAL FORMATTING
PIVOT TABLES

Highlight the values in your Pivot Table, then


choose Home > Conditional Formatting

Feel free to share with your friends & colleagues!

You might also like