0% found this document useful (0 votes)

24 views19 pages

Unit 4

notes

Uploaded by

gkeerthana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views19 pages

Unit 4

notes

Uploaded by

gkeerthana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Unit-IV

Percentage tables and cross tabulations

• Percentage Tables and Cross Tabulations (or Crosstabs) are
both tools used in data analysis to present and interpret
relationships between variables.
• 1. Percentage Tables:
• A percentage table is used to show how a variable is distributed
across categories or groups, expressed as percentages. It typically
involves calculating the percentage of total observations that fall
into each category. This is useful when you want to compare the
relative frequency or proportion of categories within a dataset.
• Example: If you have a dataset showing the number of students
by gender and grade level, a percentage table could show the
proportion of male and female students within each grade as a
percentage of the total students in that grade.
Example Layout:

Grade Level Male (%) Female (%) Total (%)

9th 40% 60% 100%
10th 45% 55% 100%
Total 42.5% 57.5% 100%

2. Cross Tabulations (Crosstabs):

Crosstabs are a way of summarizing data by displaying the
relationship between two or more categorical variables in a matrix
format. It helps identify patterns, correlations, and interactions
between variables.
•Example: A crosstab could show the relationship between gender
and whether a student passed or failed a test. It helps in
understanding how the outcome of one variable (e.g.,
passing/failing) varies by the levels of another variable (e.g.,
gender).
Example Layout:

Gender Passed Failed Total

Male 80 20 100

Female 70 30 100

Total 150 50 200

Batch processing

• Batch Processing is a method of running a series of jobs or tasks without

manual intervention. It involves processing large volumes of data or performing
a sequence of operations at scheduled times or in groups (batches), rather than in
real-time.
• Key Characteristics of Batch Processing:
1. Non-Interactive: Batch jobs are executed without user interaction. They run in
the background, and users don’t need to monitor or engage with them actively.
2. Scheduled or Triggered: Batch jobs can be scheduled to run at specific times
(e.g., nightly, weekly) or triggered by a particular event (e.g., when a certain file
arrives).
3. Efficient for Large Data Sets: Since it processes data in chunks, batch
processing is particularly suited for handling large datasets or jobs that do not
need immediate results.
4. Automated: Once set up, batch processing tasks run automatically without
human intervention. This reduces errors and the need for manual processing.
5. Time-Consuming: While efficient for large data sets, batch processing can be
slow because it processes tasks sequentially or in bulk. However, it can be
optimized for specific tasks.
• Examples of Batch Processing:
• Payroll Systems: A company might process employee payroll in batches at the end of
each month, calculating wages, deductions, and taxes for all employees at once.
• Banking Systems: Banks often use batch processing to update account balances
overnight, process large numbers of transactions, or generate monthly statements.
• Data ETL (Extract, Transform, Load): In data warehouses, batch processing is used
to extract data from various sources, transform it (e.g., clean, aggregate), and load it
into a centralized system for analysis.
• Benefits:
• Efficiency: Large amounts of data can be processed quickly and efficiently in batches.
• Cost-Effective: Batch jobs can be scheduled during off-peak hours (e.g., overnight) to
optimize system resources and reduce operational costs.
• Scalability: Batch processing can handle large volumes of data without requiring real-
time processing power or resources.
• Drawbacks:
• Delay in Results: Since batch jobs run at scheduled intervals, results may not be
available immediately (i.e., it's not suitable for real-time or immediate processing
needs).
• Lack of Flexibility: If something goes wrong in the batch process, the whole batch
may need to be rerun, which can lead to delays.
contingency table
• A contingency table (also called a cross-tabulation or crosstab) is
a tabular representation of data that helps in analyzing the
relationship between two or more categorical variables. In
development (Dev), contingency tables are used in various
domains such as database management, business intelligence,
and statistical analysis to understand data patterns and
dependencies.
• Example of a Contingency Table in Development
• Scenario: Bug Tracking System
• Suppose you are analyzing the relationship between the severity
of bugs and their status in a software development project. The
two variables are:
• Bug Severity: High, Medium, Low.
• Bug Status: Open, Resolved, Closed.
Key Components of a Contingency Table:
• Rows and Columns: Represent categories of the variables being analyzed.
• Cell Values: Count or frequency of occurrences for each combination of categories.
• Marginal Totals: Row and column totals that show the overall frequency for each category.
• Grand Total: The total number of observations across all categories.

Applications in Development:
• Bug Analysis: Understand patterns in bug occurrence, such as which severity levels remain
unresolved.
• Feature Usage: Analyze how frequently features are used across different user demographics or
time periods.
• Error Reporting: Categorize errors by type and frequency across modules or teams.
• A/B Testing: Compare user behavior under different conditions.
• Example Analysis:
• From the table above, we see that:
– High-severity bugs are less frequent but have a higher percentage of being unresolved (10/17
= 58.8% remain open).
– Medium-severity bugs have more resolved cases (10/30 = 33.3%).
– Low-severity bugs are mostly resolved or closed, indicating they are easier to handle.
• import pandas as pd

• # Example data
• data = {
• 'Severity': ['High', 'High', 'Medium', 'Low', 'Low', 'Medium', 'High'],
• 'Status': ['Open', 'Resolved', 'Open', 'Closed', 'Resolved', 'Open',
'Closed']
• }

• # Create a DataFrame
• df = pd.DataFrame(data)

• # Create the contingency table

• contingency_table = pd.crosstab(df['Severity'], df['Status'])

• print(contingency_table)
Scatter Plots and Resistant Lines

• Scatter Plot
• A scatter plot is a graphical representation of the relationship between two
variables. Each point on the plot represents an observation, with its position
determined by the values of the two variables.
• X-axis: Represents the independent variable.
• Y-axis: Represents the dependent variable.
• Points: Represent observations, with coordinates (x, y).
• Scatter plots are often used to:
• Visualize relationships: Identify patterns, correlations, or clusters.
• Detect outliers: Spot points that deviate significantly from the general trend.
• Assess trends: Help determine if a relationship is linear, quadratic, or non-
linear.
• Example:
• If you are studying the relationship between hours studied (X) and exam
scores (Y), a scatter plot could help determine if more study hours generally
lead to better scores.
Resistant Line
• A resistant line is a robust statistical line fitted to a scatter plot that is less
affected by outliers compared to traditional regression lines. It is used to
summarize the central trend in the data.
• Characteristics:
• Resistant to Outliers: Unlike the least squares regression line (which
minimizes the squared deviations of points), resistant lines are not overly
influenced by extreme points.
• Approximation of Trends: Offers a more realistic summary of data when
outliers or non-uniform variance is present.
• Simpler Computation: Often calculated using medians or other resistant
measures.
• Construction:
• A common approach to constructing a resistant line is Median-Median Line
Fitting:
• Divide the data into three groups based on the x-values (low, middle, and
high).
• Compute the median of x-values and y-values for each group.
• Use the medians to compute a slope and intercept, forming the resistant line.
Transformations in Bivariate Analysis

• In bivariate analysis, transformations are applied to one or both variables

to simplify relationships, improve interpretability, or meet assumptions for
statistical modeling (e.g., linearity, normality). These transformations can
make non-linear relationships linear, stabilize variances, or normalize data
distributions.

• Why Transformations are Used

• Linearizing Relationships: Some relationships between variables may be
non-linear. Transformations can make them linear for easier analysis and
modeling.
• Stabilizing Variance: Transformations reduce heteroscedasticity (unequal
variance in residuals).
• Normalizing Data: Ensures variables follow a normal distribution, which
is required for many statistical tests.
• Improving Correlation: Transformations can strengthen or reveal hidden
relationships.
Example in Bivariate Analysis
• Scenario: Analyzing the relationship between advertising
expenditure and sales revenue.
• Raw Data: A scatter plot might show a curved, non-linear
relationship.
• Log Transformation: Applying log⁡(x)\log(x)log(x) to
advertising expenditure might linearized the relationship,
making it suitable for regression analysis.
• Application in Visualization
• Transformations can also improve data visualization:
• Before Transformation: A scatter plot might show a
skewed or curved relationship.
• After Transformation: The scatter plot may show a more
linear or homoscedastic relationship.
Time Series Analysis

• Time series analysis is a statistical method for analyzing data

points collected or recorded at specific time intervals. It is used
to identify patterns, trends, and other characteristics in the data
over time, enabling predictions and informed decision-making.
• Key Components of Time Series Data
• Trend:
– The long-term movement or direction in the data (upward,
downward, or flat).
– Example: Increase in annual sales over a decade.
• Seasonality:
– Repeating patterns or fluctuations in data over a fixed period, such
as daily, monthly, or yearly.
– Example: Higher ice cream sales in summer.
• Cyclic Patterns:
– Fluctuations in data over a longer period (not fixed or periodic), often driven by
economic or business cycles.
– Example: Recessions in financial markets.
• Irregular or Random Variation:
– Unpredictable, non-repeating variations caused by external or random factors.
– Example: Sudden spikes in demand due to a one-time event.

• Methods of Time Series Analysis

• Smoothing Techniques:
– Used to reduce noise and highlight patterns.
– Examples:
• Moving Average: Computes the average of observations over a specific
window.
• Exponential Smoothing: Applies exponentially decreasing weights to past
data.
• Decomposition:
– Breaking the series into components: Trend, Seasonality, and Residual (random
noise).
• Autoregressive and Moving Average Models (AR, MA,
ARMA, ARIMA):
– AR (Autoregressive): Uses past values to predict future values.
– MA (Moving Average): Uses past forecast errors for
predictions.
– ARIMA (AutoRegressive Integrated Moving Average):
Combines AR, MA, and differencing for non-stationary data.
• Seasonal Decomposition of Time Series (STL):
– Separates data into trend, seasonality, and residual
components.
• Spectral Analysis:
– Identifies cyclical patterns by analyzing frequencies.
Steps in Time Series Analysis
• Plot the Data:
– Visualize the series to detect patterns, trends, or anomalies.
• Check for Stationarity:
– A stationary series has constant mean and variance over time.
– Methods: Plotting, rolling statistics, Augmented Dickey-Fuller (ADF)
test.
• Transform Data (if needed):
– Apply log, square root, or differencing transformations to make the
series stationary.
• Model the Data:
– Fit appropriate models like ARIMA, exponential smoothing, etc.
• Validate the Model:
– Assess model performance using metrics like Mean Absolute Error
(MAE) or Root Mean Square Error (RMSE).
• Forecast and Interpret:
– Generate forecasts and interpret them in the context of the problem.
Cross Tabulation in Python

• Cross Tabulation, or crosstab, is a statistical tool used to analyze the

relationship between two or more categorical variables. In Python, the pandas
library provides the pd.crosstab() function, which makes it easy to generate
cross-tabulated data.
Key Features of pd.crosstab()
• Categorical Variable Analysis: Summarizes the frequency
distribution of categorical variables.
• Multi-dimensional Tables: Handles multiple variables on
both rows and columns.
• Aggregation: Allows summing or other aggregation of
values for numeric data.
• Normalization: Converts counts into proportions or
percentages.
Syntax
pandas.crosstab(index, columns, values=None, aggfunc=None,
margins=False, normalize=False)
• index: Rows of the crosstab (categorical variable).
• columns: Columns of the crosstab (categorical variable).
• values: Optional; numeric data for aggregation.
• aggfunc: Aggregation function (e.g., sum, mean).
• margins: Adds totals for rows and columns (default False).
• normalize: Normalizes the table (e.g., row-wise, column-wise,
or overall).

Mathematical Statistics and Data Analysis 3rd Edition John A. Rice Instant Download
100% (1)
Mathematical Statistics and Data Analysis 3rd Edition John A. Rice Instant Download
29 pages
Project Report PDF
No ratings yet
Project Report PDF
14 pages
Data Exploration and Visualization Unit 2
100% (1)
Data Exploration and Visualization Unit 2
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
104 pages
Final Manuscript
100% (1)
Final Manuscript
55 pages
MA Study Notes and Question Bank
67% (3)
MA Study Notes and Question Bank
410 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
6145 Maths 1
100% (1)
6145 Maths 1
866 pages
Practice - IM - Linear Regression - 1-2-3-4 - Practice - IM - Loglinear Question - 1 - Q 9-38-Extra Question
No ratings yet
Practice - IM - Linear Regression - 1-2-3-4 - Practice - IM - Loglinear Question - 1 - Q 9-38-Extra Question
14 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
Practical Statistic For The Analytical Scientist A Bench Guide
100% (1)
Practical Statistic For The Analytical Scientist A Bench Guide
2 pages
Lecture 4 - Data Wrangling
No ratings yet
Lecture 4 - Data Wrangling
41 pages
Data Science Curriculum 2024
No ratings yet
Data Science Curriculum 2024
16 pages
Handling Several Batches, Scatter Plot and Resistant Lines and Transformations
No ratings yet
Handling Several Batches, Scatter Plot and Resistant Lines and Transformations
39 pages
Quantitative Research Method 2022
No ratings yet
Quantitative Research Method 2022
31 pages
Charlottetown Region Housing Needs Assessment Report Final (2022)
No ratings yet
Charlottetown Region Housing Needs Assessment Report Final (2022)
198 pages
Unit 3
No ratings yet
Unit 3
164 pages
Unit 1
No ratings yet
Unit 1
84 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
Module 1
No ratings yet
Module 1
91 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Data Science
No ratings yet
Data Science
59 pages
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
80 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
Computer Networks Student Manual
No ratings yet
Computer Networks Student Manual
54 pages
FINAL LECTURE 3,4.pptx - AutoRecovered
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered
73 pages
Eastern Tayabas College, Inc
No ratings yet
Eastern Tayabas College, Inc
50 pages
Business Research Methods
No ratings yet
Business Research Methods
43 pages
Ec3501-Wireless Student Copy Org
No ratings yet
Ec3501-Wireless Student Copy Org
71 pages
Unit 4
No ratings yet
Unit 4
42 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
Challenges of Wearing Niqab By. Abdulcader, Norjannah and Bogabong, Omniah
No ratings yet
Challenges of Wearing Niqab By. Abdulcader, Norjannah and Bogabong, Omniah
46 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Samplingandsamplingdistributions
No ratings yet
Samplingandsamplingdistributions
43 pages
Adoption of Six Sigma DMAIC To Reduce Cost of Poor Quality
100% (1)
Adoption of Six Sigma DMAIC To Reduce Cost of Poor Quality
26 pages
Data Manipulation and Visualization
No ratings yet
Data Manipulation and Visualization
21 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
DSA Lab
No ratings yet
DSA Lab
35 pages
CE880 Lecture3 Slides
No ratings yet
CE880 Lecture3 Slides
44 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
BUSINESS INTELLIGENCE Docs
No ratings yet
BUSINESS INTELLIGENCE Docs
12 pages
Data Visualization Module1
No ratings yet
Data Visualization Module1
44 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Unit 1
No ratings yet
Unit 1
8 pages
Business Statistics - Session Introduction To Statistics
No ratings yet
Business Statistics - Session Introduction To Statistics
34 pages
Mastering in Data Science - 3RITech
No ratings yet
Mastering in Data Science - 3RITech
37 pages
Unit Iv
No ratings yet
Unit Iv
24 pages
Graphical Analysis - Minitab Assistant
No ratings yet
Graphical Analysis - Minitab Assistant
41 pages
Week 7
No ratings yet
Week 7
13 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Unit IV 2
No ratings yet
Unit IV 2
24 pages
Kingword
No ratings yet
Kingword
11 pages
Business Analytics
No ratings yet
Business Analytics
33 pages
Ass 2
No ratings yet
Ass 2
13 pages
IT35012 M
No ratings yet
IT35012 M
8 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Class:Fybsc Actuarial Science College:Thakur College of Science & Commerce Paper Name:Actuarial Statistics 1 Exam: Ce 2
No ratings yet
Class:Fybsc Actuarial Science College:Thakur College of Science & Commerce Paper Name:Actuarial Statistics 1 Exam: Ce 2
11 pages
Nac PDF
No ratings yet
Nac PDF
23 pages
DMV U4 RK
No ratings yet
DMV U4 RK
16 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
4 Discrete Time Random Processes
No ratings yet
4 Discrete Time Random Processes
16 pages
Probit and Logit Models: Differences in The Multivariate Realm
No ratings yet
Probit and Logit Models: Differences in The Multivariate Realm
14 pages
Unit 2
No ratings yet
Unit 2
19 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Lab Manual: Department of Information Technology
No ratings yet
Lab Manual: Department of Information Technology
10 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
Insem Notes
No ratings yet
Insem Notes
8 pages
FSWD 5 QB
No ratings yet
FSWD 5 QB
5 pages
Digital Vidya Python Data Analytst Course
No ratings yet
Digital Vidya Python Data Analytst Course
18 pages
Datascience
No ratings yet
Datascience
26 pages
Big Data Report
No ratings yet
Big Data Report
6 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Simple Linear Regression: Presented by Tayyab Pervaiz 19011507-093
No ratings yet
Simple Linear Regression: Presented by Tayyab Pervaiz 19011507-093
11 pages
Alteryx - Tool - Flashcards - 201910
No ratings yet
Alteryx - Tool - Flashcards - 201910
16 pages
Cloud Computing QBank
No ratings yet
Cloud Computing QBank
6 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Learning Sheet No. 8
No ratings yet
Learning Sheet No. 8
4 pages
Assignment 13 (Statistics)
No ratings yet
Assignment 13 (Statistics)
3 pages
2010 Child Sexual Abuse Prevention Programs A Meta Analysis
No ratings yet
2010 Child Sexual Abuse Prevention Programs A Meta Analysis
10 pages
Ip Melc1 Q4
No ratings yet
Ip Melc1 Q4
2 pages
Exam-Empirical Methods For Finance
No ratings yet
Exam-Empirical Methods For Finance
7 pages
VM A3sheet Lab
No ratings yet
VM A3sheet Lab
3 pages
Chapter - 4
No ratings yet
Chapter - 4
4 pages
NOConline FDP
No ratings yet
NOConline FDP
1 page
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
MOOC Econometrics: Dennis Fok
No ratings yet
MOOC Econometrics: Dennis Fok
3 pages
Clinical Prediction Rule CASP Checklist
No ratings yet
Clinical Prediction Rule CASP Checklist
5 pages
Homework Probstat Preuas
No ratings yet
Homework Probstat Preuas
3 pages
FDP Online Schedule
No ratings yet
FDP Online Schedule
1 page
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Networking Basics Badge20241105 27 V8dopt
No ratings yet
Networking Basics Badge20241105 27 V8dopt
1 page

Unit 4

Uploaded by

Unit 4

Uploaded by

Unit-IV

Percentage tables and cross tabulations

Grade Level Male (%) Female (%) Total (%)

2. Cross Tabulations (Crosstabs):

Gender Passed Failed Total

Total 150 50 200

• Batch Processing is a method of running a series of jobs or tasks without

• # Create the contingency table

• In bivariate analysis, transformations are applied to one or both variables

• Why Transformations are Used

• Time series analysis is a statistical method for analyzing data

• Methods of Time Series Analysis

• Cross Tabulation, or crosstab, is a statistical tool used to analyze the

You might also like