0% found this document useful (0 votes)
24 views19 pages

Unit 4

notes

Uploaded by

gkeerthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views19 pages

Unit 4

notes

Uploaded by

gkeerthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit-IV

Percentage tables and cross tabulations


• Percentage Tables and Cross Tabulations (or Crosstabs) are
both tools used in data analysis to present and interpret
relationships between variables.
• 1. Percentage Tables:
• A percentage table is used to show how a variable is distributed
across categories or groups, expressed as percentages. It typically
involves calculating the percentage of total observations that fall
into each category. This is useful when you want to compare the
relative frequency or proportion of categories within a dataset.
• Example: If you have a dataset showing the number of students
by gender and grade level, a percentage table could show the
proportion of male and female students within each grade as a
percentage of the total students in that grade.
Example Layout:

Grade Level Male (%) Female (%) Total (%)


9th 40% 60% 100%
10th 45% 55% 100%
Total 42.5% 57.5% 100%

2. Cross Tabulations (Crosstabs):


Crosstabs are a way of summarizing data by displaying the
relationship between two or more categorical variables in a matrix
format. It helps identify patterns, correlations, and interactions
between variables.
•Example: A crosstab could show the relationship between gender
and whether a student passed or failed a test. It helps in
understanding how the outcome of one variable (e.g.,
passing/failing) varies by the levels of another variable (e.g.,
gender).
Example Layout:

Gender Passed Failed Total

Male 80 20 100

Female 70 30 100

Total 150 50 200


Batch processing

• Batch Processing is a method of running a series of jobs or tasks without


manual intervention. It involves processing large volumes of data or performing
a sequence of operations at scheduled times or in groups (batches), rather than in
real-time.
• Key Characteristics of Batch Processing:
1. Non-Interactive: Batch jobs are executed without user interaction. They run in
the background, and users don’t need to monitor or engage with them actively.
2. Scheduled or Triggered: Batch jobs can be scheduled to run at specific times
(e.g., nightly, weekly) or triggered by a particular event (e.g., when a certain file
arrives).
3. Efficient for Large Data Sets: Since it processes data in chunks, batch
processing is particularly suited for handling large datasets or jobs that do not
need immediate results.
4. Automated: Once set up, batch processing tasks run automatically without
human intervention. This reduces errors and the need for manual processing.
5. Time-Consuming: While efficient for large data sets, batch processing can be
slow because it processes tasks sequentially or in bulk. However, it can be
optimized for specific tasks.
• Examples of Batch Processing:
• Payroll Systems: A company might process employee payroll in batches at the end of
each month, calculating wages, deductions, and taxes for all employees at once.
• Banking Systems: Banks often use batch processing to update account balances
overnight, process large numbers of transactions, or generate monthly statements.
• Data ETL (Extract, Transform, Load): In data warehouses, batch processing is used
to extract data from various sources, transform it (e.g., clean, aggregate), and load it
into a centralized system for analysis.
• Benefits:
• Efficiency: Large amounts of data can be processed quickly and efficiently in batches.
• Cost-Effective: Batch jobs can be scheduled during off-peak hours (e.g., overnight) to
optimize system resources and reduce operational costs.
• Scalability: Batch processing can handle large volumes of data without requiring real-
time processing power or resources.
• Drawbacks:
• Delay in Results: Since batch jobs run at scheduled intervals, results may not be
available immediately (i.e., it's not suitable for real-time or immediate processing
needs).
• Lack of Flexibility: If something goes wrong in the batch process, the whole batch
may need to be rerun, which can lead to delays.
contingency table
• A contingency table (also called a cross-tabulation or crosstab) is
a tabular representation of data that helps in analyzing the
relationship between two or more categorical variables. In
development (Dev), contingency tables are used in various
domains such as database management, business intelligence,
and statistical analysis to understand data patterns and
dependencies.
• Example of a Contingency Table in Development
• Scenario: Bug Tracking System
• Suppose you are analyzing the relationship between the severity
of bugs and their status in a software development project. The
two variables are:
• Bug Severity: High, Medium, Low.
• Bug Status: Open, Resolved, Closed.
Key Components of a Contingency Table:
• Rows and Columns: Represent categories of the variables being analyzed.
• Cell Values: Count or frequency of occurrences for each combination of categories.
• Marginal Totals: Row and column totals that show the overall frequency for each category.
• Grand Total: The total number of observations across all categories.

Applications in Development:
• Bug Analysis: Understand patterns in bug occurrence, such as which severity levels remain
unresolved.
• Feature Usage: Analyze how frequently features are used across different user demographics or
time periods.
• Error Reporting: Categorize errors by type and frequency across modules or teams.
• A/B Testing: Compare user behavior under different conditions.
• Example Analysis:
• From the table above, we see that:
– High-severity bugs are less frequent but have a higher percentage of being unresolved (10/17
= 58.8% remain open).
– Medium-severity bugs have more resolved cases (10/30 = 33.3%).
– Low-severity bugs are mostly resolved or closed, indicating they are easier to handle.
• import pandas as pd

• # Example data
• data = {
• 'Severity': ['High', 'High', 'Medium', 'Low', 'Low', 'Medium', 'High'],
• 'Status': ['Open', 'Resolved', 'Open', 'Closed', 'Resolved', 'Open',
'Closed']
• }

• # Create a DataFrame
• df = pd.DataFrame(data)

• # Create the contingency table


• contingency_table = pd.crosstab(df['Severity'], df['Status'])

• print(contingency_table)
Scatter Plots and Resistant Lines

• Scatter Plot
• A scatter plot is a graphical representation of the relationship between two
variables. Each point on the plot represents an observation, with its position
determined by the values of the two variables.
• X-axis: Represents the independent variable.
• Y-axis: Represents the dependent variable.
• Points: Represent observations, with coordinates (x, y).
• Scatter plots are often used to:
• Visualize relationships: Identify patterns, correlations, or clusters.
• Detect outliers: Spot points that deviate significantly from the general trend.
• Assess trends: Help determine if a relationship is linear, quadratic, or non-
linear.
• Example:
• If you are studying the relationship between hours studied (X) and exam
scores (Y), a scatter plot could help determine if more study hours generally
lead to better scores.
Resistant Line
• A resistant line is a robust statistical line fitted to a scatter plot that is less
affected by outliers compared to traditional regression lines. It is used to
summarize the central trend in the data.
• Characteristics:
• Resistant to Outliers: Unlike the least squares regression line (which
minimizes the squared deviations of points), resistant lines are not overly
influenced by extreme points.
• Approximation of Trends: Offers a more realistic summary of data when
outliers or non-uniform variance is present.
• Simpler Computation: Often calculated using medians or other resistant
measures.
• Construction:
• A common approach to constructing a resistant line is Median-Median Line
Fitting:
• Divide the data into three groups based on the x-values (low, middle, and
high).
• Compute the median of x-values and y-values for each group.
• Use the medians to compute a slope and intercept, forming the resistant line.
Transformations in Bivariate Analysis

• In bivariate analysis, transformations are applied to one or both variables


to simplify relationships, improve interpretability, or meet assumptions for
statistical modeling (e.g., linearity, normality). These transformations can
make non-linear relationships linear, stabilize variances, or normalize data
distributions.

• Why Transformations are Used


• Linearizing Relationships: Some relationships between variables may be
non-linear. Transformations can make them linear for easier analysis and
modeling.
• Stabilizing Variance: Transformations reduce heteroscedasticity (unequal
variance in residuals).
• Normalizing Data: Ensures variables follow a normal distribution, which
is required for many statistical tests.
• Improving Correlation: Transformations can strengthen or reveal hidden
relationships.
Example in Bivariate Analysis
• Scenario: Analyzing the relationship between advertising
expenditure and sales revenue.
• Raw Data: A scatter plot might show a curved, non-linear
relationship.
• Log Transformation: Applying log⁡(x)\log(x)log(x) to
advertising expenditure might linearized the relationship,
making it suitable for regression analysis.
• Application in Visualization
• Transformations can also improve data visualization:
• Before Transformation: A scatter plot might show a
skewed or curved relationship.
• After Transformation: The scatter plot may show a more
linear or homoscedastic relationship.
Time Series Analysis

• Time series analysis is a statistical method for analyzing data


points collected or recorded at specific time intervals. It is used
to identify patterns, trends, and other characteristics in the data
over time, enabling predictions and informed decision-making.
• Key Components of Time Series Data
• Trend:
– The long-term movement or direction in the data (upward,
downward, or flat).
– Example: Increase in annual sales over a decade.
• Seasonality:
– Repeating patterns or fluctuations in data over a fixed period, such
as daily, monthly, or yearly.
– Example: Higher ice cream sales in summer.
• Cyclic Patterns:
– Fluctuations in data over a longer period (not fixed or periodic), often driven by
economic or business cycles.
– Example: Recessions in financial markets.
• Irregular or Random Variation:
– Unpredictable, non-repeating variations caused by external or random factors.
– Example: Sudden spikes in demand due to a one-time event.

• Methods of Time Series Analysis


• Smoothing Techniques:
– Used to reduce noise and highlight patterns.
– Examples:
• Moving Average: Computes the average of observations over a specific
window.
• Exponential Smoothing: Applies exponentially decreasing weights to past
data.
• Decomposition:
– Breaking the series into components: Trend, Seasonality, and Residual (random
noise).
• Autoregressive and Moving Average Models (AR, MA,
ARMA, ARIMA):
– AR (Autoregressive): Uses past values to predict future values.
– MA (Moving Average): Uses past forecast errors for
predictions.
– ARIMA (AutoRegressive Integrated Moving Average):
Combines AR, MA, and differencing for non-stationary data.
• Seasonal Decomposition of Time Series (STL):
– Separates data into trend, seasonality, and residual
components.
• Spectral Analysis:
– Identifies cyclical patterns by analyzing frequencies.
Steps in Time Series Analysis
• Plot the Data:
– Visualize the series to detect patterns, trends, or anomalies.
• Check for Stationarity:
– A stationary series has constant mean and variance over time.
– Methods: Plotting, rolling statistics, Augmented Dickey-Fuller (ADF)
test.
• Transform Data (if needed):
– Apply log, square root, or differencing transformations to make the
series stationary.
• Model the Data:
– Fit appropriate models like ARIMA, exponential smoothing, etc.
• Validate the Model:
– Assess model performance using metrics like Mean Absolute Error
(MAE) or Root Mean Square Error (RMSE).
• Forecast and Interpret:
– Generate forecasts and interpret them in the context of the problem.
Cross Tabulation in Python

• Cross Tabulation, or crosstab, is a statistical tool used to analyze the


relationship between two or more categorical variables. In Python, the pandas
library provides the pd.crosstab() function, which makes it easy to generate
cross-tabulated data.
Key Features of pd.crosstab()
• Categorical Variable Analysis: Summarizes the frequency
distribution of categorical variables.
• Multi-dimensional Tables: Handles multiple variables on
both rows and columns.
• Aggregation: Allows summing or other aggregation of
values for numeric data.
• Normalization: Converts counts into proportions or
percentages.
Syntax
pandas.crosstab(index, columns, values=None, aggfunc=None,
margins=False, normalize=False)
• index: Rows of the crosstab (categorical variable).
• columns: Columns of the crosstab (categorical variable).
• values: Optional; numeric data for aggregation.
• aggfunc: Aggregation function (e.g., sum, mean).
• margins: Adds totals for rows and columns (default False).
• normalize: Normalizes the table (e.g., row-wise, column-wise,
or overall).

You might also like