0% found this document useful (0 votes)
16 views31 pages

Data Analysis3

Uploaded by

ericgasper008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views31 pages

Data Analysis3

Uploaded by

ericgasper008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

DATA ANALYSIS

• In today’s data-driven world, organizations rely on data analysis to


uncover patterns, trends, and relationships within their data.

The term data analysis refers to the systematic application of statistical


and logical techniques to describe, summarize, and evaluate data.
• This process can involve transforming raw data into a more
understandable format, identifying significant patterns, and drawing
conclusions based on the findings.
• it essentially refers to the practice of examining datasets to draw
conclusions about the information they contain.
• The process of inspecting, cleaning, transforming, and modeling data
to discover useful information, draw conclusions, and support
decision-making.”
Methods of data analysis

•Descriptive analytics answers: ‘What is the current prevalence of HIV across the
country?’

•Diagnostic analytics answers: ‘Why are HIV patients stopping to use medications l?‘

•Predictive analytics answers: 'Which HIV paients are at risk of stopping medications in
near future?‘

•Prescriptive analytics answers: ‘'What actions can be taken to reduce number of HIV
Patients stopping to use medications ?'
• 1. Descriptive analytics
• Descriptive analytics focuses on answering the question, ‘What is
happening?’ or ‘What has happened?’ by analyzing past data.
• Of all the types of data analytics, this is the most straightforward
approach as it summarizes and simplifies the main features and
characteristics of complex datasets through interactive visualizations.
• 2. Predictive analytics
• Predictive analytics uses historical data to answer the question, ‘What
may happen next?’ This model to predict future outcomes, find
patterns, and identify risks or growth opportunities.
• While descriptive analytics serves as a reflective mirror, showing us a
holistic picture of our past activities, predictive analytics acts as a
crystal ball, providing a sneak peek into the future.
• 3. Prescriptive analytics
• Unlike predictive analytics, which focuses on future outcomes,
prescriptive analytics helps decision-makers identify the best course
of action to help them achieve their business goals.
• The primary goal of this model is to answer the question: ‘What
should we do?’
• 4. Diagnostic analytics
• Diagnostic analytics examines past data to identify the root causes
behind a particular outcome. This type of analytics aims to answer the
question, ‘Why did this happen?’
• It focuses on uncovering insights into historical data patterns,
anomalies, and correlations to facilitate a deeper understanding of a
particular business problem.
DATA ENTRY
• Data entry is the process of digitizing data by entering it into a
computer system for organization and management purposes
• Data entry is often done with a keyboard and at times also using a
mouse,[7
• Although most data entered into a computer are stored in a database,
a significant amount is stored in a spreadsheet.[17] The use of
spreadsheets instead of databases for data entry can be traced to the
1979] although a manually-fed scanner may be involved.[8]
TYPES OF DATA ENTRY
• Manual data entry

• This method involve individuals manually entering data using


keyboars , keypads.it is suitable for small data entry tasks or situations
where data is received in physical formats like paper documents
• Online data entry

• This type of data entry involves inputting data directly into online
forms or systems.it is commonly used for tasks such as online surveys
or customer registration.
DATA CLEANING
• Data cleaning refers to a process of fixing or removing incorrect ,
corrupted , incorrectly formatted , duplicate or incomplete data
within data set

• There is no one absolute way to prescribe the exactly steps of data


cleaning process because it will vary from data set to data set
Ways of cleaning data
• 1 Remove duplicate data
• 2 Fix structural errors example incorrect naming
• 3 Handle missing data
• 4. filter unwanted outliers
Characteristics of quality data
• Accuracy
• Completeness
• Consistency
• Uniformity
DATA SUMMARIZATION
• Data summarization refers presenting a compact description of a
dataset. In other words, data summarization is the presentation of a
dataset in an easy, informative, and comprehensive manner

• Data summarization is a meticulously performed summary that is


obtained from the entire data set and will divulge significant patterns
and trends in a clarified manner.
Types of data summarization
1. Based on Centrality
• A data can be summarised on the basis of its centrality. Centrality of a
data describes the centre or middle value of the data set. In other
words, it ascertains one central value around which all other values of
a dataset revolve. The other name for centrality is ‘average.’

• There several ways to find the centrality of a data. However, the most
popular ones are mean, mode and median. These three summarises
the distribution of the dataset.
• Mean
• Mean is used to calculate the numerical average of a dataset.
Arithmetic mean is calculated by adding all the values of the given
dataset and dividing it by the by number of items therein. The
mathematical formula is as follows:
x = ∑x/n
• Here, ‘∑’ represents ‘summation’
‘n’ represents ‘number of items’
• For example: consider the following heights of 10 men in centimeters
(cm): 165, 167, 169, 169, 171, 173, 175, 176, 176, 169

• The mean height is calculated by adding the heights for the ten men
and dividing the sum by 10.
Arithmetic mean = 165 + 167 + 169 + 169 + 171 + 173 + 175 + 176 + 176
+ 169 /10

x̄ = 1710/10 = 171 cm
Mode

• Mode refers to the most recurring value in the sample. In other


words, it refers to the most frequent number of the given dataset.
Mode is comparatively less preferred in statistical analysis.

• Although it can be calculated for any type of sample, but it is mostly


used where the sample size is large or the given values are integers.
• Note that it is possible to have more than one mode. For example: in
the following set of numbers (8, 7, 8, 8, 9, 6, 5, 6, 4, 6, 7) the mode is
both 8 and 6, since each is included in the dataset three times.

• This dataset is referred to as bimodal because it has two modes. • It


is also possible not to have a mode in a set of numbers.

For example: in the following set of numbers (5, 4, 9, 7, 6, 3, 8) there is


no number which occurs more frequently than any other, therefore,
there is no mode.
• Median
• Median refers to the middle value of the series when arranged in
ascending or descending order. When the distribution is normal, the
mean and median tend to coincide.

For example, below is a series of durations (in days) of absence from


classes due to sickness: 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 10,
10, 38, 80. o The median duration is 5 days.
• . Based on Dispersion
• The term ‘dispersion’ means ‘spread.’ To elaborate, dispersion means
how scattered the sample values are around the mean. It shows the
variability present within the given data
• The Range
Is defined as the difference between the maximum value and the
minimum value. For example: if the lowest and highest of a series of
diastolic blood pressure are 65 mm Hg and 95 mm Hg, then the range =
95-65 = 30 mm Hg.
The range is seldom used in statistical analysis because:
• It wastes information since it uses information from only two extreme
values.
• The two extreme values are more likely to be faulty.
• The range increases with increasing number of observations
Standard Deviation (SD)

• Standard deviation is the most used measure of dispersion. It is used


in normally distributed data and shows how spread the values are
from the mean.

• To rephrase, it shows extra small or extra-large values of the data.


Thus, gives an understanding of how scattered a data is. It is also
known as ‘average deviation’ from mean.
• The formula for SD is
Variance
• The variance represents the amount of spread or variability around
the mean of a set of data.

• Because the variance is in units squared, we find the standard


deviation to describe our data in the proper units.
Tools used in data analysis
• Numerous statistical software systems are available currently. The
commonly used software systems are
• Statistical Package for the Social Sciences (SPSS – manufactured by IBM
corporation),
• Statistical Analysis System ((SAS – developed by SAS Institute North
Carolina, United States of America),
• R (designed by Ross Ihaka and Robert Gentleman from R core team),
• Minitab (developed by Minitab Inc),
• Stata (developed by StataCorp) and the
• MS Excel (developed by Microsoft).
• Briefly explain methods of data analysis with examples

• With regard to data summarization ; explain the types and calculate


SD and variance of the following

sn 1 2 3 4 5 6 7 8
Value 24 34 38 46 47 53 53 61

You might also like