Big Data Analytics
Big Data Analytics
Big Data
Analytics
Vardha Teena,
AAO/e-HRMS,
RTI, HYDERABAD
Big Data is a
term that refers
to the data
gathered by
businesses &
organizations
which is stored
digitally.
1
26-05-2023
Opportunities for
Indian Audit and Accounts Department
2
26-05-2023
Policy Framework
In order to build on the specified opportunities, the policy framework
addresses the following issues:
1. Identification of data sources
2. Establishing Data Management Protocols
3. Digital auditing, Data analytics and Visualization strategy
4. Infrastructure, capacity building and change management.
• Audited entities’ data which includes Financial and non financial data of audited
entities, Programme specific data including beneficiary databases, Other data pertaining to audited
entities
• Third party data
• Data published by Government and statutory authorities: Census data, NSSO
External data, Data published by the various Ministries/Departments and Data available in data.gov.in,
Reports of various commissions, Other Reports and data pertaining to Union Government
/States
• Other data available in public domain :Surveys and information published by NGOs,
Industry specific information published by CII, FICCI/NASSCOM etc.,Sector specific
information published by various organizations,Social media etc.
3
26-05-2023
4
26-05-2023
1.Characteristics
2.Process
3.Results
10
5
26-05-2023
11
12
6
26-05-2023
Data Analysis
Model Data
& Creation of
Deployment Preparation
Model
13
DATA Types
Data
Unstructured Structured
14
7
26-05-2023
Data acquisition
• Since IA&AD is not the owner of several data sources required
for data analytics, data availability would remain a challenge :
Continuous persuasion and monitoring with the audited
entities taking support from relevant provisions of the CAG’s
Duties, Powers and Conditions of Service, Act 1971 and
Regulations on Audit and Accounts 2007 will be the way to
address this issue.
15
Access to data
16
8
26-05-2023
Collection of data
• Data collection is the systematic approach of gathering and measuring
information from a variety of sources to get a complete and accurate
picture of an area of interest
• identification and requisition of relevant data: can be complete
databases, selected tables out of the databases, selected data fields of
tables in the databases or data pertaining to specific criteria/ condition
for a particular period, location, class etc.
• While collecting data, the authenticity, integrity, relevance, usability
and security of the data sets should be ensured.
17
18
9
26-05-2023
Ownership of data
• The ownership of the data sets remains that of the
audited entity/ third party data sources and IA&AD
holds this data only in a fiduciary capacity.
• Once the data sets are obtained from the data sources,
the HoDs should assume the ownership of the data sets
and should exercise such controls on security and
confidentiality of the data as envisaged for the data
owner in the audited entity
19
Data security
• While handling data, the basic approach should be
to limit, to the bare necessity, the number of personnel
with access to the raw data
to establish a trail of personnel who have accessed data.
To maintain Complete and chronological record of all data
shared between data source owner and the auditor
To ensure that computers which are used for data analytics
are not connected to internet
20
10
26-05-2023
Data reliability
• Data is said to be reliable when the data accurately captures the parameter it is
representing.
• Data reliability is a function of authenticity, integrity, relevance and usability of data.
• Data reliability can be affected because of the methods of generation /capture of data.
• Generally, if the manual and IT system are operating in parallel, the chances of errors in
data are higher. Similarly an MIS system involving manual data entry is likely to be less
reliable than systems where MIS data is directly generated through an IT system
• Reliability requirement is based on intended purpose : Consideration of data reliability
would be significantly higher for data sets planned for usage as audit evidence as compared
to data sets planned for drawing broad insights while planning.
21
Data preparation
• Data preparation is the process of organizing data for analytic
purposes.
• It involves various activities such as restoration, importing of data,
selection of database/ table/ record /field, joining datasets, appending
datasets, cleansing, aggregation and treatment of missing values,
invalid values, outliers etc
• Data preparation is a project specific phase.
22
11
26-05-2023
Data restoration
• The data from the data source should be copied and restored in the
auditor’s computer for further analysis.
• While using data in dump/ backup format, it will be necessary to bring
the data tables to its original format through a data restoration process.
• Before restoring a database backup/dump file, some basic information
such as database software version, operating system, database size is
required.
23
24
12
26-05-2023
Data enhancement
• Data enhancement is also a data cleansing process where data is made
more complete by adding related information.
• involves activities such as harmonization of data and standardization
of data.
• For example, appending the name of a Bank with any Bank Code
enhances the quality of data. Similarly, harmonization of short codes
(st, rd, etc.) to actual words(street, road, etc.) could be done.
• Standardization of data is a means of changing a reference data set to a
new standard, e.g., use of standard codes.
25
26
13
26-05-2023
27
28
14
26-05-2023
29
Statistical techniques
• Correlation : is used to measure the strength of association between two
variables and ranges between -1 to +1.
• Regression analysis gives a numerical explanation of how variables relate,
enables prediction of the dependent variable(y) given the independent
variable.
• Principal Component Analysis aims to reduce the number of inter-correlated
variables to a smaller set which explains the overall variability.
• Factor Analysis aims to group together and summarise variables which are
correlated thereby enabling data reduction.
• Cluster analysis is a multivariate technique used to group
individuals/variables based on common characteristics
• The process of arranging data into homogenous group or classes according
to some common characteristics present in the data is called classification.
30
15
26-05-2023
Data Visualization
• Data Visualization serves the following two distinct purposes:
Exploratory Data Analysis(EDA): It is an approach to analyzing data sets to
summarize their main characteristics, often with visual methods. Primarily,
EDA is undertaken for seeing what the data can tell us beyond the statistical
analysis and modelling.
Communication of findings / reporting: Insights derived from data can be
communicated to the users such as higher management or the readers of audit
reports.
• IA&AD Practitioner’s Guide for use of Data Visualization and
Infographics should be referred to for principles of data visualization
31
Data Visualization
It aims at achieving one or more of the following objectives:
• Comprehensibility: makes information and relationships easily understandable.
• Comprehensiveness: presents features/information for the entire selected data
set/sample size as against selective reporting.
• Focused communication: facilitates concise and ‘to the point’ communication.
• Reducing complexity: simplifying the presentation of large amounts of data.
• Establishing patterns and relationships: enables identification of patterns and
relationships in the data
• Analysis: promotes thinking on ‘substance’ rather than on ‘methodology’. It focuses
on the essence of the finding being communicated rather than on the procedure for
communication
32
16
26-05-2023
33
34
17
26-05-2023
35
36
18
26-05-2023
37
38
19
26-05-2023
39
40
20
26-05-2023
41
Data Repository
42
21
26-05-2023
Use of
Data
Analytics
in Audit
43
44
22