0% found this document useful (0 votes)
8 views22 pages

Big Data Analytics

Uploaded by

Teena V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views22 pages

Big Data Analytics

Uploaded by

Teena V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

26-05-2023

Big Data
Analytics
Vardha Teena,
AAO/e-HRMS,
RTI, HYDERABAD

Big Data is a
term that refers
to the data
gathered by
businesses &
organizations
which is stored
digitally.

1
26-05-2023

Big Data Management Policy for


Indian Audit and Accounts Department
• Formulated in September 2015
• Developed a broad outline of Data Analytic Framework for the department
• Creating the Centre for Data Analytics and Management was the first step
towards establishing this framework
• The Centre for Data Management and Analytics (CDMA) is the the nodal body
for steering data analytic activities in IA&AD. CDMA provides guidance to the
field offices on data analytics and pioneer research and development
activities as well

Opportunities for
Indian Audit and Accounts Department

• TECHNOLOGY EXPLOSION: Cost effective tools, technology platforms and solutions


are now available to handle and analyse big data.
• TRANSFORMATIONAL IMPACT FOR AUDIT: Big data analytics enhances risk
assessment by discovering red flags, outliers, abnormal behaviour and by providing
deeper insights. It facilitates predictive analysis and use of advanced statistics for
transformation of data into actionable information. It thus, contributes to greater level
of assurance in audits.
• AID TO GOVERNANCE: It enables the Comptroller and Auditor General to aid
governance by providing insights to the executive for evidence based decision making.

2
26-05-2023

Policy Framework
In order to build on the specified opportunities, the policy framework
addresses the following issues:
1. Identification of data sources
2. Establishing Data Management Protocols
3. Digital auditing, Data analytics and Visualization strategy
4. Infrastructure, capacity building and change management.

1. Identification of Data Sources


• Combined Finance and Revenue Accounts
• VLC data base
Internal • GPF and Pension data in A&E offices
• Data generated through Audit process
• Any other data available in the department

• Audited entities’ data which includes Financial and non financial data of audited
entities, Programme specific data including beneficiary databases, Other data pertaining to audited
entities
• Third party data
• Data published by Government and statutory authorities: Census data, NSSO
External data, Data published by the various Ministries/Departments and Data available in data.gov.in,
Reports of various commissions, Other Reports and data pertaining to Union Government
/States
• Other data available in public domain :Surveys and information published by NGOs,
Industry specific information published by CII, FICCI/NASSCOM etc.,Sector specific
information published by various organizations,Social media etc.

3
26-05-2023

2.Establishing Data Management Protocols


Data Management protocols have to ensure that data satisfies the
following characteristics:
Authenticity: Data is created through the process it claims.
Integrity: Data is complete, accurate and trustworthy.
Relevance: Data is appropriate and relevant for the identified purpose.
Usability: Data is readily accessible in a convenient manner.
Security: Data is secure and accessible only to authorised parties.
It would also address the following:
• data access arrangements including agreements with external sources
• data sensitivities associated with access and usage of various sources of data
• criteria for assessing veracity of data involving an assessment of strengths and weaknesses of various
sources and their application at various stages of audit (risk assessment, sample selection, benchmarking,
reporting).
• privacy and confidentiality issues covering procedures of aggregation and anonymisation
• compliance with legislative and regulatory requirements.

3. Digital auditing, Data analytics &


Visualization strategy
Digital Auditing involves analysing 100% of transactions to detect
anomalies, indicators of control deficiencies and emerging risks,
enhancing efficiency and focusing on high risk areas.

Data Analytics & Visualization is the process of integrating and


synthesizing data to provide deep insights, discover patterns, predict
and plan audits, and support audit analysis. It leverages the evidence-
based approach and requires knowing what data is needed to answer
questions.

4
26-05-2023

4. Infrastructure, capacity building and change


management:
• The Nodal Authority at International Centre for Information
Systems and Audit will be responsible for creating infrastructure,
selecting analytical tools, facilitating data analytics, and sharing
experiences and learnings from data usage.
• Training wing at headquarters will train key officials,
• Information Systems wing will upgrade technology, and
• Professional Practices Group (PPG) will review the framework.

There are three dimensions of big data which are to be considered


while designing a management framework for big data

1.Characteristics

2.Process

3.Results

10

5
26-05-2023

1.Characteristics of BIG DATA

11

2.Big Data Process Cycle

12

6
26-05-2023

Big Data Process Cycle


Data Data Data
Identification Collection Restoration

Data Analysis
Model Data
& Creation of
Deployment Preparation
Model

13

DATA Types

Data

Unstructured Structured

Text Image Video Audio Categorical Numerical

Nominal Ordinal Interval Ratio

Based on number of variables data set can be classified as –


Univariate, bivariate or multivariate

14

7
26-05-2023

Data acquisition
• Since IA&AD is not the owner of several data sources required
for data analytics, data availability would remain a challenge :
Continuous persuasion and monitoring with the audited
entities taking support from relevant provisions of the CAG’s
Duties, Powers and Conditions of Service, Act 1971 and
Regulations on Audit and Accounts 2007 will be the way to
address this issue.

15

Access to data

16

8
26-05-2023

Collection of data
• Data collection is the systematic approach of gathering and measuring
information from a variety of sources to get a complete and accurate
picture of an area of interest
• identification and requisition of relevant data: can be complete
databases, selected tables out of the databases, selected data fields of
tables in the databases or data pertaining to specific criteria/ condition
for a particular period, location, class etc.
• While collecting data, the authenticity, integrity, relevance, usability
and security of the data sets should be ensured.

17

• Authenticity : Data is created through the process it


claims.
• Integrity: Data is complete, accurate and trustworthy.
• Relevance: Data is appropriate and relevant for the
identified purpose.
• Usability : Data is readily accessible in a convenient
manner.
• Security : Data is secure and accessible only to
authorized parties

18

9
26-05-2023

Ownership of data
• The ownership of the data sets remains that of the
audited entity/ third party data sources and IA&AD
holds this data only in a fiduciary capacity.
• Once the data sets are obtained from the data sources,
the HoDs should assume the ownership of the data sets
and should exercise such controls on security and
confidentiality of the data as envisaged for the data
owner in the audited entity

19

Data security
• While handling data, the basic approach should be
to limit, to the bare necessity, the number of personnel
with access to the raw data
to establish a trail of personnel who have accessed data.
To maintain Complete and chronological record of all data
shared between data source owner and the auditor
To ensure that computers which are used for data analytics
are not connected to internet

20

10
26-05-2023

Data reliability
• Data is said to be reliable when the data accurately captures the parameter it is
representing.
• Data reliability is a function of authenticity, integrity, relevance and usability of data.
• Data reliability can be affected because of the methods of generation /capture of data.
• Generally, if the manual and IT system are operating in parallel, the chances of errors in
data are higher. Similarly an MIS system involving manual data entry is likely to be less
reliable than systems where MIS data is directly generated through an IT system
• Reliability requirement is based on intended purpose : Consideration of data reliability
would be significantly higher for data sets planned for usage as audit evidence as compared
to data sets planned for drawing broad insights while planning.

21

Data preparation
• Data preparation is the process of organizing data for analytic
purposes.
• It involves various activities such as restoration, importing of data,
selection of database/ table/ record /field, joining datasets, appending
datasets, cleansing, aggregation and treatment of missing values,
invalid values, outliers etc
• Data preparation is a project specific phase.

22

11
26-05-2023

Data restoration
• The data from the data source should be copied and restored in the
auditor’s computer for further analysis.
• While using data in dump/ backup format, it will be necessary to bring
the data tables to its original format through a data restoration process.
• Before restoring a database backup/dump file, some basic information
such as database software version, operating system, database size is
required.

23

Importing into the analytical tool


• Merging and splitting data files
• Data cleaning: Data cleansing, data cleaning, or data scrubbing is the
process of detecting and correcting or removing corrupt or inaccurate
records from a record set, table, or database.
• It refers to identifying incomplete, incorrect, inaccurate or irrelevant
parts of the data and then replacing, modifying, or filtering out the
inaccurate or corrupt data.
• The process of data cleaning may involve removing typographical
errors or validating and correcting values against a known list of
entities or by cross checking with a validated data set.

24

12
26-05-2023

Data enhancement
• Data enhancement is also a data cleansing process where data is made
more complete by adding related information.
• involves activities such as harmonization of data and standardization
of data.
• For example, appending the name of a Bank with any Bank Code
enhances the quality of data. Similarly, harmonization of short codes
(st, rd, etc.) to actual words(street, road, etc.) could be done.
• Standardization of data is a means of changing a reference data set to a
new standard, e.g., use of standard codes.

25

Data Integration: Linking Multiple


Databases
• Data integration is the process whereby the data collected from various
data sources or different tables within the same data source are
combined to obtain the final dataset for analysis.
• Data from different sources can be integrated based on any common
field such as Unique customer id, Bill number or Village name etc.
• Understanding the Meta data of different data sources will aid the
process of data integration
• Meta data is the data of other data sets. It contains information on the
data sets in a manner to make it easier to identify

26

13
26-05-2023

Data Analysis and Modelling


• Descriptive analytics
 tries to answer “what has happened”.
involves aggregation of individual transactions and thus provides meaning and
context to the individual transactions in a larger perspective.
It involves summarization of data through numerical or visual descriptions
• Diagnostic analytics
tries to answer the question “why did it happen” or “how did it happen”.
involves an understanding of the relationship between relatable data sets and
identification of specific transactions/ transaction sets along with their
behaviour and underlying reasons.
Drill down and statistical techniques like correlation assist in this endeavour

27

Data Analysis and Modelling


• Predictive analytics:
tries to predict, “What will happen”, “when will it happen”, “where will it
happen”, based on past data.
 Various forecasting and estimation techniques can be used to predict, to a
certain extent, the future outcome of an activity
• Prescriptive analytics:
takes over from predictive analytics and allows the auditor to ‘prescribe’ a
range of possible actions as inputs such that outputs in future can be altered to
the desired solution.
multiple future scenarios can be identified based on different input
interventions

28

14
26-05-2023

Data analytic techniques


• Statistical techniques : use of statistical measures to derive insights
about the dataset.
• Visualisation techniques: the use of visuals, graphs and charts to derive
an understanding and insight into the dataset
zoom out – zoom in – filter approach: The data is first understood at a
bird’s eye view, followed by a drill down, to understand the data at a
deeper level. Subsequently, a filter is done or a query is run to extract
results or exceptions, if necessary.

29

Statistical techniques
• Correlation : is used to measure the strength of association between two
variables and ranges between -1 to +1.
• Regression analysis gives a numerical explanation of how variables relate,
enables prediction of the dependent variable(y) given the independent
variable.
• Principal Component Analysis aims to reduce the number of inter-correlated
variables to a smaller set which explains the overall variability.
• Factor Analysis aims to group together and summarise variables which are
correlated thereby enabling data reduction.
• Cluster analysis is a multivariate technique used to group
individuals/variables based on common characteristics
• The process of arranging data into homogenous group or classes according
to some common characteristics present in the data is called classification.

30

15
26-05-2023

Data Visualization
• Data Visualization serves the following two distinct purposes:
Exploratory Data Analysis(EDA): It is an approach to analyzing data sets to
summarize their main characteristics, often with visual methods. Primarily,
EDA is undertaken for seeing what the data can tell us beyond the statistical
analysis and modelling.
 Communication of findings / reporting: Insights derived from data can be
communicated to the users such as higher management or the readers of audit
reports.
• IA&AD Practitioner’s Guide for use of Data Visualization and
Infographics should be referred to for principles of data visualization

31

Data Visualization
It aims at achieving one or more of the following objectives:
• Comprehensibility: makes information and relationships easily understandable.
• Comprehensiveness: presents features/information for the entire selected data
set/sample size as against selective reporting.
• Focused communication: facilitates concise and ‘to the point’ communication.
• Reducing complexity: simplifying the presentation of large amounts of data.
• Establishing patterns and relationships: enables identification of patterns and
relationships in the data
• Analysis: promotes thinking on ‘substance’ rather than on ‘methodology’. It focuses
on the essence of the finding being communicated rather than on the procedure for
communication

32

16
26-05-2023

Data Analytic tools


• There are many powerful open source and proprietary software
available for the purpose
• Open source tools : Knime(www.knime.org), R(www.rͲproject.org),
Python (www.python.org) , Weka, Rapidminer, SPAGO
• Proprietary tools: SAS, Tableau, MS Power BI17, Tidco Spotfire,
Informatica, IBM Analytics, SPSS, D3J,Qlik etc.

33

Data Analytic tools


• When adopting a new analytic tool, the HoD should consider the issues of
sustainability of the tool in terms of financial and human resources.
• The scalability of the tool also needs to be kept in mind apart from the
availability of the tool in future.
• HoD should also ensure that the audited entities’ dataset or any other
sensitive dataset does not get shared in the server/cloud environment of
the data analytic software with unauthorized persons /entities.
• By way of abundant caution, whenever usage of a new tool is being
formalised in an office, approval for the same may be obtained from CDMA

34

17
26-05-2023

Results Of Data Analytics


• The results can be in the form of:
 Audit Insights
 Audit evidence
• Data Analytic model : the set of analytic tests leading to
analytic results, which can be used repetitively by
updating/ changing data.

35

Process flow of a Data Analytic Model

36

18
26-05-2023

Data Analytic Model Process

37

Data Analytic Model


• Once a model has been prepared, it should be submitted to CDMA for
review and approval.
• Data models could be developed on centralised or de-centralised data
sources
Centralised data sources: : If the data of the auditable entity/ sources is
centralized, i.e., is available through a central database, a model can be built
directly on the restored database
Decentralised data sources: If the data of audited entity/ sources is
decentralised, (i.e. – data from each audited entity sub unit is at different
locations which are not connected seamlessly), then the model may be used at a
sub unit level
• An important feature of the model is its reusability

38

19
26-05-2023

Documentation Of Data Analytic Process


• All documentation should be signed by the auditor and countersigned
by the supervising audit officer
• Documentation of the data analytic work should include the following
aspects:
 Data identification
Data collection
Importing data into analytic software
 Analytic technique used
Results of analysis
 Data Analytic Model
Feedback from use in audit

39

Data Repository At Field Offices


• Data Identification –All field offices should identify data sources available
within their jurisdiction. - continuous process.
• Data mapping – Once the data sources have been identified, the data should
be mapped on a sectoral basis
• Data preparation
• Data updation
• Data storage
• Metadata - Proper metadata of the data sources, tables etc. needs to be
maintained
Data Analytic Groups in field offices will be primarily responsible for all the
stages mentioned above in developing and maintaining the Data Repository.

40

20
26-05-2023

Central Data Repository


• CDMA will establish a data repository for data which has applicability
across multiple IA&AD offices
• Continuity of data analytic activities in an office should be ensured by
adhering to the business continuity management principles enunciated
in the Information Systems Security Handbook for Indian Audit &
Accounts Department (December 2003)

41

Data Repository

42

21
26-05-2023

Use of
Data
Analytics
in Audit

43

Doesn’t matter how much data you have,


it’s whether you use it successfully that counts.
Our goal is to turn data into information and
information into insight which would aid the
Governance.

44

22

You might also like