0% found this document useful (0 votes)
9 views16 pages

DA Unit 1 Trio 1

The document discusses data management and analytics, emphasizing the design of data architecture and the importance of various data sources, including primary and secondary data. It outlines the factors influencing data architecture, such as enterprise requirements, technology drivers, and economic considerations, while also detailing methods for data collection and the significance of data quality. Additionally, it highlights the characteristics that define high-quality data, which are essential for effective decision-making and operational processes.

Uploaded by

tanaymaniyar895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

DA Unit 1 Trio 1

The document discusses data management and analytics, emphasizing the design of data architecture and the importance of various data sources, including primary and secondary data. It outlines the factors influencing data architecture, such as enterprise requirements, technology drivers, and economic considerations, while also detailing methods for data collection and the significance of data quality. Additionally, it highlights the characteristics that define high-quality data, which are essential for effective decision-making and operational processes.

Uploaded by

tanaymaniyar895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

III B.

TECH-I SEMESTER-CSE-(R16)

DATA ANALYTICS (CS513PE)

UNIT - I

DATA MANAGEMENT

DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS


Data architecture is composed of models, policies, rules or standards that govern which
data is collected, and how it is stored, arranged, integrated, and put to use in data systems and
in organizations. Data is usually one of several architecture domains that form the pillars of an
enterprise architecture or solution architecture.

Design Data Architecture and Manage The Data For Analysis

Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.

• Enterprise requirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such features as
data warehouses is also a common organizational requirement, since this enables managerial
decision making and other organizational processes. One of the architecture techniques is the
split between managing transaction data and (master) reference data. Another one is splitting
data capture systems from data retrieval systems (as done in a data warehouse).

• Technology drivers
These are usually suggested by the completed data architecture and database architecture
designs. In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site resources
(e.g. previously purchased software licensing).

• Economics
These are also important factors that must be considered during the data architecture
phase. It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates, market
conditions, and legal considerations could all have an effect on decisions relevant to data
architecture.
• Business policies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental laws
that can vary by applicable agency. These policies and rules will help describe the manner in
which enterprise wishes to process their data.

• Data processing needs


These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data mining),
repetitive periodic reporting, ad hoc reporting, and support of various organizational initiatives
as required (i.e. annual budgets, new product development).

The General Approach is based on designing the Architecture at three Levels of Specification
➢ The Logical Level
➢ The Physical Level
➢ The Implementation Level
Fig 1: Three levels architecture in data analytics.

The logical level or view/user's view, of a data analytics represents data in a format that
is meaningful to a user and to the programs that process those data. That is, the logical view tells
the user, in user terms, what is in the database. Logical level consists of data requirements and
process models which are processed using any data modeling techniques to result in logical data
model.
Physical level is created when we translate the top level design in physical tables in the
database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various data
modeling techniques are used here with input from software developers or database
administrator. These data modeling techniques are various formats of representation of data such
as relational data model, network model, hierarchical model, object oriented model, Entity
relationship model.
Implementation level contains details about modification and presentation of data
through the use of various data mining tools such as (R-studio, WEKA, and Orange etc). Here
each tool has a specific feature how it works and different representation of viewing the same
data. These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.
UNDERSTAND VARIOUS SOURCES OF THE DATA
Data can be generated from two types of sources namely Primary and Secondary Sources
of Primary Data

The sources of generating primary data are -


➢ Observation Method
➢ Survey Method
➢ Experimental Method

Observation Method:

Fig 1: Data collections


An observation is a data collection method, by which you gather knowledge of the
researched phenomenon through making observations of the phenomena, as and when it occurs.
The main aim is to focus on observations of human behavior, the use of the phenomenon and
human interactions related to the phenomenon. We can also make observations on verbal and
nonverbal expressions. In making and documenting observations, we need to clearly differentiate
our own observations from the observations provided to us by other people. The range of data
storage genre found in Archives and Collections, is suitable for documenting observations e.g.
audio, visual, textual and digital including sub-genres of note taking, audio recording and video
recording.

There exist various observation practices, and our role as an observer may vary according
to the research approach. We make observations from either the outsider or insider point of view
in relation to the researched phenomenon and the observation technique can be structured or
unstructured. The degree of the outsider or insider points of view can be seen as a movable point
in a continuum between the extremes of outsider and insider. If you decide to take the insider
point of view, you will be a participant observer in situ and actively participate in the observed
situation or community. The activity of a Participant observer in situ is called field work. This
observation technique has traditionally belonged to the data collection methods of ethnology and
anthropology. If you decide to take the outsider point of view, you try to try to distance yourself
from your own cultural ties and observe the researched community as an outsider observer.

Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 3 experimental designs most frequently.
These are –
CRD - Completely Randomized Design
LSD - Latin Square Design
FD - Factorial Designs

CRD - Completely Randomized Design


A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of receiving any one
treatment. For the CRD, any difference among experimental units receiving the same
treatment is considered as experimental error. Hence, CRD is appropriate only for experiments
with homogeneous experimental units, such as laboratory experiments, where environmental
effects are relatively easy to control. For field experiments, where there is generally large
variation among experimental plots in such environmental factors as soil, the CRD is rarely
used. CRD is mainly used in agricultural field.
LSD - Latin Square Design
A Latin square is one of the experimental designs which has a balanced two-way
classification scheme say for example - 4 X 4 arrangement. In this scheme each letter from A
to D occurs only once in each row and also only once in each column. The balance
arrangement, it may be noted that, will not get disturbed if any row gets changed with the
other.
A B C D
B C D A
C D A B
D A B C

The balance arrangement achieved in a Latin Square is its main strength. In this design,
the comparisons among treatments will be free from both differences between rows and columns.
Thus the magnitude of error will be smaller than any other design.

FD - Factorial Designs
This design allows the experimenter to test two or more variables simultaneously. It also
measures interaction effects of the variables and analyzes the impacts of each of the variables.
In a true experiment, randomization is essential so that the experimenter can infer cause
and effect without any bias.

Sources of Secondary Data


While primary data can be collected through questionnaires, depth interview, focus group
interviews, case studies, experimentation and observation; The secondary data can be obtained
through
➢ Internal Sources
➢ External Sources
➢ Internal Sources of Data
If available, internal secondary data may be obtained with less time, effort and money than
the external secondary data. In addition, they may also be more pertinent to the situation at hand
since they are from within the organization.

The internal sources include


Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The information provided
is of outside the organization.
Internal Experts- These are people who are heading the various departments. They can give an
idea of how a particular thing is working.
Miscellaneous Reports- These are what information you are getting from operational reports.
If the data available within the organization are unsuitable or inadequate, the marketer should
extend the search to external secondary data sources.

External Sources of Data


External Sources are sources which are outside the company in a larger environment.
Collection of external data is more difficult because the data have much greater variety and the
sources are much more numerous.
External data can be divided into following classes.
Government Publications
Planning Commission
Reserve Bank of India
Labour Bureau
Department of Economic Affairs
Non-Government Publications

Government Publications-
Government sources provide an extremely rich pool of data for the researchers. In
addition, many of these data are available free of cost on internet websites. There are number of
government agencies generating data.
These are:
Registrar General of India-
It is an office which generates demographic data. It includes details of gender, age,
occupation etc.
Central Statistical Organization-
This organization publishes the national accounts statistics. It contains estimates of
national income for several years, growth rate, and rate of major economic activities. Annual
survey of Industries is also published by the CSO. It gives information about the total number of
workers employed, production units, material used and value added by the manufacturer.
Director General of Commercial Intelligence-
This office operates from Kolkata. It gives information about foreign trade i.e. import
and export. These figures are provided region-wise and country-wise.
Ministry of Commerce and Industries-
This ministry through the office of economic advisor provides information on
wholesale price index. These indices may be related to a number of sectors like food, fuel,
power, food grains etc. It also generates All India Consumer Price Index numbers for industrial
workers, urban, non-manual employees and cultural laboures.
Planning Commission-
It provides the basic statistics of Indian Economy.
Reserve Bank of India-
This provides information on Banking Savings and investment. RBI also prepares
currency and finance reports.
Lab our Bureau-
It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs-
It conducts economic survey and it also generates information on income, consumption,
expenditure, investment, savings and foreign trade.
State Statistical Abstract-
This gives information on various types of activities related to the state like -
commercial activities, education, occupation etc.
Non-Government Publications-
These includes publications of various industrial and trade associations,
such as
The Indian Cotton Mill
Association Various chambers
of commerce
The Bombay Stock Exchange
(it publishes a directory containing financial accounts, key profitability and other relevant
matter)
Various Associations of Press Media.
Export Promotion Council.
Confederation of Indian Industries (CII)
Small Industries Development Board of India

Different Mills like –


Woolen mills, Textile mills etc.
The only disadvantage of the above sources is that the data may be biased. They are likely
to color their negative points.

Syndicate Services-

These services are provided by certain organizations which collect and tabulate the
marketing information on a regular basis for a number of clients who are the subscribers to these
services. So the services are designed in such a way that the information suits the subscriber.
These services are useful in television viewing, movement of consumer goods etc. These
syndicate services provide information data from both household as well as institution.

In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, socio graphic, general topics. Mail Diary Panel- It may be related to 2 fields
- Purchase and Media.
Electronic Scanner Services- These are used to generate
data on volume. They collect data for Institutions from
Whole
sellers
Retailers,
and
Industrial
Firms
Various syndicate services are Operations Research Group (ORG) and The Indian Marketing
Research Bureau (IMRB).

Importance of Syndicate Services

Syndicate services are becoming popular since the constraints of decision making are changing
and we need more of specific decision-making in the light of changing environment. Also
Syndicate services are able to provide information to the industries at a low unit cost.

Disadvantages of Syndicate Services

The information provided is not exclusive. A number of research agencies provide customized
services which suits the requirement of each individual organization.

International Organization- These includes

The International Labor Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices

The Organization for Economic Co-operation and development (OECD) - It publishes data
on foreign trade, industry, food, transport, and science and technology.
The International Monetary Fund (IMA) - It publishes reports on national and international
foreign exchange regulations.

Comparison of sources of data

Based on various features (cost, data, process, source time etc.) various sources of data
can be compared as per table 1.

Table 1: Difference between primary data and secondary data.

Comparison Feature Primary data Secondary data


Meaning Data that is collected by a Data that is collected by
researcher. other people.
Data Real time data Past data.
Process Very involved Quick and easy
Source Surveys, interviews, or Books, journals, publications
experiments, questionnaire, etc..
interview etc..
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Specific to researcher need May not be specific to
researcher need
Available Crude form Refined form
Accuracy and reliability More Less
Understanding Sources of Data from Sensor
Sensor data is the output of a device that detects and responds to some type of input from
the physical environment. The output may be used to provide information or input to another
system or to guide a process.
Examples are as follows
• A photo sensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
• Lidar, a laser-based method of detection, range finding and mapping, typically uses a low-
power, eye-safe pulsing laser working in conjunction with a camera.
• A charge-coupled device (CCD) stores and displays the data for an image in such a way that
each pixel is converted into an electrical charge, the intensity of which is related to a color in
the color spectrum.
• Smart grid sensors can provide real-time data about grid conditions, detecting outages,
faults and load and triggering alarms.
• Wireless sensor networks combine specialized transducers with a communications
infrastructure for monitoring and recording conditions at diverse locations. Commonly
monitored parameters include temperature, humidity, pressure, wind direction and speed,
illumination intensity, vibration intensity, sound intensity, power line voltage, chemical
concentrations, pollutant levels and vital body functions.

Understanding Sources of Data from Signal


The simplest form of signal is a direct current (DC) that is switched on and off; this is the
principle by which the early telegraph worked. More complex signals consist of an alternating-
current (AC) or electromagnetic carrier that contains one or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across a
network. Data and signals can be either analog or digital. A signal is periodic if it consists of a
continuously repeating pattern.
Understanding Sources of Data from GPS
The Global Positioning System (GPS) is a space based navigation system that provides
location and time information in all weather conditions, anywhere on or near the Earth where
there is an unobstructed line of sight to four or more GPS satellites. The system provides critical
capabilities to military, civil, and commercial users around the world. The United States
government created the system, maintains it, and makes it freely accessible to anyone with a
GPS receiver.
DATA MANAGEMENT
Data management is the development and execution of architectures, policies, practices
and procedures in order to manage the information lifecycle needs of an enterprise in an effective
manner.

DATA QUALITY
Data quality refers to the quality of data. Data quality refers to the state of qualitative or
quantitative pieces of information. There are many definitions of data quality but data is
generally considered high quality if it is "fit for [its] intended uses in operations, decision making
and planning

The seven characteristics that define data quality are:

1. Accuracy and Precision


2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness

Accuracy and Precision:


This characteristic refers to the exactness of the data. It cannot have any erroneous
elements and must convey the correct message without being misleading. This accuracy and
precision have a component that relates to its intended use. Without understanding how the data
will be consumed, ensuring accuracy and precision could be off-target or more costly than
necessary. For example, accuracy in healthcare might be more important than in another industry
(which is to say, inaccurate data in healthcare could have more serious consequences) and,
therefore, justifiably worth higher levels of investment.

Legitimacy and Validity:


Requirements governing data set the boundaries of this characteristic. For example, on
surveys, items such as gender, ethnicity, and nationality are typically limited to a set of options
and open answers are not permitted. Any answers other than these would not be considered valid
or legitimate based on the survey’s requirement. This is the case for most data and must be
carefully considered when determining its quality. The people in each department in an
organization understand what data is valid or not to them, so the requirements must be leveraged
when evaluating data quality.
Reliability and Consistency:
Many systems in today’s environments use and/or collect the same source data.
Regardless of what source collected the data or where it resides, it cannot contradict a value
residing in a different source or collected by a different system. There must be a stable and
steady mechanism that collects and stores the data without contradiction or unwarranted
variance.
Timeliness and Relevance:
There must be a valid reason to collect the data to justify the effort required, which also
means it has to be collected at the right moment in time. Data collected too soon or too late could
misrepresent a situation and drive inaccurate decisions.
Completeness and Comprehensiveness:
Incomplete data is as dangerous as inaccurate data. Gaps in data collection lead to a
partial view of the overall picture to be displayed. Without a complete picture of how operations
are running, uninformed actions will occur. It’s important to understand the complete set of
requirements that constitute a comprehensive set of data to determine whether or not the
requirements are being fulfilled.
Availability and Accessibility:
This characteristic can be tricky at times due to legal and regulatory constraints.
Regardless of the challenge, though, individuals need the right level of access to the data in order
to perform their jobs. This presumes that the data exists and is available for access to be granted.
Granularity and Uniqueness:
The level of detail at which data is collected is important, because confusion and
inaccurate decisions can otherwise occur. Aggregated, summarized and manipulated collections
of data could offer a different meaning than the data implied at a lower level. An appropriate
level of granularity must be defined to provide sufficient uniqueness and distinctive properties to
become visible. This is a requirement for operations to function effectively.
NOISY DATA
Noisy data is meaningless data. The term has often been used as a synonym for corrupt
data. However, its meaning has expanded to include any data that cannot be understood and
interpreted correctly by machines, such as unstructured text.

OUTLIER
An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus
process) to decide what will be considered abnormal.

MISSING DATA
In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data. missing values can be replaced by
following techniques:
• Ignore the record with missing values.
• Replace the missing term with constant.
• Fill the missing value manually based on domain knowledge.
• Replace them with mean (if data is numeric) or frequent value (if data is
categorical)
• Use of modeling techniques such decision trees, baye`s algorithm, nearest
neighbor algorithm Etc.

DATA DUPLICATION
In computing, data duplication is a specialized data compression technique for
eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are
intelligent (data) compression and single instance (data) storage.
DATA PROCESSING & PROCESSING:
Data processing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data processing is a proven
method of resolving such issues.
Data goes through a series of steps during preprocessing:

• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
• Data Transformation: Data is normalized, aggregated and generalized.
• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.

Data Discretization: Involves the reduction of a number of values of a continuous attribute by


dividing the range of attribute intervals.

You might also like