DA Unit 1 Trio 1
DA Unit 1 Trio 1
TECH-I SEMESTER-CSE-(R16)
UNIT - I
DATA MANAGEMENT
Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.
• Enterprise requirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such features as
data warehouses is also a common organizational requirement, since this enables managerial
decision making and other organizational processes. One of the architecture techniques is the
split between managing transaction data and (master) reference data. Another one is splitting
data capture systems from data retrieval systems (as done in a data warehouse).
• Technology drivers
These are usually suggested by the completed data architecture and database architecture
designs. In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site resources
(e.g. previously purchased software licensing).
• Economics
These are also important factors that must be considered during the data architecture
phase. It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates, market
conditions, and legal considerations could all have an effect on decisions relevant to data
architecture.
• Business policies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental laws
that can vary by applicable agency. These policies and rules will help describe the manner in
which enterprise wishes to process their data.
The General Approach is based on designing the Architecture at three Levels of Specification
➢ The Logical Level
➢ The Physical Level
➢ The Implementation Level
Fig 1: Three levels architecture in data analytics.
The logical level or view/user's view, of a data analytics represents data in a format that
is meaningful to a user and to the programs that process those data. That is, the logical view tells
the user, in user terms, what is in the database. Logical level consists of data requirements and
process models which are processed using any data modeling techniques to result in logical data
model.
Physical level is created when we translate the top level design in physical tables in the
database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various data
modeling techniques are used here with input from software developers or database
administrator. These data modeling techniques are various formats of representation of data such
as relational data model, network model, hierarchical model, object oriented model, Entity
relationship model.
Implementation level contains details about modification and presentation of data
through the use of various data mining tools such as (R-studio, WEKA, and Orange etc). Here
each tool has a specific feature how it works and different representation of viewing the same
data. These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.
UNDERSTAND VARIOUS SOURCES OF THE DATA
Data can be generated from two types of sources namely Primary and Secondary Sources
of Primary Data
Observation Method:
There exist various observation practices, and our role as an observer may vary according
to the research approach. We make observations from either the outsider or insider point of view
in relation to the researched phenomenon and the observation technique can be structured or
unstructured. The degree of the outsider or insider points of view can be seen as a movable point
in a continuum between the extremes of outsider and insider. If you decide to take the insider
point of view, you will be a participant observer in situ and actively participate in the observed
situation or community. The activity of a Participant observer in situ is called field work. This
observation technique has traditionally belonged to the data collection methods of ethnology and
anthropology. If you decide to take the outsider point of view, you try to try to distance yourself
from your own cultural ties and observe the researched community as an outsider observer.
Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 3 experimental designs most frequently.
These are –
CRD - Completely Randomized Design
LSD - Latin Square Design
FD - Factorial Designs
The balance arrangement achieved in a Latin Square is its main strength. In this design,
the comparisons among treatments will be free from both differences between rows and columns.
Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs
This design allows the experimenter to test two or more variables simultaneously. It also
measures interaction effects of the variables and analyzes the impacts of each of the variables.
In a true experiment, randomization is essential so that the experimenter can infer cause
and effect without any bias.
Government Publications-
Government sources provide an extremely rich pool of data for the researchers. In
addition, many of these data are available free of cost on internet websites. There are number of
government agencies generating data.
These are:
Registrar General of India-
It is an office which generates demographic data. It includes details of gender, age,
occupation etc.
Central Statistical Organization-
This organization publishes the national accounts statistics. It contains estimates of
national income for several years, growth rate, and rate of major economic activities. Annual
survey of Industries is also published by the CSO. It gives information about the total number of
workers employed, production units, material used and value added by the manufacturer.
Director General of Commercial Intelligence-
This office operates from Kolkata. It gives information about foreign trade i.e. import
and export. These figures are provided region-wise and country-wise.
Ministry of Commerce and Industries-
This ministry through the office of economic advisor provides information on
wholesale price index. These indices may be related to a number of sectors like food, fuel,
power, food grains etc. It also generates All India Consumer Price Index numbers for industrial
workers, urban, non-manual employees and cultural laboures.
Planning Commission-
It provides the basic statistics of Indian Economy.
Reserve Bank of India-
This provides information on Banking Savings and investment. RBI also prepares
currency and finance reports.
Lab our Bureau-
It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs-
It conducts economic survey and it also generates information on income, consumption,
expenditure, investment, savings and foreign trade.
State Statistical Abstract-
This gives information on various types of activities related to the state like -
commercial activities, education, occupation etc.
Non-Government Publications-
These includes publications of various industrial and trade associations,
such as
The Indian Cotton Mill
Association Various chambers
of commerce
The Bombay Stock Exchange
(it publishes a directory containing financial accounts, key profitability and other relevant
matter)
Various Associations of Press Media.
Export Promotion Council.
Confederation of Indian Industries (CII)
Small Industries Development Board of India
Syndicate Services-
These services are provided by certain organizations which collect and tabulate the
marketing information on a regular basis for a number of clients who are the subscribers to these
services. So the services are designed in such a way that the information suits the subscriber.
These services are useful in television viewing, movement of consumer goods etc. These
syndicate services provide information data from both household as well as institution.
In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, socio graphic, general topics. Mail Diary Panel- It may be related to 2 fields
- Purchase and Media.
Electronic Scanner Services- These are used to generate
data on volume. They collect data for Institutions from
Whole
sellers
Retailers,
and
Industrial
Firms
Various syndicate services are Operations Research Group (ORG) and The Indian Marketing
Research Bureau (IMRB).
Syndicate services are becoming popular since the constraints of decision making are changing
and we need more of specific decision-making in the light of changing environment. Also
Syndicate services are able to provide information to the industries at a low unit cost.
The information provided is not exclusive. A number of research agencies provide customized
services which suits the requirement of each individual organization.
The International Labor Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices
The Organization for Economic Co-operation and development (OECD) - It publishes data
on foreign trade, industry, food, transport, and science and technology.
The International Monetary Fund (IMA) - It publishes reports on national and international
foreign exchange regulations.
Based on various features (cost, data, process, source time etc.) various sources of data
can be compared as per table 1.
DATA QUALITY
Data quality refers to the quality of data. Data quality refers to the state of qualitative or
quantitative pieces of information. There are many definitions of data quality but data is
generally considered high quality if it is "fit for [its] intended uses in operations, decision making
and planning
OUTLIER
An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus
process) to decide what will be considered abnormal.
MISSING DATA
In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data. missing values can be replaced by
following techniques:
• Ignore the record with missing values.
• Replace the missing term with constant.
• Fill the missing value manually based on domain knowledge.
• Replace them with mean (if data is numeric) or frequent value (if data is
categorical)
• Use of modeling techniques such decision trees, baye`s algorithm, nearest
neighbor algorithm Etc.
DATA DUPLICATION
In computing, data duplication is a specialized data compression technique for
eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are
intelligent (data) compression and single instance (data) storage.
DATA PROCESSING & PROCESSING:
Data processing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data processing is a proven
method of resolving such issues.
Data goes through a series of steps during preprocessing:
• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
• Data Transformation: Data is normalized, aggregated and generalized.
• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.