MODULE-1
MODULE-1
Data Management
Design Data Architecture
Data architecture is the process of standardizing how organizations collect, store, transform,
distribute, and use data. The goal is to deliver relevant data to people who need it, when they
need it, and help them make sense of it. Data architecture design is set of standards which are
composed of certain policies, rules and models.
Data is usually one of several architecture domains that form the pillars of an enterprise
architecture or solution architecture. The data architecture is formed by dividing into three
essential models
• Conceptual model
• Logical model
• Physical model
• Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation between
entities and their attributes.
• Logical model:
It is a model where problems are represented in the form of logic such as rows and column of
data, classes, xml tags and other DBMS techniques.
• Physical model:
Physical models hold the database design like which type of database technology will be
suitable for architecture.
• Business requirements
• Business policies
• Technology in use
• Business economics
Business requirements –
These include factors such as the expansion of business, the performance of the system access,
data management, transaction management, making use of raw data by converting them into
image files and records, and then storing in data warehouses. Data warehouses are the main
aspects of storing transactions in business.
Business policies –
The policies are rules that are useful for describing the way of processing data. These policies are
made by internal organizational bodies and other government agencies.
Technology in use –
This includes using the example of previously completed data architecture design and also using
existing licensed software purchases, database technology.
Business economics –
The economical factors such as business growth and loss, interest rates, loans, condition of the
market, and the overall cost will also have an effect on design architecture.
Data management
• Data management is an administrative process that includes acquiring, validating, storing,
protecting, and processing required data to ensure the accessibility, reliability, and
timeliness of the data for its users.
• Data management is the practice of managing data as a valuable resource to unlock its
potential for an organization.
• Managing data effectively requires having a data strategy and reliable methods to access,
integrate, cleanse, govern, store and prepare data for analytics.
• In our digital world, data pours into organizations from many sources – operational and
transactional systems, scanners, sensors, smart devices, social media, video and text.
• But the value of data is not based on its source, quality or format.
• The information is stored in computer files. When files are properly arranged and
maintained, users can easily access and retrieve the information when they need.
• If the files are not properly managed, they can lead to chaos in information processing.
• Even if the hardware and software are excellent, the information system can be very
inefficient because of poor file management.
• Data Modeling: Is first creating a structure for the data that you collect and use and then
organizing this data in a way that is easily accessible and efficient to store and pull the
data for reports and analysis.
• Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
• Data Movement: is the ability to move data from one place to another. For instance, data
needs to be moved from where it is collected to a database and then to an end user.
• Methods of data collection are essential for anyone who wish to collect data.
• Data collection is a fundamental aspect and as a result, there are different methods of
collecting data which when used on one particular set will result in different kinds of
data.
• Collection of data refers to a purpose gathering of information and relevant to the subject-
matter of the study from the units under investigation.
• The method of collection of data mainly depends upon the nature, purpose and the scope
of inquiry on one hand and availability of resources, and the time to the other.
Survey:
Survey method is one of the primary sources of data which is used to collect quantitative
information about an items in a population.
• Surveys are used in different areas for collecting the data even in public and private
sectors.
• This method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
• When the questions are administered by a researcher, the survey is called a structured
interview or a researcher-administered survey.
Observations:
Observation as one of the primary sources of data.
Interview:
Interviewing is a technique that is primarily used to gain an understanding of the
underlying reasons and motivations for people’s attitudes, preferences or behavior.
• However, Market researchers have used 4 experimental designs most frequently. These
are
FD - Factorial Designs
RBD
The term Randomized Block Design has originated from agricultural research.
• In this design several treatments of variables are applied to different blocks of land to
ascertain their effect on the yield of the crop.
• Blocks are formed in such a manner that each block contains as many plots as a number
of treatments so that one plot from each is selected at random for each treatment.
• These data are then interpreted and inferences are drawn by using the analysis of
Variance Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
LSD
• Latin Square Design - A Latin square is one of the experimental designs which has a
balanced two way classification scheme say for example - 4 X 4 arrangement.
• In this scheme each letter from A to D occurs only once in each row and also only once
in each column.
• The balance arrangement, it may be noted that, will not get disturbed if any row gets
changed with the other.
ABCD
BCDA
CDAB
DABC
• In this design, the comparisons among treatments will be free from both differences
between rows and columns.
• Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs
This design allows the experimenter to test two or more variables simultaneously.
• It also measures interaction effects of the variables and analyzes the impacts of each of
the variables.
• In a true experiment, randomization is essential so that the experimenter can infer cause
and effect without any bias.
• Secondary data are the data collected by a party not related to the research study but
collected these data for some other purpose and at different time in the past.
• If the researcher uses these data then these become secondary data for the current users.
• Sources of secondary data are government publications websites, books, journal articles,
internal records.
Internal Sources:
• If available, internal secondary data may be obtained with less time, effort and money
than the external secondary data.
• In addition, they may also be more pertinent to the situation at hand since they are from
within the organization.
• Accounting resources- This gives so much information which can be used by the
marketing researcher. They give information about internal factors.
• Sales Force Report- It gives information about the sale of a product. The information
provided is of outside the organization.
• Internal Experts- These are people who are heading the various departments. They can
give an idea of how a particular thing is working
• Miscellaneous Reports- These are what information you are getting from operational
reports. If the data available within the organization are unsuitable or inadequate, the
marketer should extend the search to external secondary data sources.
• Collection of external data is more difficult because the data have much greater variety
and the sources are much more numerous.
• In addition, many of these data are available free of cost on internet websites.
• It contains estimates of national income for several years, growth rate, and rate of major
economic activities.
• It gives information about the total number of workers employed, production units,
material used and value added by the manufacturer.
This ministry through the office of economic advisor provides information on wholesale
price index.
• These indices may be related to a number of sectors like food, fuel, power, food grains
etc.
• It also generates All India Consumer Price Index numbers for industrial workers, urban,
non manual employees and cultural labourers.
Planning Commission-
Labour Bureau-
• This is done by the Ministry of Planning and it provides social, economic, demographic,
industrial and agricultural statistics.
• This gives information on various types of activities related to the state like - commercial
activities, education, occupation etc.
Non-Government Publications-
• These includes publications of various industrial and trade associations, such as The
Indian Cotton Mill Association Various chambers of commerce.
Understand various sources of Data like Sensors/signal/GPS etc
Sensor data:
• Sensor data is the output of a device that detects and responds to some type of input from
the physical environment.
• The output may be used to provide information or input to another system or to guide a
process.
• Here are a few examples of sensors, just to give an idea of the number and diversity of
their applications:
• A photo sensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
• Smart grid sensors can provide real-time data about grid conditions, detecting outages,
faults and load and triggering alarms.
Signal:
• The simplest form of signal is a direct current (DC) that is switched on and off; this is the
principle by which the early telegraph worked.
• The Global Positioning System (GPS) is a space based navigation system that provides
location and time information in all weather conditions, anywhere on or near the Earth
where there is an unobstructed line of sight to four or more GPS satellites.
• The system provides critical capabilities to military, civil, and commercial users around
the world.
• The United States government created the system, maintains it, and makes it freely
accessible to anyone with a GPS receiver.
Quality of Data
Data quality is the ability of your data to serve its intended purpose based on factors such as
– accuracy,
– completeness,
– consistency,
– reliability
and these factors that play a huge role in determining data quality.
Accuracy:
Completeness:
– Unavailability of data
• Inconsistent means data source containing discrepancies between different data items.
• Some attributes representing a given concept may have different names in different
databases, causing inconsistencies and redundancies.
Reliability:
• Reliability means that data are reasonably complete and accurate, meet the intended
purposes, and are not subject to inappropriate alteration.
• Some other features that also affect the data quality include timeliness (the data is
incomplete until all relevant information is submitted after certain time periods),
believability (how much the data is trusted by the user) and interpretability (how easily
the data is understood by all stakeholders).
• To make the process easier, data preprocessing is divided into four stages:
– data cleaning,
– data integration,
– data transformation.
• Outliers
• Missing Values
• Noisy
• Duplicate Values
Outliers
• Outliers are extreme values that deviate from other observations on data, they may
indicate a variability in a measurement, experimental errors or a novelty.
• Most of the ways to deal with outliers are similar to the methods of missing values like
deleting observations, transforming them, binning them, treat them as a separate group,
imputing values and other statistical methods.
• Here, we will discuss the common techniques used to deal with outliers:
• Deleting observations: We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
• Transforming and binning values: Transforming variables can also eliminate outliers.
• Decision Tree algorithm allows to deal with outliers well due to binning of variable.
• We can also use statistical model to predict values of outlier observation and after that we
can impute it with predicted values.
Missing data
• Missing data in the training data set can reduce the power / fit of a model or can lead to a
biased model because we have not analysed the behavior and relationship with other
variables correctly.
• Now, let’s identify the reasons for occurrence of these missing values.
1. Data Extraction:
• In such cases, we should double-check for correct data with data guardians.
• Some hashing procedures can also be used to make sure data extraction is correct.
• Errors at data extraction stage are typically easy to find and can be corrected easily as
well.
2. Data collection:
• These errors occur at time of data collection and are harder to correct.
• This is a case when the probability of missing variable is same for all observations.
• For example: respondents of data collection process decide that they will declare their
earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings
& vice versa. Here each observation has equal chance of missing value.
Missing at random:
• This is a case when variable is missing at random and missing ratio varies for different
values / level of other input variables.
• For example: We are collecting data for age and female has higher missing value
compare to male.
Missing that depends on unobserved predictors:
• This is a case when the missing values are not random and are related to the unobserved
input variable.
• For example: In a medical study, if a particular diagnostic causes discomfort, then there is
higher chance of drop out from the study. This missing value is not at random unless we
have included “discomfort” as an input variable for all patients.
• This is a case when the probability of missing value is directly correlated with missing
value itself.
• For example: People with higher or lower income are likely to provide non-response to
their earning.
• In list wise deletion, we delete observations where any of the variable is missing.
• Simplicity is one of the major advantage of this method, but this method reduces the
power of model because it reduces the sample size.
• In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present.
• Advantage of this method is, it keeps as many cases available for analysis.
One of the disadvantage of this method, it uses different sample size for different variables.
Deletion methods are used when the nature of missing data is “Missing completely at
random” else non random missing values can bias the model output.
• The objective is to employ known relationships that can be identified in the valid values
of the data set to assist in estimating the missing values.
• Mean / Mode / Median imputation is one of the most frequently used methods.
• It consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
Generalized Imputation:
• In this case, we calculate the mean or median for all non missing values of that variable
then replace missing value with mean or median.
• Like in above table, variable “Manpower” is missing so we take average of all non
missing values of “Manpower” (28.33) and then replace missing value with it.
• In this case, we calculate average for gender “Male” (29.75) and “Female” (25)
individually of non missing values then replace the missing value based on gender.
• For “Male“, we will replace missing values of manpower with 29.75 and for “Female”
with 25.
Noisy Data
• Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc.
Binning Method:
• The whole data is divided into segments of equal size and then various methods are
performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values can be used to
complete the task.
Regression:
• The regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).
Clustering:
– technology limitation
Duplicate values:
• A dataset may include data objects which are duplicates of one another.
• It may happen when say the same person submits a form more than once.
• The term deduplication is often used to refer to the process of dealing with duplicates.
• In most cases, the duplicates are removed so as to not give that particular data object an
advantage or bias, when running machine learning algorithms.
Data Pre-processing
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
Missing Data: This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can choose
to fill the missing values manually, by attribute mean or the most probable value.
Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
o Binning Method: This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.
o Regression:Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Integration:
The process of combining data from multiple sources (databases, spreadsheets,text files) into a
single dataset. Single and consistent view of data is created in this process. Major problems
during data integration are Schema integration(Integrates set of data collected from various
sources), Entity identification(identifying entities from different databases) and detecting and
resolving data values concept.
3. Data Transformation:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
• In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
Discretization:
• This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
Concept Hierarchy Generation:
• Here attributes are converted from low level to higher level in hierarchy.
4. Data Reduction:
• Since data mining is a technique that is used to handle huge amount of data.
• While working with huge volume of data, analysis became harder in such cases.
• It aims to increase the storage efficiency and reduce data storage and analysis costs.
• Aggregation operation is applied to data for the construction of the data cube.
(Redundant, noisy data removed)
• The highly relevant attributes should be used, rest all can be discarded.
• For performing attribute selection, one can use level of significance and p- value of the
attribute. the attribute having p-value greater than significance level can be discarded.
Numerosity Reduction:
Dimensionality Reduction:
• If after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction.
• The two effective methods of dimensionality reduction are: Wavelet transforms and PCA
(Principal Componenet Analysis).
5. Data Discretization: Involves the reduction of a number of values of a continuous attribute by
dividing the range of attribute intervals.