Cs655 Unit V
Cs655 Unit V
Similar terms:
knowledge mining from data,
knowledge extraction,
data/pattern analysis,
data archaeology,
data dredging.
Knowledge Discovery from Data, or KDD.
Knowledge base
This is the domain knowledge that is used to guide the search or evaluate the result patterns
can include concept hierarchies, user beliefs, used to organize attributes or attribute values into different
levels of abstraction.
Other examples of domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine
This is essential to the data mining system and consists of a set of functional modules for tasks such as
characterization,
association and correlation analysis,
classification,
prediction,
cluster analysis,
outlier analysis,
evolution analysis.
Pattern evaluation module:
This component typically employs interestingness measures and interacts with the data mining modules so as
to focus the search toward interesting patterns.
distinguishing feature
its visual interface, which allows users to wire components together to create self-documenting programs
Up to date and consistent at all times Data is consistent only up to the last update
Queries touch small amount of data Queries touch large amounts of data
Concurrency is the biggest performance Each report or query requires lot of resources
concern
OLTP targets specific process like ordering OLAP integrates data from different processes like (Ordering,
from an online store processing, inventory, sales etc.,)
Databases size is usually around 100 MB Databases size is usually around 100 GB to a few TB
to 100 GB
Only current data available (old data is Both current and historic data available (current is appended to
replaced by current data by updating) historic data)
Concurrency control and transaction No concurrent transactions and therefore no recovery upon
recovery failures required
Largely online ad hoc queries, requiring Largely pre-determined queries requiring high level of indexing
low level of indexing
Introduction
Data warehousing and data mining are the important means of preparing the government to face the
challenges of the new millennium.
These technologies have extensive potential applications in the government: in various Central Government
sectors such as Agriculture, Rural Development, Health and Energy.
These technologies can and should therefore be implemented
Similarly, in State Government activities also, large opportunities exist for applying these techniques.
Almost all these opportunities have not yet been exploited.
Census Data
The Registrar General and Census Commissioner of India decennially compiles information of all
individuals, villages, population groups, etc.
This information is wide ranging such as the individual-slip, a compilation of information of individual
households, of which a database of 5% sample is maintained for analysis.
A data warehouse can be built from this database upon which OLAP techniques can be applied.
Data mining also can be performed for analysis and knowledge discovery
General Information Services Terminal of National Informatics Centre (GISTNIC)
A village-level database was originally developed by National Informatics Centre at Hyderabad under
General Information Services Terminal of National Informatics Centre (GISTNIC) for the 1991 Census.
This consists of two parts: primary census abstract and village amenities
Subsequently, a data warehouse was also developed for village amenities for Tamil Nadu.
This enables multidimensional analysis of the village-level data in such sectors as education, health and
infrastructure.
The fact data pertains to the individual village data compiled under 1991 Census
As the Census compilation is performed once in 10 years, the data is quasistatic and, therefore, no
refreshing of the warehouse needs to be done on a periodic basis.
Only the new data needs to be either appended to the data warehouse or alternatively a new data
warehouse can be built.
There exist many other subject areas (e.g. migration tables) within the census purview which may be
amenable and appropriate for data warehouse development, OLAP and data mining applications on which
work can be taken up in future.
Other Areas
Other possible areas for data warehousing and data mining in Central Government sectors are discussed in
detail in the following sections
Agriculture
The Agricultural Census performed by the Ministry of Agriculture, Government of India, compiles a large
number of agricultural parameters at the national level
District-wise agricultural production, area and yield of crops is compiled;
this can be built into a data warehouse for analysis, mining and forecasting.
Statistics on consumption of fertilizers also can be turned into a data mart
Data on agricultural inputs such as seeds and fertilizers can also be effectively analysed
Data from livestock census can be turned into a data warehouse
Land-use pattern statistics can also be analysed in a warehousing environment.
Other data such as watershed details and also agricultural credit data can be effectively used for analysis
by applying the technologies of OLAP and data mining
substantial scope for application of data warehousing and data mining techniques in agricultural sector
Rural Development
Data on individuals below poverty line (BPL survey) can be built into a data warehouse.
Drinking water census data (from Drinking Water Mission)
Monitoring and analysis of progress made on implementation of rural development programmes
o can be effectively utilized by OLAP and data mining technologies
Health
Community needs assessment data,
immunization data,
data from national programmes on controlling blindness, leprosy, malaria
o can all be used for data warehousing implementation, OLAP and data mining applications.
Planning
At the Planning Commission, data warehouses can be built for state plan data on all sectors: labor, energy,
education, trade and industry, five year plan, etc.
Education
The Sixth All India Educational Survey data has been converted into a data warehouse (3GB of data)
Various types of analytical queries and reports can be answered.
Other sectors
Number of other potential application areas for data warehousing and data mining, as follows
Tourism
Tourist arrival behavior and preferences; tourism products data; foreign exchange earnings data; and Hotels,
Travel and Transportation data
Programme implementation
Central projects data (for monitoring).
Revenue
Customs data; central excise data; and commercial taxes data (state government)
Economic affairs
Budget and expenditure data; and annual economic survey.
Audit and accounts
Government accounts data
Paradigm Shift
All government departments are deeply involved in generating and processing a large amount of data
Much of the analysis work was done manually by the Department of Statistics in the Central Government
or in any State Government.
The techniques used were conventional statistical techniques on largely batch-mode processing
the advent and prominence of the data warehousing and data mining technology, there is a paradigm shift
may finally result in improved governance and better planning by better utilization of data
can rely on data warehousing and data mining technologies for their day-to-day decision-making
Different data marts for separate departments, can be integrated into one data warehouse for the
government
Thus data warehouses can be built at Central level, State level and also at District level
Conclusion
In the government, the individual data marts are required to be maintained by the individual departments (or
public sector organizations) and a central data warehouse is required to be maintained by the ministry
concerned for the concerned sector.
A generic inter-sectoral data warehouse is maintained by a central body (as Planning Commission).
at the State level, a generic inter-departmental data warehouse can be built and maintained by a nodal
agency, and detailed data warehouses can also be built and maintained at the district level by an
appropriate agency.
National Informatics Centre may possibly play the role of the nodal agency at Central, State and District
levels for developing and maintaining data warehouses in various sectors
Rainfall
This data mart has information on daily levels of rainfall across various weather stations in Tamil Nadu.
This will help them to plan the water supply to various districts in Tamil Nadu and using various models to
forecast rainfall levels
Applications
Time-based rainfall analysis
Geography/time-based rainfall analysis
A line plot screen for tracking rainfall levels at various weather stations in Tamil Nadu.
On selecting one of the weather stations in the list box on the screen, the line plot changes to reflect the
rainfall level for the selected weather station.
On clicking any point on the line plot the graph displays the data and rainfall level for the data point
Malaria Statistics
This data mart has information on various health camps conducted across Tamil Nadu to detect and cure
malaria patients.
This has vital information like number of people suffering from malaria, deaths caused due to malaria,
source of malaria infection, demographic information of malaria patients, etc.
Using the data warehouse, the end-users will be able to plan various precautionary measures to reduce the
number of people suffering from malaria in Tamil Nadu
Applications
MDR census on samples collected and tested
MDR source of malarial parasites
MDR age- and sex-wise malarial census
Graph- and sex-wise malarial census
School Health
has information about various health check-up camps conducted in various schools across Tamil Nadu.
It has information about students suffering from various diseases, defects, immunization programmes, etc
Applications
Disease analysis- MODS report, Disease analysis-graph, Immunization analysis- MODS
Immunization analysis- graph
Intended for educational purposes only. Not intended for any sort of commercial use
Purely created to help students with limited preparation time. No guarantee on the adequacy or accuracy
Text and picture used were taken from the reference items
Reference
rd
DATA WAREHOUSING: Concepts, Techniques, Products and Applications 3 Ed, CSR Prabhu
nd
Data Mining: Concepts and Techniques 2 Ed, Jiawei Han, Micheline Kamber
Credits
Thanks to my family members who supported me, while I spent considerable amount of time to prepare these notes. Feedback is
always welcome at [email protected]