The Need of Data Analysis
The Need of Data Analysis
data warehouse (DW) is a database used for reporting. The data is offloaded from the operational
systems for reporting. The data may pass through an operational data store for additional operations
before it is used in the DW for reporting.
A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used
to store raw data for use by developers (analysis and support). The integration layer is used to integrate
data and to have a level of abstraction from users. The access layer is for getting data out for users.
This definition of the data warehouse focuses on data storage. The main source of the data is cleaned,
transformed, catalogued and made available for use by managers and other business professionals for
data mining, online analytical processing, market research and decision support (Marakas & OBrien
2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to
manage the data dictionary are also considered essential components of a data warehousing system.
Many references to data warehousing use this broader context. Thus, an expanded definition for data
warehousing includes business intelligence tools, tools to extract, transform and load data into the
repository, and tools to manage and retrieve metadata.
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of
highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis
has multiple facets and approaches, encompassing diverse techniques under a variety of names, in
different business, science, and social science domains.
Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for
predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies
heavily on aggregation, focusing on business information. In statistical applications, some people divide
data analysis into descriptive statistics, exploratory data analysis, and confirmatory data analysis. EDA
focuses on discovering new features in the data and CDA on confirming or falsifying existing
hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive
forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to
extract and classify information from textual sources, a species of unstructured data. All are varieties of
data analysis.
Data integration is a precursor to data analysis, and data analysis is closely linked to data
visualization and data dissemination. The term data analysis is sometimes used as a synonym for data
modeling, which is unrelated to the subject of this
Typical information that a decision support application might gather and present are:
inventories of information assets (including legacy and relational data sources, cubes, data
warehouses, and data marts),
comparative sales figures between one period and the next,
projected revenue figures based on product sales assumptions.
Databases configured for OLAP use a multidimensional data model, allowing for complex analytical
and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases and
hierarchical databases that are faster than relational databases.
The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the
rows and columns of the matrix; the measures form the values
STAR SCHEMAS:
EXAMPLE:
Consider a database of sales, perhaps from a store chain, classified by date, store and product. The
image of the schema to the right is a star schema version of the sample schema provided in
the snowflake schema article.
Fact.Sales is the fact table and there are three dimension
tables Dim.Date,Dim.Store and Dim.Product.
Each dimension table has a primary key on its PK column, relating to one of the columns (viewed as rows
in the example schema) of the Fact.Sales table's three-column (compound) primary key
(Date_FK, Store_FK,Product_FK). The non-primary key [Units Sold] column of the fact table in
this example represents a measure or metric that can be used in calculations and analysis. The non-
primary key columns of the dimension tables represent additional attributes of the dimensions (such as
the Yearof the Dim.Date dimension).
Using schema descriptors with dot-notation, combined with simple suffix decorations for column
differentiation, makes it easier to write the SQL for Star Schema queries. This is because fewer
underscores are required and table aliasing is minimized.
Most SQL database engines allow schemata descriptors, and also permit decoration suffixes on
surrogate keys columns. Using square brackets, which are physically easier to type on the keyboard (no
shift key needed) are not intrusive and make the code easier to read.
For example, the following query extracts how many TV sets have been sold, for each brand and country,
in 1997:
SELECT Brand, Country, SUM ([Units Sold])
FROM Fact.Sales
JOIN Dim.Date
ON Date_FK = Date_PK
JOIN Dim.Store
ON Store_FK = Store_PK
JOIN Dim.Product
ON Product_FK = Product_PK
DATA MINING:
Data mining, a branch of computer science and artificial intelligence, is the process of extracting patterns
from data. Data mining is seen as an increasingly important tool by modern business to transform data
into business intelligence giving an informational advantage. It is currently used in a wide range
of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery.
The related terms data dredging, data fishing and data snooping refer to the use of data mining
techniques to sample portions of the larger population data set that are (or may be) too small for reliable
statistical inferences to be made about the validity of any patterns discovered (see also data-snooping
bias). These techniques can, however, be used in the creation of new hypotheses to test against the
larger data populations
Clustering - is the task of discovering groups and structures in the data that are in some way or
another "similar", without using known structures in the data.
Classification - is the task of generalizing known structure to apply to new data. For example, an
email program might attempt to classify an email as legitimate or spam.
Common algorithms include decision tree learning, nearest neighbor, naive Bayesian
classification, neural networks and support vector machines.
Regression - Attempts to find a function which models the data with the least error.
Association rule learning - Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought together and use this information
for marketing purposes. This is sometimes referred to as market basket analysis.
DATABASE ADMINISTRATION:
DBA Responsibilities
Installation, configuration and upgrading of Database server software and related products.
Evaluate Database features and Database related products.
Establish and maintain sound backup and recovery policies and procedures.
Take care of the Database design and implementation.
Implement and maintain database security (create and maintain users and roles, assign
privileges).
Database tuning and performance monitoring.
Application tuning and performance monitoring.
Setup and maintain documentation and standards.
Plan growth and changes (capacity planning).
Work as part of a team and provide 24x7 support when required.
Do general technical troubleshooting and give consultation to development teams.
Interface with Oracle / Microsoft Corporation for technical support.
The needs of organizations and management are changeable, diverse and often ill-defined, yet they must
be met. Added to these are outside pressures from federal taxing authorities, federal securities agencies
and legislators making privacy laws. Both internal and external forces demand that organizations exercise
control over their data resources.
Decisions and actions in the organization are based upon the image contained in the corporate database.
Managerial decisions direct the actions at the operational level and produce plans and expectations which
are formally captured and stored in the corporate database. Transactions record actual results of
rganizational activities and environmental changes and update the database to maintain a current image .
To avoid data inaccuracies and the potential for disasters, there must be a corporate-wide awareness of
data quality and a recognition of the importance of corporate data
There are three critical success factors that each company needs to identify before moving
forward with the issue of data quality:
Senior management commitment to maintaining the quality of corporate data can be achieved by
instituting a data administration department that oversees data management standards, policies,
procedures, and guidelines.
Data quality is defined as being data that is complete, timely, accurate, valid, and consistent. The
definition of data quality must describe the degree of quality required for each element loaded
into the data warehouse.
The quality assurance of data refers to the verification of the accuracy and correction of the data,
if necessary, and this may involve cleansing of existing data. Since no company is able to rectify
all of its unclean data, procedures have to be put in place to ensure data quality at the source.
This task can only be achieved by modifying business processes and designing data quality into
the system. In identifying every data item and its usefulness to the ultimate users of this data,
data quality requirements can be established. Increasing the quality of data as an after-the-fact
task is five to ten times more costly than capturing it correctly at the source.
If companies want to use data warehouse for competitive advantage and reap its benefits, data
quality is extremely important. Only when data quality is recognized as a corporate asset by
every member of the organization will the benefits of data warehousing and CRM initiatives be
realized.
Read on
Unreliable and inaccurate data in the data warehouse causes numerous problems. First and
foremost, the confidence of the users in the validity and reliability of this technology will be
seriously impaired. Furthermore, if the data is used for strategic decision making, unreliable data
affects the entire organization, and will affect senior management’s view of the data warehousing
project.
Erroneous Data
An excellent example of the damage that can be caused by erroneous data occurred in the early
1980s, when the banks had incorrect risk exposure data on Texas-based businesses. When the oil
market slumped those banks that had many Texas accounts encountered major losses. In other
cases, manufacturing firms scaled down their operations and took actions to eliminate excess
inventory. Because they had inaccurate market data, they had overestimated the inventory and
sold off critical business equipment.
Erroneous data should be captured and corrected of before it enters the warehouse. Capture and
correction are handled programmatically in the process of transforming data from one system to
the data warehouse. An example might be a field that was in lowercase that needs to be stored in
uppercase. A final means of handling errors is to replace erroneous data with a default value. If,
for example, the date February 29 of a non-leap year is defaulted to February 28, there is no loss
in data integrity.
Consistency of Data
In analyzing the characteristics of data required for data warehousing and data mining
applications, the quality of the data is of extreme importance in a data warehousing project, and
the challenge for data managers is to ensure the consistency of data entering the system. In some
organizations, data is stored in so-called 'flat' files and a variety of relational databases. In
addition, different systems designed for different functions contain the same terms but with
different meanings.
If care is not taken to clean up this terminology during data warehouse construction, misleading
management information results. The logical consequence of this requirement is that
management has to agree on the data definitions for elements in the warehouse.
The database is a collection of information stored in the computer in a systematic way so that it
can be checked using a computer program to obtain information from the database. Software
used to manage and call the database query is called database management system. Database
system studied in information science.
The term database originated from computer science. Although then the meaning is more
widespread, putting things outside the field of electronics. Note that similar to the data base was
already there before the industrial revolution in the form of ledgers, receipts and collection of
data related to the business.
The basic concept of the database is a collection of records, or pieces of knowledge. A structured
database has an explanation of the types of facts that are stored in it an explanation is called a
schema. The scheme describes the objects that represented a data base, and the relationship
between these objects. There are many ways to organize the scheme, or model the database
structure is known as a database model or data model.
The database has a crucial role in an organization, and utilized for a number of objectives that
support the main objectives of the organization. The primary role of the database are as follows:
In an enterprise database role is essential. Information can be obtained quickly thanks to the
underlying data has been stored in the database. For example, a mechanism for withdrawing
money at ATM (Automatic Transfer Money) actually is based on decisions that are based on the
data base. First, the system will validate the legitimacy of the cardholder to check the password
provided by that person. In this case, you entered will be matched with the password to the
database.
If the same, the next step will be implemented, namely checking the balance of money that was
recorded in the database against the amount of money taken. If you qualify, the money will be
issued by the machine. The database also allows applications like online KRS implemented,
which enables students to fill in data taking courses via the Internet.
So far I think the database is not only useful in an organization or company, but also for personal
purposes. By using software such as Microsoft Access database, one can manage the data into
personal affairs, such as telephone data and monthly spending data, and if necessary all the
information can be obtained easily and quickly.
Technically, the functional areas of an organization or company that has been generally applied
database system for the sake of efficiency, security, accuracy, and speed and ease of data
management, among others are:
Various organizations have implemented the database in its information systems, and managed
to improve the performance of organizations, among others:
• Banking
• Insurance
• Education / schools
• Supermarkets
• Hospitals
• Travel agencies
• Industrial / manufacturing
• Telecommunications
• etc.
That’s some of my descriptions and opinions about the important role of databases in an
organization and company.
Aqua Data Studio provides database administration tools for the Oracle database. To access the Oracle
DBA Tools, you may use the Application menu under DBA Tools->Oracle->[Tool]. You may also access
the Oracle DBA Tool by selecting the context popup menu on the schema browser and select DBA Tools-
>[Name].
The tools consist of 8 tools to manage every aspect of an Oracle database. The tools include:
Instance Manager: Provides manageability of the Oracle instance allowing the user to view and
modify server parameters. Including the monitoring and backup of the Oracle controlfile.
Storage Manager: Provides manageability of the Oracle tablespaces and datafiles. Allowing a
user to visualize and maintained storage. Including object and file IO statistics.
Rollback Manager: Provides the monitoring and maintenance or rollback segments, including
current statements, transactions and execution plans.
Log Manager: Provides manageability of Redo Logs and Archive Logs. Allows users to create and
manage redo logs including monitoring archive logs.
Security Manager: Provides manageability of users, roles and profiles. Allowing the user to
manage permissions, roles and security of the Oracle database.
Session Manager: Provides manageability of database sessions, including user and system locks.
Allowing the user to kill/disconnect sessions, start traces and monitor open cursors and user
queries with execution plans.
SGA Manager: Provides manageability of the Oracle SGA area, including SQL Area, Lib Cache, Lib
Cache Stats and a summary of the SGA. Allowing users to also pin and unpin code.
Server Statistics: Provides a summary of statistics for the Oracle instance, waits and latches.