Business Intelligence, Data Warehousing, Data Mining, Data Visualization
Business Intelligence, Data Warehousing, Data Mining, Data Visualization
VTUPulse.com
FIGURE 2.1 BIDM cycle
VTUPulse.com
their environment and predicting the future
for their own benefit and growth.
VTUPulse.com
4. Healthcare and Wellness
5. Education
6. Banking
7. Financial Services
8. Insurance
9. Manufacturing
10. Public Sector
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Retail
• Retail organizations grow by meeting customer needs
with quality products in a convenient, timely, and
cost-effective manner.
• Understanding emerging customer shopping
VTUPulse.com
patterns can help retailers organize their products,
inventory, store layout, and web presence in order to
delight their customers, which in turn would help
increase revenue and profits.
• Retailers generate a lot of transaction and logistics
data that can be used to diagnose and solve
problems. For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Retail
Optimize Inventory Levels
• At different Locations Retailers need to manage their inventories carefully.
Carrying too much inventory imposes carrying costs, while carrying too little
inventory can cause stock-outs and lost sales opportunities. Predicting sales
trends dynamically help retailers move inventory to where it is most in demand.
Retail organizations can provide their suppliers with real time information
about sales of their items, so the suppliers can deliver their product to the right
VTUPulse.com
locations and minimize stock-outs.
Improve Store Layout and Sales Promotions
• A market basket analysis can develop predictive models of the products often
sold together. This knowledge of affinities between products can help retailers
co-locate those products. Alternatively, those affinity products could be located
farther apart to make the customer walk the length and breadth of the store,
and thus be exposed to other products. Promotional discounted product
bundles can be created to push a nonselling item along with a set of products
that sell well together. For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Retail
Optimize Logistics for Seasonal Effects
• Seasonal products offer tremendously profitable short-term sales
opportunities, yet they also offer the risk of unsold inventories at
the end of the season. Understanding the products that are in
season in which market can help retailers dynamically manage
prices to ensure their inventory is sold during the season. If it is
VTUPulse.com
raining in a certain area, then the inventory of umbrella and
ponchos could be rapidly moved there from non-rainy areas to
help increase sales.
Minimize Losses due to Limited Shelf Life
• Perishable goods offer challenges in terms of disposing off the
inventory in time. By tracking sales trends, the perishable products
at risk of not selling before
For Videothe sell-by
Lectures date,tocan be suitably
subscribe
discounted andhttps://www.youtube.com/c/maheshhuddar
promoted.
Telecom
BI in telecom can help the customer side as well as network side of the
operations. Key BI applications include churn management,
marketing/customer profiling, network failure, and fraud detection.
VTUPulse.com
• In addition to customer data, telecom companies also store call
detail records (CDRs), which can be analyzed to precisely describe
the calling behavior of each customer. This unique data can be used
to profile customers and then can be used for creating new
product/service bundles for marketing purposes. An American
telecom company, MCI, created a program called Friends & Family
that allowed free calls with one's friends and family on that
network, and thus, effectively locked many people into their
For Video Lectures subscribe to
network.
https://www.youtube.com/c/maheshhuddar
Telecom
Churn Management
• Telecom customers have shown a tendency to switch their
providers in search for better deals. Telecom companies tend to
respond with many incentives and discounts to hold on to
customers. However, they need to determine which customers are
VTUPulse.com
at a real risk of switching and which others are just negotiating for a
better deal. The level of risk should be factored into the kind of
deals and discounts that should be given. Millions of such customer
calls happen every month. The telecom companies need to provide
a consistent and data-based way to predict the risk of the customer
switching, and then make an operational decision in real time while
the customer call is taking place. A decision-tree or a neural
network-based system can be used to guide the customer service
call operator to make Forthe right
Video decisions
Lectures for to
subscribe the company, in a
consistent manner.
https://www.youtube.com/c/maheshhuddar
Telecom
Network Failure Management
• Failure of telecom networks for technical failures or malicious
attacks can have devastating impacts on people, businesses, and
society. In telecom infrastructure, some equipment will likely fail
with certain mean time between failures. Modeling the failure
VTUPulse.com
pattern of various components of the network can help with
preventive maintenance and capacity planning.
Fraud Management
• There are many kinds of fraud in consumer transactions.
Subscription fraud occurs when a customer opens an account with
the intention of paying for the services. Superimposition fraud
involves illegitimate activity, a person other than the legitimate
account holder. Decision rules can developed to analyze each CDR
For Video Lectures subscribe to
in real time to identify chances of fraud and take effective action.
https://www.youtube.com/c/maheshhuddar
Customer Relationship Management
• A business exists to serve a customer.
• A happy customer becomes a repeat customer.
• A business should understand the needs and
sentiments of the customer' to sell more of its
VTUPulse.com
offerings to the existing customers, and also expand
the pool of customers it serves.
• BI applications can impact many aspects of
marketing.
VTUPulse.com
existing customers. Scoring each customer on their likelihood to quit can
help the business design effective interventions, such as discounts or free
services to retain profitable customers in a cost-effective manner.
Maximize Customer Value
• Every contact with the customer should be seen as an opportunity to
gauge their current needs. Offering a customer new products and
solutions based on those imputed needs can help increase revenue per
customer. Even a customer complaint can be seen as an opportunity to
wow the customer. Using the knowledge
For Video of the customer's
Lectures subscribe to history and
value, the business can choose to sell a premium service to the customer.
https://www.youtube.com/c/maheshhuddar
Customer Relationship Management
Identify and Delight Highly-Valued Customers
• By segmenting the customers, the best customers can be identified. They
can be proactively contacted, and delighted, with greater attention and
better service. Loyalty programs can be managed more effectively.
VTUPulse.com
Manage Brand Image
• A business can create a listening post to listen to social media chatter
about itself. It can then do sentiment analysis of the text to understand
the nature of comments, and respond appropriately to the prospects and
customers.
VTUPulse.com
• Decision models using decision trees can be created to assess the impact of events
on changes in market volume and prices. Monetary policy changes (such as
Federal Reserve interest rate change) or geopolitical changes (such as war in a part
of the world) can be factored into the predictive model to help take action with
greater confidence and Jess risk.
Identify and Prevent Fraudulent Activities in Trading
• There have unfortunately been many cases of insider trading, leading to many
prominent financial industry stalwarts going to jail. Fraud detection models seek
out-of-the-ordinary activities, and help identify and flag fraudulent activity
patterns. For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Insurance
This industry is a prolific user of prediction models in pricing
insurance proposals and managing losses from claims against
insured assets.
Forecast Claim Costs for Better Business Planning
• When natural disasters, such as hurricanes and earthquakes
VTUPulse.com
strike, loss of life and property occurs. By using the best
available data to model the likelihood (or risk) of such events
happening, the insurer can plan for losses and manage
resources and profits effectively.
Determine Optimal Rate Plans
• Pricing an insurance rate plan requires covering the potential
losses and making a profit. Insurers use actuary tables to
project life spans andFor Video Lectures
disease subscribe
tables to
to project mortality rates,
and thus pricehttps://www.youtube.com/c/maheshhuddar
themselves competitively yet profitably.
Insurance
Optimize Marketing to Specific Customers
• By micro-segmenting potential customers, a data-savvy
insurer can cherry pick the best customers and leave the less
profitable customers to its competitors. Progressive Insurance
is a US-based company that is known to actively use data
VTUPulse.com
mining to cherry pick customers and increase its profitability.
Identify and Prevent Fraudulent Claim Activities
• Patterns can be identified as to where and what kinds of fraud
are more likely to occur. Decision-tree-based models can be
used to identify and flag fraudulent claims.
VTUPulse.com
diagnosis as much of an art form as it is science. Systems, such as IBM Watson,
absorb all the medical research to date and make probabilistic diagnoses.
Treatment Effectiveness
• The prescription of medication and treatment is also a difficult choice out of so
many possibilities. For example, there are more than 100 medications for
hypertension (high blood pressure) alone. There are also interactions in terms of
which drugs work well with others and which drugs do not.
Wellness Management
• This includes keeping a track of patient's
For Video health
Lectures records,
subscribe to analyzing customer health
trends and proactively advising them to take any needed precautions.
https://www.youtube.com/c/maheshhuddar
Education
As higher education becomes more expensive and
competitive, it becomes a great user of data-based
decision-making. There is a strong need for efficiency,
increasing revenue, and improving the quality of
VTUPulse.com
student experience at all levels of education.
VTUPulse.com
• Billions of financial transactions happen around the world every day.
Exception-seeking models can identify Patterns of fraudulent transactions.
For example, if money is being transferred to an unrelated account for the
first time, it could be a fraudulent transact
Maximize Customer Value
• Selling more products and services to existing customers is often the
easiest way to increase revenue. A checking account customer in good
standing could be offered home, auto, educational loans on more
favorable terms than other customers, and thus the value generated from
For Video Lectures subscribe to
that customer could be increased.
https://www.youtube.com/c/maheshhuddar
Public Sector
• Government gathers a large amount of data
by virtue of their regulatory function.
• That data could be analyzed for developing
models of effective functioning.
VTUPulse.com
• There are innumerable applications that can
benefit from mining that data.
• A couple of sample applications are shown
here.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Public Sector
Law Enforcement
• Social behavior is a lot more patterned and predictable than one
would imagine.
Scientific Research
VTUPulse.com
• Any large collection of research data is amenable to being mined for
patterns and insights. Protein folding (microbiology), nuclear
reaction analysis (sub-atomic physics), disease control (public
health) are some examples where data mining can yield powerful
new insights.
VTUPulse.com
facilitate distributed access to up-to-date business knowledge for
departments and functions, thus improving business efficiency and
customer service.
2. DW can present a competitive advantage by facilitating decision making
and helping reform business processes.
3. DW enables a consolidated view of corporate data, all cleaned and
organized. Thus, the entire organization can see an integrated view of
itself.
For Video Lectures subscribe to
4. DW thus provides better and timely information. It simplifies data
https://www.youtube.com/c/maheshhuddar
access and allows end users to perform extensive analysis.
Design Considerations for DW
The objective of DW is to provide business knowledge to support decision
making. For DW to serve its objective, it should be aligned around those
decisions. It should be comprehensive, easy to access, and up-to-date.
Here are some requirements for a good DW:
1. Subject-oriented: To be effective, DW should be designed around a
subject domain, that is, to help solve a certain category of problems.
VTUPulse.com
2. Integrated: DW should include data from many functions that can
shed light on a particular subject area. Thus, the organization can
benefit from a comprehensive view of the subject area.
3. Time-variant (time series): The data in DW should grow at daily or
other chosen intervals. That allows latest comparisons over time.
4. Nonvolatile: DW should be persistent, that is, it should not be
created on the fly from the operations databases. Thus, DW is
consistently availableForfor analysis,
Video Lecturesacross thetoorganization and over
subscribe
time. https://www.youtube.com/c/maheshhuddar
Design Considerations for DW
5. Summarized: DW contains rolled-up data at the right level for queries and
analysis. The rolling up helps create consistent granularity for effective
comparisons. It helps reduces the number of variables or dimensions of the
data to make them more meaningful for the decision makers.
6. Not normalized: DW often uses a star schema, which is a rectangular central
table, surrounded by some lookup tables. The single-table view significantly
VTUPulse.com
enhances speed of queries.
7. Metadata: Many of the variables in the database are computed from other
variables in the operational database. For example, total daily sales may be a
computed field. The method of its calculation for each variable should be
effectively documented. Every element in DW should be sufficiently well-
defined.
8. Near real-time and/or right-time (active): DWs should be updated in near
real-time in many high-transaction volume industries, such as airlines. The
cost of implementing and updating DW in real time could discourage others.
For Video Lectures
Another downside of real-time DW is thesubscribe to of inconsistencies in
possibilities
https://www.youtube.com/c/maheshhuddar
reports drawn just a few minutes apart.
DW Architecture
• DW has four key elements shown in below
figure
VTUPulse.com
VTUPulse.com
VTUPulse.com
• Similarly, “people with blood pressure greater than
160 and an age greater than 65 were at a high risk of
dying from a heart stroke” is of great diagnostic value
for doctors, who can then focus on treating such
patients with urgent care and great sensitivity.
VTUPulse.com
• There are also streams of machine-generated data from connected
machines, RFID tags, the internet of things, and so on. The data
should be put in rectangular data shapes with clear columns and
rows before submitting it to data mining.
• Knowledge of the business domain helps select the right streams of
data for pursuing new insights. Data that suits the nature of the
problem being solved should be gathered. The data elements
should be relevant, and suitably address the problem being solved.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Data Cleansing and Preparation
• The quality of data is critical to the success and value of the data
mining project.
• Otherwise, the situation will be of the kind of garbage in and
garbage out (GIGO).
• The quality of incoming data varies by the source and nature of
data. Data from internal operations is likely to be of higher quality,
VTUPulse.com
as it will be accurate and consistent.
• Data from social media and other public sources is less under the
control of business, and is less likely to be reliable.
• Data almost certainly needs to be cleansed and transformed before
it can be used for data mining.
• There are many ways in what data may need to be cleansed—filling
missing values, reigning in the effects of outliers, transforming fields
and many more. Data cleansing and preparation is a labor-intensive
or semi-automated activity that can take up to 60 to 70 percent of
the time needed forFor a data mining project.
Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Data Cleansing and Preparation
1. Duplicate data needs to be removed.
2. Missing values need to be filled in, or those rows should be removed
from analysis. Missing values can be filled in with average or modal or
default values.
3. Data elements may need to be transformed from one unit to another.
For example, total costs of health care and the total number of
VTUPulse.com
patients may need to be reduced to cost/patient to allow
comparability of that value.
4. Continuous values may need to be binned into a few buckets to help
with some analyses. For example, work experience could be binned as
low, medium, and high.
5. Data elements may need to be adjusted to make them comparable
over time. For example, currency values may need to be adjusted for
inflation; they wouldFor
need to Lectures
Video be converted totothe same base year for
subscribe
comparability. https://www.youtube.com/c/maheshhuddar
They may need to be converted to a common currency.
Data Cleansing and Preparation
6. Outlier data elements need to be removed after careful review, to avoid
the skewing of results. For example, one big donor could skew the analysis
of alumni donors in an educational setting.
7. Any biases in the selection of data should be corrected to ensure the data
is representative of the phenomena under analysis. If the data includes
many more members of one gender than is typical of the population of
8. VTUPulse.com
interest, then adjustments need to be applied to the data.
Data should be brought to the same granularity to ensure comparability.
Sales data may be available daily, but the sales person compensation data
may only be available monthly. To relate these variables, the data must be
brought to the lowest common denominator, in this case, monthly.
9. Data may need to be selected to increase information density. Some data
may not show much variability, because it was not properly recorded or
for any other reasons. This data may dull the effects of other differences in
the data and should be Forremoved
Video Lectures subscribe
to improve thetoinformation density of
the data. https://www.youtube.com/c/maheshhuddar
Outputs of Data Mining
• Data mining techniques can serve different types of objectives. The
outputs of data mining will reflect the objective being served. There are
many representations of the outputs of data mining.
• One popular form of data mining output is a decision tree. It is a
hierarchically branched structure that helps visually follow the steps to
make a model-based decision. The tree may have certain attributes,
VTUPulse.com
such as probabilities assigned to each branch. A related format is a set of
business rules, which are if-then statements that show causality. A
decision tree can be mapped to business rules. If the objective function
is prediction, then a decision tree or business rules are the most
appropriate mode of representing the output.
• The output can be in the form of a regression equation or mathematical
function that represents the best fitting curve to represent the data. This
equation may include linear and nonlinear terms. Regression equations
For Video Lectures subscribe to
are a good way https://www.youtube.com/c/maheshhuddar
of representing the output of classification exercises.
These are also a good representation of forecasting formulae.
Evaluating Data Mining Results
• There are two primary kinds of data mining processes:
supervised learning and unsupervised learning. In supervised
learning, a decision model can be created using past data, and
the model can then be used to predict the correct answer for
future data instances. Classification is the main category of
VTUPulse.com
supervised learning activity.
• There are many techniques for classification, decision trees
being the most popular one. Each of these techniques can be
implemented with many algorithms.
• A common metric for all of classification techniques is
predictive accuracy.
Predictive AccuracyFor
= Video
(Correct Predictions) / Total Predictions
Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Evaluating Data Mining Results
• Suppose a data mining project has been initiated to develop a
predictive model for cancer patients using a decision tree. Using a
relevant set of variables and data instances, a decision tree model has
been created.
• The model is then used to predict other data instances.
VTUPulse.com
• When a true positive data point is positive, that is a correct prediction,
called a true positive (TP).
• When a true negative data point is classified as negative, that is a true
negative (TN).
• When a true-positive data point is classified by the model as negative,
that is an incorrect prediction, called a false negative (FN).
• When a true-negative data point is classified as positive, that is
classified as a false positive (FP).
For Video Lectures subscribe to
• This is called the confusion matrix.
https://www.youtube.com/c/maheshhuddar
Evaluating Data Mining Results
VTUPulse.com
• Thus, the predictive accuracy can be specified by the following formula.
Predictive Accuracy = (TP + TN) / (TP + TN + FP + FN).
• All classification techniques have a predictive accuracy associated with a
predictive model. The highest value can be 100 percent. In practice,
predictive models with Formore
Videothan 70 percent
Lectures accuracy
subscribe to can be considered
usable in business domains, depending upon the nature of the business.
https://www.youtube.com/c/maheshhuddar
Data Mining Techniques
• Data may be mined to help make more efficient decisions in the future. Or
it may be used to explore the data to find interesting associative patterns.
The right technique depends upon the kind of problem being solved
VTUPulse.com
VTUPulse.com
well as executives. They also show a high predictive accuracy.
2. They select the most relevant variables automatically out of all the
available variables for decision-making.
3. Decision trees are tolerant of data quality issues and do not require
much data preparation from the users.
4. Even nonlinear relationships can be handled well by decision trees.
VTUPulse.com
VTUPulse.com
VTUPulse.com
(or far away) from each other are categorized into separate clusters.
• There can be any number of clusters that could be produced by the data.
The K-means technique is a popular technique and allows the user
guidance in selecting the right number (K) of clusters from the data.
• Clustering is also known as the segmentation technique. The technique
shows the clusters of things from past data. The output is the centroids
for each cluster and the allocation of data points to their cluster. The
centroid definition is used to assign new data instances that can be
Forhomes.
assigned to their cluster Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Cluster analysis
VTUPulse.com
VTUPulse.com
commerce sites like Amazon.com and streaming movie sites like
Netflix.com.
• The technique helps find interesting relationships (affinities)
between variables (items or events). These are represented as
rules of the form X ⇒ Y, where X and Y are sets of data items.
• A form of unsupervised learning, it has no dependent variable;
and there are no right or wrong answers. There are just stronger
and weaker affinities. Thus,Lectures
For Video each rule has atoconfidence level
subscribe
assigned to it.https://www.youtube.com/c/maheshhuddar
Association rules
VTUPulse.com
1. There are simple end-user data mining tools, such as MS Excel, and there
are more sophisticated tools, such as IBM SPSS Modeler.
2.
3.
VTUPulse.com
There are stand-alone tools, and there are tools embedded in an existing
transaction processing or data warehousing or ERP system.
There are open-source and freely available tools, such as Weka, and
there are commercial products.
4. There are text-based tools that require some programing skills, and
there are Graphical User Interface (GUI)-based drag-and-drop format
tools.
5. There are tools that work only on proprietary data formats, and there
are those directly accept data from a host of popular data management
tools formats.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Comparison of popular data mining
platforms
Commercial,
Ownership Commercial Open-source, free
expensive
Limited, extensible Extensive,
Data mining Extensive features,
VTUPulse.com
with add-on performance issues
features unlimited data sizes
modules with large data
Embedded in BI
Stand-alone Stand-alone Stand-alone
software suites
User skills needed End users Skilled BI analysts Skilled BI analysts
Drag-and-drop use,
Select and click, GUI, mostly b&w
User interface colorful, beautiful
easy text output
GUI
Variety of data
Data formats Industry
Forstandard
Video Lectures subscribe to Proprietary
sources accepted
https://www.youtube.com/c/maheshhuddar
Data Mining Best Practices
• Effective and successful use of data mining activity requires both
business and technology skills. The business aspects help
understand the domain and the key questions. It also helps one
imagine possible relationships in the data and create hypotheses to
test it. The IT aspects help fetch the data from many sources, clean
up the data, assemble it to meet the needs of the business
problem, and then run the data mining techniques on the platform.
VTUPulse.com
• An important element is to go after the problem iteratively. It is
better to divide and conquer the problem with smaller amounts of
data, and get closer to the heart of the solution in an iterative
sequence of steps. There are several best practices learned from
the use of data mining techniques over a long period of time.
• The data mining industry has proposed a Cross-Industry Standard
Process for Data Mining (CRISP-DM).
• It has six essential steps:
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Data Mining Best Practices
• CRISP-DM data mining cycle
VTUPulse.com
VTUPulse.com
available and required.
3. The data should be clean and of high quality. It is important to assemble a team that has a mix of
technical and business skills, who understand the domain and the data. Data cleaning can take 60 to 70
percent of the time in a data mining project. It may be desirable to add new data elements from external
sources of data that could help improve predictive accuracy.
4. Patience is required in continuously engaging with the data until the data yields some good insights. A
host of modeling tools and algorithms should be used. A tool could be tried with different options, such
as running different decision tree algorithms.
5. One should not accept what the data says at first. It is better to triangulate the analysis by applying
multiple data mining techniques and conducting many what-if scenarios, to build confidence in the
solution. Evaluate the model’s predictive accuracy with more test data.
6. The dissemination and rollout of the solution is the key to project success. Otherwise the project will be
a waste of time and will be a For Video
setback Lecturesand
for establishing subscribe
supportingtoa data-based decision-process
culture in the organization. The model should be embedded in the organization’s business processes.
https://www.youtube.com/c/maheshhuddar
Myths about Data Mining
There are many myths about this area, scaring away many business executives
from using data mining.
Myth #1: Data mining is about algorithms: Data mining is used by business to
answer important and practical business questions. Formulating the problem
statement correctly and identifying imaginative solutions for testing are far more
important before the data mining algorithms get called in.
VTUPulse.com
Myth #2: Data mining is about predictive accuracy: While important, predictive
accuracy is a feature of the algorithm. As in myth #1, the quality of output is a
strong function of the right problem, right hypothesis, and the right data.
Myth #3: Data mining requires a data warehouse: While the presence of a data
warehouse assists in the gathering of information, sometimes the creation of the
data warehouse itself can benefit from some exploratory data mining.
Myth #4: Data mining requires large quantities of data: Many interesting data
mining exercises are done using small- or medium-sized data sets.
Myth #5: Data mining requires a technology
For Video expert: Many
Lectures subscribe to interesting data mining
exercises are done by end users and executives using simple everyday tools like
https://www.youtube.com/c/maheshhuddar
spreadsheets.
Data Mining Mistakes
VTUPulse.com
or having no goals, data mining leads to a waste of time. Getting the right answer
to an irrelevant question could be interesting, but it would be pointless.
Mistake #2: Buried under mountains of data without clear metadata: It is more
important to be engaged with the data, than to have lots of data. The relevant
data required may be much less than initially thought. There may be insufficient
knowledge about the data or metadata.
Mistake #3: Disorganized data mining: Without clear goals, much time is wasted.
Doing the same tests using the same mining algorithms repeatedly and blindly,
without thinking about the next stage, without a plan, would lead to wasted time
and energy. This can comeForfromVideo Lectures
being sloppysubscribe to
about keeping track of the data
mining procedure andhttps://www.youtube.com/c/maheshhuddar
results.
Data Mining Mistakes
Mistake #4: Insufficient business knowledge: Without a deep understanding of the business
domain, the results would be gibberish and meaningless. Do not make erroneous assumptions,
courtesy of experts. Do not rule out anything when observing data analysis results. Do not ignore
suspicious (good or bad) findings and quickly move on. Be open to surprises. Even when insights
emerge at one level, it is important to sliced and dice the data at other levels to see if more
powerful insights can be extracted.
Mistake #5: Incompatibility of data mining tools: All the tools from data gathering, preparation,
VTUPulse.com
mining, and visualization should work together.
Mistake #6: Locked in the data jailhouse: Use tools that can work with data from multiple sources
in multiple industry standard formats.
Mistake #7: Looking only at aggregated results and not at individual records/predictions. It is
possible that the right results at the aggregate level provide absurd conclusions at an individual
record level.
Mistake #8: Running out of time: Not leaving sufficient time for data acquisition, selection, and
preparation can lead to data quality issues and GIGO. Similarly not providing enough time for
testing the model, training the users and deploying the system can make the project a failure.
Mistake #9: Measuring your results differently from the way your sponsor measures them: This
comes from losing a sense of business objectives
For Video andsubscribe
Lectures beginning to
tomine data for its own sake.
Mistake #10: Naively believing everything you are told about the data: Also naively believing
https://www.youtube.com/c/maheshhuddar
everything you are told about your own data mining analysis.
Data Visualization
Objectives for graphical excellence (Data Visualization)
1. Show, and even reveal, the data: The data should tell a story, especially story
hidden in large masses of data. However, reveal the data in context, so the
story is correctly told.
2. Induce the viewer to think of the substance of the data: The format of the
graph should be so natural to the data, that it hides itself and lets data shine.
VTUPulse.com
3. Avoid distorting what the data have to say: Statistics can be used to lie. In
the name of simplifying, some crucial context could be removed leading to
distorted communication.
4. Make large data sets coherent: By giving shape to data, visualizations can
help bring the data together to tell a comprehensive story.
5. Encourage the eyes to compare different pieces of data: Organize the chart in
ways the eyes would naturally move to derive insights from the graph.
6. Reveal the data at several levels of detail: Graphs leads to insights, which
raise further curiosity, and
For thusLectures
Video presentations should
subscribe to help get to the root
cause. https://www.youtube.com/c/maheshhuddar
Types of Charts
• Line graph. This is a basic and most popular type of displaying information.
It shows data as a series of points connected by straight line segments. If
mining with time-series data, time is usually shown on the x-axis. Multiple
variables can be represented on the same scale on y-axis to compare of
the line graphs of all the variables.
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com