OrangeIT PDF
OrangeIT PDF
1
Conceptual Basis
2
Data Mining
3
Phases of Data Mining
4
Data Mining Techniques
● Supervised
● Unsupervised
5
Supervised
In these techniques there is usually an attribute called class, where it is determined
whether or not the attributes belong to a certain concept.
● Prediction: regression, prediction trees and nucleus estimator.
6
Unsupervised
These techniques do not monitor the process: the attributes are ordered without the
guidance of a class.
7
Some Data Mining Tools...
8
Product Details ● Latest R-language engine for statistical computing
License: Open Source Software ● Open source, R-Enterprise, R- Cloud (Paid version
● Data visualization and analysis up to 16 TB
Price: Free ● Extended capabilities with reproducible R Tool Kits
● Windows, Mac OS and variants of Linux
URL:https://www.r-project.org/
9
Useful Functions ● Graphics Visualization
● Spatial Data Analysis
● Clustering
● Text Mining
● Social Network Analysis and Graph mining
● Statistics
● Data Manipulation
Success Stories
● Bank of America
● Bing
● Facebook
● Ford
● Google
10
Product Details ● Open Source
● A collection of machine learning algorithms
License: Open Source ● Data visualization and analysis
Software ● Java based platform
● Most researchers and practitioners
Price: Free
URL:https://www.cs.waikato.ac.
nz/ml/weka/index.html
11
Features ● General public license
● GUI for interacting
● Primitive tasks including data pre-
processing, classification, regression,
clustering, association rules and
visualization
● Execute data files in multiple platform
Success Stories
● The Weka mailing has over 1100
subscribers in 50 countries, including
subscribers from many major companies
such Rechtsportal
12
Product Details
● Open sources
License: Proprietary ● Data visualization and analysis
● Machine learning
Price: Contact for pricing ● Data mining, Text mining
● Business Intelligent
Free Trial
● Wor on Java runtime
● Available all major operating systems and
URL: https://rapidminer.com
platforms
Company Details
● Started asYALE in 2001
● In 2006 it was renamed by Rapidminer
● License by AGPL
13
Features ● A visual code-free-environment, so no programming
needed
● Design of analysis processes
● Predictive analysis pre made templates
● Data loading
● Data modeling
● Data visualization
● Allows you to work with different types and sizes of data
sources
● Modular operator concept
Success Stories
● CISCO
● PAYPAL
● EBAY
● VOLKSWAGEN
14
Product Details ● Open source
● Data visualization and analysis
License: Open Source ● Novice and experts
Software ● Through Python scripting
● Available for all popular platforms, including Windows,
Price: Free Mac OS X and variants of Linux
URL:https://orange.biolab.si/
● Founded on 1996
Company Details ● Orange is distributed free under the GPL
● M&D at the Bioinformatics Laboratory of the Faculty of
Computer and Information Science, University of
Ljubljana, Slovenia
15
Features
● Visual Programming
● Visualization
● Interaction and Data Analytics
● Large Toolbox
● Scripting interface
● Extendable
● Documentation
● Open source
● Platform independence
16
What´s Orange?
It is an open source tool for data analysis and visualization where data mining is done through
visual programming or Python code. It can be used through a nice and intuitive user interface or,
for more advanced users, as a module for the Python programming language.
17
Predictive Analysis
18
Why Orange?
19
Some Features...
Interactive Data Visualization Visual Programing
20
Visual Programing
Graphic user interface allows you to focus on exploratory data analysis instead of coding. Place
widgets on the canvas, connect them, load your datasets and harvest the insight!
21
Interactive and data visualization:
Perform simple data analysis with clever data visualization. Explore statistical distributions, box
plots and scatter plots, and others.
22
Orange in the cloud
Use Orange remotely by running it on a remote server as a docker container.
23
Add-ons Extend Functionality
Use various add-ons available within Orange to mine data from external data sources, perform
natural language processing and text mining, conduct network analysis, infer frequent itemset
and do association rules mining. Additionally, bioinformaticians and molecular biologists can use
Orange to rank genes by their differential expression and perform enrichment analysis.
24
Workflows in Orange
Orange workflows consist of components that read, process and visualize data “Widgets”, this
Widgets communicate by sending information along with a communication channel. An output
from one widget is used as input to another.
25
Widgets
26
Widgets
27
Widgets
28
Widgets
29
Widgets
30
Widgets
31
Widgets
32
Widgets
33
Benefits and Limitations
Benefits Limitations
● Unsupervised ● Workflows: Orange works through Widgets,
● Free and cross platform software. grouped according to their functionality.
● Allows visual and versatile programming for ● Connections:
data analysis. ○ Mysql-python
● It is friendly and intuitive with the user. ○ PostgreSQL
● It is open to all types of users, whether ● Python dependency
apprentice or advanced.
34
Business Organizations
using Orange
Data Mining and Orange
35
Wärtsilä (1/5)
Wärtsilä is a global leader in smart
technologies and complete lifecycle solutions
for the marine and energy markets. By
emphasising sustainable innovation, total
efficiency and data analytics.
36
Wärtsilä (2/5)
The company wanted to know how to identify dissatisfied employees and provide incentive from they to stay.
Using their dataset that has 1470 instances (employees) and 18 features describing them. The target variable is Attrition, where Yes means
the person left the company and No means it stayed.
37
Wärtsilä (3/5)
Using orange they built a predictive model that will successfully
predict the likelihood of a person leaving.
38
Wärtsilä (4/5)
They displayed the top ten features, ranked by their
contribution to the final probability of a class.
39
Wärtsilä (5/5)
They determined that they need to optimize the decisions to both increase the satisfaction of employees
while keep our costs low.
The conclusion was: Seems like John (one of the greatest workers) is most likely to leave. He has been at
the company for only a year and he works overtime. This is something HR department can work with to
design proper policies and keep best talent.
RESULTS.
40
SMIS SOLEIL (1/2)
SOLEIL is both an electromagnetic radiation
source covering a wide range of energies
(from the infrared to the x-rays) and a
research laboratory at the cutting edge of
experimental techniques dedicated to matter
analysis down to the atomic scale.
41
SMIS SOLEIL (2/2)
They have been using Orange for many scientific projects of their areas:
42
Telekom Slovenije (1/4)
The Telekom Slovenije Group is the national
telecommunications operator in Slovenia,
their main activities include fixed and mobile
communications services, cloud computing
services, construction and maintenance of
telecommunications networks.
43
Telekom Slovenije (2/4)
The company needed to know
how to segment their
customers. The data used was
Telecom customer churn
dataset.
44
Telekom Slovenije (3/4)
Then comes the design of a standard
hierarchical clustering workflow.
45
Telekom Slovenije (4/4)
They displayed the top ten features, ranked by their contribution to the final probability of a class. Based on
the results they determined that cluster C1 does not use internet (or at least does not buy internet from them).
C3 on the other hand normally has the whole package. They now will focus marketing on cluster C1 and offer
discounted internet packages. Marketing to C3 would be essentially useless.
46
Examples...
❖ Interactive Data Visualization
❖ Data Processing
❖ Test and Evaluation
❖ Data modeling and Prediction
❖ Image Analytics
❖ Python
47