Aula 1 - Programa Mestrado Data Mining I 201617 v2
Aula 1 - Programa Mestrado Data Mining I 201617 v2
SYLLABUS
2016-2017
1
INSTRUCTOR FERNANDO LUCAS BAÇÃO
INFORMATION 2nd floor, room 10
Phone: 21 3870413 (ext. 222)
[email protected]
http://www.novaims.unl.pt/fbacao/
FREDERICO JESUS, VASCO JESUS E JOÃO SANTOS
[email protected]; [email protected]
SCHEDULE Tuesdays 18:30h – 20:15h; 20:30h – 22:15h;
OFFICE HOURS Tuesdays from 17:00h – 18:00h (schedule appointment by email)
2nd Floor, Room 10
CONTACT The course has its own email address [email protected], which
should be used by the student to contact the teachers as well as to
submit any homework and projects.
DESCRIPTION The Data Mining course aims to study the main methods and tools
available in data mining (knowledge discovery in databases), in
particular descriptive models. The course does not assume
familiarity of the student with the theme, but it is highly
recommended that the student have knowledge of inferential
statistics, as well as a computer user skills.
The course seeks a trade-off between courses dedicated to in-depth
analysis of the algorithms, and the courses for managers where what
is sought is to raise awareness of the importance of the tools. This is
a technical course for all who work or seek to work on developing
descriptive models and exploring big databases. As such, during the
course, students will develop the activities of a typical data analyst,
thus practice constitutes a central component of the course.
The main concern in this course is to present the algorithms in a clear
and comprehensible way to a wide audience with different academic
backgrounds. It is intended that the student is able to understand the
fundamentals associated with the inner workings of the different
methods, because only then he will be able to apply them
judiciously.
The course program covers the main methodological aspects as well
as the most used tools, including visualization tools, algorithms for
clustering, association rules and link analysis, among others. The aim
is also to provide students the opportunity to use the Enterprise
Miner software from SAS Institute, so that they can develop the
practical aspects related to the use of these tools.
OBJECTIVES At the end of the course, students should be able to:
• Discuss the most relevant ideas and concepts associated with
data mining;
• Be able to execute basic and intermediate data preparation
1
Fernando Lucas Bação
DATA MINING I 2016/2017
and pre-processing tasks;
• Describe the principles and execute an RFM analysis;
• Describe with detail the hierarchical, k-means and self-
organizing map algorithms;
• Create a segmentation, being able to explain the options used
and explaining alternative, whenever available;
• Describe the apriori algorithm and the association rules are
generated;
• Calculate and explain the most relevant performance
measures of association rules;
COURSE SUCCESS In this course success depends on a number of factors:
• Basic knowledge of statistics;
• Attend classes;
• Work during the semester and not only when exams are
about to start;
• Develop the course project during the semester, making the
most of the practical classes;
• Read the suggested references.
CONTENTS 1. The context for analytics
a. The data deluge
b. Information as a strategic resource
c. Data-driven decision making
d. The relation between analytics and company
performance
e. Data analytic thinking
f. Data mining and data science
2. Business problems and analytical solutions
a. From business problems to data mining tasks
b. Supervised versus unsupervised methods
i. Knowledge discovery (Clustering e
Summary)
ii. Predictive Modeling (Classification e
Regression)
c. The data mining process
i. Business understanding
ii. Data understanding
iii. Data preparation
iv. Modeling
v. Evaluation
vi. Deployment
3. Data visualization
a. Motivation
b. Guidelines for presenting information
c. Graphics for presentation
d. Graphics for analysis
4. Data preparation and preprocessing
a. Motivation
b. Types of measurements
c. Noise vs signal
d. Descriptive statistics
e. Variable Distribution
f. Ouliers
2
Fernando Lucas Bação
DATA MINING I 2016/2017
g. Missing data
h. Data discretization
i. Standardization:
j. Transformations
k. Dimensionality reduction
i. Feature extraction and selection
ii. Business transformations
5. Cluster analysis
a. Motivation
b. Components of a Clustering Task
c. The User’s Dilemma and the Role of Expertise
d. History
e. Similarity Measures
6. Clustering techniques
a. Hierarchical Clustering Algorithms
b. Partitional Algorithms (k-means)
c. Fuzzy Clustering
d. Artificial Neural Networks (Self-Organizing Maps)
7. Analysis and validation of clustering solutions
a. The number of clusters
b. Analysis and profiling of the clustering solution
c. Classification trees
d. Validity of the solution
e. Supervised classification through k-nearest
neighbors
8. Association rules
a. Motivation
b. Apriori algorithm
c. Interpretation measures
d. Types of rules
e. Temporal extension
9. Introduction to network analysis
a. Structural importance
b. Degree centrality
c. Geometric centrality
10. Introduction to text mining
a. Distance functions
b. Clustering algorithms
c. Visualization techniques
BIBLIOGRAPHY References:
q Provost, F. and Fawcett, T. Data Science for Business.
O’Reilly Media, New York, 2013.
q M.J.A. Berry, G.S. Linoff, Data mining techniques second
edition - for marketing, sales, and customer relationship
management. Wiley 2004 Chap. 1, 2, 3, 4, 5, 8 e 10.
q A. K. Jain, M.N. Murthy and P.J. Flynn, 1999 Data
Clustering: A Review, ACM Computing Review.
q Course Notes Enterprise MinerTM: Applying Data Mining
Techniques
Additional References:
q Mitchell, T., (1997) Machine Learning, McGraw Hill.
q Hand, D. J., Mannila, H., Smyth, P. (2001) Principles of Data
Mining (Adaptive Computation and Machine Learning),
3
Fernando Lucas Bação
DATA MINING I 2016/2017
MIT Press.
q Kohonen, T. (1988). “Self-organization and Associative
Memory” (2nd Edition). Springer-Verlag: New York
Note: all references are available at ISEGI-NOVA library or are
provided by the teacher.
STUDENT EVALUATION 1st Session – Exam (65%), Project (35%)
2nd Session – Exam (65%), Project (35%)
CALENDAR Lec. 1 13 Sep. Course presentation (Syllabus)
Evaluation
Course project
The context for analytics
The data deluge
Information as a strategic resource
Data-driven decision making
The relation between analytics and
company performance
Data analytic thinking
Data mining and data science
Lec. 2 20 Sep. Business problems and analytical solutions
From business problems to data mining
tasks
Supervised versus unsupervised methods
Knowledge discovery
(Clustering e Summary)
Predictive Modeling
(Classification e Regression)
The data mining process
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Lec. 3 27 Sep. Data visualization
Motivation
Guidelines for presenting information
Graphics for presentation
Graphics for analysis
Lec. 4 4 Oct. Data preparation and preprocessing
Motivation
Types of measurements
Noise vs signal
Descriptive statistics
Variable Distribution
Ouliers
Missing data
Data discretization
Standardization:
Transformations
Dimensionality reduction
Feature extraction and
selection
Business transformations
Lec. 5 11 Oct. Practical Class Enterprise Miner
Lec. 6 18 Oct. Cluster analysis
Motivation
Components of a Clustering Task
The User’s Dilemma and the Role of
4
Fernando Lucas Bação
DATA MINING I 2016/2017
Expertise
History
Similarity Measures
Clustering techniques
Hierarchical Clustering Algorithms
Partitional Algorithms (k-means)
Lec. 7 25 Oct. Clustering techniques
Fuzzy Clustering
Artificial Neural Networks (Self-
Organizing Maps)
Lec. 8 8 Nov. Practical Class Enterprise Miner
Lec. 9 15 Nov. Analysis and validation of clustering solutions
The number of clusters
Analysis and profiling of the clustering
solution
Classification trees
Validity of the solution
Supervised classification through k-
nearest neighbors
Lec. 10 22 Nov. Practical Class Enterprise Miner
Lec. 11 29 Nov. Association rules
Motivation
Apriori algorithm
Interpretation measures
Types of rules
Temporal extension
Lec. 12 6 Dec. Introduction to network analysis
Structural importance
Degree centrality
Geometric centrality
Introduction to text mining
Distance functions
Clustering algorithms
Visualization techniques
Lec. 13 13 Dec. Practical Class Enterprise Miner
Lec. 14 19 Dec. Practical Class Enterprise Miner
Course Projects
Project consists on a practical project using SAS Enterprise Miner. In this project the students will
complete the segmentation of a customer’s database, following all the usual steps of a real world
project. For this the students will receive a set of specific guidelines that they should follow, as
well as the data. The guidelines provide the students with the type of tasks they should do and
the general results they should achieve. The end product of the project should be a report about
the database and the different segments of the company. With this project the students should
develop their analytical skills, but also their proficiency working with large datasets, extract,
transform and load tasks, visualization and reporting conclusions.
5
Fernando Lucas Bação
DATA MINING I 2016/2017
Tasks. In both, practical and theoretical classes, students will be frequently assigned homework,
which will consist on simple tasks related with the material of the course. It is expected that the
students complete these tasks.
Final Exam. The exam will be a single hour in-class exam covering all the course material. The
exam will consist on 15 multiple-choice questions, 5 true or false questions and a small essay.
Grading
Project : 35%
Exam: 65%
Both components of the evaluation are mandatory. There are two opportunities to do the exam.
Any delay in the delivery of the project is subject to a penalty of 10% of the grade for each day of
delay. Please note that the project will be developed in groups, but each group cannot have more
than 3 elements. To obtain approval in the discipline the student cannot have less than 8 (40%) in
the exam grade.
6
Fernando Lucas Bação
DATA MINING I 2016/2017