0% found this document useful (0 votes)
20 views11 pages

2.1 Intro Statistical Learning 1

This document provides an overview of a data mining course. It introduces the instructor, Joseph Rebehmed, and provides information about grading, textbooks, and course description. The course will cover fundamental data mining techniques including machine learning, statistics, classification, and clustering. Active learning approaches will be used, including in-class activities, discussions, and applications. Reading the textbook and meeting deadlines are emphasized.

Uploaded by

neuro.ultragod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

2.1 Intro Statistical Learning 1

This document provides an overview of a data mining course. It introduces the instructor, Joseph Rebehmed, and provides information about grading, textbooks, and course description. The course will cover fundamental data mining techniques including machine learning, statistics, classification, and clustering. Active learning approaches will be used, including in-class activities, discussions, and applications. Reading the textbook and meeting deadlines are emphasized.

Uploaded by

neuro.ultragod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

9/1/2021

Data Mining
BIF 524 - CSC 498

Data is the sword of the 21st century, those who wield it well,
the Samurai. – Jonathan Rosenberg

1
9/1/2021

Before we start

• Instructor: Joseph Rebehmed


• Contact: [email protected]
• Office hours: TR, 9:00 – 11:00 AM;W 5:30 – 7:30 PM
& by appointment (Online)
• Lecture: MWF, 9:00 – 9:50 AM
AKSOB 1003, Online via Collaborate platform
• Grading: (subject to 5% variation)
• Midterm: 30%
• Project: 25%
• Final Exam: 35%
• Participation: 10%

Textbook

https://www.statlearning.com/

2
9/1/2021

Course Description
This course covers the fundamental techniques and applications
for mining data; topics include concepts from:

• Machine learning
• Statistics
• Techniques and algorithms for parametric and non-parametric
classification, clustering, classifier assessment.
• Supervised vs unsupervised learning.
• Expert system
• Graphical models

Course Description (2)

This course aims to provide a very applied overview to:


• modern non-linear methods as:
• Generalized Additive Models,
• Decision Trees,
• Boosting, Bagging,
• Support Vector Machines

• more classical linear approaches such as:


• Logistic Regression,
• Linear Discriminant Analysis,
• K-Means Clustering,
• Nearest Neighbors.

• Cover many cases/data sets in the course plus some additional


interesting applications + Lab sessions

3
9/1/2021

Teaching/Learning methods

• Active learning approaches, no more passive learners


• The most important kind of learning comes from doing, not
from standing on the sidelines.

• In parallel to “Lectures”, this course makes extensive use of:


• in class group activities
• Dialogues, discussions and sharing ideas
• Reading providing materials before class, lecture preparation

• Plenty of applications

Tips for success

• Actively participate in class

• Don’t wait until the last minute to start your assignments or to


study for an exam.

• Please communicate with me if you have any


questions/difficulties/challenges

4
9/1/2021

Additional Remarks

• Reading the textbook is a must.


• Deadlines must be respected.
• Make-ups and Incomplete: students are not automatically
entitled to make-ups; F will be given until reasons (in writing and
within one week of absence) are presented and approved.
• Some of the exam questions will be based on class discussion
and assignments.
• No mobile phones in the classroom.

Introduction

10

5
9/1/2021

Introduction (2)

• Statistical learning refers to a set of tools for modelling and


understanding complex datasets.

• With the explosion of “Big Data” problems, statistical learning


has become a very hot field in many scientific areas (marketing,
finance, CS, biology, etc.)

• People with statistical learning skills are in high demand.

• Many companies are using Machine Learning in different and/or


cool ways

11

Pinterest – Improved Content Discovery

12

6
9/1/2021

Twitter – Curated Timelines

13

IBM – Better Healthcare

14

7
9/1/2021

Statistical Learning Problems

• Identify the risk factors for prostate cancer.

• Predict whether someone will have a heart attack based


on demographic, diet and clinical measurements.

• Customize an email spam detection system

• Classify a tissue sample into one of several cancer


classes, based on gene expression profile

15

16

8
9/1/2021

17

18

9
9/1/2021

Notation
• Use n to represent the number of distinct data points, or
observations, in our sample; p the number of variables.
• xij represent the value of the jth variable for the ith observation,
where i = 1, 2, . . ., n and j = 1, 2, . . . , p
• X denote a n×p matrix.

• The input variables are typically denoted using the symbol X,


with a subscript to distinguish them. The inputs go by different
names, such as predictors, independent variables, features or
sometimes just variables.
• The output variable is often called the response or dependent
variable and is typically denoted using the symbol Y.

19

What is Statistical Learning?

• In ML, we have a large set of inputs X and corresponding


outputs Y but not the function f(X).
• We believe that there is a relationship between Y and at least
one of the X’s.
• The goal is to find/model the relationship as:

Yi  f (Xi )   i

• Where f is some fixed but unknown function and ε is a random


error term, which is independent of X with mean zero.

20

10
9/1/2021

Simple Example

The function f that connects the input variable to the output variable is
in general unknown. In this situation one must estimate f based on the
observed points.
21

Different Standard Deviations


sd=0.001 sd=0.005
0.10

0.10
0.05

0.05
0.00

0.00
y

y
-0.05

-0.05

The difficulty of
-0.10

-0.10

estimating f will
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

depend on the
standard deviation of
sd=0.01 sd=0.03

the ε’s.
0.10

0.00 0.05 0.10


0.05
0.00
y

y
-0.05

-0.10
-0.10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

22

11

You might also like