0% found this document useful (0 votes)

150 views39 pages

GE 461 Introduction To Data Science: Spring 2021

This document provides information about an introduction to data science course offered in spring 2021. The course will be taught by multiple instructors and TAs from three departments. It will take place on Tuesdays and Thursdays in room EA-502 or online. The course covers various data science fundamentals, techniques, and applications. Students will complete a final exam, projects, and assignments. Grading will be based on the final exam (40%) and projects (60%). The document also defines data science and compares it to related fields. It discusses why data science has become prominent now due to advances in machine learning, computing power, and data collection and storage. Examples of data science applications are provided.

Uploaded by

Muhammed Naci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views39 pages

GE 461 Introduction To Data Science: Spring 2021

Uploaded by

Muhammed Naci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

GE

461
Introduction to Data Science
Spring 2021
Course Website
All course related material will be provided in the course website
http://www.cs.bilkent.edu.tr/~korpe/courses/ge461/
Check regularly for announcements!
Weekly topics, instructors are stated.
Slides will be provided here.
Assignments released on Moodle.
Various external links to other similar courses and online textbooks.
Instructors
Cross-department Course with Multiple Instructors.

Coordinator: İbrahim Körpeoğlu Eray Tüzün

Selim Aksoy Cem Tekin
Savaş Dayanık Ercüment Çiçek
Can Alkan Hamdi Dibeklioğlu
Tolga Çukur Shervin Rahimzadeh Arashloo
Fazlı Can

TAs will be announced on the Course Website. They will be from all
3 departments.
Location & Time
When: T 10:30 – 12:20 and Th 15:30 – 17:20.
Where: EA-502 (Online).
What: A lot! Introduction to data science fundamentals, techniques
and applications; data collection, preparation, storage and querying;
parametric models for data; models and methods for fitting, analysis,
evaluation, and validation; dimensionality reduction, visualization;
various learning methods, classifiers, clustering, data and text mining;
applications in diverse domains such as business, medicine, social
networks, computer vision; breadth knowledge on topics and hands-on
experience through projects and computer assignments.
Grading Policy
Final: 40%
Project: 60%
Multiple computer/programming/exercise assignments of various sizes.
A project can be assigned earlier than the indicated date on the weekly plan.
Projects can be individual or group based. Instructors will decide.
Projects will be uploaded to Moodle.
Piazza will be used as the forum to discuss.
Attendance:
A student who misses more than 9 hours will fail the course.
What is Data Science?
The field of study that uses various methods to extract useful insights
and knowledge from the data to make data-driven decisions.

Methods can include/require, domain expertise, programming skills

(i.e., scripting to process data), statistical modeling (i.e., machine
learning algorithms), visualization techniques.

Usually performed on big data.

Recommended readings:
http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.p
df
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-
the-21st-century
What is NOT Data Science?

Data Science makes use of AI, ML, DL

https://blogs.nvidia.com/blog/2016/07/ 29/whats difference -

artificial-intelligence -machine -lea rningde ep-
learning-ai/

Image source: Rob Tibshirani, Stanford Stats 101

What is NOT Data Science? Example
An AI breakthrough in 2011, now empowers Data Science.
Data Science vs Other Related Terms
Many terms are used interchangeably; vague definitions.

Data Science aims at finding the right questions, more predictive analysis. Somewhat involves
creativity.
On the other hand, Business Intelligence aims helping in the decision making of a business based
on past data.

Data mining is a technique that searches for patterns in the data and can be considered as a tool of
Data Science.
For example: Baby diapers and beer are frequently bought together.

Data analytics aims at analyzing data to find answers to concrete questions.

For instance, optimizing the teller processes at the bank to serve more customers.
It is a tool for Business Intelligence.
Why Now? Some advances
Better machine learning algorithms

+ big
i.e., deep architectures, ADAM optimizer etc.

Faster Computers
GPU power to crunch large datasets

data
Better ways (NoSQL) to manage
Data (Hadoop, Hive, HBase)

Data is ubiquitous
Cheap to produce and
store
Python and R vs SAS and SPSS to process data
Advanced data visualization tools like Tableau

Image source: Rob Tibshirani, Stanford Stats 101

Big Data
Data is easy to produce, cheap to store. One example from genomics.
Big Data – cont’d
Database (old) vs Data Science (new)
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
Riak, Memcached,
Apache River,
MongoDB, CouchDB,
Hbase, Cassandra,…
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance Slide by John Canny
Modelling vs Data-Driven Solutions
Scientific modelling
Background knowledge, set of rules,
principles, representations etc.
Example: Weather forecasting.

Data-Driven Solutions
No or little apriori model, which is
replaced by an inference algorithm CIMMS, U of Wisconsin

(e.g., Neural Network, SVM etc.).

Example: Image classification.

AlexNet/VGG-F visualization from Brown CSCI1430

Some examples - Search
Google PageRank Algorithm

Original Paper

Used in many applications to

have data driven answers to various problems

Image source: Wikipedia

Some examples – Recommendation Systems
Some examples – Flu Trends
Google Flu Trends

AlexNet/VGG-F visualization from Brown CSCI1430

Some examples – Comp. Biology
Data Science for Gene Risk Prediction
It is not enough to collect the data.
What does the data tell us?
Use methods to analyze the it.

Satterstrom et al., CELL 2020

Cross
Same Gene in
Coexpression Steiner Tree time/brain
Gene Steiner Tree ASD Risk consequtive
edge region
Gene Gene windows
interaction

Some examples – Comp. Biology

y z

Machine Learning for Gene Risk Prediction

Spatio-Temporal

Build algorithms to predict the risk Window 2

Spatio-Temporal
Window 1

Spatio-temporal Network-based Analysis. Norman and Cicek, Bioinformatics 2019.

Satterstrom et al., CELL 2020

Multi-Task Learning for Autism Gene Risk Prediction. Karakahya et al., in prep.
Some examples – Comp. Biology
Data Science for Online Feedback to Surgeons
Use Multiple Multivariate Regression to predict the result of a test that
is infeasible to perform during surgery due to time requirement.

Karakaslar et al., IEEE/ACM TCBB 2019, in press.

Some examples – Comp. Biology
Machine Learning for Online Feedback to Surgeons
Design a neural network that learns important parts of to classify
tumors.

BILSTM

Attention mechanism

Cakmakci et al.,
PLoS Computational Biology 2020.
Data Science Pipeline

Image credit Wolfram Research

Data Science Pipeline - Data Collection
Many data types, many ways
Sensors
Crowdsourcing, putting humans at work once computers fail:
Mechanical Turk
Crawling
Questionnaires..

The Turk
Data Science Pipeline - Data Wrangling
After you obtain the raw data converting it into a more useful format
Gather multiple files into single, standardized format
For example: Unite multiple crawled files into one, get rid of html
tags etc.
Data Science Pipeline - Data Cleaning
Dig deeper into the data after standardization and detect problems.
Inconsistencies
Outliers
Missing values
Data Science Pipeline
Explore – Preprocess – Model Cycle
1. Explore the structure of the data and decide on the appropriate
model to analyze.
For instance: sequence data, maybe LSTM?
image data, maybe Convolutional Neural Networks.
2. Preprocess the data to be fit into the model
For instance, RGB -> Grayscale
3. Apply the model and analyze results
4. Go to 1.
Data Science Pipeline - Validation
After you fine-tuned your model in the previous cycle validate your
data on a data that has not been seen by the model.

Validate that your claim is not just random finding.

Multiple hypothesis correction
Correlation is not causation.
Data Science Pipeline – Story Telling
A data scientist also needs to communicate well.
Infographics and how you convey the story is important.

vs
Data Storage and Cloud
Database Systems
Relational databases, organized around tables, SQL
NoSQL databases for online distributed databases, eventual
consistency: Cassandra, HBase
Cloud Storage
Ubiquitous computing, data access from everywhere
No worries on losing data
Cloud Computing
Distributed computing on large scale data
Map Reduce, Hadoop
Statistical Modeling
Parametric Models
Family of probability distributions with a finite number of
parameters
For example: Binomial distribution has 2 (n,p)
Non-parametric Models
Parameter set is infinite dimensional i.e., grows with the data
size. For example: k nearest neighbors classification.

Applications to customer choice problems and market simulation.

Model Validation
Experimental Design
Cross Validation
Statistical Tests for validation
Unsupervised Learning
Feature extraction: Principal Component Analysis, t-SNE etc.

PC1

PC2
Unsupervised Learning – cont’d
Clustering: Finding groups of data points
which are similar to each other.

John Snow, a London physician plotted

the location of cholera deaths on a map
during an outbreak in the 1850s.

The locations indicated that cases were

clustered around certain intersections
where there were polluted wells – thus
exposing both the problem and the
solution
Unsupervised Learning – cont’d
Clustering: Finding groups of data
points which are similar to each other.

Given a sample of breast cancer

patients and their gene activity level
measurements. Can you find
subgroups? (e.g., aggressive, mild
etc.)

So many other applications:

Targeted advertising
LinkedIn contact suggestion
Unsupervised Learning – cont’d
Winner take all rule, competitive learning
Several algorithm examples
k-means GMM example

k cluster centers as means of assigned data points

Gaussian Mixture Models
assumes k Gaussian processes generate data
Spectral Clustering
Generate eigenvalues/eigenvectors of the Laplacian of the
similarity matrix
Use smallest eigenvalue and corresponding eigenvectors
for dimension reduction
Supervised Learning
When the data has labels learn a predictive model
using features.
Neural Network Architectures
Perceptron
Multi Layer Perceptron
Convolutional Networks
Recurrent Neural Networks
Neural Network Training Neural Networks
Backpropagation
Optimizers
Support Vector Machines
Decision Trees
Ensemble Learning
Random Forest
XGBoost, AdaBoost

SVM example – image source Cornell cs4780

Reinforcement Learning
Learning a policy by experience, reward, penalty like humans.
Q-Learning
Deep Q-Network

AlphaGo beats a 9-dan (professional) 4-1, gets 9-dan AlphaZero beats a top professional player. First, time in a RTS game.
Later AlphaZero is developed for GO, Shogi and Chess Again, by DeepMind.
Next: Applications in Computer Vision

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
BMW530 Wire Color Coding Description
No ratings yet
BMW530 Wire Color Coding Description
2 pages
Introducing Data Science
57% (7)
Introducing Data Science
2 pages
Building PYRTE - An Introduction PDF
No ratings yet
Building PYRTE - An Introduction PDF
14 pages
Commerce Clause Flowchart
100% (1)
Commerce Clause Flowchart
1 page
Property Management Presentation
100% (1)
Property Management Presentation
14 pages
APO-Philippines Brand Guide
100% (3)
APO-Philippines Brand Guide
17 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Project Report
No ratings yet
Project Report
29 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Science Class Lecture
No ratings yet
Data Science Class Lecture
22 pages
Data Science
No ratings yet
Data Science
40 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
06 02 Introduction To Data Science
No ratings yet
06 02 Introduction To Data Science
42 pages
Introduction To Data Science - Ii-I Course File 2025-26
No ratings yet
Introduction To Data Science - Ii-I Course File 2025-26
152 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data Science 1
100% (4)
Data Science 1
133 pages
Kadir
No ratings yet
Kadir
84 pages
Data Science
No ratings yet
Data Science
85 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
AI Lecture 6
No ratings yet
AI Lecture 6
23 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Lesson - 1 Course Introduction Data Science Foundation
100% (1)
Lesson - 1 Course Introduction Data Science Foundation
26 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
01 Introduction
No ratings yet
01 Introduction
37 pages
Ids PPT and PDF
No ratings yet
Ids PPT and PDF
493 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
BCSC 0016 - Emerging Tech (Updatedv3) - 1
No ratings yet
BCSC 0016 - Emerging Tech (Updatedv3) - 1
66 pages
Data Science
No ratings yet
Data Science
28 pages
Class X AI Unit 4: Data Science
No ratings yet
Class X AI Unit 4: Data Science
57 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Day 1 Intro To DS and ML - New
No ratings yet
Day 1 Intro To DS and ML - New
41 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Asd 01
No ratings yet
Asd 01
38 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Data Science Lecture No 01
No ratings yet
Data Science Lecture No 01
28 pages
Abdul Kadir
No ratings yet
Abdul Kadir
97 pages
Data Science Presentation Enhanced
No ratings yet
Data Science Presentation Enhanced
34 pages
INTRODUCTION TO DATA SCIENCE - VI SEMESTER - BOOK - DR - PS - 58 COPIES
No ratings yet
INTRODUCTION TO DATA SCIENCE - VI SEMESTER - BOOK - DR - PS - 58 COPIES
190 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Da Session 1
No ratings yet
Da Session 1
50 pages
Datascience
75% (8)
Datascience
28 pages
CS429: Data Mining: About Instructor
No ratings yet
CS429: Data Mining: About Instructor
26 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Data Science
100% (2)
Data Science
52 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
CO1 1 Introduction To Data Science, Evolution of Data SciencE
No ratings yet
CO1 1 Introduction To Data Science, Evolution of Data SciencE
24 pages
1c. INTRODUCTION-Data-Science-basic
No ratings yet
1c. INTRODUCTION-Data-Science-basic
31 pages
Module 1
No ratings yet
Module 1
192 pages
Ids Unit 1,2,3,4 & 5
No ratings yet
Ids Unit 1,2,3,4 & 5
117 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
Birla Institute of Technology & Science, Pilani Work Integrated Learning Programmes Digital
No ratings yet
Birla Institute of Technology & Science, Pilani Work Integrated Learning Programmes Digital
9 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
CPU Scheduling: Bilkent University Department of Computer Engineering CS342 Operating Systems
No ratings yet
CPU Scheduling: Bilkent University Department of Computer Engineering CS342 Operating Systems
75 pages
Processes: Bilkent University Department of Computer Engineering CS342 Operating Systems
No ratings yet
Processes: Bilkent University Department of Computer Engineering CS342 Operating Systems
79 pages
Operating System Structures: Bilkent University Department of Computer Engineering CS342 Operating Systems
No ratings yet
Operating System Structures: Bilkent University Department of Computer Engineering CS342 Operating Systems
58 pages
Bilkent University Department of Computer Engineering CS342 Operating Systems
No ratings yet
Bilkent University Department of Computer Engineering CS342 Operating Systems
56 pages
Estrada V Sandiganbayan Digest
No ratings yet
Estrada V Sandiganbayan Digest
2 pages
As ISO IEC 6523.1-2005 Information Technology - Structure For The Identification of Organizations and Organiz
No ratings yet
As ISO IEC 6523.1-2005 Information Technology - Structure For The Identification of Organizations and Organiz
7 pages
Employee Survey Questionnaire
No ratings yet
Employee Survey Questionnaire
1 page
HYD691 Datasheet: Introducing The Hyd691... Standard Materials of Construction
No ratings yet
HYD691 Datasheet: Introducing The Hyd691... Standard Materials of Construction
4 pages
Supreme Court: Susano A. Velasquez For Appellant. Teodoro R. Dominguez For Appellee
No ratings yet
Supreme Court: Susano A. Velasquez For Appellant. Teodoro R. Dominguez For Appellee
6 pages
Centrifugal and Axial Compressor Appendix B
No ratings yet
Centrifugal and Axial Compressor Appendix B
21 pages
Piping Supervisor
No ratings yet
Piping Supervisor
12 pages
Policing For Profit
No ratings yet
Policing For Profit
212 pages
Modified BG Prasad Socio-Economic Classification, Updated - 2020
No ratings yet
Modified BG Prasad Socio-Economic Classification, Updated - 2020
2 pages
Vagtacho Usb: See The List of Supported Cars For The Delco Hsfi, and Delco "F" Update
No ratings yet
Vagtacho Usb: See The List of Supported Cars For The Delco Hsfi, and Delco "F" Update
9 pages
Dayananda Sagar College of Engineering: M.TECH: Digital Electronics and Communication
No ratings yet
Dayananda Sagar College of Engineering: M.TECH: Digital Electronics and Communication
4 pages
ZEOFREE® 600 - Evonik
No ratings yet
ZEOFREE® 600 - Evonik
2 pages
Industrial Disputes Act
No ratings yet
Industrial Disputes Act
2 pages
Final Training Design
No ratings yet
Final Training Design
4 pages
Colorimeter Calibration
No ratings yet
Colorimeter Calibration
3 pages
Project 2
No ratings yet
Project 2
7 pages
STV Insights
No ratings yet
STV Insights
20 pages
A100K11750 CTB Technical Manual
No ratings yet
A100K11750 CTB Technical Manual
82 pages
Asymptotic Analysis: Objectives
No ratings yet
Asymptotic Analysis: Objectives
20 pages
RD Rigidsteelconduitimc
No ratings yet
RD Rigidsteelconduitimc
1 page
Purple Ocean Strategy
No ratings yet
Purple Ocean Strategy
11 pages
Chapter 11 Test Bank PDF
No ratings yet
Chapter 11 Test Bank PDF
116 pages
Public Administration:: Your Unofficially The Compulsory Subject (In The Changed Context)
No ratings yet
Public Administration:: Your Unofficially The Compulsory Subject (In The Changed Context)
4 pages
Simple Additive Weighting Method To Determining Employee Salary Increase Rate
No ratings yet
Simple Additive Weighting Method To Determining Employee Salary Increase Rate
7 pages
Importance of ITeS
No ratings yet
Importance of ITeS
12 pages