0% found this document useful (0 votes)
150 views39 pages

GE 461 Introduction To Data Science: Spring 2021

This document provides information about an introduction to data science course offered in spring 2021. The course will be taught by multiple instructors and TAs from three departments. It will take place on Tuesdays and Thursdays in room EA-502 or online. The course covers various data science fundamentals, techniques, and applications. Students will complete a final exam, projects, and assignments. Grading will be based on the final exam (40%) and projects (60%). The document also defines data science and compares it to related fields. It discusses why data science has become prominent now due to advances in machine learning, computing power, and data collection and storage. Examples of data science applications are provided.

Uploaded by

Muhammed Naci
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views39 pages

GE 461 Introduction To Data Science: Spring 2021

This document provides information about an introduction to data science course offered in spring 2021. The course will be taught by multiple instructors and TAs from three departments. It will take place on Tuesdays and Thursdays in room EA-502 or online. The course covers various data science fundamentals, techniques, and applications. Students will complete a final exam, projects, and assignments. Grading will be based on the final exam (40%) and projects (60%). The document also defines data science and compares it to related fields. It discusses why data science has become prominent now due to advances in machine learning, computing power, and data collection and storage. Examples of data science applications are provided.

Uploaded by

Muhammed Naci
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

GE

461
Introduction to Data Science
Spring 2021
Course Website
All course related material will be provided in the course website
http://www.cs.bilkent.edu.tr/~korpe/courses/ge461/
Check regularly for announcements!
Weekly topics, instructors are stated.
Slides will be provided here.
Assignments released on Moodle.
Various external links to other similar courses and online textbooks.
Instructors
Cross-department Course with Multiple Instructors.

Coordinator: İbrahim Körpeoğlu Eray Tüzün


Selim Aksoy Cem Tekin
Savaş Dayanık Ercüment Çiçek
Can Alkan Hamdi Dibeklioğlu
Tolga Çukur Shervin Rahimzadeh Arashloo
Fazlı Can

TAs will be announced on the Course Website. They will be from all
3 departments.
Location & Time
When: T 10:30 – 12:20 and Th 15:30 – 17:20.
Where: EA-502 (Online).
What: A lot! Introduction to data science fundamentals, techniques
and applications; data collection, preparation, storage and querying;
parametric models for data; models and methods for fitting, analysis,
evaluation, and validation; dimensionality reduction, visualization;
various learning methods, classifiers, clustering, data and text mining;
applications in diverse domains such as business, medicine, social
networks, computer vision; breadth knowledge on topics and hands-on
experience through projects and computer assignments.
Grading Policy
Final: 40%
Project: 60%
Multiple computer/programming/exercise assignments of various sizes.
A project can be assigned earlier than the indicated date on the weekly plan.
Projects can be individual or group based. Instructors will decide.
Projects will be uploaded to Moodle.
Piazza will be used as the forum to discuss.
Attendance:
A student who misses more than 9 hours will fail the course.
What is Data Science?
The field of study that uses various methods to extract useful insights
and knowledge from the data to make data-driven decisions.

Methods can include/require, domain expertise, programming skills


(i.e., scripting to process data), statistical modeling (i.e., machine
learning algorithms), visualization techniques.

Usually performed on big data.


VS

Recommended readings:
http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.p
df
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-
the-21st-century
What is NOT Data Science?

Data Science makes use of AI, ML, DL

https://blogs.nvidia.com/blog/2016/07/ 29/whats difference -


artificial-intelligence -machine -lea rningde ep-
learning-ai/

Image source: Rob Tibshirani, Stanford Stats 101


What is NOT Data Science? Example
An AI breakthrough in 2011, now empowers Data Science.
Data Science vs Other Related Terms
Many terms are used interchangeably; vague definitions.

Data Science aims at finding the right questions, more predictive analysis. Somewhat involves
creativity.
On the other hand, Business Intelligence aims helping in the decision making of a business based
on past data.

Data mining is a technique that searches for patterns in the data and can be considered as a tool of
Data Science.
For example: Baby diapers and beer are frequently bought together.

Data analytics aims at analyzing data to find answers to concrete questions.


For instance, optimizing the teller processes at the bank to serve more customers.
It is a tool for Business Intelligence.
Why Now? Some advances
Better machine learning algorithms

+ big
i.e., deep architectures, ADAM optimizer etc.

Faster Computers
GPU power to crunch large datasets

data
Better ways (NoSQL) to manage
Data (Hadoop, Hive, HBase)

Data is ubiquitous
Cheap to produce and
store
Python and R vs SAS and SPSS to process data
Advanced data visualization tools like Tableau

Image source: Rob Tibshirani, Stanford Stats 101


Big Data
Data is easy to produce, cheap to store. One example from genomics.
Big Data – cont’d
Database (old) vs Data Science (new)
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
Riak, Memcached,
Apache River,
MongoDB, CouchDB,
Hbase, Cassandra,…
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance Slide by John Canny
Modelling vs Data-Driven Solutions
Scientific modelling
Background knowledge, set of rules,
principles, representations etc.
Example: Weather forecasting.

Data-Driven Solutions
No or little apriori model, which is
replaced by an inference algorithm CIMMS, U of Wisconsin

(e.g., Neural Network, SVM etc.).


Example: Image classification.

AlexNet/VGG-F visualization from Brown CSCI1430


Some examples - Search
Google PageRank Algorithm

Original Paper

Used in many applications to


have data driven answers to various problems

Image source: Wikipedia


Some examples – Recommendation Systems
Some examples – Flu Trends
Google Flu Trends

AlexNet/VGG-F visualization from Brown CSCI1430


Some examples – Comp. Biology
Data Science for Gene Risk Prediction
It is not enough to collect the data.
What does the data tell us?
Use methods to analyze the it.

Satterstrom et al., CELL 2020


Cross
Same Gene in
Coexpression Steiner Tree time/brain
Gene Steiner Tree ASD Risk consequtive
edge region
Gene Gene windows
interaction

Some examples – Comp. Biology


y z

Machine Learning for Gene Risk Prediction


x

Spatio-Temporal

Build algorithms to predict the risk Window 2

Spatio-Temporal
Window 1

Spatio-temporal Network-based Analysis. Norman and Cicek, Bioinformatics 2019.

Satterstrom et al., CELL 2020

Multi-Task Learning for Autism Gene Risk Prediction. Karakahya et al., in prep.
Some examples – Comp. Biology
Data Science for Online Feedback to Surgeons
Use Multiple Multivariate Regression to predict the result of a test that
is infeasible to perform during surgery due to time requirement.

Karakaslar et al., IEEE/ACM TCBB 2019, in press.


Some examples – Comp. Biology
Machine Learning for Online Feedback to Surgeons
Design a neural network that learns important parts of to classify
tumors.

BILSTM

Attention mechanism

Cakmakci et al.,
PLoS Computational Biology 2020.
Data Science Pipeline

Image credit Wolfram Research


Data Science Pipeline - Data Collection
Many data types, many ways
Sensors
Crowdsourcing, putting humans at work once computers fail:
Mechanical Turk
Crawling
Questionnaires..

The Turk
Data Science Pipeline - Data Wrangling
After you obtain the raw data converting it into a more useful format
Gather multiple files into single, standardized format
For example: Unite multiple crawled files into one, get rid of html
tags etc.
Data Science Pipeline - Data Cleaning
Dig deeper into the data after standardization and detect problems.
Inconsistencies
Outliers
Missing values
Data Science Pipeline
Explore – Preprocess – Model Cycle
1. Explore the structure of the data and decide on the appropriate
model to analyze.
For instance: sequence data, maybe LSTM?
image data, maybe Convolutional Neural Networks.
2. Preprocess the data to be fit into the model
For instance, RGB -> Grayscale
3. Apply the model and analyze results
4. Go to 1.
Data Science Pipeline - Validation
After you fine-tuned your model in the previous cycle validate your
data on a data that has not been seen by the model.

Validate that your claim is not just random finding.


Multiple hypothesis correction
Correlation is not causation.
Data Science Pipeline – Story Telling
A data scientist also needs to communicate well.
Infographics and how you convey the story is important.

vs
Data Storage and Cloud
Database Systems
Relational databases, organized around tables, SQL
NoSQL databases for online distributed databases, eventual
consistency: Cassandra, HBase
Cloud Storage
Ubiquitous computing, data access from everywhere
No worries on losing data
Cloud Computing
Distributed computing on large scale data
Map Reduce, Hadoop
Statistical Modeling
Parametric Models
Family of probability distributions with a finite number of
parameters
For example: Binomial distribution has 2 (n,p)
Non-parametric Models
Parameter set is infinite dimensional i.e., grows with the data
size. For example: k nearest neighbors classification.

Applications to customer choice problems and market simulation.


Model Validation
Experimental Design
Cross Validation
Statistical Tests for validation
Unsupervised Learning
Feature extraction: Principal Component Analysis, t-SNE etc.

PC1

PC2
Unsupervised Learning – cont’d
Clustering: Finding groups of data points
which are similar to each other.

John Snow, a London physician plotted


the location of cholera deaths on a map
during an outbreak in the 1850s.

The locations indicated that cases were


clustered around certain intersections
where there were polluted wells – thus
exposing both the problem and the
solution
Unsupervised Learning – cont’d
Clustering: Finding groups of data
points which are similar to each other.

Given a sample of breast cancer


patients and their gene activity level
measurements. Can you find
subgroups? (e.g., aggressive, mild
etc.)

So many other applications:


Targeted advertising
LinkedIn contact suggestion
Unsupervised Learning – cont’d
Winner take all rule, competitive learning
Several algorithm examples
k-means GMM example

k cluster centers as means of assigned data points


Gaussian Mixture Models
assumes k Gaussian processes generate data
Spectral Clustering
Generate eigenvalues/eigenvectors of the Laplacian of the
similarity matrix
Use smallest eigenvalue and corresponding eigenvectors
for dimension reduction
Supervised Learning
When the data has labels learn a predictive model
using features.
Neural Network Architectures
Perceptron
Multi Layer Perceptron
Convolutional Networks
Recurrent Neural Networks
Neural Network Training Neural Networks
Backpropagation
Optimizers
Support Vector Machines
Decision Trees
Ensemble Learning
Random Forest
XGBoost, AdaBoost

SVM example – image source Cornell cs4780


Reinforcement Learning
Learning a policy by experience, reward, penalty like humans.
Q-Learning
Deep Q-Network

AlphaGo beats a 9-dan (professional) 4-1, gets 9-dan AlphaZero beats a top professional player. First, time in a RTS game.
Later AlphaZero is developed for GO, Shogi and Chess Again, by DeepMind.
Next: Applications in Computer Vision

You might also like