GE 461 Introduction To Data Science: Spring 2021
GE 461 Introduction To Data Science: Spring 2021
461
Introduction to Data Science
Spring 2021
Course Website
All course related material will be provided in the course website
http://www.cs.bilkent.edu.tr/~korpe/courses/ge461/
Check regularly for announcements!
Weekly topics, instructors are stated.
Slides will be provided here.
Assignments released on Moodle.
Various external links to other similar courses and online textbooks.
Instructors
Cross-department Course with Multiple Instructors.
TAs will be announced on the Course Website. They will be from all
3 departments.
Location & Time
When: T 10:30 – 12:20 and Th 15:30 – 17:20.
Where: EA-502 (Online).
What: A lot! Introduction to data science fundamentals, techniques
and applications; data collection, preparation, storage and querying;
parametric models for data; models and methods for fitting, analysis,
evaluation, and validation; dimensionality reduction, visualization;
various learning methods, classifiers, clustering, data and text mining;
applications in diverse domains such as business, medicine, social
networks, computer vision; breadth knowledge on topics and hands-on
experience through projects and computer assignments.
Grading Policy
Final: 40%
Project: 60%
Multiple computer/programming/exercise assignments of various sizes.
A project can be assigned earlier than the indicated date on the weekly plan.
Projects can be individual or group based. Instructors will decide.
Projects will be uploaded to Moodle.
Piazza will be used as the forum to discuss.
Attendance:
A student who misses more than 9 hours will fail the course.
What is Data Science?
The field of study that uses various methods to extract useful insights
and knowledge from the data to make data-driven decisions.
Recommended readings:
http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.p
df
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-
the-21st-century
What is NOT Data Science?
Data Science aims at finding the right questions, more predictive analysis. Somewhat involves
creativity.
On the other hand, Business Intelligence aims helping in the decision making of a business based
on past data.
Data mining is a technique that searches for patterns in the data and can be considered as a tool of
Data Science.
For example: Baby diapers and beer are frequently bought together.
+ big
i.e., deep architectures, ADAM optimizer etc.
Faster Computers
GPU power to crunch large datasets
data
Better ways (NoSQL) to manage
Data (Hadoop, Hive, HBase)
Data is ubiquitous
Cheap to produce and
store
Python and R vs SAS and SPSS to process data
Advanced data visualization tools like Tableau
Data-Driven Solutions
No or little apriori model, which is
replaced by an inference algorithm CIMMS, U of Wisconsin
Original Paper
Spatio-Temporal
Spatio-Temporal
Window 1
Multi-Task Learning for Autism Gene Risk Prediction. Karakahya et al., in prep.
Some examples – Comp. Biology
Data Science for Online Feedback to Surgeons
Use Multiple Multivariate Regression to predict the result of a test that
is infeasible to perform during surgery due to time requirement.
BILSTM
Attention mechanism
Cakmakci et al.,
PLoS Computational Biology 2020.
Data Science Pipeline
The Turk
Data Science Pipeline - Data Wrangling
After you obtain the raw data converting it into a more useful format
Gather multiple files into single, standardized format
For example: Unite multiple crawled files into one, get rid of html
tags etc.
Data Science Pipeline - Data Cleaning
Dig deeper into the data after standardization and detect problems.
Inconsistencies
Outliers
Missing values
Data Science Pipeline
Explore – Preprocess – Model Cycle
1. Explore the structure of the data and decide on the appropriate
model to analyze.
For instance: sequence data, maybe LSTM?
image data, maybe Convolutional Neural Networks.
2. Preprocess the data to be fit into the model
For instance, RGB -> Grayscale
3. Apply the model and analyze results
4. Go to 1.
Data Science Pipeline - Validation
After you fine-tuned your model in the previous cycle validate your
data on a data that has not been seen by the model.
vs
Data Storage and Cloud
Database Systems
Relational databases, organized around tables, SQL
NoSQL databases for online distributed databases, eventual
consistency: Cassandra, HBase
Cloud Storage
Ubiquitous computing, data access from everywhere
No worries on losing data
Cloud Computing
Distributed computing on large scale data
Map Reduce, Hadoop
Statistical Modeling
Parametric Models
Family of probability distributions with a finite number of
parameters
For example: Binomial distribution has 2 (n,p)
Non-parametric Models
Parameter set is infinite dimensional i.e., grows with the data
size. For example: k nearest neighbors classification.
PC1
PC2
Unsupervised Learning – cont’d
Clustering: Finding groups of data points
which are similar to each other.
AlphaGo beats a 9-dan (professional) 4-1, gets 9-dan AlphaZero beats a top professional player. First, time in a RTS game.
Later AlphaZero is developed for GO, Shogi and Chess Again, by DeepMind.
Next: Applications in Computer Vision