Data Science From A Research Perspective
Data Science From A Research Perspective
V A N I
D ATA S C I E N C E F RO M A
V A S U D E V A N
RESEARCH PERSPECTIVE
1
1 1 / 1 2 / 2 2
Dr.Vani Vasudevan, Professor – CSE,
1
Some Quotes!!!
D r .
V a n i
The purpose of computing is insight, not numbers.– Richard W. Hamming (Data Science)
V a s u d e v a n
•
• A data scientist is someone who knows more statistics than a computer scientist and more computer
science than a statistician.– Josh Blumenstock (Mathematics)
• On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will
the right answers come out?” . . . I am not able to rightly apprehend the kind of confusion of ideas that 2
could provoke such a question.– Charles Babbage (Data Wrangling )
• Money is a scoreboard where you can rank how you’re doing against other people.– Mark Cuban
(Measures)
1 1 / 1 2 / 2 2
• It is easy to lie with statistics, but easier to lie without them. (Statistical Analysis)
• At their best, graphics are instruments for reasoning.– Edward Tufte (Data Visualization )
2
D r .
V a n i
Some More Quotes!!!
V a s u d e v a n
• All models are wrong, but some models are useful.– George Box (Mathematical
Models)
• A change in quantity also entails a change in quality.– Friedrich Engel (Big Data)
https://www.internetlivestats.com/
1 1 / 1 2 / 2 2
3
D r .
V a n i
Data Science…
V a s u d e v a n
Data science is an emerging field that
(2) deals with enormous amounts of complex, incongruent, and dynamic data from
multiple sources; and
1 1 / 1 2 / 2 2
Source : Data Science and Predictive Analytics
Ivo D. Dinov, “Biomedical and Health Applications using R”, Springer
2018
4
D r .
V a n i
Data Science
V a s u d e v a n
(3) aims to develop algorithms, methods, tools, and services capable of ingesting such datasets
and generating semiautomated decision support systems.
5
The latter can mine the data for patterns or motifs, predict expected outcomes, suggest clustering
or labeling of retrospective or prospective observations, compute data signatures or fingerprints,
extract valuable information, and offer evidence-based actionable knowledge.
1 1 / 1 2 / 2 2
Data science techniques often involve data manipulation (wrangling), data harmonization and
aggregation, exploratory or confirmatory data analyses, predictive analytics, validation, and fine-
tuning.
5
D r .
V a n i
What is Data Science?
V a s u d e v a n
Like any emerging field, it isn’t yet well defined, but incorporates elements of:
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
6
D r .
V a n i
S K I L L S E TS FO R D ATA S C I E N C E
V a s u d e v a n
7
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
Some more resources:
https://towardsdatascience.com/introduction-to-statistics-e9d72d818745
https://deepai.org/machine-learning-glossary-and-terms/data-science
https://www.devopsschool.com/blog/what-is-data-science-advantages-and-
disadvantages-of-data-science/
7
D r .
V a n i
Appreciating Data
V a s u d e v a n
• Computer Scientists do not naturally appreciate data: it’s just stuff to run through a
program.
• The usual way to test algorithm performance is to run the implementation on “random 8
data”.
• But, interesting data sets are a scarce resource, which requires hard
work and imagination to obtain.
1 1 / 1 2 / 2 2
•
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
8
D r .
V a n i
Computer Vs. Real Scientists
V a s u d e v a n
• Scientists strive to understand the complicated and messy natural world, while computer
scientists build their own clean and organized virtual worlds. Thus:
• Nothing is ever completely true or false in science, while everything is either true or false in 9
Computer Science / Mathematics.
• Scientists obsess about discovering things, which computer scientists invent rather than
1 1 / 1 2 / 2 2
discover.
• Scientists are comfortable with the idea that data has errors; computer scientists are not.
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
9
D r .
V a n i
Genius Vs. Wisdom
V a s u d e v a n
• Software developers are hired to produce code.
Data science (like most things) benefits more from wisdom than from genius.
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
10
D r .
V a n i
Developing Wisdom
V a s u d e v a n
• Wisdom comes from experience.
• Wisdom comes from humility, observing how often you have been wrong and
why/how.
1 1 / 1 2 / 2 2
I seek pass on wisdom, through my experience on the difficulty of making
good predictions.
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
11
D r .
V a n i
Developing Curiosity
V a s u d e v a n
• The good data scientist develops a curiosity about the domain/application they are working in.
• They talk shop with the people whose data they are working on.
• They read the newspaper every day, to get a broader perspective on the world. 12
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
12
D r .
V a n i
V a s u d e v a n
Asking GOOD QUESTIONS:
Software developers are not encouraged to ask questions, but data scientists are:
• What exciting things might you be able to learn from a given data set?
13
• What things do you/your people really want to know?
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
13
D r .
V a n i
LET’S PRACTICE ASKING QUESTIONS!
V a s u d e v a n
• Who, What, Where, When, and Why on the following datasets:
1 1 / 1 2 / 2 2
14
15
15
1 1 / 1 2 / 2 2
D r . V a n i V a s u d e v a n
IMDb: Movie Data
16
16
1 1 / 1 2 / 2 2
D r . V a n i V a s u d e v a n
IMDb: Actor Data
D r .
V a n i
Movie Questions
V a s u d e v a n
• Can we predict how well people will like a movie? What about its gross?
• Do stars live longer or shorter lives than the bit players or public?
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
17
D r .
V a n i
N Y C Ta x i Tr i p D a t a
V a s u d e v a n
• https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data
18
1 1 / 1 2 / 2 2
18
D r .
V a n i
N Y C Ta x i Tr i p Q u e s t i o n s
V a s u d e v a n
• https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
19
D r .
V a n i
Properties Of Data
V a s u d e v a n
• Structured vs. Unstructured Data
1 1 / 1 2 / 2 2
• Do not blindly aspire to analyze large data sets. Seek the right data to answer a given
question, not necessarily the biggest thing you can get your hands on.
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
20
D r .
V a n i
Classification And Regression
V a s u d e v a n
• Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression.
21
Classification: Often, we seek to assign a label to an item from a discrete set of
possibilities. Such problems as predicting the winner of a particular sporting contest
(team A or team B?) or deciding the genre of a given movie (comedy, drama, or
1 1 / 1 2 / 2 2
animation?) are classification problems, since each entail selecting a label from the
possible choices.
21
D r .
V a n i
Classification And Regression
V a s u d e v a n
• Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression.
22
Regression: Another common task is to forecast a given numerical quantity. Predicting a
person's weight or how much rain we will get this year is a regression problem, where we
forecast the future value of a numerical function in terms of previous values and other
1 1 / 1 2 / 2 2
relevant features.
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
22
D r .
V a n i
Classification And Regression
V a s u d e v a n
• The best way to see the intended distinction is to look at a variety of data science
problems and label (classify) them as regression or classification.
• Different algorithmic methods are used to solve these two types of problems. 23
1. Will the price of a particular stock be higher or lower tomorrow?
1 1 / 1 2 / 2 2
3.
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
23
D r .
V a n i
PRACTICE QUESTION
V a s u d e v a n
Identifying Data Sets
1. Identify where interesting data sets relevant to the following domains can be found on the web:
(a) Books. 24
(c) Stock prices.
1 1 / 1 2 / 2 2
(f) Crime rates.
For each of these data sources, explain what you must do to turn this data intoa usable format on
your computer for analysis.
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
24
D r .
V a n i
The Data Science Pipeline
V a s u d e v a n
1. Get or collect data
1 1 / 1 2 / 2 2
25
D r .
Some Useful Web Resources to Kick S tar t
V a n i
Yo u r L e a r n i n g A n d R e s e a r c h !
V a s u d e v a n
• https://cognitiveclass.ai/ - Data Science and Cognitive Computing Courses
• https://www.kdnuggets.com/ - Site on AI, Analytics, Big Data, Data Mining, Data Science, and ML https://www.kaggle.com/ - ML & DS
community
• http://archive.ics.uci.edu/ml/index.php - ML Repository 26
• https://homepages.ecs.vuw.ac.nz/~marslast/MLbook.html - Stephen Marsland homepage
• Rohit singh, tommi jaakkola, and ali mohammad. 6.867 Machine learning. Fall 2006. Massachusetts institute of technology: MIT
1 1 / 1 2 / 2 2
opencourseware, https://ocw.mit.edu.
• Leslie kaelbling, tomás lozano-pérez, isaac chuang, and duane boning. 6.036 introduction to machine learning. Fall 2020. Massachusetts
institute of technology: MIT opencourseware, https://ocw.mit.edu.
• https://ocw.mit.edu/courses/15-062-data-mining-spring-2003/
26
D r .
V a n i
D a t a S c i e n c e To o l s
V a s u d e v a n
1. Python(Most known)
• Python is one of the most dominant languages in the field of data science today because of
its flexibility, ease of use in terms of syntax, open-source nature, and ability to handle, 27
clean, manipulate, visualize, and analyze data.
1 1 / 1 2 / 2 2
both programmers and data scientists alike. Moreover, there are various other tools
connected to and built with the help of Python, such as Dask, SciPy, Cython, Matplotlib,
and High-Performance Analytics Toolkit(HPAT).
27
D r .
V a n i
Java Vs Python
V a s u d e v a n
28
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
28
D r .
V a n i
Python Libraries
V a s u d e v a n
29
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
29
D r .
V a n i
The Python Ecosystem
V a s u d e v a n
• Numpy: N-dimensional arrays, Matrices and Linear Algebra
• Scipy: Algorithms from linear algebra, optimization, statistics and signal processing
1 1 / 1 2 / 2 2
• Scikit-learn: Machine Learning
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
30
D r .
V a n i
Anaconda
V a s u d e v a n
• A bundle of data science, machine learning and visualization libraries.
• Installation
1. Go to: https://www.anaconda.com/distribution/
1 1 / 1 2 / 2 2
2. Download the installer for your OS and Python version of choice and follow
instructions
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
31
D r .
V a n i
Jupiter Notebook
V a s u d e v a n
• A browser-based notebook with support for code, text, mathematical expressions,
inline plots and other rich media
32
1 1 / 1 2 / 2 2
32
D r .
V a n i
COLAB
V a s u d e v a n
• Colaboratory is a free Jupyter notebook environment that requires no setup and runs
entirely in the cloud.
• With Colaboratory you can write and execute code, save and share your analyses, and 33
access powerful computing resources, all for free from your browser.
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
33
D r .
V a n i
PANDAS
V a s u d e v a n
• It is a library for data manipulation and analysis
• Data is loaded in-memory, hence super fast (but not ideal for datasets of scale > 34
memory size)
1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
34
D r .
V a n i
Data Science with Python
V a s u d e v a n
35
1 1 / 1 2 / 2 2
1.source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
2.Data Accquisition --- Beautiful Soup, LXML, Scrapy, Tweepy (Obtaining data by
spidering the web etc), pySpark (for large data), mySQL client, mongoDB
3.Pre-processing -- Domain Specific Pre-processing techniques (Text: NLTK,
Images: Scikit-image etc)
4.Analysis/Modeling
1. Exploratory Data Analysis: Pandas
2. Visualization: pylab, matplotlib, seaborn
3. Modeling: numpy, scipy, sympy
4. Hypothesis Testing: scipy, statsmodels
5. Machine Learning: sklearn --
5.Evaluation/Interpretation/Communication
1. Latex in Ipython
2. Bokeh
3. Flask
35
O t h e r D a t a S c i e n c e To o l s
D r .
V a n i
• WEKA • RapidMiner
V a s u d e v a n
• R (RStudio) • Excel
• MATLAB • PowerBI 36
• Statistical Analysis System (SAS) • Google Analytics
1 1 / 1 2 / 2 2
• Apache Hadoop and much more!
• Tableau
• QlikView
36
D r .
V a n i
V a s u d e v a n
37
1 2 / 1 2 / 2 2
Source: Machine Learning Mastery: https://machinelearningmastery.com/wp-
content/uploads/2021/03/MachineLearningAlgorithms.jpg?__s=hbkixgpvleicleslspeo
&utm_source=drip&utm_medium=email&utm_campaign=MMLA+Mini-
Course&utm_content=Machine+Learning+Algorithms+Mind-Map+and+Mini-Course
https://machinelearningmastery.com/parametric-and-nonparametric-machine-
learning-algorithms/
Parametric Approaches :
Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks
37
Support Vector Machines
37
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D B I G
D ATA
D r .
V a n i
V a s u d e v a n
The research problems in intersection of big data with data science
• Approaches to make the models learn with a smaller number of data samples
38
• Neural Machine Translation to Local languages
1 1 / 1 2 / 2 2
• Handling Data and Model drift for real-world applications
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136
38
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D B I G
D ATA
D r .
V a n i
V a s u d e v a n
The research problems related to data engineering aspects
• Lightweight Big Data analytics as a Service 39
1 2 / 1 2 / 2 2
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136
39
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA
D r .
V a n i
The problems related to core big data area of handling the scale:
V a s u d e v a n
• Scalable architectures for parallel data processing
40
• Handling real-time video analytics in a distributed cloud
1 1 / 1 2 / 2 2
• Efficient graph processing at scale
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136
40
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA
D r .
V a n i
The research problems to handle noise and uncertainty in the data:
V a s u d e v a n
• Identify fake news in near real-time
41
• Dimensional Reduction approaches for large scale data
1 2 / 1 2 / 2 2
• Training / Inference in noisy environments and incomplete data
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136
41
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA
D r .
V a n i
V a s u d e v a n
The research problems in the security and privacy area:
1 2 / 1 2 / 2 2
• Secure federated learning with real-world applications
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136
42
D r .
V a n i
Te n R e s e a r c h C h a l l e n g e A r e a s I n D a t a S c i e n c e
V a s u d e v a n
1. Scientific Understanding of Learning, Especially Deep Learning Algorithms.
2. Causal Reasoning
3. Precious Data
43
4. Multiple, Heterogeneous Data Sources
1 1 / 1 2 / 2 2
8. Automating Front-End Stages of the Data Life Cycle
9. Privacy
10. Ethics
https://hdsr.mitpress.mit.edu/pub/d9j96ne4/release/3
43
D R .
V A N I
ANY QUESTIONS?
V A S U D E V A N
44
1 1 / 1 2 / 2 2
Reach me @ [email protected]
LinkedIn: https://in.linkedin.com/in/dr-vani-vasudevan-0b89b713
44