0% found this document useful (0 votes)
28 views45 pages

Data Science From A Research Perspective

Data science from a research perspective

Uploaded by

Dr. Vani V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views45 pages

Data Science From A Research Perspective

Data science from a research perspective

Uploaded by

Dr. Vani V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

D R .

V A N I
D ATA S C I E N C E F RO M A

V A S U D E V A N
RESEARCH PERSPECTIVE
1

1 1 / 1 2 / 2 2
Dr.Vani Vasudevan, Professor – CSE,

Nitte Meenakshi Institute of Technology, Bangalore.

1
Some Quotes!!!

D r .
V a n i
The purpose of computing is insight, not numbers.– Richard W. Hamming (Data Science)

V a s u d e v a n

• A data scientist is someone who knows more statistics than a computer scientist and more computer
science than a statistician.– Josh Blumenstock (Mathematics)

• On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will
the right answers come out?” . . . I am not able to rightly apprehend the kind of confusion of ideas that 2
could provoke such a question.– Charles Babbage (Data Wrangling )

• Money is a scoreboard where you can rank how you’re doing against other people.– Mark Cuban
(Measures)

1 1 / 1 2 / 2 2
• It is easy to lie with statistics, but easier to lie without them. (Statistical Analysis)

• At their best, graphics are instruments for reasoning.– Edward Tufte (Data Visualization )

2
D r .
V a n i
Some More Quotes!!!

V a s u d e v a n
• All models are wrong, but some models are useful.– George Box (Mathematical
Models)

• Any sufficiently advanced form of cheating is indistinguishable from learning.– 3

Jan Schaumann (Machine Learning)

• A change in quantity also entails a change in quality.– Friedrich Engel (Big Data)
https://www.internetlivestats.com/

1 1 / 1 2 / 2 2

3
D r .
V a n i
Data Science…

V a s u d e v a n
Data science is an emerging field that

(1) is extremely transdisciplinary –bridging between the theoretical, computational,


experimental, and biosocial areas; 4

(2) deals with enormous amounts of complex, incongruent, and dynamic data from
multiple sources; and

1 1 / 1 2 / 2 2
Source : Data Science and Predictive Analytics
Ivo D. Dinov, “Biomedical and Health Applications using R”, Springer
2018

4
D r .
V a n i
Data Science

V a s u d e v a n
(3) aims to develop algorithms, methods, tools, and services capable of ingesting such datasets
and generating semiautomated decision support systems.

5
The latter can mine the data for patterns or motifs, predict expected outcomes, suggest clustering
or labeling of retrospective or prospective observations, compute data signatures or fingerprints,
extract valuable information, and offer evidence-based actionable knowledge.

1 1 / 1 2 / 2 2
Data science techniques often involve data manipulation (wrangling), data harmonization and
aggregation, exploratory or confirmatory data analyses, predictive analytics, validation, and fine-
tuning.

Source : Data Science and Predictive Analytics


Ivo D. Dinov, “Biomedical and Health Applications using R”, Springer
2018

5
D r .
V a n i
What is Data Science?

V a s u d e v a n
Like any emerging field, it isn’t yet well defined, but incorporates elements of:

• Exploratory Data Analysis and Visualization

• Machine Learning and Statistics 6

• High-Performance Computing technologies for dealing with scale.

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

6
D r .
V a n i
S K I L L S E TS FO R D ATA S C I E N C E

V a s u d e v a n
7

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
Some more resources:
https://towardsdatascience.com/introduction-to-statistics-e9d72d818745
https://deepai.org/machine-learning-glossary-and-terms/data-science
https://www.devopsschool.com/blog/what-is-data-science-advantages-and-
disadvantages-of-data-science/

7
D r .
V a n i
Appreciating Data

V a s u d e v a n
• Computer Scientists do not naturally appreciate data: it’s just stuff to run through a
program.

• The usual way to test algorithm performance is to run the implementation on “random 8
data”.

• But, interesting data sets are a scarce resource, which requires hard
work and imagination to obtain.

1 1 / 1 2 / 2 2

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

8
D r .
V a n i
Computer Vs. Real Scientists

V a s u d e v a n
• Scientists strive to understand the complicated and messy natural world, while computer
scientists build their own clean and organized virtual worlds. Thus:

• Nothing is ever completely true or false in science, while everything is either true or false in 9
Computer Science / Mathematics.

• Scientists are data-driven, while computer scientists are algorithm-driven .

• Scientists obsess about discovering things, which computer scientists invent rather than

1 1 / 1 2 / 2 2
discover.

• Scientists are comfortable with the idea that data has errors; computer scientists are not.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

9
D r .
V a n i
Genius Vs. Wisdom

V a s u d e v a n
• Software developers are hired to produce code.

• Data Scientists are hired to produce insights.


10
• Genius shows in finding the right answer!!!

• Wisdom shows in avoiding the wrong answers.

Data science (like most things) benefits more from wisdom than from genius.

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

10
D r .
V a n i
Developing Wisdom

V a s u d e v a n
• Wisdom comes from experience.

• Wisdom comes from general knowledge.


11
• Wisdom comes from listening to others.

• Wisdom comes from humility, observing how often you have been wrong and
why/how.

1 1 / 1 2 / 2 2
I seek pass on wisdom, through my experience on the difficulty of making
good predictions.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

11
D r .
V a n i
Developing Curiosity

V a s u d e v a n
• The good data scientist develops a curiosity about the domain/application they are working in.

• They talk shop with the people whose data they are working on.

• They read the newspaper every day, to get a broader perspective on the world. 12

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

12
D r .
V a n i
V a s u d e v a n
Asking GOOD QUESTIONS:

Software developers are not encouraged to ask questions, but data scientists are:

• What exciting things might you be able to learn from a given data set?
13
• What things do you/your people really want to know?

• What data sets might get you there?

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

13
D r .
V a n i
LET’S PRACTICE ASKING QUESTIONS!

V a s u d e v a n
• Who, What, Where, When, and Why on the following datasets:

1. International Movie Database (IMDB)

2. New York City Taxi Trip Duration 14

1 1 / 1 2 / 2 2

14
15
15

1 1 / 1 2 / 2 2
D r . V a n i V a s u d e v a n
IMDb: Movie Data
16
16

1 1 / 1 2 / 2 2
D r . V a n i V a s u d e v a n
IMDb: Actor Data
D r .
V a n i
Movie Questions

V a s u d e v a n
• Can we predict how well people will like a movie? What about its gross?

• What does the social network of actors look like?


17
• What is the age distribution of actors and actresses in film?

• Do stars live longer or shorter lives than the bit players or public?

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

17
D r .
V a n i
N Y C Ta x i Tr i p D a t a

V a s u d e v a n
• https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data

18

1 1 / 1 2 / 2 2

18
D r .
V a n i
N Y C Ta x i Tr i p Q u e s t i o n s

V a s u d e v a n
• https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data

• How far do they travel?


19
• How much slower is traffic during rush hour?

• Where are people traveling to/from at different times of the day?

• Where should drivers go to pick up their next fare?

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

19
D r .
V a n i
Properties Of Data

V a s u d e v a n
• Structured vs. Unstructured Data

• Quantitative vs. Categorical Data

• Big Data vs. Little Data 20

1 1 / 1 2 / 2 2
• Do not blindly aspire to analyze large data sets. Seek the right data to answer a given
question, not necessarily the biggest thing you can get your hands on.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

20
D r .
V a n i
Classification And Regression

V a s u d e v a n
• Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression.

21
Classification: Often, we seek to assign a label to an item from a discrete set of
possibilities. Such problems as predicting the winner of a particular sporting contest
(team A or team B?) or deciding the genre of a given movie (comedy, drama, or

1 1 / 1 2 / 2 2
animation?) are classification problems, since each entail selecting a label from the
possible choices.

21
D r .
V a n i
Classification And Regression

V a s u d e v a n
• Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression.

22
Regression: Another common task is to forecast a given numerical quantity. Predicting a
person's weight or how much rain we will get this year is a regression problem, where we
forecast the future value of a numerical function in terms of previous values and other

1 1 / 1 2 / 2 2
relevant features.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

22
D r .
V a n i
Classification And Regression

V a s u d e v a n
• The best way to see the intended distinction is to look at a variety of data science
problems and label (classify) them as regression or classification.

• Different algorithmic methods are used to solve these two types of problems. 23
1. Will the price of a particular stock be higher or lower tomorrow?

2. What will the price of a particular stock be tomorrow?

Is this person a good risk to sell an insurance policy to?

1 1 / 1 2 / 2 2
3.

4. How long do we expect this person to live?

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

23
D r .
V a n i
PRACTICE QUESTION

V a s u d e v a n
Identifying Data Sets
1. Identify where interesting data sets relevant to the following domains can be found on the web:

(a) Books. 24
(c) Stock prices.

(d) Risks of diseases.

(e) Colleges and universities.

1 1 / 1 2 / 2 2
(f) Crime rates.

For each of these data sources, explain what you must do to turn this data intoa usable format on
your computer for analysis.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

24
D r .
V a n i
The Data Science Pipeline

V a s u d e v a n
1. Get or collect data

2. Manipulate and process data

3. Modeling and analysis 25

4. Visualize, evaluate, present, and communicate

1 1 / 1 2 / 2 2

25
D r .
Some Useful Web Resources to Kick S tar t

V a n i
Yo u r L e a r n i n g A n d R e s e a r c h !

V a s u d e v a n
• https://cognitiveclass.ai/ - Data Science and Cognitive Computing Courses

• https://www.kdnuggets.com/ - Site on AI, Analytics, Big Data, Data Mining, Data Science, and ML https://www.kaggle.com/ - ML & DS
community

• https://data.gov/ - US government data

• http://archive.ics.uci.edu/ml/index.php - ML Repository 26
• https://homepages.ecs.vuw.ac.nz/~marslast/MLbook.html - Stephen Marsland homepage

• https://www.cs.waikato.ac.nz/ml/weka/courses.html – Waikato University - Weka MOOC

• https://nptel.ac.in/courses/106/106/106106202/ - NPTEL - Machine Learning

• Rohit singh, tommi jaakkola, and ali mohammad. 6.867 Machine learning. Fall 2006. Massachusetts institute of technology: MIT

1 1 / 1 2 / 2 2
opencourseware, https://ocw.mit.edu.

• Leslie kaelbling, tomás lozano-pérez, isaac chuang, and duane boning. 6.036 introduction to machine learning. Fall 2020. Massachusetts
institute of technology: MIT opencourseware, https://ocw.mit.edu.

• https://ocw.mit.edu/courses/hst-953-collaborative-data-science-for-healthcare-fall-2020/ collaborative data science for healthcare

• https://ocw.mit.edu/courses/15-062-data-mining-spring-2003/

26
D r .
V a n i
D a t a S c i e n c e To o l s

V a s u d e v a n
1. Python(Most known)

• Python is one of the most dominant languages in the field of data science today because of
its flexibility, ease of use in terms of syntax, open-source nature, and ability to handle, 27
clean, manipulate, visualize, and analyze data.

• Python was essentially developed as a programming language. However, it offers a wide


range of libraries, such as TensorFlow, Keras, PyTorch, Seaborn, etc., that are attractive for

1 1 / 1 2 / 2 2
both programmers and data scientists alike. Moreover, there are various other tools
connected to and built with the help of Python, such as Dask, SciPy, Cython, Matplotlib,
and High-Performance Analytics Toolkit(HPAT).

27
D r .
V a n i
Java Vs Python

V a s u d e v a n
28

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

28
D r .
V a n i
Python Libraries

V a s u d e v a n
29

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

29
D r .
V a n i
The Python Ecosystem

V a s u d e v a n
• Numpy: N-dimensional arrays, Matrices and Linear Algebra

• Scipy: Algorithms from linear algebra, optimization, statistics and signal processing

• Pandas: Data Manipulation and Analysis 30

• Matplotlib: Data Visualization

• IPython: Interactive shell for Python

1 1 / 1 2 / 2 2
• Scikit-learn: Machine Learning

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

30
D r .
V a n i
Anaconda

V a s u d e v a n
• A bundle of data science, machine learning and visualization libraries.

• Contains every library you'd need in this course.

• Easiest way to avoid inter-dependency issues. 31

• Installation

1. Go to: https://www.anaconda.com/distribution/

1 1 / 1 2 / 2 2
2. Download the installer for your OS and Python version of choice and follow
instructions

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

31
D r .
V a n i
Jupiter Notebook

V a s u d e v a n
• A browser-based notebook with support for code, text, mathematical expressions,
inline plots and other rich media

32

1 1 / 1 2 / 2 2

32
D r .
V a n i
COLAB

V a s u d e v a n
• Colaboratory is a free Jupyter notebook environment that requires no setup and runs
entirely in the cloud.

• With Colaboratory you can write and execute code, save and share your analyses, and 33
access powerful computing resources, all for free from your browser.

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

33
D r .
V a n i
PANDAS

V a s u d e v a n
• It is a library for data manipulation and analysis

• Data structures: Series and Data Frame (tabular data)

• Data is loaded in-memory, hence super fast (but not ideal for datasets of scale > 34
memory size)

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

34
D r .
V a n i
Data Science with Python

V a s u d e v a n
35

1 1 / 1 2 / 2 2
1.source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
2.Data Accquisition --- Beautiful Soup, LXML, Scrapy, Tweepy (Obtaining data by
spidering the web etc), pySpark (for large data), mySQL client, mongoDB
3.Pre-processing -- Domain Specific Pre-processing techniques (Text: NLTK,
Images: Scikit-image etc)
4.Analysis/Modeling
1. Exploratory Data Analysis: Pandas
2. Visualization: pylab, matplotlib, seaborn
3. Modeling: numpy, scipy, sympy
4. Hypothesis Testing: scipy, statsmodels
5. Machine Learning: sklearn --
5.Evaluation/Interpretation/Communication
1. Latex in Ipython
2. Bokeh
3. Flask

35
O t h e r D a t a S c i e n c e To o l s

D r .
V a n i
• WEKA • RapidMiner

V a s u d e v a n
• R (RStudio) • Excel

• MATLAB • PowerBI 36
• Statistical Analysis System (SAS) • Google Analytics

1 1 / 1 2 / 2 2
• Apache Hadoop and much more!

• Tableau

• QlikView

36
D r .
V a n i
V a s u d e v a n
37

1 2 / 1 2 / 2 2
Source: Machine Learning Mastery: https://machinelearningmastery.com/wp-
content/uploads/2021/03/MachineLearningAlgorithms.jpg?__s=hbkixgpvleicleslspeo
&utm_source=drip&utm_medium=email&utm_campaign=MMLA+Mini-
Course&utm_content=Machine+Learning+Algorithms+Mind-Map+and+Mini-Course

https://machinelearningmastery.com/parametric-and-nonparametric-machine-
learning-algorithms/

Parametric Approaches :

Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks

Non Parametric Approaches:


k-Nearest Neighbors
Decision Trees like CART and C4.5

37
Support Vector Machines

37
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D B I G
D ATA

D r .
V a n i
V a s u d e v a n
The research problems in intersection of big data with data science
• Approaches to make the models learn with a smaller number of data samples
38
• Neural Machine Translation to Local languages

1 1 / 1 2 / 2 2
• Handling Data and Model drift for real-world applications

• Handling interpretability of deep learning models in real-time applications

• Building large scale generative based conversational systems

• Building context-sensitive large-scale systems

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

38
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D B I G
D ATA

D r .
V a n i
V a s u d e v a n
The research problems related to data engineering aspects
• Lightweight Big Data analytics as a Service 39

• Auto conversion of algorithms to MapReduce problems

1 2 / 1 2 / 2 2
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

39
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA

D r .
V a n i
The problems related to core big data area of handling the scale:

V a s u d e v a n
• Scalable architectures for parallel data processing
40
• Handling real-time video analytics in a distributed cloud

1 1 / 1 2 / 2 2
• Efficient graph processing at scale

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

40
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA

D r .
V a n i
The research problems to handle noise and uncertainty in the data:

V a s u d e v a n
• Identify fake news in near real-time
41
• Dimensional Reduction approaches for large scale data

1 2 / 1 2 / 2 2
• Training / Inference in noisy environments and incomplete data

• Handling uncertainty in big data processing

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

41
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA

D r .
V a n i
V a s u d e v a n
The research problems in the security and privacy area:

• Anomaly Detection in Very Large-Scale Systems


42
• Effective anonymization of sensitive fields in the large-scale systems

1 2 / 1 2 / 2 2
• Secure federated learning with real-world applications

• Scalable privacy preservation on big data

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

42
D r .
V a n i
Te n R e s e a r c h C h a l l e n g e A r e a s I n D a t a S c i e n c e

V a s u d e v a n
1. Scientific Understanding of Learning, Especially Deep Learning Algorithms.
2. Causal Reasoning
3. Precious Data
43
4. Multiple, Heterogeneous Data Sources

5. Inferring From Noisy and/or Incomplete Data


6. Trustworthy AI
7. Computing Systems for Data-Intensive Applications

1 1 / 1 2 / 2 2
8. Automating Front-End Stages of the Data Life Cycle
9. Privacy

10. Ethics

https://hdsr.mitpress.mit.edu/pub/d9j96ne4/release/3

43
D R .
V A N I
ANY QUESTIONS?

V A S U D E V A N
44

1 1 / 1 2 / 2 2
Reach me @ [email protected]
LinkedIn: https://in.linkedin.com/in/dr-vani-vasudevan-0b89b713

44

You might also like