0% found this document useful (0 votes)

28 views45 pages

Data Science From A Research Perspective

Data science from a research perspective

Uploaded by

Dr. Vani V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views45 pages

Data Science From A Research Perspective

Data science from a research perspective

Uploaded by

Dr. Vani V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

D R .

V A N I
D ATA S C I E N C E F RO M A

V A S U D E V A N
RESEARCH PERSPECTIVE
1

1 1 / 1 2 / 2 2
Dr.Vani Vasudevan, Professor – CSE,

Nitte Meenakshi Institute of Technology, Bangalore.

1
Some Quotes!!!

D r .
V a n i
The purpose of computing is insight, not numbers.– Richard W. Hamming (Data Science)

V a s u d e v a n
•

• A data scientist is someone who knows more statistics than a computer scientist and more computer
science than a statistician.– Josh Blumenstock (Mathematics)

• On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will
the right answers come out?” . . . I am not able to rightly apprehend the kind of confusion of ideas that 2
could provoke such a question.– Charles Babbage (Data Wrangling )

• Money is a scoreboard where you can rank how you’re doing against other people.– Mark Cuban
(Measures)

1 1 / 1 2 / 2 2
• It is easy to lie with statistics, but easier to lie without them. (Statistical Analysis)

• At their best, graphics are instruments for reasoning.– Edward Tufte (Data Visualization )

2
D r .
V a n i
Some More Quotes!!!

V a s u d e v a n
• All models are wrong, but some models are useful.– George Box (Mathematical
Models)

• Any sufficiently advanced form of cheating is indistinguishable from learning.– 3

Jan Schaumann (Machine Learning)

• A change in quantity also entails a change in quality.– Friedrich Engel (Big Data)
https://www.internetlivestats.com/

1 1 / 1 2 / 2 2

3
D r .
V a n i
Data Science…

V a s u d e v a n
Data science is an emerging field that

(1) is extremely transdisciplinary –bridging between the theoretical, computational,

experimental, and biosocial areas; 4

(2) deals with enormous amounts of complex, incongruent, and dynamic data from
multiple sources; and

1 1 / 1 2 / 2 2
Source : Data Science and Predictive Analytics
Ivo D. Dinov, “Biomedical and Health Applications using R”, Springer
2018

4
D r .
V a n i
Data Science

V a s u d e v a n
(3) aims to develop algorithms, methods, tools, and services capable of ingesting such datasets
and generating semiautomated decision support systems.

5
The latter can mine the data for patterns or motifs, predict expected outcomes, suggest clustering
or labeling of retrospective or prospective observations, compute data signatures or fingerprints,
extract valuable information, and offer evidence-based actionable knowledge.

1 1 / 1 2 / 2 2
Data science techniques often involve data manipulation (wrangling), data harmonization and
aggregation, exploratory or confirmatory data analyses, predictive analytics, validation, and fine-
tuning.

Source : Data Science and Predictive Analytics

Ivo D. Dinov, “Biomedical and Health Applications using R”, Springer
2018

5
D r .
V a n i
What is Data Science?

V a s u d e v a n
Like any emerging field, it isn’t yet well defined, but incorporates elements of:

• Exploratory Data Analysis and Visualization

• Machine Learning and Statistics 6

• High-Performance Computing technologies for dealing with scale.

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

6
D r .
V a n i
S K I L L S E TS FO R D ATA S C I E N C E

V a s u d e v a n
7

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
Some more resources:
https://towardsdatascience.com/introduction-to-statistics-e9d72d818745
https://deepai.org/machine-learning-glossary-and-terms/data-science
https://www.devopsschool.com/blog/what-is-data-science-advantages-and-
disadvantages-of-data-science/

7
D r .
V a n i
Appreciating Data

V a s u d e v a n
• Computer Scientists do not naturally appreciate data: it’s just stuff to run through a
program.

• The usual way to test algorithm performance is to run the implementation on “random 8
data”.

• But, interesting data sets are a scarce resource, which requires hard
work and imagination to obtain.

1 1 / 1 2 / 2 2
•

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

8
D r .
V a n i
Computer Vs. Real Scientists

V a s u d e v a n
• Scientists strive to understand the complicated and messy natural world, while computer
scientists build their own clean and organized virtual worlds. Thus:

• Nothing is ever completely true or false in science, while everything is either true or false in 9
Computer Science / Mathematics.

• Scientists are data-driven, while computer scientists are algorithm-driven .

• Scientists obsess about discovering things, which computer scientists invent rather than

1 1 / 1 2 / 2 2
discover.

• Scientists are comfortable with the idea that data has errors; computer scientists are not.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

9
D r .
V a n i
Genius Vs. Wisdom

V a s u d e v a n
• Software developers are hired to produce code.

• Data Scientists are hired to produce insights.

10
• Genius shows in finding the right answer!!!

• Wisdom shows in avoiding the wrong answers.

Data science (like most things) benefits more from wisdom than from genius.

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

10
D r .
V a n i
Developing Wisdom

V a s u d e v a n
• Wisdom comes from experience.

• Wisdom comes from general knowledge.

11
• Wisdom comes from listening to others.

• Wisdom comes from humility, observing how often you have been wrong and
why/how.

1 1 / 1 2 / 2 2
I seek pass on wisdom, through my experience on the difficulty of making
good predictions.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

11
D r .
V a n i
Developing Curiosity

V a s u d e v a n
• The good data scientist develops a curiosity about the domain/application they are working in.

• They talk shop with the people whose data they are working on.

• They read the newspaper every day, to get a broader perspective on the world. 12

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

12
D r .
V a n i
V a s u d e v a n
Asking GOOD QUESTIONS:

Software developers are not encouraged to ask questions, but data scientists are:

• What exciting things might you be able to learn from a given data set?
13
• What things do you/your people really want to know?

• What data sets might get you there?

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

13
D r .
V a n i
LET’S PRACTICE ASKING QUESTIONS!

V a s u d e v a n
• Who, What, Where, When, and Why on the following datasets:

1. International Movie Database (IMDB)

2. New York City Taxi Trip Duration 14

1 1 / 1 2 / 2 2

14
15
15

1 1 / 1 2 / 2 2
D r . V a n i V a s u d e v a n
IMDb: Movie Data
16
16

1 1 / 1 2 / 2 2
D r . V a n i V a s u d e v a n
IMDb: Actor Data
D r .
V a n i
Movie Questions

V a s u d e v a n
• Can we predict how well people will like a movie? What about its gross?

• What does the social network of actors look like?

17
• What is the age distribution of actors and actresses in film?

• Do stars live longer or shorter lives than the bit players or public?

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

17
D r .
V a n i
N Y C Ta x i Tr i p D a t a

V a s u d e v a n
• https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data

1 1 / 1 2 / 2 2

18
D r .
V a n i
N Y C Ta x i Tr i p Q u e s t i o n s

V a s u d e v a n
• https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data

• How far do they travel?

19
• How much slower is traffic during rush hour?

• Where are people traveling to/from at different times of the day?

• Where should drivers go to pick up their next fare?

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

19
D r .
V a n i
Properties Of Data

V a s u d e v a n
• Structured vs. Unstructured Data

• Quantitative vs. Categorical Data

• Big Data vs. Little Data 20

1 1 / 1 2 / 2 2
• Do not blindly aspire to analyze large data sets. Seek the right data to answer a given
question, not necessarily the biggest thing you can get your hands on.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

20
D r .
V a n i
Classification And Regression

V a s u d e v a n
• Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression.

21
Classification: Often, we seek to assign a label to an item from a discrete set of
possibilities. Such problems as predicting the winner of a particular sporting contest
(team A or team B?) or deciding the genre of a given movie (comedy, drama, or

1 1 / 1 2 / 2 2
animation?) are classification problems, since each entail selecting a label from the
possible choices.

21
D r .
V a n i
Classification And Regression

V a s u d e v a n
• Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression.

22
Regression: Another common task is to forecast a given numerical quantity. Predicting a
person's weight or how much rain we will get this year is a regression problem, where we
forecast the future value of a numerical function in terms of previous values and other

1 1 / 1 2 / 2 2
relevant features.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

22
D r .
V a n i
Classification And Regression

V a s u d e v a n
• The best way to see the intended distinction is to look at a variety of data science
problems and label (classify) them as regression or classification.

• Different algorithmic methods are used to solve these two types of problems. 23
1. Will the price of a particular stock be higher or lower tomorrow?

2. What will the price of a particular stock be tomorrow?

Is this person a good risk to sell an insurance policy to?

1 1 / 1 2 / 2 2
3.

4. How long do we expect this person to live?

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

23
D r .
V a n i
PRACTICE QUESTION

V a s u d e v a n
Identifying Data Sets
1. Identify where interesting data sets relevant to the following domains can be found on the web:

(a) Books. 24
(c) Stock prices.

(d) Risks of diseases.

(e) Colleges and universities.

1 1 / 1 2 / 2 2
(f) Crime rates.

For each of these data sources, explain what you must do to turn this data intoa usable format on
your computer for analysis.

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

24
D r .
V a n i
The Data Science Pipeline

V a s u d e v a n
1. Get or collect data

2. Manipulate and process data

3. Modeling and analysis 25

4. Visualize, evaluate, present, and communicate

1 1 / 1 2 / 2 2

25
D r .
Some Useful Web Resources to Kick S tar t

V a n i
Yo u r L e a r n i n g A n d R e s e a r c h !

V a s u d e v a n
• https://cognitiveclass.ai/ - Data Science and Cognitive Computing Courses

• https://www.kdnuggets.com/ - Site on AI, Analytics, Big Data, Data Mining, Data Science, and ML https://www.kaggle.com/ - ML & DS
community

• https://data.gov/ - US government data

• http://archive.ics.uci.edu/ml/index.php - ML Repository 26
• https://homepages.ecs.vuw.ac.nz/~marslast/MLbook.html - Stephen Marsland homepage

• https://www.cs.waikato.ac.nz/ml/weka/courses.html – Waikato University - Weka MOOC

• https://nptel.ac.in/courses/106/106/106106202/ - NPTEL - Machine Learning

• Rohit singh, tommi jaakkola, and ali mohammad. 6.867 Machine learning. Fall 2006. Massachusetts institute of technology: MIT

1 1 / 1 2 / 2 2
opencourseware, https://ocw.mit.edu.

• Leslie kaelbling, tomás lozano-pérez, isaac chuang, and duane boning. 6.036 introduction to machine learning. Fall 2020. Massachusetts
institute of technology: MIT opencourseware, https://ocw.mit.edu.

• https://ocw.mit.edu/courses/hst-953-collaborative-data-science-for-healthcare-fall-2020/ collaborative data science for healthcare

• https://ocw.mit.edu/courses/15-062-data-mining-spring-2003/

26
D r .
V a n i
D a t a S c i e n c e To o l s

V a s u d e v a n
1. Python(Most known)

• Python is one of the most dominant languages in the field of data science today because of
its flexibility, ease of use in terms of syntax, open-source nature, and ability to handle, 27
clean, manipulate, visualize, and analyze data.

• Python was essentially developed as a programming language. However, it offers a wide

range of libraries, such as TensorFlow, Keras, PyTorch, Seaborn, etc., that are attractive for

1 1 / 1 2 / 2 2
both programmers and data scientists alike. Moreover, there are various other tools
connected to and built with the help of Python, such as Dask, SciPy, Cython, Matplotlib,
and High-Performance Analytics Toolkit(HPAT).

27
D r .
V a n i
Java Vs Python

V a s u d e v a n
28

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

28
D r .
V a n i
Python Libraries

V a s u d e v a n
29

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

29
D r .
V a n i
The Python Ecosystem

V a s u d e v a n
• Numpy: N-dimensional arrays, Matrices and Linear Algebra

• Scipy: Algorithms from linear algebra, optimization, statistics and signal processing

• Pandas: Data Manipulation and Analysis 30

• Matplotlib: Data Visualization

• IPython: Interactive shell for Python

1 1 / 1 2 / 2 2
• Scikit-learn: Machine Learning

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

30
D r .
V a n i
Anaconda

V a s u d e v a n
• A bundle of data science, machine learning and visualization libraries.

• Contains every library you'd need in this course.

• Easiest way to avoid inter-dependency issues. 31

• Installation

1. Go to: https://www.anaconda.com/distribution/

1 1 / 1 2 / 2 2
2. Download the installer for your OS and Python version of choice and follow
instructions

source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

31
D r .
V a n i
Jupiter Notebook

V a s u d e v a n
• A browser-based notebook with support for code, text, mathematical expressions,
inline plots and other rich media

1 1 / 1 2 / 2 2

32
D r .
V a n i
COLAB

V a s u d e v a n
• Colaboratory is a free Jupyter notebook environment that requires no setup and runs
entirely in the cloud.

• With Colaboratory you can write and execute code, save and share your analyses, and 33
access powerful computing resources, all for free from your browser.

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

33
D r .
V a n i
PANDAS

V a s u d e v a n
• It is a library for data manipulation and analysis

• Data structures: Series and Data Frame (tabular data)

• Data is loaded in-memory, hence super fast (but not ideal for datasets of scale > 34
memory size)

1 1 / 1 2 / 2 2
source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.

34
D r .
V a n i
Data Science with Python

V a s u d e v a n
35

1 1 / 1 2 / 2 2
1.source: Steven S. Skiena, “The Data Science Design Manual”, Springer 2017.
2.Data Accquisition --- Beautiful Soup, LXML, Scrapy, Tweepy (Obtaining data by
spidering the web etc), pySpark (for large data), mySQL client, mongoDB
3.Pre-processing -- Domain Specific Pre-processing techniques (Text: NLTK,
Images: Scikit-image etc)
4.Analysis/Modeling
1. Exploratory Data Analysis: Pandas
2. Visualization: pylab, matplotlib, seaborn
3. Modeling: numpy, scipy, sympy
4. Hypothesis Testing: scipy, statsmodels
5. Machine Learning: sklearn --
5.Evaluation/Interpretation/Communication
1. Latex in Ipython
2. Bokeh
3. Flask

35
O t h e r D a t a S c i e n c e To o l s

D r .
V a n i
• WEKA • RapidMiner

V a s u d e v a n
• R (RStudio) • Excel

• MATLAB • PowerBI 36
• Statistical Analysis System (SAS) • Google Analytics

1 1 / 1 2 / 2 2
• Apache Hadoop and much more!

• Tableau

• QlikView

36
D r .
V a n i
V a s u d e v a n
37

1 2 / 1 2 / 2 2
Source: Machine Learning Mastery: https://machinelearningmastery.com/wp-
content/uploads/2021/03/MachineLearningAlgorithms.jpg?__s=hbkixgpvleicleslspeo
&utm_source=drip&utm_medium=email&utm_campaign=MMLA+Mini-
Course&utm_content=Machine+Learning+Algorithms+Mind-Map+and+Mini-Course

https://machinelearningmastery.com/parametric-and-nonparametric-machine-
learning-algorithms/

Parametric Approaches :

Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks

Non Parametric Approaches:

k-Nearest Neighbors
Decision Trees like CART and C4.5

37
Support Vector Machines

37
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D B I G
D ATA

D r .
V a n i
V a s u d e v a n
The research problems in intersection of big data with data science
• Approaches to make the models learn with a smaller number of data samples
38
• Neural Machine Translation to Local languages

1 1 / 1 2 / 2 2
• Handling Data and Model drift for real-world applications

• Handling interpretability of deep learning models in real-time applications

• Building large scale generative based conversational systems

• Building context-sensitive large-scale systems

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

38
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D B I G
D ATA

D r .
V a n i
V a s u d e v a n
The research problems related to data engineering aspects
• Lightweight Big Data analytics as a Service 39

• Auto conversion of algorithms to MapReduce problems

1 2 / 1 2 / 2 2
https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

39
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA

D r .
V a n i
The problems related to core big data area of handling the scale:

V a s u d e v a n
• Scalable architectures for parallel data processing
40
• Handling real-time video analytics in a distributed cloud

1 1 / 1 2 / 2 2
• Efficient graph processing at scale

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

40
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA

D r .
V a n i
The research problems to handle noise and uncertainty in the data:

V a s u d e v a n
• Identify fake news in near real-time
41
• Dimensional Reduction approaches for large scale data

1 2 / 1 2 / 2 2
• Training / Inference in noisy environments and incomplete data

• Handling uncertainty in big data processing

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

41
R E S E A R C H P RO B L E M S I N D ATA S C I E N C E A N D
B I G D ATA

D r .
V a n i
V a s u d e v a n
The research problems in the security and privacy area:

• Anomaly Detection in Very Large-Scale Systems

42
• Effective anonymization of sensitive fields in the large-scale systems

1 2 / 1 2 / 2 2
• Secure federated learning with real-world applications

• Scalable privacy preservation on big data

https://towardsdatascience.com/top-20-latest-research-problems-in-big-data-and-
data-science-c6fb51e03136

42
D r .
V a n i
Te n R e s e a r c h C h a l l e n g e A r e a s I n D a t a S c i e n c e

V a s u d e v a n
1. Scientific Understanding of Learning, Especially Deep Learning Algorithms.
2. Causal Reasoning
3. Precious Data
43
4. Multiple, Heterogeneous Data Sources

5. Inferring From Noisy and/or Incomplete Data

6. Trustworthy AI
7. Computing Systems for Data-Intensive Applications

1 1 / 1 2 / 2 2
8. Automating Front-End Stages of the Data Life Cycle
9. Privacy

10. Ethics

https://hdsr.mitpress.mit.edu/pub/d9j96ne4/release/3

43
D R .
V A N I
ANY QUESTIONS?

V A S U D E V A N
44

1 1 / 1 2 / 2 2
Reach me @ [email protected]
LinkedIn: https://in.linkedin.com/in/dr-vani-vasudevan-0b89b713

Data Science Design
No ratings yet
Data Science Design
299 pages
2017 Book TheDataScienceDesignManual - by WWW - Learnengineering.in
No ratings yet
2017 Book TheDataScienceDesignManual - by WWW - Learnengineering.in
456 pages
Datascience
75% (8)
Datascience
28 pages
FDS - Aids Complete Notes
No ratings yet
FDS - Aids Complete Notes
138 pages
Lab10 - Dirbuster
No ratings yet
Lab10 - Dirbuster
15 pages
Tips Workbook
No ratings yet
Tips Workbook
299 pages
LIBRO the+Data+Science+Design+Manual
No ratings yet
LIBRO the+Data+Science+Design+Manual
456 pages
Module 1
No ratings yet
Module 1
181 pages
Ec8552-Cao Unit 5
No ratings yet
Ec8552-Cao Unit 5
72 pages
The History of UI - UX Design
No ratings yet
The History of UI - UX Design
5 pages
Blockchain Databases: Practice Exercises
0% (1)
Blockchain Databases: Practice Exercises
4 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Executive Data Science A Guide To Training and Managing The Best Data Scientists by Brian Caffo, Roger D. Peng, Jeffrey T. Leek
100% (1)
Executive Data Science A Guide To Training and Managing The Best Data Scientists by Brian Caffo, Roger D. Peng, Jeffrey T. Leek
150 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data Science Tips and Tricks To Learn Data Science Theories Effectively
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
208 pages
Unit 1 Data Science Notes
No ratings yet
Unit 1 Data Science Notes
33 pages
Ge Egd Manual
No ratings yet
Ge Egd Manual
47 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Lectura 1
No ratings yet
Lectura 1
43 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Washer Programming PDF
No ratings yet
Washer Programming PDF
31 pages
Intro To Data Science
No ratings yet
Intro To Data Science
100 pages
Abdul Kadir
No ratings yet
Abdul Kadir
97 pages
Infrastructre As Code
No ratings yet
Infrastructre As Code
2 pages
Unit 1 - Cga - 2021
No ratings yet
Unit 1 - Cga - 2021
40 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Arrays Theory & Sorting
No ratings yet
Arrays Theory & Sorting
11 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Dia 1
No ratings yet
Dia 1
88 pages
Module 4.1 - Data Science
No ratings yet
Module 4.1 - Data Science
56 pages
Paper 2 (Practical Programming Project)
No ratings yet
Paper 2 (Practical Programming Project)
8 pages
Kadir
No ratings yet
Kadir
84 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
Sujet Examen Semestre 2 Audit de Securite
No ratings yet
Sujet Examen Semestre 2 Audit de Securite
4 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
CMR Bda Why Data Analytics
No ratings yet
CMR Bda Why Data Analytics
108 pages
Callcourier Web Service Api Documentation
No ratings yet
Callcourier Web Service Api Documentation
7 pages
Unit 1
No ratings yet
Unit 1
76 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Da Session 1
No ratings yet
Da Session 1
50 pages
AU9510 USB Smart Card Reader Chip Technical Reference Manual
No ratings yet
AU9510 USB Smart Card Reader Chip Technical Reference Manual
30 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Computer System Structure PDF
No ratings yet
Computer System Structure PDF
24 pages
Unit 1
No ratings yet
Unit 1
60 pages
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
No ratings yet
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
30 pages
Order Picking Pt2
No ratings yet
Order Picking Pt2
78 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Área Personal V-2020-C-2-1450-2849-ING-110 Module 1 Test: Mis Cursos
No ratings yet
Área Personal V-2020-C-2-1450-2849-ING-110 Module 1 Test: Mis Cursos
5 pages
Session 1819
No ratings yet
Session 1819
47 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
UNIT-2 IoT
No ratings yet
UNIT-2 IoT
11 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Module1 21CS644 DSV
No ratings yet
Module1 21CS644 DSV
16 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Slide Voice Content
No ratings yet
Slide Voice Content
20 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
6th Sem Syllabus
No ratings yet
6th Sem Syllabus
6 pages
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Computer Fundamentals and Programming: Course Description
No ratings yet
Computer Fundamentals and Programming: Course Description
2 pages
Step-by-Step Guide To Control RRHFOEM02 Over SPI From Raspberry Pi 4
No ratings yet
Step-by-Step Guide To Control RRHFOEM02 Over SPI From Raspberry Pi 4
7 pages
Data File Handling Working With Binary Files
No ratings yet
Data File Handling Working With Binary Files
10 pages
A Bird's Eye view of Data Visualisation
From Everand
A Bird's Eye view of Data Visualisation
Nisarg Patel
No ratings yet
Data
No ratings yet
Data
43 pages
Data Science
No ratings yet
Data Science
18 pages
6220010
No ratings yet
6220010
37 pages
Exp 5
No ratings yet
Exp 5
9 pages
DIP Lab File
No ratings yet
DIP Lab File
13 pages
Lecturer 2 - Software Engineering Layered Technology - SDLC
No ratings yet
Lecturer 2 - Software Engineering Layered Technology - SDLC
28 pages
Unit I - Data Science Fundamentals
No ratings yet
Unit I - Data Science Fundamentals
6 pages
PhaniSaiBhogadi AWS DEVOPS
No ratings yet
PhaniSaiBhogadi AWS DEVOPS
2 pages
Data Science vs. Statistics: Two Cultures?
No ratings yet
Data Science vs. Statistics: Two Cultures?
22 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Data Science Internship
No ratings yet
Data Science Internship
6 pages
Towards Methods For Systematic Research On Big Data
No ratings yet
Towards Methods For Systematic Research On Big Data
10 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
MET BTech Common Counseling 2024 Cutoff Ranks Round 3
No ratings yet
MET BTech Common Counseling 2024 Cutoff Ranks Round 3
2 pages
(PSBUCLSHW-71) NTB Communication Between MR4 and MR4 Patch4 - Microchip Customer Jira
No ratings yet
(PSBUCLSHW-71) NTB Communication Between MR4 and MR4 Patch4 - Microchip Customer Jira
5 pages
AY 20-21 - CSBS Curriculum - 7 Sem Pattern 180 Credits
No ratings yet
AY 20-21 - CSBS Curriculum - 7 Sem Pattern 180 Credits
3 pages
Synology DS411j Data Sheet Enu
No ratings yet
Synology DS411j Data Sheet Enu
2 pages

Data Science From A Research Perspective

Uploaded by

Data Science From A Research Perspective

Uploaded by

D R .

Nitte Meenakshi Institute of Technology, Bangalore.

• Any sufficiently advanced form of cheating is indistinguishable from learning.– 3

Jan Schaumann (Machine Learning)

(1) is extremely transdisciplinary –bridging between the theoretical, computational,

Source : Data Science and Predictive Analytics

• Exploratory Data Analysis and Visualization

• Machine Learning and Statistics 6

• High-Performance Computing technologies for dealing with scale.

• Scientists are data-driven, while computer scientists are algorithm-driven .

• Data Scientists are hired to produce insights.

• Wisdom shows in avoiding the wrong answers.

• Wisdom comes from general knowledge.

• What data sets might get you there?

1. International Movie Database (IMDB)

2. New York City Taxi Trip Duration 14

• What does the social network of actors look like?

• How far do they travel?

• Where are people traveling to/from at different times of the day?

• Where should drivers go to pick up their next fare?

• Quantitative vs. Categorical Data

• Big Data vs. Little Data 20

2. What will the price of a particular stock be tomorrow?

Is this person a good risk to sell an insurance policy to?

4. How long do we expect this person to live?

(d) Risks of diseases.

(e) Colleges and universities.

2. Manipulate and process data

3. Modeling and analysis 25

4. Visualize, evaluate, present, and communicate

• https://data.gov/ - US government data

• https://www.cs.waikato.ac.nz/ml/weka/courses.html – Waikato University - Weka MOOC

• https://nptel.ac.in/courses/106/106/106106202/ - NPTEL - Machine Learning

• https://ocw.mit.edu/courses/hst-953-collaborative-data-science-for-healthcare-fall-2020/ collaborative data science for healthcare

• Python was essentially developed as a programming language. However, it offers a wide

• Pandas: Data Manipulation and Analysis 30

• Matplotlib: Data Visualization

• IPython: Interactive shell for Python

• Contains every library you'd need in this course.

• Easiest way to avoid inter-dependency issues. 31

• Data structures: Series and Data Frame (tabular data)

Non Parametric Approaches:

• Handling interpretability of deep learning models in real-time applications

• Building large scale generative based conversational systems

• Building context-sensitive large-scale systems

• Auto conversion of algorithms to MapReduce problems

• Handling uncertainty in big data processing

• Anomaly Detection in Very Large-Scale Systems

• Scalable privacy preservation on big data

5. Inferring From Noisy and/or Incomplete Data

You might also like