0% found this document useful (0 votes)
46 views

Data Science: October 2021

This document provides an overview of data science and its key components. It discusses how data science evolved from earlier fields like data mining and statistics. The document outlines the different types of data, characteristics of big data, and components of data science like statistics, domain expertise, data engineering, visualization, and advanced computing. It also gives examples of applications of data science in areas like image recognition, gaming, and internet search. The document is intended to provide foundational knowledge about data science.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Data Science: October 2021

This document provides an overview of data science and its key components. It discusses how data science evolved from earlier fields like data mining and statistics. The document outlines the different types of data, characteristics of big data, and components of data science like statistics, domain expertise, data engineering, visualization, and advanced computing. It also gives examples of applications of data science in areas like image recognition, gaming, and internet search. The document is intended to provide foundational knowledge about data science.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355170843

Data Science

Presentation · October 2021


DOI: 10.13140/RG.2.2.17701.22240

CITATIONS READS

0 1,001

1 author:

Chitra G Desai
National Defence Academy
59 PUBLICATIONS   110 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Chitra G Desai on 11 October 2021.

The user has requested enhancement of the downloaded file.


Foundations of Data Science
Dr Chitra Desai
Professor and Head
Faculty of Computational Science
Introduction
• Data Analysis
• Data Mining
• Statistical Learning
• Knowledge Discovery
• Pattern Discovery
• Big Data

Fall under the same umbrella which is learning from data.

2
• Data Analysis is a process of inspecting, cleaning, transforming and modelling
data with the goal of discovering useful information, informing conclusion and
supporting decision-making.
• Statistical Learning theory deals with the problem of finding a predictive function
based on data
• Data Mining is the process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database
systems.,
• Knowledge Discovery in Databases, or KDD for short, refers to the broad process
of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. The unifying goal of the KDD process is to extract
knowledge from data in the context of large databases,
• Pattern Discovery – Uncovering patterns from massive data sets
• Big Data - is larger, more complex data sets, especially from new data sources.
These data sets are so voluminous that traditional data processing software just
can't manage them.
3
Data…
• For example, Google processes 24 petabytes of data per day.
• Facebook processes ten millions of photo every hour.
• YouTube, we have about one hour of video uploaded every second.
• Twitter, about 500 million tweets per day.
• And in astronomy, for example, satellites data is in hundreds of
petabytes.
• It is estimated that by 2021, the digital universe will reach 74
zettabytes of data.

4
Big Data
• Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS.
• Data science involves using methods to analyze massive amounts of
data and extract the knowledge it contains.
• The relationship between big data and data science as being like the
relationship between crude oil and an oil refinery.
• Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.

5
Vs of Big Data
• The characteristics of big data are often referred to as the four Vs:
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?
• Veracity - How accurate the data is?
• Value – What is the value of the huge data collected?

• The challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition,
big data calls for specialized techniques to extract the insights

6
Data Trail

• The fact is that today we are “datafied”.

• Wherever we go, we leave a trail of data.

• Smartphones for example, are tracking our locations.

• We leave it a data trail in our web browsing.

• We also interact a lot today with social networks, leaving behind us


photos, comments, and so on, and so forth.

7
Types of Data
• Raw ingredient for data science • tables,
comes in the form of Data • images,
• Different types and flavours. • transactions,
• text, • videos,
• it could be numbers, • and sometimes all of the above.
• click streams,
• graphs,

8
Types of Data
• Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record.
Structured data stored in tables within databases or Excel files .
• Semi-structured data
Semi-structured data is information that doesn't reside in a relational database but that does have
some organizational properties that make it easier to analyze.
Examples - CSV , XML and JSON documents are semi structured documents, NoSQL databases are
considered as semi structured.
• Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying.

Examples - unstructured data is your regular email, a paragraph from a book with relevant
information, social media comments and posts that need to be analyzed.
9
Types of Data
• Graph-based or network data

Examples of graph-based data can be found on many social media websites . For instance, on
LinkedIn you can see who you know at which company. Your follower list on Twitter is another
example of graph-based data.

• Streaming data

Streaming data is data that is continuously generated by different sources. Such data should be
processed incrementally using stream processing techniques without having access to all of the
data.

Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.

10
Data Science
• The term data mining tend to disappear today, it was coined in the
mid 90s.

• Today we talk more about data science.

• All this terminology has been unified under the data science
terminology, which is

How to do science with data?

11
Data Mining … Data Science
• 1996 data mining - Obtaining useful information from data
• 2001 Willian S Cleveland - took data mining to another level

Computer Science + Data Mining = Data Science

• Solve real company problem using data


• Talk about and what industry want.
• Improve on their product from the data input.

12
Web 2.0
• Big data (World of possibilities insight using data )
• Sophisticated data handling infrastructure
• Parallel computing technology
• Map reduce
• Spark
• Hadoop
• Massive unstructured data set
• 2010 train m/c using data driven approach rather than knowledge
driven approach

13
Data Science
• Data Science is the area of study which involves extracting insights
from vast amounts of data by the use of various scientific methods,
algorithms, and processes.
• It helps you to discover hidden patterns from the raw data.
• The term Data Science has emerged because of the evolution of
mathematical statistics, data analysis, and big data.
• Data Science is an interdisciplinary field that allows you to extract
knowledge from structured or unstructured data.
• Data science enables you to translate a business problem into a
research project and then translate it back into a practical solution.

14
Components of Data Science

15
Components of Data Science
• Statistics:
• Statistics is one of the most important components of data science.
• Statistics is a way to collect and analyze the numerical data in a large amount
and finding meaningful insights from it.
• Domain Expertise:
• In data science, domain expertise binds data science together.
• Domain expertise means specialized knowledge or skills of a particular area.
• In data science, there are various areas for which we need domain experts.

16
Components of Data Science
• Data engineering:
• Data engineering is a part of data science, which involves acquiring, storing,
retrieving, and transforming the data.
• Data engineering also includes metadata (data about data) to the data.
• Visualization:
• Data visualization is meant by representing data in a visual context so that people
can easily understand the significance of data.
• Data visualization makes it easy to access the huge amount of data in visuals.
• Advanced computing:
• Heavy lifting of data science is advanced computing.
• Advanced computing involves designing, writing, debugging, and maintaining the
source code of computer programs.

17
Components of Data Science

18
Applications of Data Science:

• Image recognition and speech recognition:


• Photo tagging – Facebook
• Device response – Siri, Cortana
• Gaming world:
• EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
• Internet Search:
• Search engines use the data science technology to make the search
experience better, and you can get a search result with a fraction of seconds.

19
Applications of Data Science:
• Transport:
• Self driving car
• Healthcare:
• Tumor detection, drug discovery, medical image analysis, virtual medical bots,
etc.
• Recommendation systems:
• Amazon, Netflix, Google Play, etc., are using data science technology for
making a better user experience with personalized recommendations.
• Risk detection:
• Finance industries - issue of fraud and risk of losses, but with the help of data
science, this can be rescued.

20
Tools for Data Science

• Following are some tools required for data science:


• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,
MATLAB, Excel, RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
• Data Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.

21
Challenges of Data science Technology

• High variety of information & data is required for accurate analysis


• Not adequate data science talent pool available
• Management does not provide financial support for a data science team
• Unavailability of/difficult access to data
• Data Science results not effectively used by business decision makers
• Explaining data science to others is difficult
• Privacy issues
• Lack of significant domain expert
• If an organization is very small, they can't have a Data Science team
22
23
24
25
26
About Data
• Data for Nuclear Explosion during specific time period. The file nw is
obtained from
data.world repository https://data.world/datasets/military.
• The data is analyzed here to apply concepts of Exploratory Data
Analysis and application of machine learning algorithm using python.

27
Data Information

28
Data Description

29
30
Missing Data

31
Redundant Data

32
Data Format

33
Data Visualization

34
35
EDA : Exploratory Data Analysis
• 1. Our first objective is to identify from which year to which year the
data is available with us.
• For that let us focus on the field datatime

We see that datetime field consist


of year month date and time
Our objective currently is to focus
only the year.
Let us extract year in a new column
year from the field datetime
36
37
Observation 1
• It is observed that data is for the period 1945 to 1998
• Also it is observed that during 1947,1950,1959 and 1997 no
explosions were carried out

38
2. Year wise Explosion Data

39
Data Visualization

40
3. Identify the number of countries participated in explosion from 1945 to 1998.
also find Country that has conducted max explosions

41
42
Data products
• At the h20 world conference in the Bay Area, on 11th November 2015
• Hilary Mason emphasized that the creation of “data products”
requires three components:
• data (of course)
• plus technical expertise (machine-learning)
• plus people and process (talent).
• Google Maps is a great example of a data product that epitomizes all
these three qualities.

• Hilary Mason is an American data scientist and the founder of technology startup Fast Forward Labs as
well as Data Scientist in Residence at Accel Partners.

43
Fact
• Data is meaningless without context
• People are natural analyst

44
SPF 15 Vs SPF 30

45
Ways to represent…

46
Learning

Data scientist – Interpret and Guide


47
Data Science Hierarchy of Needs

AI and Deep
Learning

Learn / Optimize A/B Testing, Experimentation,


Simple ML problems
Analytics, Metric, Segments, Aggregates,
Aggregate/Label
Features, Training Data

Explore/Transform Cleaning, Anomaly Detection, Prep

Reliable data flow,


Move / Store Infrastructure,ETL,Structured and
unstructured data storage

Collect Instrumentation, Logging, Sensors,


48
External data, User Generated Contents
Questions?
Thank You

50
View publication stats

You might also like