Data Science: October 2021
Data Science: October 2021
net/publication/355170843
Data Science
CITATIONS READS
0 1,001
1 author:
Chitra G Desai
National Defence Academy
59 PUBLICATIONS 110 CITATIONS
SEE PROFILE
All content following this page was uploaded by Chitra G Desai on 11 October 2021.
2
• Data Analysis is a process of inspecting, cleaning, transforming and modelling
data with the goal of discovering useful information, informing conclusion and
supporting decision-making.
• Statistical Learning theory deals with the problem of finding a predictive function
based on data
• Data Mining is the process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database
systems.,
• Knowledge Discovery in Databases, or KDD for short, refers to the broad process
of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. The unifying goal of the KDD process is to extract
knowledge from data in the context of large databases,
• Pattern Discovery – Uncovering patterns from massive data sets
• Big Data - is larger, more complex data sets, especially from new data sources.
These data sets are so voluminous that traditional data processing software just
can't manage them.
3
Data…
• For example, Google processes 24 petabytes of data per day.
• Facebook processes ten millions of photo every hour.
• YouTube, we have about one hour of video uploaded every second.
• Twitter, about 500 million tweets per day.
• And in astronomy, for example, satellites data is in hundreds of
petabytes.
• It is estimated that by 2021, the digital universe will reach 74
zettabytes of data.
4
Big Data
• Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS.
• Data science involves using methods to analyze massive amounts of
data and extract the knowledge it contains.
• The relationship between big data and data science as being like the
relationship between crude oil and an oil refinery.
• Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.
5
Vs of Big Data
• The characteristics of big data are often referred to as the four Vs:
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?
• Veracity - How accurate the data is?
• Value – What is the value of the huge data collected?
• The challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition,
big data calls for specialized techniques to extract the insights
6
Data Trail
7
Types of Data
• Raw ingredient for data science • tables,
comes in the form of Data • images,
• Different types and flavours. • transactions,
• text, • videos,
• it could be numbers, • and sometimes all of the above.
• click streams,
• graphs,
8
Types of Data
• Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record.
Structured data stored in tables within databases or Excel files .
• Semi-structured data
Semi-structured data is information that doesn't reside in a relational database but that does have
some organizational properties that make it easier to analyze.
Examples - CSV , XML and JSON documents are semi structured documents, NoSQL databases are
considered as semi structured.
• Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying.
Examples - unstructured data is your regular email, a paragraph from a book with relevant
information, social media comments and posts that need to be analyzed.
9
Types of Data
• Graph-based or network data
Examples of graph-based data can be found on many social media websites . For instance, on
LinkedIn you can see who you know at which company. Your follower list on Twitter is another
example of graph-based data.
• Streaming data
Streaming data is data that is continuously generated by different sources. Such data should be
processed incrementally using stream processing techniques without having access to all of the
data.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
10
Data Science
• The term data mining tend to disappear today, it was coined in the
mid 90s.
• All this terminology has been unified under the data science
terminology, which is
11
Data Mining … Data Science
• 1996 data mining - Obtaining useful information from data
• 2001 Willian S Cleveland - took data mining to another level
12
Web 2.0
• Big data (World of possibilities insight using data )
• Sophisticated data handling infrastructure
• Parallel computing technology
• Map reduce
• Spark
• Hadoop
• Massive unstructured data set
• 2010 train m/c using data driven approach rather than knowledge
driven approach
13
Data Science
• Data Science is the area of study which involves extracting insights
from vast amounts of data by the use of various scientific methods,
algorithms, and processes.
• It helps you to discover hidden patterns from the raw data.
• The term Data Science has emerged because of the evolution of
mathematical statistics, data analysis, and big data.
• Data Science is an interdisciplinary field that allows you to extract
knowledge from structured or unstructured data.
• Data science enables you to translate a business problem into a
research project and then translate it back into a practical solution.
14
Components of Data Science
15
Components of Data Science
• Statistics:
• Statistics is one of the most important components of data science.
• Statistics is a way to collect and analyze the numerical data in a large amount
and finding meaningful insights from it.
• Domain Expertise:
• In data science, domain expertise binds data science together.
• Domain expertise means specialized knowledge or skills of a particular area.
• In data science, there are various areas for which we need domain experts.
16
Components of Data Science
• Data engineering:
• Data engineering is a part of data science, which involves acquiring, storing,
retrieving, and transforming the data.
• Data engineering also includes metadata (data about data) to the data.
• Visualization:
• Data visualization is meant by representing data in a visual context so that people
can easily understand the significance of data.
• Data visualization makes it easy to access the huge amount of data in visuals.
• Advanced computing:
• Heavy lifting of data science is advanced computing.
• Advanced computing involves designing, writing, debugging, and maintaining the
source code of computer programs.
17
Components of Data Science
18
Applications of Data Science:
19
Applications of Data Science:
• Transport:
• Self driving car
• Healthcare:
• Tumor detection, drug discovery, medical image analysis, virtual medical bots,
etc.
• Recommendation systems:
• Amazon, Netflix, Google Play, etc., are using data science technology for
making a better user experience with personalized recommendations.
• Risk detection:
• Finance industries - issue of fraud and risk of losses, but with the help of data
science, this can be rescued.
20
Tools for Data Science
21
Challenges of Data science Technology
27
Data Information
28
Data Description
29
30
Missing Data
31
Redundant Data
32
Data Format
33
Data Visualization
34
35
EDA : Exploratory Data Analysis
• 1. Our first objective is to identify from which year to which year the
data is available with us.
• For that let us focus on the field datatime
38
2. Year wise Explosion Data
39
Data Visualization
40
3. Identify the number of countries participated in explosion from 1945 to 1998.
also find Country that has conducted max explosions
41
42
Data products
• At the h20 world conference in the Bay Area, on 11th November 2015
• Hilary Mason emphasized that the creation of “data products”
requires three components:
• data (of course)
• plus technical expertise (machine-learning)
• plus people and process (talent).
• Google Maps is a great example of a data product that epitomizes all
these three qualities.
• Hilary Mason is an American data scientist and the founder of technology startup Fast Forward Labs as
well as Data Scientist in Residence at Accel Partners.
43
Fact
• Data is meaningless without context
• People are natural analyst
44
SPF 15 Vs SPF 30
45
Ways to represent…
46
Learning
AI and Deep
Learning
50
View publication stats