0% found this document useful (0 votes)

542 views37 pages

Introduction To Data Science

Data science involves extracting meaningful insights from raw data using scientific methods, technologies, and algorithms. It involves asking the right questions, modeling data using complex algorithms, visualizing data, and understanding data to make better decisions. Tools used for data science include Python, R, SQL, and machine learning tools. Data science has applications in transportation, finance, e-commerce, healthcare, gaming, and logistics to optimize processes, detect patterns, and make predictions.

Uploaded by

Manak Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

542 views37 pages

Introduction To Data Science

Uploaded by

Manak Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Introduction to Data Science

What is Data Science?

• Data science is a deep study of the massive amount of data, which
involves extracting meaningful insights from raw, structured, and
unstructured data that is processed using the scientific method,
different technologies, and algorithms.
• “Data science is a multidisciplinary blend of data inference,
algorithm development, and technology in order to solve
analytically complex problems.”
What is Data Science?
• In short, we can say that data science is all about:

• Asking the correct questions and analyzing the raw data.

• Modeling the data using various complex and efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the
final result.
Data Science Components
Tools for Data Science

Following are some tools required for data science:

Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,
MATLAB, Excel, RapidMiner.
Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
Data Visualization tools: R, Jupyter, Tableau, Cognos.
Machine learning tools: Spark, Mahout, Azure ML studio.
Applications of Data Science
• In Transport- Data Science also entered into the Transport field like
Driverless Cars. With the help of Driverless Cars, it is easy to reduce the
number of Accidents.

• In Finance - Data Science plays a key role in Financial Industries.

Financial Industries always have an issue of fraud and risk of losses. Thus,
Financial Industries needs to automate risk of loss analysis in order to
carry out strategic decisions for the company.

• In E-Commerce - E-Commerce Websites like Amazon, Flipkart, etc. uses

data Science to make a better user experience with personalized
recommendations.
Applications of Data Science
• In Health Care
• In the Healthcare Industry data science act as a boon. Data Science is used
for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.

• Image Recognition
• Targeting Recommendation- Data Science helps those companies who are
paying for Advertisements for their mobile
Applications of Data Science
• Data Science in Gaming
• Medicine and Drug Development
• In Delivery Logistics - Various Logistics companies like DHL, FedEx,
etc. make use of Data Science. Data Science helps these companies to find
the best route for the Shipment of their Products, the best time suited for
delivery, the best mode of transport to reach the destination, etc.
• Autocomplete - AutoComplete feature is an important part of Data
Science where the user will get the facility to just type a few letters or
words, and he will get the feature of auto-completing the line
The Relationship between Data Science and Information
Science
• Data science is the discovery of knowledge or actionable information in
data.
• Information science is the design of practices for storing and retrieving
information.

For example,
• The number “480,000” is a data point. But when we add an explanation
that it represents the number of deaths per year in the USA from cigarette
smoking,32 it becomes information. But in many real-world scenarios, the
distinction between a meaningful and a meaningless data point is not clear
enough for us to differentiate data and information.
Data science
• Data science is used in business functions such as strategy
formation, decision making and operational processes. It touches
on practices such as artificial intelligence, analytics, predictive
analytics and algorithm design. The discovery of knowledge and
actionable information in data. Data science is an interdisciplinary
field about scientific methods, processes, and systems to extract
knowledge or insights from data in various forms, either
structured or unstructured.
Information Science
• The field of information science, which often stems from
computing, computational science, informatics, information
technology, or library science, often represents and serves such
application areas. The core idea here is to cover people studying,
accessing, using, and producing information in various contexts
Business Intelligence versus Data Science
S. No. Factor Data Science Business Intelligence

It is a field that uses It is basically a set of

mathematics, statistics and technologies, applications
various other tools to and processes that are used
discover the hidden patterns by the enterprises for
1. Concept in the data. business data analysis.

It focuses on the past and

2. Focus It focuses on the future. present.

It deals with both structured It mainly deals only with

3. Data as well as unstructured data. structured data.

It is less flexible as in case

Data science is much more of business intelligence
flexible as data sources can data sources need to be
4. Flexibility be added as per requirement. pre-planned.
Business Intelligence versus Data Science
It makes use of the It makes use of the
5. Method scientific method. analytic method.

It has a higher complexity in

comparison to business It is much simpler when
6. Complexity intelligence. compared to data science.

It’s expertise is data It’s expertise is the

7. Expertise scientist. business user.

It deals with the questions

of what will happen and It deals with the question of
8. Questions what if. what happened.

The data to be used is

disseminated in real-time Data warehouse is utilized
9. Storage clusters. to hold data.
Business Intelligence versus Data Science

The ELT The ETL

(Extract-Load-Transform) (Extract-Transform-Load
process is generally used ) process is generally used
for the integration of data for the integration of data
Integration for data science for business intelligence
10. of data applications. applications.

It’s tools are

InsightSquared Sales
Analytics, Klipfolio,
It’s tools are SAS, BigML, ThoughtSpot, Cyfe, TIBCO
11. Tools MATLAB, Excel, etc. Spotfire, etc.
Business Intelligence versus Data Science

Companies can harness

their potential by Business Intelligence
anticipating the future helps in performing root
scenario using data science cause analysis on a failure
in order to reduce risk and or to understand the
12. Usage increase income. current status.

The ETL
The ELT (Extract-Transform-Load)
(Extract-Load-Transform) process is generally used
process is generally used for for the integration of data
Integration of the integration of data for for business intelligence
10. data data science applications. applications.

It’s tools are

InsightSquared Sales
Analytics, Klipfolio,
It’s tools are SAS, BigML, ThoughtSpot, Cyfe, TIBCO
11. Tools MATLAB, Excel, etc. Spotfire, etc.
Data: Data Types
• “Just as trees are the raw material from which paper is produced,
so too, can data be viewed as the raw material from which
information is obtained.”

• Structured Data
• Unstructured Data
• Semi structured Data
Structured Data
• The data which is to the point, factual, and highly organized is
referred to as structured data. It is quantitative in nature, i.e., it is
related to quantities that means it contains measurable numerical
values like numbers, dates, and times.

• Structured data is highly organized and understandable for

machine language.

• It is easy to search and analyze structured data. Structured data

exists in a predefined format.
Unstructured Data
• Unstructured data is data without labels.
• The lack of structure makes compilation and organizing
unstructured data a time- and energy-consuming task.
• All the unstructured files, log files, audio files, and image files are
included in the unstructured data.
• Examples of human-generated unstructured data are Text files,
Email, social media, media, mobile data, business applications,
and others. The machine-generated unstructured data includes
satellite images, scientific data, sensor data, digital surveillance,
and many more.
Data Collections
• Open Data
• Social Media Data
• Multimodal Data
• Data Storage and Presentation
Open Data
• Open data is that some data should be freely available in a public
domain that can be used by anyone as they wish, without
restrictions from copyright, patents, or other mechanisms of
control.
• Public
• Described
• Reusable
• Complete
• Timely
• Managed Post-Release
Social Media Data
• Social media has become a gold mine for collecting data to
analyze for research or marketing purposes.
• This is facilitated by the Application Programming Interface
(API) that social media companies provide to researchers and
developers
Multimodal Data
• Open Data
• Social Media Data
• Multimodal Data
• Data Storage and Presentation
Data Storage and Presentation
• Depending on its nature, data is stored in various formats.
• the most commonly used formats that store data as simple text – comma-separated values
(CSV) and tab-separated values (TSV).
1. CSV (Comma-Separated Values) format is the most common import and export format
for spreadsheets and databases.
For example,
1. Depression.csv is a dataset that is available at UF Health, UF Biostatistics.
2. treat,before,after,diff
3. No Treatment,13,16,3
4. No Treatment,10,18,8
5. No Treatment,16,16,0
6. Placebo,16,13,-3
7. Placebo,14,12,-2
8. Placebo,19,12,-7
9. Seroxat (Paxil),17,15,-2
10. Seroxat (Paxil),14,19,5
11. Seroxat (Paxil),20,14,-6
12. Effexor,17,19,2
13. Effexor,20,12,-8
14. Effexor,13,10,-3
2. TSV (Tab-Separated Values) files are used for raw data and can
be imported into and exported from spreadsheet software.
Tab-separated values files are essentially text files, and the raw data
can be viewed by text editors, though such files are often used when
42 Data moving raw data between spreadsheets.

Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W Franklin
Paul<TAB>25<TAB>Big Farm Way
Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St
where <TAB> denotes a TAB character.1
Multimodal Data
3. XML (eXtensible Markup Language) was designed to be both
human- and machinereadable, and can thus be used to store and
transport data. In the real world, computer systems and databases
contain data in incompatible formats. As the XML data is stored in
plain text format, it provides a software- and hardware-independent
way of storing data.
Data Wrangling
• Data Wrangling is referred to as data munging.
• It is the process of transforming and mapping data from one "raw" data
form into another format to make it more appropriate and valuable for
various downstream purposes such as analytics.
• The goal of data wrangling is to assure quality and useful data.
• Data wrangling acts as a preparation stage for the data mining process,
which involves gathering data and making sense of it.
• Data wrangling is the process of removing errors and combining
complex data sets to make them more accessible and easier to
analyze.
Data Pre-processing
• Incomplete- When some of the attribute values are lacking,
certain attributes of interest are lacking, or attributes contain only
aggregate data.
• Noisy- When data contains errors or outliers. For example, some
of the data points in a dataset may contain extreme values that can
severely affect the dataset’s range.
• Inconsistent- Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of
employees contains values other than alphabetical letters, or if
records do not start with a capital letter, discrepancies are present.
Data Pre-processing
• Incomplete- When some of the attribute values are lacking,
certain attributes of interest are lacking, or attributes contain only
aggregate data.
• Noisy- When data contains errors or outliers. For example, some
of the data points in a dataset may contain extreme values that can
severely affect the dataset’s range.
• Inconsistent- Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of
employees contains values other than alphabetical letters, or if
records do not start with a capital letter, discrepancies are present.
Data Cleaning
Data Cleaning
A. Data Munging
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the
mix.”

B. Handling Missing Data

Consider a table containing customer data in which some of the home
phone numbers are absent.

C. Smooth Noisy Data

for humans a 99.4°F temperature means you are fine, and 99.8°F means you
have a fever, and if our storage system represents both of them as 99°F,
then it fails to differentiate between healthy and sick persons!
Data Integration
The following steps describe how to integrate multiple databases or files.
1. Combine data from multiple sources into a coherent storage place (e.g., a
single file or a database).
2. Engage in schema integration, or the combining of metadata from different
sources.
3. Detect and resolve data value conflicts. For example:
a. A conflict may arise; for instance, such as the presence of different
attributes and values from various sources for the same real-world entity.
b. Reasons for this conflict could be different representations or different
scales; for example, metric vs. British units.
4. Address redundant data in data integration. Redundant data is commonly
generated in the process of integrating multiple databases.
For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table; for example,
annual revenue.
c. Correlation analysis may detect instances of redundant data
Data Transformation
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and
aggregation. Some of the techniques that are used for accomplishing
normalization (but we will not be covering them here) are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones
Data Reduction
• Data reduction is a key process in which a reduced representation of a
dataset that produces the same or similar analytical results is obtained.
• The most common techniques used for data reduction :
1. Data Cube Aggregation - The lowest level of a data cube is the
aggregated data for an individual entity of interest. To do this, use the
smallest representation that is sufficient to address the given task. In
other words, we reduce the data to its more meaningful size and structure
for the task at hand.
Data Discretization
• The data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the attribute
into intervals.

• It can be restoring multiple values of a continuous attribute with a small

number of interval labels therefore decrease and simplifies the original
information.

• There are three types of attributes involved in discretization:

a. Nominal: Values from an unordered set

b. Ordinal: Values from an ordered set
c. Continuous: Real numbers
Data Discretization

• To achieve discretization, divide the range of continuous attributes into

intervals.
• For instance, we could decide to split the range of temperature values into
cold, moderate, and hot, or the price of company stock into above or below
its market valuation.
Thank you

23PCSC10 Data Science and Analytics
No ratings yet
23PCSC10 Data Science and Analytics
118 pages
Full Project
No ratings yet
Full Project
66 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
AI Mini Project
No ratings yet
AI Mini Project
29 pages
Model QP CCW 331 Nov Dec 2024
No ratings yet
Model QP CCW 331 Nov Dec 2024
3 pages
DVT - Unit 1 Notes
No ratings yet
DVT - Unit 1 Notes
10 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
Segmented Paging: Unit Iv
100% (1)
Segmented Paging: Unit Iv
11 pages
The French Fluency Formula - Master The Language
No ratings yet
The French Fluency Formula - Master The Language
211 pages
DAA RR Question Paper 2024
No ratings yet
DAA RR Question Paper 2024
4 pages
Data Analytics III-i
No ratings yet
Data Analytics III-i
85 pages
Study Notion
No ratings yet
Study Notion
51 pages
Data Analytics Question Bank
No ratings yet
Data Analytics Question Bank
4 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Data Visualization and Hadoop
No ratings yet
Data Visualization and Hadoop
34 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
Dbms Model Lab
No ratings yet
Dbms Model Lab
7 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Weather Data Analysis
No ratings yet
Weather Data Analysis
4 pages
Te Aids - (Elective-I) Human Computer Interface
No ratings yet
Te Aids - (Elective-I) Human Computer Interface
2 pages
TCS CodeVita Preparation Guide
No ratings yet
TCS CodeVita Preparation Guide
37 pages
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
No ratings yet
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
16 pages
ASSIGNMENT 1 Questions BI
No ratings yet
ASSIGNMENT 1 Questions BI
1 page
Data Wrangling Report
No ratings yet
Data Wrangling Report
3 pages
Big Data (Assignment)
No ratings yet
Big Data (Assignment)
20 pages
Unit VIII - Query Processing and Security
No ratings yet
Unit VIII - Query Processing and Security
29 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Bca PDF
No ratings yet
Bca PDF
114 pages
Consolidation Achievers B1 Vocabulary Worksheet Consolidation Unit 2 1
100% (1)
Consolidation Achievers B1 Vocabulary Worksheet Consolidation Unit 2 1
1 page
DBMS Unit4 Notes
No ratings yet
DBMS Unit4 Notes
14 pages
Predictive Analytics: Course Syllabus
No ratings yet
Predictive Analytics: Course Syllabus
8 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
Manual
No ratings yet
Manual
48 pages
KMBNIT03 - Unit 2
No ratings yet
KMBNIT03 - Unit 2
12 pages
Unit01 03
No ratings yet
Unit01 03
147 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Gold Advanced Contents PDF
No ratings yet
Gold Advanced Contents PDF
1 page
Digital Image Processing and Its Applications Syllabus
No ratings yet
Digital Image Processing and Its Applications Syllabus
1 page
Criterion-6 Action Plan Autonomous
No ratings yet
Criterion-6 Action Plan Autonomous
10 pages
The Incredible Years Parent, Child, and Teacher Programs
100% (1)
The Incredible Years Parent, Child, and Teacher Programs
2 pages
Basics and 8086 Family
No ratings yet
Basics and 8086 Family
35 pages
AN INDUSTRY ORIENTED MINI PROJECT - Docx Edited'
No ratings yet
AN INDUSTRY ORIENTED MINI PROJECT - Docx Edited'
5 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Dav Institute Of, Meangement Sahil
No ratings yet
Dav Institute Of, Meangement Sahil
61 pages
DataScience Reading
No ratings yet
DataScience Reading
6 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Values Education
No ratings yet
Values Education
17 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Unix Lab Manual
No ratings yet
Unix Lab Manual
23 pages
Unit 1
No ratings yet
Unit 1
28 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Intern Report Progress
No ratings yet
Intern Report Progress
59 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
Revisiting Vygotsky and Gardner: Realizing Human Potential: Beliavsky, Ninah
No ratings yet
Revisiting Vygotsky and Gardner: Realizing Human Potential: Beliavsky, Ninah
12 pages
Sociolinguistic On Color Term
No ratings yet
Sociolinguistic On Color Term
10 pages
Romeo Juliet Unit
No ratings yet
Romeo Juliet Unit
5 pages
Developing Leadership Competencies
No ratings yet
Developing Leadership Competencies
17 pages
Runaway Project and System Failure
No ratings yet
Runaway Project and System Failure
11 pages
ARW2 - Midterm Exam Virtual 202210 Avanzado 11 19 30-21 00
No ratings yet
ARW2 - Midterm Exam Virtual 202210 Avanzado 11 19 30-21 00
8 pages
Ouniverse
No ratings yet
Ouniverse
25 pages
Laura Sasu - Compararive-Contrastive Analysis of Romanian To English Translation. Language Structures
No ratings yet
Laura Sasu - Compararive-Contrastive Analysis of Romanian To English Translation. Language Structures
7 pages
Seminar Report DSP
No ratings yet
Seminar Report DSP
34 pages
Introduction To Fuzzy Logic, Classical Sets and Fuzzy Sets
No ratings yet
Introduction To Fuzzy Logic, Classical Sets and Fuzzy Sets
20 pages
Language and The Brain
No ratings yet
Language and The Brain
3 pages
Finn Nyelvész Bibliográfia
No ratings yet
Finn Nyelvész Bibliográfia
51 pages
What Is The Future of Our Planet
No ratings yet
What Is The Future of Our Planet
2 pages
Unit 2
No ratings yet
Unit 2
13 pages
Steganography Project Report For Major Project in B Tech
No ratings yet
Steganography Project Report For Major Project in B Tech
74 pages
Edu 521 Math Lesson For Weebly Website
No ratings yet
Edu 521 Math Lesson For Weebly Website
5 pages
Visualizing Lesson Plan
No ratings yet
Visualizing Lesson Plan
4 pages
Exam Rubrics Final PDF
No ratings yet
Exam Rubrics Final PDF
1 page
Halton The Meaningof Personal Art Objects
No ratings yet
Halton The Meaningof Personal Art Objects
10 pages
Monday Tuesday Wednesday Thursday Friday: Daily Lesson LOG I.Objectives
No ratings yet
Monday Tuesday Wednesday Thursday Friday: Daily Lesson LOG I.Objectives
15 pages
Transparency, Replicability, and Discovery in Cognitive Aging Research: A Computational Modeling Approach
No ratings yet
Transparency, Replicability, and Discovery in Cognitive Aging Research: A Computational Modeling Approach
20 pages
Rubrics
No ratings yet
Rubrics
8 pages
Alternative Assessment
No ratings yet
Alternative Assessment
24 pages
PPG Module
No ratings yet
PPG Module
68 pages
Siike prt2
No ratings yet
Siike prt2
21 pages
Frontmatter
No ratings yet
Frontmatter
10 pages
UCSPW6
No ratings yet
UCSPW6
7 pages
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
Touchpad Plus Ver. 2.1 Class 2
From Everand
Touchpad Plus Ver. 2.1 Class 2
Team Orange
No ratings yet

Introduction To Data Science

Uploaded by

Introduction To Data Science

Uploaded by

Introduction to Data Science

What is Data Science?

• Asking the correct questions and analyzing the raw data.

Following are some tools required for data science:

• In Finance - Data Science plays a key role in Financial Industries.

• In E-Commerce - E-Commerce Websites like Amazon, Flipkart, etc. uses

It is a field that uses It is basically a set of

It focuses on the past and

It deals with both structured It mainly deals only with

It is less flexible as in case

It has a higher complexity in

It’s expertise is data It’s expertise is the

It deals with the questions

The data to be used is

The ELT The ETL

It’s tools are

Companies can harness

It’s tools are

• Structured data is highly organized and understandable for

• It is easy to search and analyze structured data. Structured data

B. Handling Missing Data

C. Smooth Noisy Data

• It can be restoring multiple values of a continuous attribute with a small

• There are three types of attributes involved in discretization:

a. Nominal: Values from an unordered set

• To achieve discretization, divide the range of continuous attributes into

You might also like