0% found this document useful (0 votes)
542 views37 pages

Introduction To Data Science

Data science involves extracting meaningful insights from raw data using scientific methods, technologies, and algorithms. It involves asking the right questions, modeling data using complex algorithms, visualizing data, and understanding data to make better decisions. Tools used for data science include Python, R, SQL, and machine learning tools. Data science has applications in transportation, finance, e-commerce, healthcare, gaming, and logistics to optimize processes, detect patterns, and make predictions.

Uploaded by

Manak Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
542 views37 pages

Introduction To Data Science

Data science involves extracting meaningful insights from raw data using scientific methods, technologies, and algorithms. It involves asking the right questions, modeling data using complex algorithms, visualizing data, and understanding data to make better decisions. Tools used for data science include Python, R, SQL, and machine learning tools. Data science has applications in transportation, finance, e-commerce, healthcare, gaming, and logistics to optimize processes, detect patterns, and make predictions.

Uploaded by

Manak Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to Data Science

What is Data Science?


• Data science is a deep study of the massive amount of data, which
involves extracting meaningful insights from raw, structured, and
unstructured data that is processed using the scientific method,
different technologies, and algorithms.
• “Data science is a multidisciplinary blend of data inference,
algorithm development, and technology in order to solve
analytically complex problems.”
What is Data Science?
• In short, we can say that data science is all about:

• Asking the correct questions and analyzing the raw data.


• Modeling the data using various complex and efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the
final result.
Data Science Components
Tools for Data Science

Following are some tools required for data science:


Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,
MATLAB, Excel, RapidMiner.
Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
Data Visualization tools: R, Jupyter, Tableau, Cognos.
Machine learning tools: Spark, Mahout, Azure ML studio.
Applications of Data Science
• In Transport- Data Science also entered into the Transport field like
Driverless Cars. With the help of Driverless Cars, it is easy to reduce the
number of Accidents.

• In Finance - Data Science plays a key role in Financial Industries.


Financial Industries always have an issue of fraud and risk of losses. Thus,
Financial Industries needs to automate risk of loss analysis in order to
carry out strategic decisions for the company.

• In E-Commerce - E-Commerce Websites like Amazon, Flipkart, etc. uses


data Science to make a better user experience with personalized
recommendations.
Applications of Data Science
• In Health Care
• In the Healthcare Industry data science act as a boon. Data Science is used
for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.

• Image Recognition
• Targeting Recommendation- Data Science helps those companies who are
paying for Advertisements for their mobile
Applications of Data Science
• Data Science in Gaming
• Medicine and Drug Development
• In Delivery Logistics - Various Logistics companies like DHL, FedEx,
etc. make use of Data Science. Data Science helps these companies to find
the best route for the Shipment of their Products, the best time suited for
delivery, the best mode of transport to reach the destination, etc.
• Autocomplete - AutoComplete feature is an important part of Data
Science where the user will get the facility to just type a few letters or
words, and he will get the feature of auto-completing the line
The Relationship between Data Science and Information
Science
• Data science is the discovery of knowledge or actionable information in
data.
• Information science is the design of practices for storing and retrieving
information.

For example,
• The number “480,000” is a data point. But when we add an explanation
that it represents the number of deaths per year in the USA from cigarette
smoking,32 it becomes information. But in many real-world scenarios, the
distinction between a meaningful and a meaningless data point is not clear
enough for us to differentiate data and information.
Data science
• Data science is used in business functions such as strategy
formation, decision making and operational processes. It touches
on practices such as artificial intelligence, analytics, predictive
analytics and algorithm design. The discovery of knowledge and
actionable information in data. Data science is an interdisciplinary
field about scientific methods, processes, and systems to extract
knowledge or insights from data in various forms, either
structured or unstructured.
Information Science
• The field of information science, which often stems from
computing, computational science, informatics, information
technology, or library science, often represents and serves such
application areas. The core idea here is to cover people studying,
accessing, using, and producing information in various contexts
Business Intelligence versus Data Science
S. No. Factor Data Science Business Intelligence

It is a field that uses It is basically a set of


mathematics, statistics and technologies, applications
various other tools to and processes that are used
discover the hidden patterns by the enterprises for
1. Concept in the data. business data analysis.

It focuses on the past and


2. Focus It focuses on the future. present.

It deals with both structured It mainly deals only with


3. Data as well as unstructured data. structured data.

It is less flexible as in case


Data science is much more of business intelligence
flexible as data sources can data sources need to be
4. Flexibility be added as per requirement. pre-planned.
Business Intelligence versus Data Science
It makes use of the It makes use of the
5. Method scientific method. analytic method.

It has a higher complexity in


comparison to business It is much simpler when
6. Complexity intelligence. compared to data science.

It’s expertise is data It’s expertise is the


7. Expertise scientist. business user.

It deals with the questions


of what will happen and It deals with the question of
8. Questions what if. what happened.

The data to be used is


disseminated in real-time Data warehouse is utilized
9. Storage clusters. to hold data.
Business Intelligence versus Data Science

The ELT The ETL


(Extract-Load-Transform) (Extract-Transform-Load
process is generally used ) process is generally used
for the integration of data for the integration of data
Integration for data science for business intelligence
10. of data applications. applications.

It’s tools are


InsightSquared Sales
Analytics, Klipfolio,
It’s tools are SAS, BigML, ThoughtSpot, Cyfe, TIBCO
11. Tools MATLAB, Excel, etc. Spotfire, etc.
Business Intelligence versus Data Science

Companies can harness


their potential by Business Intelligence
anticipating the future helps in performing root
scenario using data science cause analysis on a failure
in order to reduce risk and or to understand the
12. Usage increase income. current status.

The ETL
The ELT (Extract-Transform-Load)
(Extract-Load-Transform) process is generally used
process is generally used for for the integration of data
Integration of the integration of data for for business intelligence
10. data data science applications. applications.

It’s tools are


InsightSquared Sales
Analytics, Klipfolio,
It’s tools are SAS, BigML, ThoughtSpot, Cyfe, TIBCO
11. Tools MATLAB, Excel, etc. Spotfire, etc.
Data: Data Types
• “Just as trees are the raw material from which paper is produced,
so too, can data be viewed as the raw material from which
information is obtained.”

• Structured Data
• Unstructured Data
• Semi structured Data
Structured Data
• The data which is to the point, factual, and highly organized is
referred to as structured data. It is quantitative in nature, i.e., it is
related to quantities that means it contains measurable numerical
values like numbers, dates, and times.

• Structured data is highly organized and understandable for


machine language.

• It is easy to search and analyze structured data. Structured data


exists in a predefined format.
Unstructured Data
• Unstructured data is data without labels.
• The lack of structure makes compilation and organizing
unstructured data a time- and energy-consuming task.
• All the unstructured files, log files, audio files, and image files are
included in the unstructured data.
• Examples of human-generated unstructured data are Text files,
Email, social media, media, mobile data, business applications,
and others. The machine-generated unstructured data includes
satellite images, scientific data, sensor data, digital surveillance,
and many more.
Data Collections
• Open Data
• Social Media Data
• Multimodal Data
• Data Storage and Presentation
Open Data
• Open data is that some data should be freely available in a public
domain that can be used by anyone as they wish, without
restrictions from copyright, patents, or other mechanisms of
control.
• Public
• Described
• Reusable
• Complete
• Timely
• Managed Post-Release
Social Media Data
• Social media has become a gold mine for collecting data to
analyze for research or marketing purposes.
• This is facilitated by the Application Programming Interface
(API) that social media companies provide to researchers and
developers
Multimodal Data
• Open Data
• Social Media Data
• Multimodal Data
• Data Storage and Presentation
Data Storage and Presentation
• Depending on its nature, data is stored in various formats.
• the most commonly used formats that store data as simple text – comma-separated values
(CSV) and tab-separated values (TSV).
1. CSV (Comma-Separated Values) format is the most common import and export format
for spreadsheets and databases.
For example,
1. Depression.csv is a dataset that is available at UF Health, UF Biostatistics.
2. treat,before,after,diff
3. No Treatment,13,16,3
4. No Treatment,10,18,8
5. No Treatment,16,16,0
6. Placebo,16,13,-3
7. Placebo,14,12,-2
8. Placebo,19,12,-7
9. Seroxat (Paxil),17,15,-2
10. Seroxat (Paxil),14,19,5
11. Seroxat (Paxil),20,14,-6
12. Effexor,17,19,2
13. Effexor,20,12,-8
14. Effexor,13,10,-3
2. TSV (Tab-Separated Values) files are used for raw data and can
be imported into and exported from spreadsheet software.
Tab-separated values files are essentially text files, and the raw data
can be viewed by text editors, though such files are often used when
42 Data moving raw data between spreadsheets.

Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W Franklin
Paul<TAB>25<TAB>Big Farm Way
Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St
where <TAB> denotes a TAB character.1
Multimodal Data
3. XML (eXtensible Markup Language) was designed to be both
human- and machinereadable, and can thus be used to store and
transport data. In the real world, computer systems and databases
contain data in incompatible formats. As the XML data is stored in
plain text format, it provides a software- and hardware-independent
way of storing data.
Data Wrangling
• Data Wrangling is referred to as data munging.
• It is the process of transforming and mapping data from one "raw" data
form into another format to make it more appropriate and valuable for
various downstream purposes such as analytics.
• The goal of data wrangling is to assure quality and useful data.
• Data wrangling acts as a preparation stage for the data mining process,
which involves gathering data and making sense of it.
• Data wrangling is the process of removing errors and combining
complex data sets to make them more accessible and easier to
analyze.
Data Pre-processing
• Incomplete- When some of the attribute values are lacking,
certain attributes of interest are lacking, or attributes contain only
aggregate data.
• Noisy- When data contains errors or outliers. For example, some
of the data points in a dataset may contain extreme values that can
severely affect the dataset’s range.
• Inconsistent- Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of
employees contains values other than alphabetical letters, or if
records do not start with a capital letter, discrepancies are present.
Data Pre-processing
• Incomplete- When some of the attribute values are lacking,
certain attributes of interest are lacking, or attributes contain only
aggregate data.
• Noisy- When data contains errors or outliers. For example, some
of the data points in a dataset may contain extreme values that can
severely affect the dataset’s range.
• Inconsistent- Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of
employees contains values other than alphabetical letters, or if
records do not start with a capital letter, discrepancies are present.
Data Cleaning
Data Cleaning
A. Data Munging
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the
mix.”

B. Handling Missing Data


Consider a table containing customer data in which some of the home
phone numbers are absent.

C. Smooth Noisy Data


for humans a 99.4°F temperature means you are fine, and 99.8°F means you
have a fever, and if our storage system represents both of them as 99°F,
then it fails to differentiate between healthy and sick persons!
Data Integration
The following steps describe how to integrate multiple databases or files.
1. Combine data from multiple sources into a coherent storage place (e.g., a
single file or a database).
2. Engage in schema integration, or the combining of metadata from different
sources.
3. Detect and resolve data value conflicts. For example:
a. A conflict may arise; for instance, such as the presence of different
attributes and values from various sources for the same real-world entity.
b. Reasons for this conflict could be different representations or different
scales; for example, metric vs. British units.
4. Address redundant data in data integration. Redundant data is commonly
generated in the process of integrating multiple databases.
For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table; for example,
annual revenue.
c. Correlation analysis may detect instances of redundant data
Data Transformation
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and
aggregation. Some of the techniques that are used for accomplishing
normalization (but we will not be covering them here) are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones
Data Reduction
• Data reduction is a key process in which a reduced representation of a
dataset that produces the same or similar analytical results is obtained.
• The most common techniques used for data reduction :
1. Data Cube Aggregation - The lowest level of a data cube is the
aggregated data for an individual entity of interest. To do this, use the
smallest representation that is sufficient to address the given task. In
other words, we reduce the data to its more meaningful size and structure
for the task at hand.
Data Discretization
• The data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the attribute
into intervals.

• It can be restoring multiple values of a continuous attribute with a small


number of interval labels therefore decrease and simplifies the original
information.

• There are three types of attributes involved in discretization:

a. Nominal: Values from an unordered set


b. Ordinal: Values from an ordered set
c. Continuous: Real numbers
Data Discretization

• To achieve discretization, divide the range of continuous attributes into


intervals.
• For instance, we could decide to split the range of temperature values into
cold, moderate, and hot, or the price of company stock into above or below
its market valuation.
Thank you

You might also like