Introduction To Data Science
Introduction To Data Science
• Image Recognition
• Targeting Recommendation- Data Science helps those companies who are
paying for Advertisements for their mobile
Applications of Data Science
• Data Science in Gaming
• Medicine and Drug Development
• In Delivery Logistics - Various Logistics companies like DHL, FedEx,
etc. make use of Data Science. Data Science helps these companies to find
the best route for the Shipment of their Products, the best time suited for
delivery, the best mode of transport to reach the destination, etc.
• Autocomplete - AutoComplete feature is an important part of Data
Science where the user will get the facility to just type a few letters or
words, and he will get the feature of auto-completing the line
The Relationship between Data Science and Information
Science
• Data science is the discovery of knowledge or actionable information in
data.
• Information science is the design of practices for storing and retrieving
information.
For example,
• The number “480,000” is a data point. But when we add an explanation
that it represents the number of deaths per year in the USA from cigarette
smoking,32 it becomes information. But in many real-world scenarios, the
distinction between a meaningful and a meaningless data point is not clear
enough for us to differentiate data and information.
Data science
• Data science is used in business functions such as strategy
formation, decision making and operational processes. It touches
on practices such as artificial intelligence, analytics, predictive
analytics and algorithm design. The discovery of knowledge and
actionable information in data. Data science is an interdisciplinary
field about scientific methods, processes, and systems to extract
knowledge or insights from data in various forms, either
structured or unstructured.
Information Science
• The field of information science, which often stems from
computing, computational science, informatics, information
technology, or library science, often represents and serves such
application areas. The core idea here is to cover people studying,
accessing, using, and producing information in various contexts
Business Intelligence versus Data Science
S. No. Factor Data Science Business Intelligence
The ETL
The ELT (Extract-Transform-Load)
(Extract-Load-Transform) process is generally used
process is generally used for for the integration of data
Integration of the integration of data for for business intelligence
10. data data science applications. applications.
• Structured Data
• Unstructured Data
• Semi structured Data
Structured Data
• The data which is to the point, factual, and highly organized is
referred to as structured data. It is quantitative in nature, i.e., it is
related to quantities that means it contains measurable numerical
values like numbers, dates, and times.
Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W Franklin
Paul<TAB>25<TAB>Big Farm Way
Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St
where <TAB> denotes a TAB character.1
Multimodal Data
3. XML (eXtensible Markup Language) was designed to be both
human- and machinereadable, and can thus be used to store and
transport data. In the real world, computer systems and databases
contain data in incompatible formats. As the XML data is stored in
plain text format, it provides a software- and hardware-independent
way of storing data.
Data Wrangling
• Data Wrangling is referred to as data munging.
• It is the process of transforming and mapping data from one "raw" data
form into another format to make it more appropriate and valuable for
various downstream purposes such as analytics.
• The goal of data wrangling is to assure quality and useful data.
• Data wrangling acts as a preparation stage for the data mining process,
which involves gathering data and making sense of it.
• Data wrangling is the process of removing errors and combining
complex data sets to make them more accessible and easier to
analyze.
Data Pre-processing
• Incomplete- When some of the attribute values are lacking,
certain attributes of interest are lacking, or attributes contain only
aggregate data.
• Noisy- When data contains errors or outliers. For example, some
of the data points in a dataset may contain extreme values that can
severely affect the dataset’s range.
• Inconsistent- Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of
employees contains values other than alphabetical letters, or if
records do not start with a capital letter, discrepancies are present.
Data Pre-processing
• Incomplete- When some of the attribute values are lacking,
certain attributes of interest are lacking, or attributes contain only
aggregate data.
• Noisy- When data contains errors or outliers. For example, some
of the data points in a dataset may contain extreme values that can
severely affect the dataset’s range.
• Inconsistent- Data contains discrepancies in codes or names. For
example, if the “Name” column for registration records of
employees contains values other than alphabetical letters, or if
records do not start with a capital letter, discrepancies are present.
Data Cleaning
Data Cleaning
A. Data Munging
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the
mix.”