0% found this document useful (0 votes)
10 views15 pages

Chapter-1 DS

Data science is a multidisciplinary field focused on extracting insights from structured and unstructured data using various technologies and algorithms. It has applications across multiple domains, including healthcare, finance, and transportation, and is often compared to business intelligence, machine learning, and artificial intelligence. Data warehousing and data mining are integral components of data science, facilitating the storage, analysis, and discovery of patterns in large datasets.

Uploaded by

trexwarrior92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Chapter-1 DS

Data science is a multidisciplinary field focused on extracting insights from structured and unstructured data using various technologies and algorithms. It has applications across multiple domains, including healthcare, finance, and transportation, and is often compared to business intelligence, machine learning, and artificial intelligence. Data warehousing and data mining are integral components of data science, facilitating the storage, analysis, and discovery of patterns in large datasets.

Uploaded by

trexwarrior92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter-1

What is Data Science? Definition and scope of Data Science, Applications


and domains of Data Science, Comparison with other fields like Business
Intelligence (BI), Artificial Intelligence (AI), Machine Learning (ML), and
Data Warehousing/Data Mining (DW-DM)

Data Science:

Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate data so
that you can find something new and meaningful.

Applications of Data Science:

o Image recognition and speech recognition:


Data science is currently used for Image and speech recognition. When you
upload an image on Facebook and start getting the suggestion to tag your
friends. This automatic tagging suggestion uses an image recognition
algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., these
devices respond as per voice control, so this is possible with speech
recognition algorithms.
o Gaming
In the gaming world, the use of Machine learning algorithms is increasing day
by day. EA Sports, Sony, Nintendo, are widely using data science for
enhancing user experience.
o Internet:
When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc. All these
search engines use data science technology to make the search experience
better, and you can get a search result within a fraction of seconds.
o Transport:
Transport industries are also using data science technology to create self-
driving cars. With self-driving cars, it will be easy to reduce the number of
road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science
is being used for tumor detection, drug discovery, medical image analysis,
virtual medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and
you start getting suggestions for similar products, so this is because of data
science technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.
Most of the finance companies are looking for data scientists to avoid risk and
any type of losses with an increase in customer satisfaction.

BI stands for business intelligence, which is also used for data analysis of business
information:

differences between BI and Data sciences:

Criterion Business intelligence Data science

Data Business intelligence deals with Data science deals with

Source structured data, e.g., data structured and unstructured

warehouse. data, e.g., weblogs, feedback,

etc.

Method Analytical(historical data) Scientific(goes deeper to know

the reason for the data report)


Skills Statistics and Visualization are the Statistics, Visualization, and

two skills required for business Machine learning are the

intelligence. required skills for data science.

Focus Business intelligence focuses on Data science focuses on past

both Past and present data data, present data, and also

future predictions.

Difference between Data Science and Machine Learning:

Data Science Machine Learning

It deals with understanding and It is a subfield of data science that enables

finding hidden patterns or useful the machine to learn from the past data and

insights from the data, which helps experiences automatically.

to make smarter business decisions.


It is used for discovering insights It is used for making predictions and

from the data. classifying the result for new data points.

It is a broad term that includes It is used in the data modeling step of data

various steps to create a model for a science as a complete process.

given problem and deploy the

model.

A data scientist needs to have skills A Machine Learning Engineer needs to

to use big data tools like Hadoop, have skills such as computer science

Hive and Pig, statistics, fundamentals, programming skills in

programming in Python, R, or Scala. Python or R, statistics and probability

concepts, etc.

It can work with raw, structured, and It mostly requires structured data to work

unstructured data. on.


Data scientists spend lots of time ML engineers spend a lot of time managing

handling the data, cleansing the data, the complexities that occur during the

and understanding its patterns. implementation of algorithms and

mathematical concepts behind that.

Difference between Data Science and AI

Data Science is a detailed


AI(short) is the implementation of a
process that mainly involves
Basics predictive model to forecast future
pre- processing analysis,
events and trends.
visualization and prediction.

Identifying the patterns that are Automation of the process and the
Goals concealed in the data is the main granting of autonomy to the data
objective of data science. model are the main goals of artificial
intelligence.
Data Science will have a variety of AI uses standardized
Types of different types of data, including data in the form of
data structured, semi-structured, and vectors and
unstructured type of data. embeddings.

It has a lot of high


Scientific It has a high degree of scientific
levels of complex
Processing processing.
processing.

The tools utilized in Data Science are far


The tools used in AI
more extensive than those used in AI.
are less extensive
Tools used This is because Data Science entails a
compared to Data
number of procedures for analyzing data
Science.
and developing insights from it.
By using the concept of data By using this we emulate
science, we can build complex cognition and human
Build
models about statistics and facts understanding to a certain
about data. level.

Technique It uses the technique of data It uses a lot of machine


used analysis and data analytics. learning techniques.

Artificial intelligence makes


Data science makes use of
Use use of algorithms and
graphical representation.
network node representation.

Its knowledge was established to Its knowledge is all about


Knowledge find hidden patterns and trends in imparting some autonomy to a
the data. data model.

Data Warehousing
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various


applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."
Characteristics:

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attribute types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in the
data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered the warehouse, and data should not change.
Goals of Data Warehousing

o To help reporting as well as analysis


o Maintain the organization's historical information.
o Be the foundation for decision making.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to store enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier
to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provides the capabilities to analyze a large amount of
historical data.

Difference between database and data warehouse: -

Database Data Warehouse

1. It is used for Online Transactional 1. It is used for Online Analytical

Processing (OLTP) but can be used for Processing (OLAP). This reads the

other objectives such as Data Warehousing. historical information for the

This records the data from the clients for customers for business decisions.

history.

2. The tables and joins are complicated 2. The tables and joins are accessible

since they are normalized for RDBMS. since they are denormalized. This is

This is done to reduce redundant files and done to minimize the response time

to save storage space. for analytical queries.


3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures 4. Data: Modeling approaches are

are used for RDBMS database design. used for the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical

queries.

7. The database is the place where the data 7. Data Warehouse is the place

is taken as a base and managed to get where the application data is

available fast and efficient access. handled for analysis and reporting

objectives.

ETL (Extract, Transform, and Load) Process


The mechanism of extracting information from source systems and bringing it into
the data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.
The ETL process requires active input from various stakeholders, including
developers, analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs
to change with business changes. ETL is a recurring method (daily, weekly,
monthly) of a Data warehouse system and needs to be agile, automated, and well
documented.

Extraction

o Extraction is the operation of extracting information from a source system for


further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
the changed data to the warehouse and keep it up-to-date.

Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed
to improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and define appropriate associations between values.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement
a three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older
files are replaced. Refresh is usually used in combination with static extraction
to populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying pre-existing data. This method is used in combination with
incremental extraction to update data warehouses regularly.
Data Mining:
The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of data
is called Data Mining.
We can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient
analysis, data mining algorithms, helping decision making and other data
requirements to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
probability of future events. Data Mining is also called Knowledge Discovery of
Data (KDD).
Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by
tables, records, and columns from which data can be accessed in various ways
without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources
within the organization to provide meaningful business insights. The huge amount
of data comes from multiple places such as Marketing and Finance. The extracted
data is utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However,
many IT professionals utilize the term more clearly to refer to a specific kind of setup
within an IT structure. For example, a group of databases, where an organization has
kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model
is called an object-relational model. It supports Classes, Objects, Inheritance, etc.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has
the potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the
relational database systems support transactional database activities.

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based


data.
o Data mining enables organizations to make lucrative modifications in
operation and production.
o Compared with other statistical data applications, data mining is cost-
efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.

Disadvantages of Data Mining

o There is a probability that the organizations may sell useful data of customers
to other organizations for money. As per the report, American Express has
sold credit card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data
mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Data Mining Applications

Data Mining is primarily used by organizations with intense consumer demands-


Retail, Communication, Financial, marketing company, determine price, consumer
preferences, product positioning, and impact on sales, customer satisfaction, and
corporate profits. Data mining enables a retailer to use point-of-sale records of
customer purchases to develop products and promotions that help the organization
to attract the customer.
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can
incorporate statistical models, machine learning techniques, and mathematical
algorithms, such as neural networks or decision trees. Thus, data mining incorporates
analysis and prediction.
Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.

Chapter Ends…

You might also like