Chapter-1 DS
Chapter-1 DS
Data Science:
Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate data so
that you can find something new and meaningful.
BI stands for business intelligence, which is also used for data analysis of business
information:
etc.
both Past and present data data, present data, and also
future predictions.
finding hidden patterns or useful the machine to learn from the past data and
from the data. classifying the result for new data points.
It is a broad term that includes It is used in the data modeling step of data
model.
to use big data tools like Hadoop, have skills such as computer science
concepts, etc.
It can work with raw, structured, and It mostly requires structured data to work
handling the data, cleansing the data, the complexities that occur during the
Identifying the patterns that are Automation of the process and the
Goals concealed in the data is the main granting of autonomy to the data
objective of data science. model are the main goals of artificial
intelligence.
Data Science will have a variety of AI uses standardized
Types of different types of data, including data in the form of
data structured, semi-structured, and vectors and
unstructured type of data. embeddings.
Data Warehousing
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attribute types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in the
data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered the warehouse, and data should not change.
Goals of Data Warehousing
Processing (OLTP) but can be used for Processing (OLAP). This reads the
This records the data from the clients for customers for business decisions.
history.
2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are denormalized. This is
This is done to reduce redundant files and done to minimize the response time
are used for RDBMS database design. used for the Data Warehouse design.
queries.
7. The database is the place where the data 7. Data Warehouse is the place
available fast and efficient access. handled for analysis and reporting
objectives.
Extraction
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed
to improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and define appropriate associations between values.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement
a three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely rewritten. This means that older
files are replaced. Refresh is usually used in combination with static extraction
to populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying pre-existing data. This method is used in combination with
incremental extraction to update data warehouses regularly.
Data Mining:
The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of data
is called Data Mining.
We can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient
analysis, data mining algorithms, helping decision making and other data
requirements to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
probability of future events. Data Mining is also called Knowledge Discovery of
Data (KDD).
Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by
tables, records, and columns from which data can be accessed in various ways
without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources
within the organization to provide meaningful business insights. The huge amount
of data comes from multiple places such as Marketing and Finance. The extracted
data is utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However,
many IT professionals utilize the term more clearly to refer to a specific kind of setup
within an IT structure. For example, a group of databases, where an organization has
kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model
is called an object-relational model. It supports Classes, Objects, Inheritance, etc.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has
the potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the
relational database systems support transactional database activities.
o There is a probability that the organizations may sell useful data of customers
to other organizations for money. As per the report, American Express has
sold credit card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data
mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.
Chapter Ends…