(DSBDA) Unit 1 Introduction To Data Science
(DSBDA) Unit 1 Introduction To Data Science
Syllabus
Basics and need of Data Science and Big Data, Applications of Data Science, Data explosion, 5 V’s of Big Data,
Relationship between Data Science and Information Science, Business intelligence versus Data Science, Data Science
Life Cycle, Data: Data Types, Data Collection. Need of Data wrangling, Methods: Data Cleaning, Data Integration,
Data Reduction, Data Transformation, Data Discretization.
Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms of
data. At its core, data science aims to discover an extract actionable knowledge from data that can be used to
make sound business decisions and predictions.
Data science uses advanced analytical theory and various methods such as time series analysis for predicting
future. From historical data, instead of knowing how many products sold in previous quarter, data science
helps in forecasting future product sales and revenue more accurately.
Data science is the domain of study that deals with vast volumes of data using modern tools and technique to
find unseen patterns, derive meaningful information and make business decisions. It uses complex machine
learning algorithms to build predictive models.
It also enables businesses to process huge amounts of structured and unstructured big data to detect patterns.
Big data can be defined as very large volumes of data available at various data sources, in varying degrees of
complexity and generated at different speeds/velocities and also at varying degrees of ambiguity. This data
cannot be processed using traditional technologies, processing methods, algorithms or any commercial off-
the-shelf solutions.
‘Big data’ is a term used to describe collection of data that is huge in size and yet growing exponentially with
time.
The processing of big data begins with the raw data that isn’t aggregated or organized and is most often
impossible to store in the memory of a single computer.
The value of big data isn't solely determined by the amount of data available. Its worth is determined by how you use
it. You can get answers that 1) streamline resource management, 2) increase operational efficiencies, 3) optimize
product development, 4) drive new revenue and growth prospects, and 5) enable smart decision making by evaluating
data from any source.
When big data and high-performance analytics are combined, you can do business-related tasks such as:
In near-real time, determining the root causes of failures, difficulties, and flaws.
Anomalies are detected faster and more correctly than the naked eye.
Improving patient outcomes by transforming medical picture data into insights as quickly as possible.
In minutes, whole risk portfolios can be recalculated.
Increasing the ability of deep learning models to effectively categorize and respond to changing variables.
Detecting fraudulent activity before it has a negative impact on your company.
Applications of Data Science
Data Science Applications haven't taken on a new function overnight. We can now forecast massive amount of data in
minutes, which used to take many human hours to process, because to faster computing and cheaper storage. Also
derive meaningful insights from this data.
10 apps that build on Data Science concepts and explore a variety of domains, including:
Finance was one of the first industries to use data science. Every year, businesses were fed up with bad loans
and losses. They did, however, have a lot of data that was acquired during the first filing for loan approval.
They decided to hire data scientists to help them recover from their losses.
Banking businesses have learned to divide and conquer data over time using consumer profiling, historical
spending, and other critical indicators to assess risk and default possibilities. Furthermore, it aided them in
promoting their banking products depending on the purchasing power of their customers.
Healthcare
Healthcare companies are using data science to build sophisticated medical instruments and detect and cure diseases.
Data science applications are very beneficial to the healthcare industry:
1. Medical Image Analysis: To identify appropriate parameters for jobs like lung texture categorization,
procedures like detecting malignancies, artery stenosis, and organ delineation use a variety of approaches and
frameworks like MapReduce. For solid texture classification, it uses machine learning techniques such as
support vector machines (SVM), content-based medical picture indexing, and wavelet analysis.
2. Drug Development: The drug discovery process is quite complex and entails a wide range of professions. The
best ideas are frequently circumscribed by billions of dollars in testing, as well as significant money and time
commitments. An formal submission takes an average of twelve years.
3. Genetics & Genomics: Through genetics and genomics research, Data Science applications also provide a
higher level of therapy customization. The goal is to discover specific biological linkages between genetics,
illnesses, and treatment response in order to better understand the impact of DNA on our health. Data science
tools enable the integration of various types of data with genomic data in illness research, allowing for a better
understanding of genetic concerns in medication and disease reactions.
Internet Search
When you think about Data Science Applications, this is usually the first thing that comes to mind.
When we think of search, we immediately think of Google. Right? However, there are other search engines, such as
Yahoo, Bing, Ask, AOL, and others. Data science techniques are used by all of these search engines (including
Google) to offer the best result for our searched query in a matter of seconds. In light of the fact that Google processes
over 20 petabytes of data per day.
Targeted Advertising
If you thought Search was the most important data science use, consider this: the full digital marketing spectrum. Data
science algorithms are used to determine practically anything, from display banners on various websites to digital
billboards at airports.
This is why digital advertisements have a far greater CTR (Call-Through Rate) than traditional advertisements. They
can be tailored to a user's previous actions.
This is why you may see adverts for Data Science Training Programs while I see an advertisement for apparels in the
same spot at the same time.
Website Recommendations
Aren't we all used to Amazon's suggestions for similar products? They not only assist you in locating suitable products
from the billions of products accessible, but they also enhance the user experience.
Many businesses have aggressively employed this engine to market their products based on user interest and
information relevance. This technique is used by internet companies such as Amazon, Twitter, Google Play, Netflix,
LinkedIn, IMDB, and many others to improve the user experience. The recommendations are based on a user's
previous search results.
You share a photograph on Facebook with your pals, and you start receiving suggestions to tag your friends. Face
recognition method is used in this automatic tag recommendation function.
Facebook's recent post details the extra progress they've achieved in this area, highlighting their improvements in
image recognition accuracy and capacity.
Speech Recognition
Google Voice, Siri, Cortana, and other speech recognition products are some of the best examples. Even if you are
unable to compose a message, your life will not come to a halt if you use the speech-recognition option. Simply say
the message out loud, and it will be transformed to text. However, you will notice that voice recognition is not always
correct.
The airline industry has been known to suffer significant losses all over the world. Companies are fighting to retain
their occupancy ratios and operational earnings, with the exception of a few aviation service providers. The issue has
worsened due to the huge rise in air-fuel prices and the requirement to give significant discounts to clients. It wasn't
long before airlines began to use data science to pinpoint important areas for development. Airlines can now, thanks to
data science, do the following:
Data explosion
Parallel to expansion in service offerings of IT companies, there is growth in another environment - the data
environment. The volume of data is practically exploding by the day. Not only this, the data that is available
now in becoming increasingly unstructured. Statistics from IDC state that 2011 will see global data grow by
up to 44 times amounting to a massive 35.2 zettabytes (ZB - a billion terabytes).
These factors, coupled with the need for real-time data, constitute the “Big-Data” environment. How can
organizations stay afloat in the big data environment? How can they manage this copious amount of data.
I believe a three-tier approach to managing big data would be the key - the first tier to handle structured data,
the second involving appliances for real-time processing and the third for analyzing unstructured content. Can
this structure be tailored for your organization?
No matter what the approach might be, organizations need to create a cost effective method that provides a
structure to big data. According to a report by McKinsey & Company, accurate interpretation of Big Data can
improve retail operating margins by as much as 60%. This is where information management comes in.
Information management is vital to be able to summarise the data into a manageable and understandable form.
It is also needed to extract useful and relevant data from the large pool that is available and to standardize the
data. With information management, data can be standardized in a fixed form. Standardized data can be used
to find underlying patterns and trends.
All trends indicate that organizations have caught on to the importance of navigating the big data
environment. They are maturing modernizing their existing technologies to accommodate those that will help
manage the flux of data. One worrying trend though is the lack of talent pool necessary to capitalize on Big
Data.
Statistics say that the United States alone could face a shortage of 140,000 to 190,000 persons with requisite
analytic and decision making skills by 2018. Organizations are now looking for partners for effective
information management to form mutually beneficial long sighted arrangements.
The challenge before the armed forces is to develop tools that enable extraction of relevant information from
the data for mission planning and intelligence gathering. And for that, armed forces require data scientists like
never before.
Big Data describes a massive volume of both structured and unstructured data. This data is so large that it is
difficult to process using traditional database and software techniques. While the term refers to the volume of
data, it includes technology, tools and processes required to handle the large amounts of data and storage
facilities.
The phenomenon of exponential multiplication of data that gets stored is termed as ‘Data Explosion’.
Continuous flow of real time data from various processes, machinery and mutual inputs keeps flooding to the
storage servers every second.
Reasons for data explosion in innovation:
o Business model transformation
o Globalization
o Personalization of services
o New sources of data
In recent years, the "3Vs" of Big Data have been replaced by the "5Vs," which are also known as the characteristics of
Big Data and are as follows:
Volume
Volume refers to the amount of data generated through websites, portals and online applications. Especially
for B2C companies, Volume encompasses the available data that are out there and need to be assessed for
relevance.
Volume defines the data infrastructure capability of an organization’s storage, management and delivery of
data to end users and applications. Volume focuses on planning current and future storage capacity -
particularly as it relates to velocity - but also in reaping the optimal benefits of effectively utilizing a current
storage infrastructure.
Volume is the V most associated with big data because, well, volume can be big. What we’re talking about
here is quantities of data that reach almost incomprehensible proportions.
Facebook, for example, stores photographs. That statement doesn’t begin to boggle the mind until you start to
realize that Face book has more users than China has people. Each of those users has stored a whole lot of
photographs. Face book is storing roughly 250 billion images.
Try to wrap your head around 250 billion images. Try this one. As far back as 2016, Facebook had 2.5 trillion
posts. Seriously, that’s a number so big it’s pretty much impossible to picture.
So, in the world of big data, when we start talking about volume, we're talking about insanely large amounts
of data. As we move forward, we're going to have more and more huge collections. For example, as we add
connected sensors to pretty much everything, all that telemetry data will add up.
How much will it add up? Consider this. Gartner, Cisco, and Intel estimate there will be between 20 and 200
(no, they don’t agree, surprise!) connected IoT devices, the number are huge no matter what. But it's not just
the quantity of devices.
Consider how much data is coming off of each one. I have a temperature sensor in my garage. Even with a
one-minute level of granularity (one measurement a minute), that’s still 525,950 data points in a year, and
that’s just one sensor. Let’s say you have a factory with a thousand sensors, you’re looking at half a billion
data points, just for the temperature alone.
Velocity
With Velocity we refer to the speed with which data are being generated. Staying with our social media
example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter,
0.4 million hours of video are uploaded on YouTube and 3.5 billion searches are performed in Google.
This is like a nuclear data explosion. Big Data helps the company to hold this explosion, accept the incoming
flow of data and at the same time process it fast so that it does not create bottlenecks.
250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users
upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a
drop in the bucket in a few months.
Also: Facebook explains Fabric Aggregator, its distributed network system
Velocity is the measure of how fast the data is coming in. Face book has to handle a tsunami of photographs
every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.
Here’s another example. Let’s say you’re running a marketing campaign and you want to know how the folks
“out there” are feeling about your brand right now. How would you do it? One way would be to license some
Twitter data from Grip (acquired by Twitter) to grab a constant stream of tweets, and subject them to
sentiment analysis.
That feed of Twitter data is often called “the firehouse” because so much data is being produced, it feels like
being at the business end of a firehouse.
Here’s another velocity example: packet analysis for cyber security. The Internet sends a vast amount of
information across the world every second. For an enterprise IT team, a portion of that flood has to travel
through firewalls into a corporate network.
Variety
Veracity
It refers to data inconsistencies and uncertainty, i.e., available data can become untidy at times, and quality
and accuracy are difficult to control.
Because of the numerous data dimensions originating from multiple distinct data kinds and sources, Big Data
is also volatile.
For example, a large amount of data can cause confusion, yet a smaller amount of data can only convey half
or incomplete information.
Value
After considering the four V's, there is one more V to consider: Value! The majority of data with no value is
useless to the organization until it is converted into something beneficial.
Data is of no utility or relevance in and of itself; it must be turned into something useful in order to extract
information. As a result, Value! can be considered the most essential of the five V's.
The finding of knowledge or actionable information in data is what data science is all about.
The design of procedures for storing and retrieving information is known as information science.
Data science and information science are two separate but related fields.
Harry is a computer scientist and mathematician who focuses on data science. Library science, cognitive science, and
communications are all areas of interest in information science.
Business tasks such as strategy formation, decision making, and operational processes all require data science. It
discusses Artificial Intelligence, analytics, predictive analytics, and algorithm design, among other topics.
Knowledge management, data management, and interaction design are all domains where information science is
employed.
Data science Information science
The finding of knowledge or actionable
The design of procedures for storing and retrieving
Definitions information in data is what data science is all
information is known as information science.
about.
Data science
Data science is a field in which data is mined for information and knowledge using a variety of scientific methods,
algorithms, and processes. It can thus be characterised as a collection of mathematical tools, algorithms, statistics, and
machine learning techniques that are used to uncover hidden patterns and insights in data to aid decision-making. Both
organised and unstructured data are dealt with in data science. It has to do with data mining as well as big data. Data
science is researching historical trends and then applying the findings to reshape current trends and forecast future
trends.
Business intelligence
Business intelligence (BI) is a combination of technology, tools, and processes that businesses utilise to analyse
business data. It is mostly used to transform raw data into useful information that can then be used to make business
decisions and take profitable actions. It is concerned with the analysis of organised and unstructured data in order to
open up new and profitable business opportunities. It favours fact-based decision-making over assumption-based
decision-making. As a result, it has a direct impact on a company's business decisions. Business intelligence tools
improve a company's prospects of entering a new market and aid in the analysis of marketing activities.
The following table compares and contrasts Data Science with Business Intelligence:
A data science life cycle is a series of data science steps that you go through to complete a project or analysis. Because
each data science project and team is unique, each data science life cycle is also unique. Most data science projects, on
the other hand, follow a similar generic data science life cycle.
Some data science life cycles concentrate just on the data, modelling, and evaluation stages. Others are more
complete, beginning with an understanding of the company and ending with deployment. And the one we'll go over is
considerably bigger, as it includes operations. It also places a greater emphasis on agility than other life cycles.
There are five stages to this life cycle:
Understanding Problem
Data Collection
Data Cleaning and Processing
Exploratory Data Analysis (EDA)
Model building and evaluation
Model Deployment
These aren't steps in data science that follow a straight line. Step one will be completed first, followed by step two.
However, you should naturally flow between the steps as needed after that. It is preferable to do several minor
incremental steps rather than a few big comprehensive ones.
Quantities, letters, or symbols on which a computer performs operations and which can be stored and communicated
as electrical signals and recorded on magnetic, optical, or mechanical media.
Structured data
Structured data is any data that can be processed, accessed, and stored in a fixed format. Over time, software
engineering expertise has made significant progress in developing strategies for working with this type of data and
inferring a benefit from it. Nonetheless, we are predicting challenges in the future when the size of such data grows to
huge proportions, with average quantities approaching zettabytes.
The most straightforward to work with in large data is structured data. Structured data is a sort of big data that is
closely linked to measurements that are specified by parameters.
● Address
● Age
● Expenses
● Contact
● Billing
Example
Unstructured data
This is one of the types of big data that incorporates the data format of a large number of unstructured files, such as
image files, audio files, log files, and video files. Unstructured data refers to data that has an unfamiliar structure or
model. Because of its magnitude, unstructured data in big data presents unique challenges in terms of preparation for
evaluating a value.
One complicated data source with a mix of photos, videos, and text files is an example of this. A few organisations
have a wealth of information at their disposal. However, because the data is in its raw form, these organisations are
unable to derive an incentive from it.
Semi-structured data is one of the types of big data that includes both unstructured and structured data formats. To be
more explicit, it alludes to data that has crucial tags or information that isolate single components within the data,
despite the fact that it has not been sorted under a certain database. Along these lines, we've reached the end of huge
data kinds.
Examples
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data collection
It is the systematic approach to gathering and measuring information from a variety of sources to get a
complete and accurate picture of an area of interest. The big data includes information produced by humans
and devices.
In the Big Data cycle, data collecting is the most crucial step. The Internet delivers nearly limitless data
sources on a wide range of topics.
The relevance of this area varies by industry, but conventional sectors can obtain a variety of external data
sources and combine them with transactional data.
Consider the case where we want to create a system that recommends eateries. The initial stage would be to
collect data from various websites, in this case restaurant reviews, and store it in a database. Because we're
only interested in raw text and would utilize it for analytics, it doesn't matter where the data for building the
model is stored. This may seem counterintuitive given the core technologies of big data, but the only way to
deploy a big data application is to make it function in real time.
Data Wrangling
It is the process of transforming raw data into a more processed shape by recognizing, cleaning and enriching
it.
Or it can be said to be a process of getting data from its raw format into something suitable for more
conventional analytics.
Data wrangling can be defined as the process of cleaning, organizing and transforming raw data into the
desired format for analysts to use for prompt decision making.
Although wrangling the data is critical, it is also regarded as the backbone of the entire analysis process. Data
wrangling's fundamental goal is to make raw data usable. To put it another way, putting data into a shape. Data
scientists, on average, spend 75% of their time wrangling data, which comes as no surprise. The following are some of
the most important requirements for data wrangling:
When working with data, your analysis and insights are only as good as the data you use. If you do data analysis on
stale data, your company will be unable to make efficient and productive judgments.
Data cleaning is an important element of data management since it allows you to ensure that your data is of
excellent quality.
Data cleaning entails more than just correcting spelling and grammatical mistakes. It's a key machine learning
approach and a vital part of data science analytics. Today, we'll learn more about data cleaning, its
advantages, data concerns that can develop, and next steps in your learning.
Data cleaning, also known as data cleansing, is the act of eliminating or correcting inaccurate, incomplete, or
duplicate data from a dataset. The initial stage in your workflow should be data cleaning.
There's a good chance you'll duplicate or mislabel data while working with large datasets and merging several
data sources. If your data is faulty or incorrect, it loses its value, and your algorithms and results become
untrustworthy.
Data cleaning differs from data transformation in a way that it involves deleting data from your dataset that
doesn't belong there. Whereas data transformation is the process of transforming data into a different format
or organization.
Data wrangling and data munging are terms used to describe data transformation procedures.
Although data cleansing may appear to be a tedious and monotonous activity, it is one of the most crucial tasks a data
scientist must complete. Data that is inaccurate or of poor quality can sabotage your operations and analyses. A
brilliant algorithm can be ruined by bad data.
High-quality data, on the other hand, can make a simple algorithm provide excellent results. You should become
familiar with a variety of data cleaning processes in order to improve the quality of your data. Not every piece of
information is valuable. As a result, another important element affecting the quality of your data is your location.
While data cleaning processes differ depending on the sorts of data your firm stores, you may utilize these
fundamental steps to create a foundation for your company.
Remove any undesirable observations, such as duplicates or irrelevant observations, from your dataset. Duplicate
observations are most likely to occur during the data collection process. Duplicate data can be created when you
integrate data sets from numerous sources, scrape data, or get data from clients or multiple departments. One of the
most important aspects to consider in this procedure is deduplication. When you observe observations that aren't
relevant to the problem you're trying to solve, you've made irrelevant observations. For example, if you want to study
data about millennial clients but your dataset includes observations from previous generations, you may wish to
eliminate those observations.
Step 2: Fix structural errors
When you measure or transfer data and find unusual naming conventions, typos, or wrong capitalization, you have
structural issues. Mislabeled categories or classes can result from these inconsistencies. "N/A" and "Not Applicable,"
for example, may both exist, but they should be examined as one category.
There will frequently be one-off observations that do not appear to fit into the data you are studying at first sight. If
you have a good cause to delete an outlier, such as incorrect data entry, doing so will make the data you're working
with perform better. The advent of an outlier, on the other hand, can sometimes prove a theory you're working on. It's
important to remember that just because an outlier exists doesn't mean it's wrong. This step is required to determine
the number's legitimacy. Consider deleting an outlier if it appears to be unimportant for analysis or is a mistake.
Many algorithms will not allow missing values, therefore you can't ignore them. There are several options for dealing
with missing data. Neither option is ideal, but they can both be examined.
You can drop observations with missing values as a first option, but this can cause you to lose or lose
information, so be aware of this before you do so.
As a second alternative, you can fill in missing numbers based on other observations; however, you risk losing
data integrity because you're working with assumptions rather than actual observations.
As a third solution, you may change the way the data is used to navigate null values more efficiently.
Data Integration
The technical and business methods used to combine data from many sources into a unified, single view of the
data are referred to as data integration.
Data integration is the process of combining data from various sources into a single dataset with the goal of
providing users with consistent data access and delivery across a wide range of subjects and structure types.
As well as meeting the information requirements of all applications and business processes.
The data integration process is one of the most important parts of the total data management process, and it's
becoming more common as big data integration and the need to share existing data become more important.
Data integration architects provide data integration tools and platforms that allow for an automated data integration
process that connects and routes data from source systems to target systems. This can be accomplished via a variety of
data integration methods, such as:
Extract, Transform, and Load (ETL) - copies of datasets from various sources are combined, harmonised,
and loaded into a data warehouse or database.
Extract, Load, and Transform(ELT) - data is fed into a big data system in its raw form, then processed
afterwards for specific analytical purposes.
Change Data Capture discovers and applies real-time data changes in databases to a data warehouse or other
repositories.
Data Replication - data from one database is duplicated to other databases in order to keep the information
synced for operational and backup purposes.
Instead of importing data into a new repository, data from disparate systems is virtualized and integrated to
create a cohesive perspective.
Streaming Data Integration is a real-time data integration method that involves continuously integrating and
feeding multiple streams of data into analytics systems and data repositories.
Big data, with all of its benefits and challenges, is being embraced by businesses that want to stay competitive and
relevant. Data integration enables searches in these massive databases, with benefits ranging from corporate
intelligence and consumer data analytics to data enrichment and real-time data delivery.
The management of company and customer data is one of the most common use cases for data integration services
and solutions. To provide corporate reporting, business intelligence (BI data integration), and advanced analytics,
enterprise data integration feeds integrated data into data warehouses or virtual data integration architecture.
Customer data integration gives a holistic picture of key performance indicators (KPIs), financial risks, customers,
manufacturing and supply chain operations, regulatory compliance activities, and other areas of business processes to
business managers and data analysts.
In the healthcare industry, data integration is extremely vital. By arranging data from several systems into a single
perspective of relevant information from which helpful insights can be gained, integrated data from various patient
records and clinics aids clinicians in identifying medical ailments and diseases. Medical insurers benefit from effective
data gathering and integration because it assures a consistent and accurate record of patient names and contact
information. Interoperability is the term used to describe the exchange of data across various systems.
Entity identification problem: As we know data is unified from the heterogenous sources, then how we can
match the real world entities from the data.
o For e.g., in customer data an entity from one data source had customer_id and and entity from another
data source has customer_no. Now here schema integration can be achieved by using the metadata of
each attribute.
Redundancy: Redundant data is any data which is unimportant or data that is no longer needed. Redundancy
arises due to attributes that could be derived using another attribute in the dataset.
o For e.g., id one dataset has the customer age and other dataset has customer date of birth, then the age
would be a redundant attribute as it could be derived using the date of birth.
o Redundancy can be discovered using correlation analysis. The attributes are analysed to detect their
interdependency on each other and thereby detecting the correlation between them.
Data Reduction
Because the vast gathering of large data streams introduces the 'curse of dimensionality' with millions of
features (variables and dimensions), big data reduction is primarily thought of as a dimension reduction
problem. This raises the storage and computational complexity of big data systems.
"Data reduction is the transition of numerical or alphabetical digital information derived empirically or
experimentally into a rectified, ordered, and simpler form," according to a formal definition.
Simply said, it means that enormous amounts of data are cleansed, sorted, and categorized based on
predetermined criteria to aid in the making of business choices.
Therefore it means obtaining reduced representation of the dataset which is much smaller in volume but yet
produces the same analytical results as the non-reduced data.
Data reduction strategies:
o Data cube aggregation operations are applied to the data in order to construct a data cube.
Summary or aggregation operations are applied to the data.
o Dimensionality reduction involves detection and removal of redundant attributes in order to reduce
the data size.
o Data compression involves encoding mechanisms to reduce the size of dataset.
o Numerosity reduction involves replacement or estimation of data by an alternative.
o Discretization and concept hierarchy generation involves replacing raw data values for attributes with
ranges or higher conceptual levels.
Dimensionality Reduction
Dimensionality Reduction and Numerosity Reduction are the two main approaches of data reduction.
The technique of lowering the number of dimensions(random variables) across which data is dispersed is known as
dimensionality reduction(hence obtaining a set of principle variables).
If the original data can be reconstructed from the compressed data without any major loss of information, then the data
reduction is called lossless. If instead we can reconstruct only an approximation of the original data then the data
reduction is called lossy.
The number of dimensions increases the sparsity of the properties or features that the data set contains. Clustering,
outlier analysis, and other methods rely on this sparsity. It is simple to display and handle data when the number of
dimensions is minimised. Dimensionality reduction can be divided into three categories.
Wavelet Transform: The Wavelet Transform is a lossy dimensionality reduction approach in which a data
vector X is transformed into another vector X' while maintaining the same length for both X and X'. Unlike its
original, the wavelet transform result can be truncated, resulting in dimensionality reduction. Wavelet
transforms work well with data cubes, sparse data, and severely skewed data. In picture compression, the
wavelet transform is frequently utilised.
Principal Component Analysis: This strategy entails identifying a small number of independent tuples with n
properties that can represent the complete data set. This strategy can be used with data that is skewed or
sparse. (Lossy Reduction)
Attribute Subset Selection: A core attribute subset excludes attributes that aren't useful to data mining or are
redundant. The selection of the core attribute subset decreases the data volume and dimensionality.
Numerosity Reduction: This strategy reduces data volume by using alternate, compact forms of data
representation. Parametric and Non-Parametric Numerosity Reduction are the two forms of Numerosity
Reduction.
Parametric: This method assumes that the data fits into a model. The parameters of the data model are
estimated, and only those parameters are saved, with the remainder of the data being destroyed. If the data fits
the Linear Regression model, for example, a regression model can be utilised to achieve parametric reduction.
A linear relationship between two data set features is modelled using linear regression. Let's imagine we need
to fit a linear regression model between two variables, x and y, where y is the dependent variable and x is the
independent variable. The equation y=wx b can be used to express the model. The regression coefficients w
and b are used here. We can define the variable y in terms of many predictor attributes using a multiple linear
regression model.
The Log-Linear model is another way for determining the relationship between two or more discrete
characteristics. Assume we have a collection of tuples in n-dimensional space; the log-linear model may be
used to calculate the probability of each tuple in this space.
Non-Parametric: There is no model in a non-parametric numerosity reduction strategy. The non-Parametric
technique produces a more uniform reduction regardless of data size, but it does not accomplish the same
large volume of data reduction as the Parametric technique. Non-parametric data reduction techniques include
Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
Data Transformation
The process of modifying the format, structure, or values of data is known as data transformation. Data can be
modified at two phases of the data pipeline for data analytics initiatives.
On-premises data warehouses often employ an ETL (extract, transform, load) method, with data
transformation serving as the middle phase. The majority of businesses now employ cloud-based data
warehouses, which can increase computation and storage resources in seconds or minutes.
Because of the cloud platform's scalability, enterprises can forego preload transformations and instead load
raw data into the data warehouse, which is subsequently transformed at query time – a paradigm known as
ELT ( extract, load, transform).
Data transformation can be used in a variety of processes, including data integration, data migration, data
warehousing, and data wrangling.
Data transformation can be positive (adding, copying, and replicating data), negative (deleting fields and
records), aesthetic (standardising salutations or street names), or structural (adding, copying, and replicating
data) (renaming, moving, and combining columns in a database).
An organisation can choose from a number of ETL technologies to automate the data transformation process.
Data analysts, data engineers, and data scientists use scripting languages like Python or domain-specific
languages like SQL to alter data.
1. Smoothing
2. Aggregation
3. Generalization
4. Normalization
5. Attribute construction
To make data more organised, it is changed. Humans and computers may find it easier to use transformed
data.
Null values, unexpected duplicates, wrong indexing, and incompatible formats can all be avoided with
properly structured and verified data, which enhances data quality and protects programmes from potential
landmines.
Data transformation makes it easier for applications, systems, and different types of data to work together.
Data that is utilised for several purposes may require different transformations.
It is possible that data transformation will be costly. The price is determined by the infrastructure, software,
and tools that are utilised to process data. Licensing, computing resources, and recruiting appropriate
employees are all possible expenses.
Data transformations can be time-consuming and resource-intensive. Performing transformations after loading
data into an on-premises data warehouse, or altering data before feeding it into apps, might impose a strain on
other operations. Because the platform can scale up to meet demand, you can conduct the changes after
loading if you employ a cloud-based data warehouse.
During transition, a lack of competence and negligence might cause issues. Because they are unfamiliar with
the range of valid and allowed numbers, data analysts without proper subject matter experience are less likely
to spot typos or inaccurate data. Someone dealing with medical data who is inexperienced with relevant
terminologies, for example, may misspell illness names or fail to indicate disease names that should be
mapped to a singular value.
Enterprises have the ability to undertake conversions that do not meet their requirements. For one application,
a company may alter information to a specific format, only to restore the information to its previous format
for another.
For a variety of reasons, you might want to modify your data. Businesses typically seek to convert data in order to
make it compatible with other data, move it to another system, combine it with other data, or aggregate information in
the data.
Consider the following scenario: your company has acquired a smaller company, and you need to merge the Human
Resources departments' records. Because the purchased company's database differs from the parent company's, you'll
have to perform some legwork to guarantee that the records match. Each new employee has been given an employee
ID number, which can be used as a key. However, you'll need to update the date formatting, delete any duplicate rows,
and make sure that the Employee ID field doesn't contain any null values to ensure that all employees are tallied.
Before you load the data to the final target, you perform all of these crucial activities in a staging area.
You're migrating your data to a new data store, such as a cloud data warehouse, and you need to modify the
data types.
You'd like to combine unstructured or streaming data with structured data in order to examine the data as a
whole.
You wish to enrich your data by doing lookups, adding geographical data, or adding timestamps, for example.
You want to compare sales statistics from different regions or total sales from multiple regions.
Data Discretization
The technique of transforming continuous data into discrete buckets by grouping it is known as data
discretization. Discretization is also known for making data easier to maintain.
When a model is trained using discrete data, it is faster and more effective than when it is trained with
continuous data. Despite the fact that continuous-valued data includes more information, large volumes of
data can cause the model to slow down.
Discretization can assist us in striking a balance between the two. Binning and employing a histogram are two
well-known data discretization techniques.
Although data discretization is beneficial, we must carefully select the range of each bucket, which is a
difficult task.
The most difficult part of discretization is deciding on the number of intervals or bins to use and how to
determine their breadth.
In recent years, the discretization procedure has piqued public interest and has shown to be one of the most effective
data pre-processing approaches in DM.
Discretization, to put it another way, converts quantitative data into qualitative data, resulting in a non-overlapping
partition of a continuous domain. It also ensures that each numerical value is associated with a certain interval.
Because it reduces data from a vast domain of numeric values to a subset of categorical values, discretization is
considered a data reduction process.
Many DM methods that can only deal with discrete qualities require the usage of discretized data. Three of the 10
approaches listed as the top ten in DM, for example, need data discretization in some way. Discretization causes
significant gains in learning speed and accuracy in learning methods, which is one of its key benefits. Furthermore,
when discrete values are used, some decision tree-based algorithms give shorter, more compact, and accurate outputs.
A large number of discretization proposals can be found in the specialised literature. In reality, numerous surveys have
been created in an attempt to systematise the strategies that are now available. When working with a new real-world
situation or data collection, it's critical to figure out which discretizer is the greatest fit. In terms of correctness and
simplicity of the solution obtained, this will determine the success and applicability of the upcoming learning phase.
Despite the effort put forth to categorise the entire family of discretizers, the most well-known and unquestionably
effective are included in a new taxonomy described in this work, which has since been modified at the time of writing.
References:
1. David Dietrich, Barry Hiller, “Data Science and Big Data Analytics”, EMC education services, Wiley
publication, 2012, ISBN0-07-120413-X
2. EMC Education Services, “Data Science and Big Data Analytics- Discovering, analyzing Visualizing and
Presenting Data”
3. DT Editorial Services, “Big Data, Black Book”, DT Editorial Services, ISBN: 9789351197577, 2016 Edition
4. Chirag Shah, “A Hands-On Introduction To Data Science”, Cambridge University Press, (2020), ISBN: ISBN
978-1-108-47244-9