0% found this document useful (0 votes)
2 views

Data Structure & Data Mining

Uploaded by

keshav1gig
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Structure & Data Mining

Uploaded by

keshav1gig
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Structures in Data Science

What is a Data Structure?

A data structure is used to store data in an organized fashion in


order to make data manipulation and other data operations more
efficient.

Types of Data Structures:

1. Vector- It is one of the basic data structures and has a


homogenous nature. This means it only contains elements of
the same data type. Data types can be numeric, integer,
character, complex or logical.

How to Create a Vector in R:

In R programming c() function is used to create a vector.


Example:
#Example
x <- c(44, 25, 64, 96, 30)

How to Create a Vector in Python:

In Python, use the np.array( ) function to create a vector.


# Vector as row
vec_row = np.array ([1, 2, 3])
vector_row
#Vector as column
vec_column = np.array([[1], [2], [3]])
vector_column

2. Matrix- A matrix is a 2-dimensional data structure that is


homogenous in nature. This means that it only accepts elements of
the same data type. Coercion takes place if elements of different
data types are passed.

How to Create a Matrix in R:

In R, it is created using the matrix() function. The basic syntax to


create a matrix is:
matrix(data, nrow, ncol, byrow, dimnames)
where:

X <- Matrix(byrow)

 data- input element, given as a vector.

 nrow- number of rows to be created.


 ncol- number of columns to be created.

 byrow- row-wise arrangement of the elements instead of


column-wise

 dimnames- names of columns/rows to be created.

Example:
M <- matrix(c(1:9), nrow = 5, ncol =5, byrow= TRUE)
M

How to Create a Matrix in Python:

In Python, use the np.mat( ) function to create a matrix.

Example:
matrix = np.mat([[1, 2],
[1, 2],
[1, 2]])
matrix

3. Array- They are multi-dimensional data structures. In an


array, data is stored in the form of matrices, row, and columns. We
can use the matrix level, row index, and column index to access
the matrix elements.
How to Create an Array in R:

In R, an array is created using the array() function. We will use


vectors as the input for this example.
vector1 <- c(10,12,40)
vector2 <- c(15,17,27)
output <- array(c(vector1,vector2),dim = c(2,2,2))
output

How to Create an Array in Python:

In Python, use square brackets to create arrays.


cars = ["Ford", "Volvo", "BMW"]
cars

4. Series- It is exclusive to Python especially when


working with the Pandas library. It is a one-dimensional
labeled array capable of holding data of any type (integer, string,
float, python objects, etc.). The axis labels are collectively called
the „index‟.

How to Create a Series in Python:


first create an array using the array( ) function. Then feed the
array as an input into the series using the Series( ) function.
a = np.array(['g', 'e', 'e', 'k', 's'])
s = pd.Series(a) / pd.series(‘a’, ‘b’, ‘c’, )
s

5. Data Frame- A data frame is a 2-dimensional array, that


resembles a table. Each column contains values of one variable
and each row contains one set of values from each column. The
data stored in a data frame can be numeric, factors or characters.
Each column should contain the same number of data items.

How to Create a Data Frame in R:

First create a set of series. Then use the data.frame( ) function to


create the data frame.

Example:
df1 <- c(1:4)
df2 <- c("Sam","Rob","Max","John")
df3 <- c("Google","Apple","Microsoft","Amazon")
df.data <- data.frame(df1,df2,df3)

print(df.data)
How to Create a Data Frame in Python:

In Python, a collection of series is called a data frame. We use the


pandas library to create the dataframe. Use the DataFrame
function to create the data frame.

Example:
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus'],
'Price': [22000,25000,27000,35000] }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

print (df)

6. Table- It simply creates tabular results of categorical variables.


IT is commonly used in R for aesthetic purposes.

How to create a table in R:

We use the table( ) function. Iris dataset


output <- table(iris$Species,iris$Sepal.Length)
print(output)
7. Factor- Factors are used in data analysis for statistical
modeling. They are used to categorize categorical variables in
columns, such as “TRUE”, “FALSE” etc., and store them as levels.
They can store both strings and integers. Factors are exclusive to
R.

How to Create a Factor in R:


In R , the factor() function is used to create a factor and store
vectors as inputs.

Example:
#Createvector
x <- c("East","West","East","North","North","East","West") # Apply the factor function.
factor_data <- factor(x)
X <- factor("East","West","East","North","North","East","West")
print(x)
print(is.factor(x))
print(factor_data)

8. List - Lists contain elements of different types like − numbers,


strings, vectors and another list inside it. A list can also contain a
matrix or a function as its elements. It is a collection which is
ordered and mutable (can be changed).

Note: It can contain duplicates.

How to Create a List in R:

List is created using list() function.


list1 <- list("Red", "Rita", c(21,32,11), TRUE, 51.23)
print(list1)

How to Create a List in Python:

It is as simple as creating a variable, opening a square bracket and


entering the desired values.
n = ["Red", "Rita", c(21,32,11), TRUE, 51.23]

9. Dictionary- It is also called a hash map and supports arbitrary


keys as well as values. Keys can be numbers, numeric vectors,
strings, string vectors. It is unordered, mutable and indexed.

Note: It does not contain duplicate members.


How to Create a Dictionary in R:

You need to use a library such as hash to create a dictionary. Then


use the function hash( ).
# import library
library(hash)
# create empty dictionaryh <- hash() # set values
h[["1"]] <- 42
h[["foo"]] <- "bar"

h[["4"]] <- list(a=1, b=2)

How to Create a Dictionary in Python:

Just open a curly bracket , define the key and enter the values.
{1: [1, 2, 3, 4], 'Name': 'Bill'}

10. Tuple- It is exclusive to Python. It contains elements


which are ordered and immutable. A tuple can have any number
of items and they may be of different types (integer, float, list,
string, etc.).

Note: It contains duplicate members.


How to Create a Tuple in Python:

Just create a variable, open a parenthesis and enter the values.


tuple1 = ("apple", 1, False)
print(tuple1)
1. What is Data Mining?
Mining in its casual terms refers to the extraction of valuable minerals. In the
21st century, Data is the most expensive mineral. To extract usable data from a
given set of raw data, we use Data Mining.

Through Data Mining, we extract useful information in a given dataset to


extract patterns and identify relationships.

The process of data mining is a complex process that involves intensive data
warehousing as well as powerful computational technologies.

Furthermore, data mining is not only limited to the extraction of data but is
also used for transformation, cleaning, data integration, and pattern analysis.
Another terminology for Data Mining is Knowledge Discovery.
There are various important parameters in Data Mining, such as association
rules, classification, clustering, and forecasting. Some of the key features of
Data Mining are –
 Prediction of Patterns based on trends in the data.
 Calculating the predictions for the outcomes.
 Creating information in response to the analysis
 Focusing on greater databases.
 Clustering the visual data

2. Data Mining Steps


Knowledge discovery is an essential part of Data Mining. The important steps
involved in Data Mining are –
Step 1: Data Cleaning – In this step, data is cleaned such that there is no noise
or irregularity present within the data.
Step 2: Data Integration – In the process of Data Integration, we combine
multiple data sources into one.
Step 3: Data Selection – In this step, we extract our data from the database.
Step 4: Data Transformation – In this step, we transform the data to perform
summary analysis as well as aggregatory operations.
Step 5: Data Mining – In this step, we extract useful data from the pool of
existing data.
Step 6: Pattern Evaluation – We analyze several patterns that are present in the
data.
Step 7: Knowledge Representation – In the final step, we represent the
knowledge to the user in the form of trees, tables, graphs, and matrices.
3. Data Mining Applications
There are various applications of Data Mining such as –

 Market and Stock Analysis


 Fraud Detection
 Risk Management and Corporate Analysis
 Analyzing the Customer Lifetime Value

4. Data Mining Tools


Some of the popular tools used for Data Mining are –

4.1 RapidMiner
It is one of the most popular tools for data mining. It is written in Java but
requires no coding to operate it. Furthermore, it provides various data mining
functionalities like data-preprocessing, data representation, filtering,
clustering, etc.
4.2 Weka
Weka is an open-source data mining software developed at the University of
Wichita. Like RapidMiner, it has a no-coding and a simple to use GUI.

Using Weka, you can either call the machine learning algorithms directly or
import them with your Java code. It provides a variety of tools like
visualization, pre-processing, classification, clustering, etc.

4.3 KNime
KNime is a robust data mining suite that is primarily used for data
preprocessing, that is, ETL: Extraction, Transformation & Loading.
Furthermore, it integrates various components of Machine Learning and Data
Mining to provide an inclusive platform for all suitable operations.
4.4 Apache Mahout
Apache Mahout is an extension of the Hadoop Big Data Platform. The
developers at Apache developed Mahout to address the growing need for data
mining and analytical operations in Hadoop. As a result, it contains various
machine learning functionalities like classification, regression, clustering, etc.
4.5 Oracle Data Mining
Oracle Data mining is an excellent tool for classifying, analyzing and
predicting data. It allows its users to perform data-mining on its SQL
databases to extract views and schemas.

4.6 TeraData
For data-mining, warehousing is a necessary requirement. TeraData, also
known as TeraData Database provides warehouse services that consist of data
mining tools.

It can store data based on their usage, that is, it stores less-frequently used
data in its „slow‟ section and gives fast access to frequently used data.

4.7 Orange
Orange software is most famous for integrating machine learning and data
mining tools. It is written in Python and offers interactive and aesthetic
visualizations to its users.

What is Machine Learning


Machine learning is an application of artificial intelligence
(AI) {Artificial intelligence leverages computers and
machines to mimic the problem-solving and decision-
making capabilities of the human mind.} that provides
systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine
learning focuses on the development of computer programs
that can access data and use it to learn for themselves.
The process of learning begins with observations or data,
such as examples, direct experience, or instruction, in
order to look for patterns in data and make better decisions
in the future based on the examples that we provide. The
primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust
actions accordingly.
Types of Machine Learning Algorithms
There some variations of how to define the types of
Machine Learning Algorithms but commonly they can be
divided into categories according to their purpose and the
main categories are the following:
Supervised learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning

1. Supervised Learning
The majority of practical machine learning uses supervised
learning.

Supervised learning is where you have input variables (x)


and an output variable (Y) and you use an algorithm to
learn the mapping function from the input to the output.

Y = f (X)

The goal is to approximate the mapping function so well


that when you have new input data (x) that you can predict
the output variables (Y) for that data.
It is called supervised learning because the process of an
algorithm learning from the training dataset can be
thought of as a teacher supervising the learning process.
We know the correct answers, the algorithm iteratively
makes predictions on the training data and is corrected by
the teacher. Learning stops when the algorithm achieves an
acceptable level of performance.

Supervised learning problems can be further grouped into


regression and classification problems.

 Classification: A classification problem is when the output


variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
 Regression: A regression problem is when the output
variable is a real value, such as “dollars” or “weight”.
Some common types of problems built on top
of classification and regression include recommendation
and time series prediction respectively.

Some popular examples of supervised machine learning


algorithms are:

 Linear regression for regression problems.


 Random forest for classification and regression problems.
 Support vector machines for classification problems.

The main types of supervised learning problems include


regression and classification problems
List of Common Algorithms
Nearest Neighbor S M LA
Naive Bayes
Decision Trees
Linear Regression
Support Vector Machines (SVM)
Neural Networks

2. Unsupervised Learning
Unsupervised learning is where you only have input data
(X) and no corresponding output variables.

The goal for unsupervised learning is to model the


underlying structure or distribution in the data in order to
learn more about the data.

These are called unsupervised learning because unlike


supervised learning above there is no correct answers and
there is no teacher. Algorithms are left to their own devises
to discover and present the interesting structure in the
data.

Unsupervised learning problems can be further grouped


into clustering and association problems.

 Clustering: A clustering problem is where you want to


discover the inherent groupings in the data, such as
grouping customers by purchasing behavior.
 Association: An association rule learning problem is
where you want to discover rules that describe large
portions of your data, such as people that buy X also tend
to buy Y.
Some popular examples of unsupervised learning
algorithms are:
 k-means for clustering problems.
 Apriori algorithm for association rule learning problems.

List of Common Algorithms


k-means clustering, Association Rules
3. Semi-supervised Learning
In the previous two types, either there are no labels for all
the observation in the dataset or labels are present for all
the observations. Semi-supervised learning falls in between
these two. In many practical situations, the cost to label is
quite high, since it requires skilled human experts to do
that. So, in the absence of labels in the majority of the
observations but present in few, semi-supervised
algorithms are the best candidates for the model building.
These methods exploit the idea that even though the group
memberships of the unlabeled data are unknown, this data
carries important information about the group parameters.
Problems where you have a large amount of input data (X)
and only some of the data is labeled (Y) are called semi-
supervised learning problems.

These problems sit in between both supervised and


unsupervised learning.

A good example is a photo archive where only some of the


images are labeled, (e.g. dog, cat, person) and the majority
are unlabeled.

Many real world machine learning problems fall into this


area. This is because it can be expensive or time-
consuming to label data as it may require access to domain
experts. Whereas unlabeled data is cheap and easy to
collect and store.

You can use unsupervised learning techniques to discover


and learn the structure in the input variables.

You can also use supervised learning techniques to make


best guess predictions for the unlabeled data, feed that
data back into the supervised learning algorithm as
training data and use the model to make predictions on
new unseen data.

 Supervised: All data is labeled and the algorithms learn to


predict the output from the input data.
 Unsupervised: All data is unlabeled and the algorithms
learn to inherent structure from the input data.
 Semi-supervised: Some data is labeled but most of it is
unlabeled and a mixture of supervised and unsupervised
techniques can be used.

4. Reinforcement Learning
Reinforcement Learning method aims at using observations
gathered from the interaction with the environment to take
actions that would maximize the reward or minimize the
risk. Reinforcement learning algorithm (called the agent)
continuously learns from the environment in an iterative
fashion. In the process, the agent learns from its
experiences of the environment until it explores the full
range of possible states.
Reinforcement Learning is a type of Machine Learning, and
thereby also a branch of Artificial Intelligence. It allows
machines and software agents to automatically determine
the ideal behavior within a specific context, in order to
maximize its performance. Simple reward feedback is
required for the agent to learn its behavior; this is known
as the reinforcement signal.
There are many different algorithms that tackle this issue.
As a matter of fact, Reinforcement Learning is defined by a
specific type of problem, and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an
agent is supposed decide the best action to select based on
his current state. When this step is repeated, the problem
is known as a Markov Decision Process.
In order to produce intelligent programs (also called
agents), reinforcement learning goes through the following
steps:
Input state is observed by the agent.
Decision making function is used to make the agent
perform an action.
After the action is performed, the agent receives reward or
reinforcement from the environment.
The state-action pair information about the reward is
stored.
List of Common Algorithms
Q-Learning
Temporal Difference (TD)
Deep Adversarial Networks
Use cases:
Some applications of the reinforcement learning algorithms
are computer played board games (Chess, Go), robotic
hands, and self-driving cars.
Data acquisition

Data acquisition is the process of sampling signals that measure


real-world physical conditions and converting the resulting samples
into digital numeric values that a computer can manipulate.”

Data acquisition systems (DAS or DAQ) convert physical conditions


of analog waveforms into digital values for further storage, analysis,
and processing.

In simple words, Data Acquisition is composed of two words: Data


and Acquisition, where data is the raw facts and figures, which
could be structured and unstructured and acquisition means
acquiring data for the given task at hand.

Data acquisition meaning is to collect data from relevant sources


before it can be stored, cleaned, preprocessed, and used for further
mechanisms. It is the process of retrieving relevant business
information, transforming the data into the required business form,
and loading it into the designated system.

A data scientist spends 80 percent of the time searching, cleaning,


and processing data. With Machine Learning becoming more widely
used, some applications do not have enough labeled data. Even the
best Machine Learning algorithms cannot function properly without
good data and cleaning of the data. Also, Deep learning techniques
require vast amounts of data, as, unlike Machine Learning, these
techniques automatically generate features. Otherwise, we would
have garbage in and garbage out. Hence, data acquisition or
collection is a very critical aspect.

The data acquisition in machine learning involves:

 Collection and Integration of the data: The data is extracted


from various sources and also the data is usually available at
different places so the multiple data needs to be combined to be
used. The data acquired is typically in raw format and not
suitable for immediate consumption and analysis. This calls for
future processes such as:
o Formatting: Prepare or organize the datasets as per the
analysis requirements.
o Labeling: After gathering data, it is required to label the data.
One such instance is in an application factory, one would
want to label the images of the components if the components
are defective or not. In another case, if constructing a
knowledge base by extracting information from the web then
would need to label that it is implicitly assumed to be true. At
times, it is needed to manually label the data.
This acquired data is what is ingested for the data preprocessing
steps. Lets’ move to the data acquisition process …

The Data Acquisition Process


The process of data acquisition involves searching for the datasets
that can be used to train the Machine Learning models. Having said
that, it is not simple. There are various approaches to acquire data,
here have bucketed into three main segments such as:

1. Data Discovery
2. Data Augmentation
3. Data Generation
Each of these has further sub-processes depending upon their
functionality. The figure below lays out an overview of the research
landscape of data collection for machine learning. We’ll dive deep
into each of these.
1. Data Discovery:
The first approach to acquire data is Data discovery. It is a key step
when indexing, sharing, and searching for new datasets available
on the web and incorporating data lakes. It can be broken into two
steps: Searching and Sharing. Firstly, the data must be labeled or
indexed and published for sharing using many available
collaborative systems for this purpose.

2. Data Augmentation:
The next approach for data acquisition is Data augmentation.
Augment means to make something greater by adding to it, so here
in the context of data acquisition, we are essentially enriching the
existing data by adding more external data. In Deep and Machine
learning, using pre-trained models and embeddings is common to
increase the features to train on.

3. Data Generation:
As the name suggests, the data is generated. If we do not have
enough and any external data is not available, the option is to
generate the datasets manually or automatically. Crowdsourcing is
the standard technique for manual construction of the data where
people are assigned tasks to collect the required data to form the
generated dataset. There are automatic techniques available as well
to generate synthetic datasets. Also, the data generation method
can be seen as data augmentation when there is data available
however it has missing values that need to be imputed.

Data Acquisition Techniques and Tools


The major tools and techniques for data acquisition are:
1. Data Warehouses and ETL
2. Data Lakes and ELT
3. Cloud Data Warehouse providers

Data Warehouses and ETL


The first option to acquire data is via a data warehouse. Data
warehousing is the process of constructing and using a data
warehouse to offer meaningful business insights.

A data warehouse is a centralized repository, which is constructed


by combining data from various heterogeneous sources. It is
primarily created and used for data reporting and analysis rather
than transaction processing. Also, supporting structured and ad
hoc queries and in the decision-making process. The focus of the
data warehouse is on the business processes.

A data warehouse is typically constructed to store structured


records having tabular formats. Employees’ data, sales records,
payrolls, student records, CRM all come under this bucket. In a
data warehouse, usually, we transform the data before loading, and
hence, it falls under the approach of ETL (Extract, Transform and
Load).

As we saw above, what is data acquisition? It is defined as the


extraction of data, the transformation of data, and the loading of
the data. The data acquisition is performed by two kinds of ETL
(Extract, Transform and Load), these are:
1. Code-based ETL: These ETL applications are developed using
programming languages such as SQL, PL/SQL (which is a
combination of SQL and procedural features of programming
languages). Examples: BASE SAS, SAS ACCESS.
2. Graphical User Interface (GUI)-based ETL: This type of ETL
applications are developed using the graphical user interface,
point and click techniques. The examples are data stage, data
manager, AB Initio, Informatica, ODI (Oracle Data Integration),
data services, SSIS (SQL Server Integration Services).
Data Lakes and ELT
A data lake is a storage repository having the capacity to store large
amounts of data, including structured, semi-structured, and
unstructured data. It can store images, videos, audio, sound
records, PDF files. It helps for faster ingestion of new data.

Unlike data warehouses, data lakes store everything, are more


flexible, and follow the Extract, Load, and Transform (ELT)
approach. The data is first loaded and not transformed until
required to transform. Therefore the data is processed later as per
the requirements.

Data lakes provide an “unrefined view of data” to data scientists.


The open-source tools such as Hadoop and Map Reduce are
available under data lakes.

Cloud Data Warehouse providers


A cloud data warehouse is another service that collects, organizes,
and stores data. Unlike the traditional data warehouse, cloud data
warehouses are quicker and cheaper to set up as no physical
hardware needs to be procured.

Additionally, these architectures use massively parallel processing


(MPP), i.e., employ a large number of computer processors (up to
200 or more processors) to perform a set of coordinated
computations in parallel simultaneously and, therefore, perform
complex analytical queries much faster.

Some of the prominent cloud data warehouse services are:


 Amazon Redshift
 Snowflake
 Google BigQuery
 IBM Db2 Warehouse
 Microsoft Azure Synapse
 Oracle Autonomous Data Warehouse
 SAP Data Warehouse Cloud
 Yellowbrick Data
 Teradata Integrated Data Warehouse

You might also like