Data Structure & Data Mining
Data Structure & Data Mining
X <- Matrix(byrow)
Example:
M <- matrix(c(1:9), nrow = 5, ncol =5, byrow= TRUE)
M
Example:
matrix = np.mat([[1, 2],
[1, 2],
[1, 2]])
matrix
Example:
df1 <- c(1:4)
df2 <- c("Sam","Rob","Max","John")
df3 <- c("Google","Apple","Microsoft","Amazon")
df.data <- data.frame(df1,df2,df3)
print(df.data)
How to Create a Data Frame in Python:
Example:
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus'],
'Price': [22000,25000,27000,35000] }
print (df)
Example:
#Createvector
x <- c("East","West","East","North","North","East","West") # Apply the factor function.
factor_data <- factor(x)
X <- factor("East","West","East","North","North","East","West")
print(x)
print(is.factor(x))
print(factor_data)
Just open a curly bracket , define the key and enter the values.
{1: [1, 2, 3, 4], 'Name': 'Bill'}
The process of data mining is a complex process that involves intensive data
warehousing as well as powerful computational technologies.
Furthermore, data mining is not only limited to the extraction of data but is
also used for transformation, cleaning, data integration, and pattern analysis.
Another terminology for Data Mining is Knowledge Discovery.
There are various important parameters in Data Mining, such as association
rules, classification, clustering, and forecasting. Some of the key features of
Data Mining are –
Prediction of Patterns based on trends in the data.
Calculating the predictions for the outcomes.
Creating information in response to the analysis
Focusing on greater databases.
Clustering the visual data
4.1 RapidMiner
It is one of the most popular tools for data mining. It is written in Java but
requires no coding to operate it. Furthermore, it provides various data mining
functionalities like data-preprocessing, data representation, filtering,
clustering, etc.
4.2 Weka
Weka is an open-source data mining software developed at the University of
Wichita. Like RapidMiner, it has a no-coding and a simple to use GUI.
Using Weka, you can either call the machine learning algorithms directly or
import them with your Java code. It provides a variety of tools like
visualization, pre-processing, classification, clustering, etc.
4.3 KNime
KNime is a robust data mining suite that is primarily used for data
preprocessing, that is, ETL: Extraction, Transformation & Loading.
Furthermore, it integrates various components of Machine Learning and Data
Mining to provide an inclusive platform for all suitable operations.
4.4 Apache Mahout
Apache Mahout is an extension of the Hadoop Big Data Platform. The
developers at Apache developed Mahout to address the growing need for data
mining and analytical operations in Hadoop. As a result, it contains various
machine learning functionalities like classification, regression, clustering, etc.
4.5 Oracle Data Mining
Oracle Data mining is an excellent tool for classifying, analyzing and
predicting data. It allows its users to perform data-mining on its SQL
databases to extract views and schemas.
4.6 TeraData
For data-mining, warehousing is a necessary requirement. TeraData, also
known as TeraData Database provides warehouse services that consist of data
mining tools.
It can store data based on their usage, that is, it stores less-frequently used
data in its „slow‟ section and gives fast access to frequently used data.
4.7 Orange
Orange software is most famous for integrating machine learning and data
mining tools. It is written in Python and offers interactive and aesthetic
visualizations to its users.
1. Supervised Learning
The majority of practical machine learning uses supervised
learning.
Y = f (X)
2. Unsupervised Learning
Unsupervised learning is where you only have input data
(X) and no corresponding output variables.
4. Reinforcement Learning
Reinforcement Learning method aims at using observations
gathered from the interaction with the environment to take
actions that would maximize the reward or minimize the
risk. Reinforcement learning algorithm (called the agent)
continuously learns from the environment in an iterative
fashion. In the process, the agent learns from its
experiences of the environment until it explores the full
range of possible states.
Reinforcement Learning is a type of Machine Learning, and
thereby also a branch of Artificial Intelligence. It allows
machines and software agents to automatically determine
the ideal behavior within a specific context, in order to
maximize its performance. Simple reward feedback is
required for the agent to learn its behavior; this is known
as the reinforcement signal.
There are many different algorithms that tackle this issue.
As a matter of fact, Reinforcement Learning is defined by a
specific type of problem, and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an
agent is supposed decide the best action to select based on
his current state. When this step is repeated, the problem
is known as a Markov Decision Process.
In order to produce intelligent programs (also called
agents), reinforcement learning goes through the following
steps:
Input state is observed by the agent.
Decision making function is used to make the agent
perform an action.
After the action is performed, the agent receives reward or
reinforcement from the environment.
The state-action pair information about the reward is
stored.
List of Common Algorithms
Q-Learning
Temporal Difference (TD)
Deep Adversarial Networks
Use cases:
Some applications of the reinforcement learning algorithms
are computer played board games (Chess, Go), robotic
hands, and self-driving cars.
Data acquisition
1. Data Discovery
2. Data Augmentation
3. Data Generation
Each of these has further sub-processes depending upon their
functionality. The figure below lays out an overview of the research
landscape of data collection for machine learning. We’ll dive deep
into each of these.
1. Data Discovery:
The first approach to acquire data is Data discovery. It is a key step
when indexing, sharing, and searching for new datasets available
on the web and incorporating data lakes. It can be broken into two
steps: Searching and Sharing. Firstly, the data must be labeled or
indexed and published for sharing using many available
collaborative systems for this purpose.
2. Data Augmentation:
The next approach for data acquisition is Data augmentation.
Augment means to make something greater by adding to it, so here
in the context of data acquisition, we are essentially enriching the
existing data by adding more external data. In Deep and Machine
learning, using pre-trained models and embeddings is common to
increase the features to train on.
3. Data Generation:
As the name suggests, the data is generated. If we do not have
enough and any external data is not available, the option is to
generate the datasets manually or automatically. Crowdsourcing is
the standard technique for manual construction of the data where
people are assigned tasks to collect the required data to form the
generated dataset. There are automatic techniques available as well
to generate synthetic datasets. Also, the data generation method
can be seen as data augmentation when there is data available
however it has missing values that need to be imputed.