R Programming Unit-2
R Programming Unit-2
The DataScience Process: Overview of the Data Science Process-Setting the research
goal, Retrieving Data,Data Preparation,Exploration,Modeling,data Presentation and
Automation.Getting Data in and out of R,Using reader package, Interfaces to the
outside world.
This step is concerned with how the data are generated and
collected. In general, there are two distinct possibilities. The first
is when the data-generation process is under the control of an
expert (modeler): this approach is known as a designed
experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the
observational approach. An observational setting, namely, random
data generation, is assumed in most data-mining applications.
Typically, the sampling
distribution is completely unknown after data are collected, or it is
partially and implicitly given in the data-collection procedure. It is
very important, however, to understand how data collection affects
its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for
estimating a model and the data used later for testing and applying
a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully
used in a final application of the results.
Data preprocessing
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data
is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction
are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove irrelevant
or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization
(NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving
the important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.
Compression: This involves compressing the dataset while preserving the
important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.
In general, data cleaning lowers errors and raises the caliber of the data.
Although it might be a time-consuming and laborious operation, fixing data
mistakes and removing incorrect information must be done. A crucial
method for cleaning up data is data mining. A method for finding useful
information in data is data mining. Data quality mining is a novel
methodology that uses data mining methods to find and fix data quality
issues in sizable databases. Data mining mechanically pulls intrinsic and
hidden information from large data sets. Data cleansing can be
accomplished using a variety of data mining approaches.
You can follow these fundamental stages to clean your data even if the
techniques employed may vary depending on the sorts of data your firm
stores:
You might eliminate those useless observations, for instance, if you wish to
analyze data on millennial clients but your dataset also includes
observations from earlier generations. This can improve the analysis's
efficiency, reduce deviance from your main objective, and produce a
dataset that is easier to maintain and use.
2. Fix structural errors
When you measure or transfer data and find odd naming practices, typos, or
wrong capitalization, such are structural faults. Mislabelled categories or
classes may result from these inconsistencies. For instance, "N/A" and "Not
Applicable" might be present on any given sheet, but they ought to be
analyzed under the same heading.
There will frequently be isolated findings that, at first glance, do not seem
to fit the data you are analyzing. Removing an outlier if you have a good
reason to, such as incorrect data entry, will improve the performance of the
data you are working with.
Because many algorithms won't tolerate missing values, you can't overlook
missing data. There are a few options for handling missing data. While
neither is ideal, both can be taken into account, for example:
Although you can remove observations with missing values, doing so will
result in the loss of information, so proceed with caution.
Again, there is a chance to undermine the integrity of the data since you can
be working from assumptions rather than actual observations when you
input missing numbers based on other observations.
To browse null values efficiently, you may need to change the way the data
is used.
5. Validate and QA
1. Monitoring the errors: Keep track of the areas where errors seem to
occur most frequently. It will be simpler to identify and maintain
inaccurate or corrupt information. Information is particularly
important when integrating a potential substitute with current
management software.
2. Standardize the mining process: To help lower the likelihood of
duplicity, standardize the place of insertion.
3. Validate data accuracy: Analyse the data and spend money on data
cleaning software. Artificial intelligence-based tools were utilized to
thoroughly check for accuracy.
4. Scrub for duplicate data: To save time when analyzing data, find
duplicates. By analyzing and investing in independent data-erasing
technologies that can analyze imperfect data in quantity and automate
the operation, it is possible to avoid again attempting the same data.
5. Research on data: Our data needs to be vetted, standardized, and
duplicate-checked before this action. There are numerous third-party
sources, and these vetted and approved sources can extract data
straight from our databases. They assist us in gathering the data and
cleaning it up so that it is reliable, accurate, and comprehensive for
use in business decisions.
6. Communicate with the team: Keeping the group informed will help
with client development and strengthening as well as giving more
focused information to potential clients.
The following are some examples of how data cleaning is used in data
mining:
Data Cleansing Tools can be very helpful if you are not confident of
cleaning the data yourself or have no time to clean up all your data sets.
You might need to invest in those tools, but it is worth the expenditure.
There are many data cleaning tools in the market. Here are some top-ranked
data cleaning tools, such as:
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure
When you have clean data, you can make decisions using the highest-
quality information and eventually boost productivity. The following are
some important advantages of data cleaning in data mining, including:
Data science is about extracting knowledge and insights from data. The tools
and techniques of data science are used to drive business and process
decisions.
Data Science Processes:
1.Setting the Research Goal
2.Retrieving Data
3.Data Preparation
4.Data Exploration
5.Data Modeling
6.Presentation and Automation
1.Setting the research goal:
Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project, you’ll first prepare a project
charter. This charter contains information such as what you’re going to research,
how the company benefits from that, what data and resources you need, a
timetable, and deliverables.
2. Retrieving data:
The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can use
the data in your program, which means checking the existence of, quality, and
access to the data. Data can also be delivered by third-party companies and takes
many forms ranging from Excel spreadsheets to different types of databases.
3. Data preparation:
Data collection is an error-prone process; in this phase you enhance the quality
of the data and prepare it for use in subsequent steps. This phase consists of three
subphases: data cleansing removes false values from a data source and
inconsistencies across data sources, data integration enriches data sources by
combining information from multiple data sources, and data transformation ensures
that the data is in a suitable format for use in your models.
4. Data exploration:
Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the
data, and whether there are outliers. To achieve this, you mainly use descriptive
statistics, visual techniques, and simple modeling. This step often goes by the
abbreviation EDA, for Exploratory Data Analysis.
5. Machine Learning:
Machine Learning, as the name suggests, is the process of making machines
intelligent, that have the power to think, analyze and make decisions. By building
precise Machine Learning models, an organization has a better chance of
identifying profitable opportunities or avoiding unknown risks.
You should have good hands-on knowledge of various Supervised and
Unsupervised algorithms.
Telling R all these things directly makes R run faster and more
efficiently. The read.csv() function is identical to read.table except
that some of the defaults are set differently (like
the sep argument).
Read the help page for read.table, which contains many hints
Make a rough calculation of the memory required to store your
dataset (see the next section for an example of how to do this). If
the dataset is larger than the amount of RAM on your computer,
you can probably stop right here.
Set comment.char = "" if there are no commented lines in your file.
Use the colClasses argument. Specifying this option instead of
using the default can make ’read.table’ run MUCH faster, often
twice as fast. In order to use this option, you have to know the
class of each column in your data frame. If all of the columns are
“numeric”, for example, then you can just set colClasses = "numeric" .
A quick an dirty way to figure out the classes of each column is
the following:
> initial <- read.table("datatable.txt", nrows = 100)
> classes <- sapply(initial, class)
> tabAll <- read.table("datatable.txt", colClasses = classes)
Set nrows. This doesn’t make R run faster but it helps with memory
usage. A mild overestimate is okay. You can use the Unix tool wc to
calculate the number of lines in a file.
For example, suppose I have a data frame with 1,500,000 rows and
120 columns, all of which are numeric data. Roughly, how much
memory is required to store this data frame? Well, on most modern
computers double precision floating point numbers are stored
using 64 bits of memory, or 8 bytes. Given that information, you
can do the following calculation
Reading in a large dataset for which you do not have enough RAM
is one easy way to freeze up your computer (or at least your R
session). This is usually an unpleasant experience that usually
requires you to kill the R process, in the best case scenario, or
reboot your computer, in the worst case. So make sure to do a
rough calculation of memeory requirements before reading in a
large dataset. You’ll thank me later.