Data Analytics-34-41
Data Analytics-34-41
Data Manipulation
1.Data manipulation is the process of arranging a set of data to make it more
organized and easier to interpret.
3.The steps of effective data manipulation include extracting data, cleaning the
data, constructing a database, filtering information based on your requirements
and analyzing the data.
Slicing
Slicing is the process of extracting a subset of data from a larger dataset. In
Pandas, we can slice data using the iloc and loc methods. The iloc method is
used to slice data based on the integer position of the rows and columns, while
the loc method is used to slice data based on the labels of the rows and columns.
Time Series Data:Subscripts are used to represent different time periods within
a time series. For example, in a dataset representing monthly sales, Sₜ might
represent sales in month t.
Data subset
In data analytics, a "data subset" refers to a portion or segment of a larger
dataset. Creating subsets is a fundamental step in data analysis and is used for
various purposes:
Filtering and Sampling: Subsetting allows you to filter out rows or observations
that meet certain criteria. For example, you might want to analyze only the data
related to a specific region, time period, or customer segment.
Dplyr Package
The dplyr package in R Programming Language is a structure of data
manipulation that provides a uniform set of verbs, helping to resolve the most
frequent data manipulation hurdles.
The dplyr Package in R performs the steps given below quicker and in an easier
fashion:
● By limiting the choices the focus can now be more on data manipulation
difficulties.
● There are uncomplicated “verbs”, functions present for tackling every
common data manipulation and the thoughts can be translated into
code faster.
● There are valuable backends and hence waiting time for the computer
reduces.
Select Function
select() is a function from dplyr R package that is used to select data frame
variables by name, by index, and also is used to rename variables while
selecting, and dropping variables by name. In this article, I will explain the syntax
of select() function, and its usage with examples like selecting specific variables
by name, by position, selecting variables from the list of names, and many more.
Syntax
# Syntax of select()
select(x, variables_to_select)
Program
# Create DataFrame
df <- data.frame(
id = c(10,11,12,13),
name = c('sai','ram','deepika','sahithi'),
gender = c('M','M','F','F'),
dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
state = c('CA','NY','DE',NA),
row.names=c('r1','r2','r3','r4')
df
Output
r1 10 sai M 1990-10-02 CA
r2 11 ram M 1981-03-24 NY
r3 12 deepika F 1987-06-14 DE
r4 13 sahithi F 1985-08-16 <NA>
Filter Function
The filter() function from dplyr package is used to filter the data frame rows
in R. Note that filter() doesn’t actually filter the data instead it retains all rows that
satisfy the specified condition.
Syntax
# Syntax of filter()
filter(x, condition,...)
Program
library(dplyr)
# sample data
df=data.frame(x=c(12,31,4,66,78),
y=c(22.1,44.5,6.1,43.1,99),
z=c(TRUE,TRUE,FALSE,TRUE,TRUE))
# condition
Output
x y z
1 12 22.1 TRUE
2 31 44.5 TRUE
Mutate Function
We can use the mutate() function in R programming to add new variables in the
specified data frame. These new variables are added by performing the
operations on present variables.
Before using the mutate() function, you need to install the dplyr library. We can
use the mutate() method to manipulate big datasets. mutate() is a rather simple
method to use to manipulate datasets.
Syntax
mutate(x, expr)
Program
Output
Arrange Function
arrange() function in R is from the dplyr package that is used to order/sort the
dataframe rows in either ascending or descending based on column value.
Syntax
# Syntax of arrange()
Program
df=data.frame(id=c(11,22,33,44,55),
name=c("spark","python","R","jsp","java"),
price=c(144,NA,321,567,567),
publish_date= as.Date(
c("2007-06-22", "2004-02-13","2006-05-18","2010-09-02","2007-07-20"))
library(dplyr)
df2
Output