0% found this document useful (0 votes)
10 views8 pages

Data Analytics-34-41

Uploaded by

Bhuvaneshwari M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

Data Analytics-34-41

Uploaded by

Bhuvaneshwari M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT 3

Data Manipulation
1.Data manipulation is the process of arranging a set of data to make it more
organized and easier to interpret.

2.Data manipulation is used in various industries including accounting, finance,


computer programming, banking, sales, marketing and real estate.

3.The steps of effective data manipulation include extracting data, cleaning the
data, constructing a database, filtering information based on your requirements
and analyzing the data.

Slicing
Slicing is the process of extracting a subset of data from a larger dataset. In
Pandas, we can slice data using the iloc and loc methods. The iloc method is
used to slice data based on the integer position of the rows and columns, while
the loc method is used to slice data based on the labels of the rows and columns.

Subscript and indices


In data analytics, subscripts and indices are used to represent different elements
within datasets or matrices. They play a crucial role in various mathematical and
statistical operations. Here's how they are commonly used:

Matrix Notation:Subscripts are used to denote specific elements within a matrix.


For example, in a 2x2 matrix A, the element in the first row and second column is
denoted as A₁₂.

Time Series Data:Subscripts are used to represent different time periods within
a time series. For example, in a dataset representing monthly sales, Sₜ might
represent sales in month t.

Multi-dimensional Data:For datasets with multiple dimensions or attributes,


subscripts can be used to represent different variables. For instance, in a dataset
with variables like age, income, and education level, Aᵢⱼ might represent the
element at the i-th row and j-th column.
Statistical Notation:Indices are used to denote different groups or categories in
statistical analyses. For instance, in a regression model, β₀ represents the
intercept, β₁ represents the coefficient for the first predictor variable, and so on.

Summation and Aggregation:Indices are crucial in summation operations, such


as the sigma notation (∑). They indicate which elements are being summed or
aggregated in a dataset

Array and Dataframe Access:In programming languages used for data


analytics, like Python or R, indices are used to access specific elements within
arrays, lists, or dataframes. For example, in Python, you might access the first
element of a list as list[0].

Data subset
In data analytics, a "data subset" refers to a portion or segment of a larger
dataset. Creating subsets is a fundamental step in data analysis and is used for
various purposes:

Focus on Specific Variables: You might create a subset to focus on a specific


set of variables of interest, ignoring irrelevant or redundant ones.

Filtering and Sampling: Subsetting allows you to filter out rows or observations
that meet certain criteria. For example, you might want to analyze only the data
related to a specific region, time period, or customer segment.

Dplyr Package
The dplyr package in R Programming Language is a structure of data
manipulation that provides a uniform set of verbs, helping to resolve the most
frequent data manipulation hurdles.

The dplyr Package in R performs the steps given below quicker and in an easier
fashion:

● By limiting the choices the focus can now be more on data manipulation
difficulties.
● There are uncomplicated “verbs”, functions present for tackling every
common data manipulation and the thoughts can be translated into
code faster.
● There are valuable backends and hence waiting time for the computer
reduces.

Select Function

select() is a function from dplyr R package that is used to select data frame
variables by name, by index, and also is used to rename variables while
selecting, and dropping variables by name. In this article, I will explain the syntax
of select() function, and its usage with examples like selecting specific variables
by name, by position, selecting variables from the list of names, and many more.

Syntax

Following is the syntax of select() function of dplyr package in R. This returns an


object of the same class as x (input object).

# Syntax of select()

select(x, variables_to_select)

Program

# Create DataFrame

df <- data.frame(
id = c(10,11,12,13),

name = c('sai','ram','deepika','sahithi'),

gender = c('M','M','F','F'),

dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),

state = c('CA','NY','DE',NA),

row.names=c('r1','r2','r3','r4')

df

Output

id name gender dob state

r1 10 sai M 1990-10-02 CA

r2 11 ram M 1981-03-24 NY

r3 12 deepika F 1987-06-14 DE
r4 13 sahithi F 1985-08-16 <NA>

Filter Function

The filter() function from dplyr package is used to filter the data frame rows
in R. Note that filter() doesn’t actually filter the data instead it retains all rows that
satisfy the specified condition.

Syntax

# Syntax of filter()

filter(x, condition,...)

Program

library(dplyr)

# sample data

df=data.frame(x=c(12,31,4,66,78),

y=c(22.1,44.5,6.1,43.1,99),

z=c(TRUE,TRUE,FALSE,TRUE,TRUE))

# condition

filter(df, x<50 & z==TRUE)

Output

x y z
1 12 22.1 TRUE

2 31 44.5 TRUE

Mutate Function

We can use the mutate() function in R programming to add new variables in the
specified data frame. These new variables are added by performing the
operations on present variables.

Before using the mutate() function, you need to install the dplyr library. We can
use the mutate() method to manipulate big datasets. mutate() is a rather simple
method to use to manipulate datasets.

Syntax

mutate(x, expr)

Program

library(dplyr) #load the library

# Creating data frame

df <- data.frame( studentname = c("Student1", "Student2", "Student3",


"Student4"),

Math = c(75, 58, 93, 66),

Eng= c(44, 89, 89, NA) )


# Calculate the total marks (totalMarks)

# sum of marks in Maths (Math) & English (Eng)

mutate(df, totalMarks = Math + Eng)

Output

Arrange Function

arrange() function in R is from the dplyr package that is used to order/sort the
dataframe rows in either ascending or descending based on column value.

Syntax

# Syntax of arrange()

arrange(.data, ..., .by_group = FALSE)

Program

# Create Data Frame

df=data.frame(id=c(11,22,33,44,55),

name=c("spark","python","R","jsp","java"),

price=c(144,NA,321,567,567),

publish_date= as.Date(
c("2007-06-22", "2004-02-13","2006-05-18","2010-09-02","2007-07-20"))

# Load dplyr library

library(dplyr)

# Using arrange in ascending order

df2 <- df %>% arrange(price)

df2

Output

You might also like