0% found this document useful (0 votes)

18 views21 pages

Data Science Short Notes

Data science short question pdf

Uploaded by

justcallmemrx9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views21 pages

Data Science Short Notes

Data science short question pdf

Uploaded by

justcallmemrx9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

2021

a) Linear regression helps in understanding the linear rela onship between the dependent and the
independent variables.

b) Logis c regression is a classiﬁca on algorithm that can be used when the dependent variable is
binary.

c) The logis c regression algorithm actually produces an S-shaped curve.

d) The confusion matrix is a table that is used to es mate the performance of a model.

e) R is a programming language and so ware environment for sta s cal analysis.

f) A data frame is a two-dimensional rectangular data set.

g) A loop statement allows us to execute a statement or group of statements mul ple mes.

h) A func on is a set of statements organized together to perform a speciﬁc task.

2022
a) Linear regression helps in understanding the linear rela onship between the dependent and
independent variables.

b) Logis c regression is a classiﬁca on algorithm that can be used when the dependent variable is
binary.

c) The logis c regression algorithm actually processes an S-shaped curve.

d) The confusion matrix is a table that is used to es mate the performance of a model.

e) R is a programming language and so ware environment for sta s cal analysis.

f) A data frame is a two-dimensional rectangular data set.

g) A loop statement allows us to execute a statement or group of statements mul ple mes.

h) A func on is a set of statements organized together to perform a speciﬁc task.

2023
a) GitHub is an online so ware development pla orm.

b) A class of systems responsible for managing changes to computer programs and other documents
are referred to as version control systems.

c) R programming uses # symbol to write comments.

d) The process of correc ng errors is referred to as debugging.

e) A database is an organized collec on of related data.

f) The process of conver ng data from one form to another is known as data transforma on.

g) There are two types of hypothesis: null hypothesis and alterna ve hypothesis.

h) Visualiza on is an eﬀec ve way to communicate.s

2021

2)
a) What is Markdown?
Markdown is a lightweight markup language with plain text forma ng syntax. Its design allows
people to write using an easy-to-read, easy-to-write plain text format, which can be converted into
structurally valid HTML. It's commonly used for documenta on, readme ﬁles, and crea ng rich text
without complex forma ng.
b) What is Git?
Git is a distributed version control system that helps track changes to ﬁles, typically source code, and
allows mul ple developers to work on the same project simultaneously. It enables features like
branching, merging, and versioning, making it easier to collaborate on so ware development
projects.

c) What is GitHub?
GitHub is a pla orm that hosts Git repositories. It provides tools for version control and
collabora on, allowing mul ple developers to contribute to projects, track issues, and review code.
GitHub is widely used for open-source projects and private repositories.

d) What is Keyword?
A keyword is a reserved word in programming that has a predeﬁned meaning and cannot be used for
any other purpose, like naming variables or func ons. For example, in R, if, else, for, and while are
keywords that deﬁne control structures.

e) What is Operator?
An operator is a symbol or keyword used to perform opera ons on variables and values. In
programming, operators include arithme c operators (like +, -, *, /), logical operators (like &&, ||),
and rela onal operators (like ==, !=, <, >).

f) What is While Loop?

A while loop is a control ﬂow statement in programming that repeatedly executes a block of code as
long as a speciﬁed condi on is true. For example, in R:

while (condi on) {

# code to be executed

The loop con nues to run as long as the condi on evaluates to true.

g) What is RStudio?
RStudio is an integrated development environment (IDE) for the R programming language. It provides
tools to help users write, debug, and visualize R code. RStudio includes features like a console, syntax
highligh ng, code comple on, and integrated plo ng tools.

h) What is Bug?
A bug is an error or ﬂaw in a so ware program that causes it to behave incorrectly or produce
incorrect results. Bugs can occur due to mistakes in the code or unexpected interac ons between
parts of the program.

i) What is Debugging?
Debugging is the process of iden fying, analyzing, and ﬁxing bugs or errors in so ware code. It
involves using debugging tools, adding print statements, or using breakpoints to inspect the
program's behavior and correct issues in the code.

j) Deﬁne R Programming?
R programming is a programming language and environment designed for sta s cal compu ng, data
analysis, and graphical visualiza on. It is widely used in ﬁelds like data science, bioinforma cs, and
academic research for tasks such as data manipula on, sta s cal modeling, and producing high-
quality plots and graphics.
3)

a) What is Raw Data Type in R Programming?

The raw data type in R is used to represent raw bytes of data. It is par cularly useful when dealing
with binary data, like ﬁles in a non-text format (e.g., images, audio ﬁles, or encrypted data). The raw
data type stores each byte as an individual element in a vector, and the elements are represented by
integers between 0 and 255, which correspond to byte values.

For example:

raw_data <- charToRaw("Hello") # Convert the string "Hello" to raw data

print(raw_data)

This will return:

[1] 48 65 6c 6c 6f

Each number corresponds to the ASCII value of the characters 'H', 'e', 'l', 'l', 'o'.

b) Deﬁne Logical Data Type

The logical data type in R is used to represent Boolean values: TRUE or FALSE. This data type is
commonly used for condi onal opera ons, comparisons, and controlling the ﬂow of execu on in
control structures (like if statements, loops, etc.).

For example:

is_true <- TRUE

is_false <- FALSE

print(is_true)

print(is_false)

This will output:

[1] TRUE

[1] FALSE

Logical values in R are used for making comparisons between variables, such as checking if one
number is greater than another:

result <- 5 > 3 # result is TRUE

Here are the deﬁni ons and explana ons for each term:

c) What is an if statement? Write its syntax.

An if statement is a condi onal control ﬂow statement used to execute a block of code only if a
speciﬁed condi on evaluates to TRUE. If the condi on is FALSE, the block of code is skipped.
Syntax:

if (condi on) {

# Code to be executed if condi on is TRUE

Op onally, you can include an else statement to execute code if the condi on is FALSE:

if (condi on) {

# Code to be executed if condi on is TRUE

} else {

# Code to be executed if condi on is FALSE

Example:

x <- 5

if (x > 3) {

print("x is greater than 3")

} else {

print("x is not greater than 3")

d) What is code proﬁling?

Code proﬁling is the process of analyzing a program to iden fy performance bo lenecks or

inefficient code. Profiling helps to measure how much me is spent in various parts of the code,
which can guide op miza on efforts. Profiling tools provide metrics like execu on me, memory
usage, and CPU usage to help developers improve the performance of their code.

e) What is Data Cleaning?

Data cleaning refers to the process of iden fying and rec fying errors or inconsistencies in the
dataset. It involves tasks like handling missing data, correc ng incorrect entries, removing duplicates,
and ensuring the data is in a consistent format. Data cleaning is essen al for preparing data for
analysis or modeling.

Examples of data cleaning tasks:

 Impu ng missing values

 Removing outliers

 Correc ng data entry errors

 Conver ng data to the correct format

f) What is Tidy Data?

Tidy data is a concept from data science where data is structured in a consistent and organized
manner. In dy data, each variable forms a column, each observa on forms a row, and each type of
observa onal unit forms a table.

For example, a dy dataset might look like this:

Name Age Gender

Alice 23 Female

Bob 30 Male

Carol 22 Female

Each column represents a variable, and each row represents an observa on.

g) What is EDA?

EDA (Exploratory Data Analysis) is the process of analyzing and summarizing the main characteris cs
of a dataset, o en with the help of visual methods. EDA helps to uncover pa erns, detect anomalies,
check assump ons, and test hypotheses. It usually involves:

 Visualiza ons (e.g., histograms, box plots)

 Summary sta s cs (e.g., mean, median, variance)

 Data transforma ons

h) What is Data Mining?

Data mining refers to the process of discovering pa erns, trends, and useful informa on from large
datasets using sta s cal, mathema cal, and computa onal techniques. It combines methods from
machine learning, sta s cs, and database systems to iden fy meaningful insights from raw data.

Common tasks in data mining include:

 Classiﬁca on

 Clustering

 Associa on rule mining

 Regression analysis

i) What is Data Engineering?

Data engineering is the prac ce of designing, building, and maintaining the infrastructure and tools
needed to collect, store, process, and analyze data. Data engineers work with large datasets and
ensure that the data pipeline is eﬃcient, reliable, and scalable. They typically work with technologies
like databases, big data systems (e.g., Hadoop), and cloud pla orms.

Key tasks in data engineering:

 Data extrac on, transforma on, and loading (ETL)

 Designing databases and data architectures

 Building and managing data pipelines

j) What is Data Visualiza on?

Data visualiza on is the graphical representa on of data to help people understand and interpret
the informa on more easily. It involves crea ng charts, graphs, maps, and other visual tools to make
pa erns, trends, and insights in the data more accessible.

Common types of data visualiza ons include:

 Bar charts

 Line graphs

 Sca er plots

 Heatmaps

 Pie charts

Data visualiza on is o en used in repor ng, dashboards, and business intelligence to help
stakeholders make informed decisions based on data.

2022
2)

(a) What is Markdown?

Markdown is a lightweight markup language used to format plain text. It is designed to be easy to
write and read, and it can be converted into HTML. Markdown is commonly used for documenta on,
readme ﬁles, and wri ng content for web pla orms like GitHub. It allows users to format text (such
as making text bold or italic), create lists, and add links and images using simple syntax.

(b) What is Git?

Git is a distributed version control system that tracks changes to ﬁles, par cularly source code,
allowing mul ple users to collaborate on a project. It enables versioning, branching, and merging,
which helps in managing code changes and preven ng conﬂicts. Git allows users to keep a history of
all changes, making it easier to roll back to previous versions if needed.

(c) What is GitHub?

GitHub is a pla orm that hosts Git repositories, providing version control and collabora on features.
It allows developers to store, share, and manage their code online. GitHub also includes tools for
tracking issues, reviewing code, and managing team collabora ons, making it a central hub for open-
source and private so ware projects.

(d) What is a Keyword?

A keyword in programming is a reserved word that has a predefined meaning and cannot be used for
any other purpose, like naming variables or func ons. Keywords are part of the programming
language syntax and are essen al for defining the structure of the code. For example, in R, if, else,
for, and while are keywords used to control the flow of execu on.

(e) Deﬁne Operator.

An operator in programming is a symbol or keyword used to perform opera ons on variables and
values. Operators include:

 Arithme c operators (e.g., +, -, *, /)

 Rela onal operators (e.g., ==, !=, <, >)

 Logical operators (e.g., &&, ||)

 Assignment operators (e.g., =, <- in R)

Operators are essen al for manipula ng data and controlling the ﬂow of a program.

(f) What is a While Loop?

A while loop is a control structure in programming that repeatedly executes a block of code as long
as a given condi on evaluates to true. The loop checks the condi on before each itera on, and if the
condi on is true, it con nues execu ng the code block.

Syntax:

while (condi on) {

# Code to be executed while the condi on is TRUE

Example in R:

x <- 1
while (x <= 5) {

print(x)

x <- x + 1

(g) Write something about RStudio?

RStudio is an integrated development environment (IDE) for the R programming language. It provides
an interface for wri ng, debugging, and visualizing R code. RStudio includes features like a console
for running code, a script editor with syntax highligh ng, and tools for managing packages, plots, and
data. It is widely used by data analysts, sta s cians, and data scien sts for working with R.

(h) What is a Bug?

A bug is an error, ﬂaw, or unintended behavior in a program that causes it to func on incorrectly or
produce incorrect results. Bugs can occur due to mistakes in coding, logic errors, or unexpected
interac ons between parts of a program.

(i) State Debugging.

Debugging is the process of iden fying, isola ng, and ﬁxing bugs or errors in a program. It involves
running the program with various inputs, using debugging tools (such as breakpoints, print
statements, or debuggers), and inspec ng the code to locate and correct problems. Debugging is an
essen al skill for so ware development, as it ensures that the code works as intended.

(j) Deﬁne R Programming.

R programming is a language and environment primarily used for sta s cal compu ng, data analysis,
and graphical visualiza on. It is widely used by sta s cians, data scien sts, and researchers to
analyze data, build sta s cal models, and produce high-quality plots and reports. R provides a rich
ecosystem of packages and func ons for data manipula on, machine learning, and data visualiza on.

(a) What is raw data type in R Programming?

The raw data type in R represents raw bytes of data. It is used to store binary data or data in byte
format (e.g., images, ﬁles, or encrypted data). Each element of a raw vector in R is an individual byte,
which can take integer values between 0 and 255.

Example:
raw_data <- charToRaw("Hello") # Converts a string into raw bytes

print(raw_data)

Output:

[1] 48 65 6c 6c 6f

Here, the ASCII values of the characters in "Hello" are stored as raw bytes.

(b) Deﬁne logical data type.

The logical data type in R represents Boolean values: TRUE or FALSE. Logical values are used to
evaluate condi ons and control the ﬂow of the program (e.g., in if statements, loops). Logical values
are essen al in comparisons, ﬁltering, and condi onal opera ons.

Example:

x <- TRUE

y <- FALSE

print(x)

print(y)

(c) What is a statement? Write its syntax.

A statement in programming is a single instruc on or opera on that performs a speciﬁc task. In R, a

statement could be an expression or a func on call that is executed. For example, a simple
assignment statement is:

x <- 5 # Assigns the value 5 to variable x

Another example of a statement is calling a func on:

print("Hello, World!") # Prints a message to the console

(d) What is code proﬁling?

Code profiling is the process of analyzing a program's performance, typically to iden fy performance
bo lenecks. Profiling measures aspects like execu on me, memory usage, and CPU u liza on,
which helps developers op mize their code. Profiling tools give insights into which func ons or parts
of the code consume the most resources.

(e) What is data learning?

It seems like you meant data cleaning or data learning might refer to machine learning.

 Data cleaning is the process of preparing raw data for analysis by correc ng errors, handling
missing values, and ensuring consistency.
 Data learning isn't a common term in data science, but machine learning refers to the use of
algorithms and sta s cal models to allow a computer system to improve its performance on
a task through experience (data).

(f) What is dy data?

Tidy data refers to a structured way of organizing data, where:

 Each variable is in a separate column.

 Each observa on forms a row.

 Each type of observa onal unit forms a table.

In dy data, the structure allows for easy analysis and visualiza on. For example:

Name Age Gender

Alice 23 Female

Bob 30 Male

Carol 22 Female

Each column represents a variable, and each row represents an observa on.

(g) What is EDA?

EDA (Exploratory Data Analysis) is the process of analyzing a dataset to summarize its main
characteris cs, o en using visual methods. EDA is important for understanding the data, spo ng
anomalies, iden fying pa erns, and forming hypotheses. It typically involves:

 Visualiza ons (e.g., histograms, sca er plots)

 Summary sta s cs (e.g., mean, median)

 Data cleaning and transforma on

(h) What is data mining?

Data mining is the process of discovering pa erns, correla ons, trends, or useful informa on from
large datasets using algorithms and sta s cal methods. Data mining involves tasks such as:

 Classiﬁca on: Assigning labels to data.

 Clustering: Grouping similar data points.

 Associa on: Finding rela onships between variables.

It is commonly used in applica ons like customer segmenta on, fraud detec on, and
recommenda on systems.
(i) What is data engineering?

Data engineering refers to the design, crea on, and management of systems and infrastructure that
allow the collec on, storage, processing, and analysis of data. Data engineers build and maintain
data pipelines, work with databases, and ensure that the data infrastructure is scalable, reliable, and
eﬃcient. They work closely with data scien sts and analysts to ensure that data is accessible and
usable.

(j) What do you mean by data visualiza on?

Data visualiza on is the process of crea ng graphical representa ons of data to help people
understand complex data more easily. It uses charts, graphs, and maps to present pa erns, trends,
and insights clearly. Common types of data visualiza ons include:

 Bar charts

 Line graphs

 Pie charts

 Heatmaps

 Sca er plots

Data visualiza on is crucial for communica ng insights from data eﬀec vely to a wide audience,
o en helping in decision-making processes.

2023
2)

a) Name the data types supported by R programming.

R programming supports several data types, including:

1. Numeric: For represen ng both integers and real numbers (e.g., 5, 3.14).

2. Character: For represen ng text or strings (e.g., "hello", "data").

3. Logical: For represen ng boolean values (TRUE or FALSE).

4. Integer: For whole numbers (e.g., 1L, 45L).

5. Complex: For complex numbers (e.g., 2 + 3i).

6. Raw: For storing raw bytes of data (e.g., binary data).

7. List: For storing ordered collec ons of objects, which can be of diﬀerent types (e.g., list(1,
"hello", TRUE)).

8. Data Frame: A two-dimensional table-like structure, o en used for datasets.

9. Matrix: A two-dimensional array-like object where all elements are of the same data type.

10. Factor: Used for categorical data, storing data as levels (e.g., "low", "medium", "high").

b) Deﬁne data transforma on.

Data transforma on is the process of conver ng data from its original format into a format that is
more suitable for analysis or processing. This may involve opera ons such as:

 Normalizing or scaling values

 Aggrega ng data (e.g., summing, averaging)

 Changing the structure of the data (e.g., conver ng a wide format to a long format)

 Handling missing values

 Encoding categorical variables

Data transforma on is an essen al step in data cleaning and prepara on before analysis.

c) What is the signiﬁcance of indenta on in R programming?

Indenta on in R programming is important for improving the readability and organiza on of code.
While R does not require speciﬁc indenta on for the code to run (like Python), consistent indenta on
helps developers understand the structure of the code, especially when working with nested loops,
condi onal statements, or func ons. Proper indenta on also helps with debugging and maintenance
by clearly indica ng the scope of loops, condi ons, or func on bodies.

d) Deﬁne func on.

A func on in R is a block of reusable code that performs a speciﬁc task. Func ons take input
(arguments), perform an opera on, and return an output. Func ons are essen al for code
modularity and reusability.

Syntax:

func on_name <- func on(arg1, arg2) {

# Code to perform the task

result <- arg1 + arg2 # Example opera on

return(result) # Return the result

Example:
add_numbers <- func on(a, b) {

return(a + b)

print(add_numbers(3, 5)) # Output: 8

e) Name two operators with their use in R programming.

1. Arithme c Operators: These are used for mathema cal calcula ons.

o Example: +, -, *, /, ^

o Usage:

o x <- 5

o y <- 2

o z <- x + y # Adds x and y

o print(z) # Output: 7

2. Logical Operators: These are used to perform logical opera ons, o en in control ﬂow (e.g., if
statements).

o Example: & (AND), | (OR), ! (NOT)

o Usage:

o a <- TRUE

o b <- FALSE

o result <- a & b # AND opera on

o print(result) # Output: FALSE

f) What do you mean by high dimensionality data?

High dimensionality data refers to datasets that have a large number of features (variables)
compared to the number of observa ons (data points). In such datasets, each data point is
represented by many a ributes (dimensions). High-dimensional data is common in ﬁelds like
genomics, text analysis, and image processing.

Challenges with high-dimensional data:

 Curse of dimensionality: As the number of features increases, the space becomes sparse,
making it harder to analyze.

 Overﬁ ng: More features can lead to models that ﬁt noise in the data rather than the true
pa erns.
g) What is the use of API?

An API (Applica on Programming Interface) is a set of protocols, rou nes, and tools that allow
diﬀerent so ware applica ons to communicate with each other. APIs allow developers to access the
func onality or data of another applica on, service, or pla orm without needing to understand its
internal workings.

Common uses of APIs:

 Accessing data from remote servers (e.g., weather, stock prices).

 Integra ng with third-party services (e.g., payment gateways, social media pla orms).

 Automa ng tasks and workﬂows.

h) Deﬁne knowledge.

Knowledge refers to the understanding, awareness, or familiarity gained through experience,

learning, or study. In the context of data science or machine learning:

 Knowledge can be inferred from data, o en a er analyzing it through various techniques

(e.g., sta s cal analysis, pa ern recogni on).

 In machine learning, knowledge is typically stored in models that have learned pa erns from
training data.

i) What is the signiﬁcance of EDA?

Exploratory Data Analysis (EDA) is the process of analyzing data sets to summarize their main
characteris cs, o en using visual methods. The signiﬁcance of EDA includes:

 Understanding data distribu on: Helps to understand the central tendencies, spread, and
shape of the data.

 Iden fying outliers: Detects anomalies that may aﬀect the analysis.

 Uncovering rela onships: Visualizes correla ons and dependencies between variables.

 Data cleaning: Iden ﬁes issues like missing values, incorrect entries, and duplicates.

EDA helps in forming hypotheses and making informed decisions about the next steps in analysis or
modeling.

j) Deﬁne dy data.

Tidy data is a way of organizing data to make it easy to analyze. In dy data:

 Each variable is in a separate column.

 Each observa on is in a separate row.

 Each type of observa onal unit forms a table.

Tidy data follows these principles, making it easier to manipulate, analyze, and visualize the data.

Example of dy data:

Name Age Gender

Alice 23 Female

Bob 30 Male

Carol 22 Female

(a) Diﬀeren ate between Raw data and Final data.

Raw Data:

 Deﬁni on: Raw data refers to data that is collected directly from a source without any
processing or cleaning. It is o en unorganized, messy, and may contain errors,
inconsistencies, or irrelevant informa on.

 Characteris cs:

o Unstructured and unreﬁned.

o May contain missing values, duplicates, or outliers.

o Needs processing, cleaning, and transforma on before it can be analyzed.

o Example: A survey dataset with incomplete responses or raw logs from a web server.

Final Data:

 Deﬁni on: Final data refers to the cleaned, processed, and transformed version of the raw
data, ready for analysis or repor ng. It has been reﬁned to ensure accuracy, consistency, and
completeness.

 Characteris cs:

o Structured, clean, and organized.

o Missing values are handled, outliers are addressed, and data is transformed for
speciﬁc analy cal purposes.

o Ready for sta s cal analysis, machine learning, or visualiza on.

o Example: A dataset where errors have been corrected, and missing values have been
imputed.

In summary, raw data is the unprocessed, raw form, while ﬁnal data is the result of cleaning and
transforma on that makes it suitable for analysis.
(b) Write down three beneﬁts of visualiza on.

1. Simpliﬁes Complex Data: Visualiza on helps in represen ng large datasets in a simple and
understandable format, making complex rela onships, trends, and pa erns more accessible.
For example, a sca er plot can reveal correla ons between two variables, which might be
diﬃcult to comprehend from a table of numbers.

2. Iden ﬁes Trends and Pa erns: Through various visualiza on techniques (e.g., line graphs,
bar charts, heatmaps), trends and pa erns that might be hidden in raw data become more
apparent. This can help in making informed decisions or formula ng hypotheses.

3. Improves Data Interpreta on: Visualiza ons are o en more intui ve than raw data. They
make it easier to communicate ﬁndings, helping stakeholders, analysts, and decision-makers
quickly grasp the key points of a dataset without needing to dive into the numbers. This is
par cularly useful in presenta ons and reports.

(c) What is Exploratory Data Analysis (EDA) in data analysis?

Exploratory Data Analysis (EDA) is an approach used to analyze and summarize the main
characteris cs of a dataset, o en with the help of graphical and sta s cal methods. EDA is used to
understand the underlying structure of the data, detect outliers, iden fy pa erns, test hypotheses,
and check assump ons. It helps in making decisions on the next steps in data processing, feature
selec on, or sta s cal modeling.

Key objec ves of EDA:

 To explore data through visualiza on techniques (e.g., histograms, box plots, sca er plots).

 To check for data quality issues such as missing values, duplicates, or inconsistencies.

 To generate hypotheses about the data that can be tested later.

 To guide the prepara on of data for more advanced modeling or analysis.

(d) What is a data warehouse?

A data warehouse is a centralized repository designed to store large volumes of structured data from
mul ple sources for analysis and repor ng purposes. It is typically used in business intelligence (BI)
to support decision-making processes by consolida ng historical data, enabling users to query and
analyze data eﬃciently.

Key features of a data warehouse:

 Data integra on: It collects data from diﬀerent opera onal systems (e.g., databases, external
sources).

 Historical storage: It stores historical data that can be used for trend analysis and repor ng.
 Op miza on for queries: Data is structured in a way that makes querying fast and eﬃcient,
o en using techniques like indexing and par oning.

(e) Deﬁne simula on.

Simula on is the process of crea ng a model or a virtual representa on of a real-world system or

process to study its behavior under diﬀerent condi ons. Simula ons are commonly used in areas
such as engineering, ﬁnance, and science to predict outcomes, assess risks, or op mize performance.

In data analysis, simula ons might involve genera ng synthe c data based on certain assump ons or
running repeated trials to understand variability and uncertainty. For example, Monte Carlo
simula ons use random sampling to model complex systems and es mate probabili es.

(f) Write down the control structures used in R programming.

R programming includes several control structures that dictate the ﬂow of execu on in a program.
The key control structures in R are:

1. Condi onal Statements:

o if statement: Executes a block of code if the condi on is true.

o else statement: Executes a block of code if the if condi on is false.

o else if statement: Checks addi onal condi ons if the ini al if condi on is false.

2. if (x > 10) {

3. print("x is greater than 10")

4. } else {

5. print("x is less than or equal to 10")

6. }

7. Loops:

o for loop: Iterates over a sequence (e.g., vector, list).

o while loop: Repeats a block of code as long as a condi on is true.

o repeat loop: Repeats a block of code indeﬁnitely un l a condi on breaks it.

8. for (i in 1:5) {

9. print(i)

10. }

11. Break and Next:

o break: Exits the loop immediately.

o next: Skips the current itera on of the loop and moves to the next one.
(g) Name the common mul variate sta s cal techniques.

Some common mul variate sta s cal techniques include:

1. Principal Component Analysis (PCA): A dimensionality reduc on technique that transforms

data into a new set of variables (principal components) while retaining as much variance as
possible.

2. Factor Analysis: Iden ﬁes underlying factors that explain the correla ons between observed
variables.

3. Cluster Analysis (Clustering): Groups similar observa ons into clusters based on mul variate
data.

4. Mul variate Analysis of Variance (MANOVA): Extends ANOVA to analyze the diﬀerences in
mul ple dependent variables across diﬀerent groups.

5. Canonical Correla on Analysis (CCA): Measures the rela onship between two sets of
variables.

6. Discriminant Analysis: Classiﬁes observa ons into predeﬁned groups based on mul ple
predictor variables.

(h) Write down the steps used for data cleaning.

The common steps for data cleaning include:

1. Remove or Handle Missing Data:

o Check for missing values.

o Decide whether to remove, impute, or ﬁll missing values based on the context.

2. Remove Duplicates:

o Iden fy and remove duplicate records that can distort analysis.

3. Correct Data Types:

o Ensure that columns have the correct data types (e.g., numeric, date).

4. Handle Outliers:

o Iden fy and deal with outliers by capping, removing, or transforming them.

5. Standardize Data:

o Ensure consistent forma ng (e.g., dates in the same format, consistent units).

6. Normalize/Scale Data:

o Standardize or scale numerical values to bring them to the same range or scale for
be er analysis.

7. Remove Irrelevant Data:

o Drop columns or features that are irrelevant to the analysis.

8. Validate Data:

o Check for logical inconsistencies, such as nega ve values where they don't make
sense or contradictory entries.

(i) What are the diﬀerent data models used in case of databases?

In databases, several types of data models are used to structure and organize data. Some common
data models are:

1. Rela onal Model:

o Organizes data into tables (rela ons), with rows as records and columns as
a ributes. It uses SQL for querying.

2. Hierarchical Model:

o Organizes data in a tree-like structure where each child record has only one parent,
crea ng a hierarchy.

3. Network Model:

o Similar to the hierarchical model but allows each child to have mul ple parents,
forming a graph structure.

4. Object-Oriented Model:

o Organizes data as objects, similar to objects in programming languages, combining

both data and the methods to manipulate them.

5. Document Model:

o Used in NoSQL databases, where data is stored as documents (e.g., JSON or XML)
and can vary in structure.

6. Key-Value Model:

o Used in NoSQL databases, stores data as key-value pairs, providing high ﬂexibility
and scalability.

(j) What are the diﬀerent sources to get the data?

There are several sources from which data can be obtained:

1. Internal Data:

o Data generated within the organiza on, such as sales records, customer databases,
and opera onal logs.

2. External Data:
o Data obtained from outside sources, such as government datasets, industry reports,
and public APIs (e.g., weather data, social media data).

3. Web Scraping:

o Extrac ng data from websites using web scraping tools or libraries.

4. Surveys and Ques onnaires:

o Data collected directly from individuals via surveys, interviews, or ques onnaires.

5. Social Media:

o Data collected from pla orms like Twi er, Facebook, or Instagram using APIs or
scraping tools.

6. IoT Devices:

o Data generated by sensors and smart devices, such as temperature readings, heart
rate, or GPS data.

7. Open Data Repositories:

o Publicly available datasets shared by organiza ons or ins tu ons for research and
analysis (e.g., Kaggle, UCI Machine Learning Repository).

8. Business Partners:

o Data shared by business partners, such as joint venture partners, suppliers, or

collaborators.

12 CS em
No ratings yet
12 CS em
15 pages
6th Sem Data Science (DSE) Answer
No ratings yet
6th Sem Data Science (DSE) Answer
17 pages
302 SM and Da (Unit 3 4 5)
No ratings yet
302 SM and Da (Unit 3 4 5)
47 pages
Week 1
No ratings yet
Week 1
32 pages
P2 - Basics of R Programming
No ratings yet
P2 - Basics of R Programming
47 pages
R programming.Q.A
No ratings yet
R programming.Q.A
13 pages
Pplpresentation 211012192639
No ratings yet
Pplpresentation 211012192639
35 pages
PT1 Revision - Practice Questions
No ratings yet
PT1 Revision - Practice Questions
2 pages
100 Data Science in R Interview Questions and Answers For 2016
100% (2)
100 Data Science in R Interview Questions and Answers For 2016
56 pages
Data Science in R Interview Questions and Answers
No ratings yet
Data Science in R Interview Questions and Answers
56 pages
R Programming For Data Science QB
No ratings yet
R Programming For Data Science QB
21 pages
S6 IP Question Bank2024
No ratings yet
S6 IP Question Bank2024
8 pages
PT1 Revision - Practice Questions (Answers)
0% (1)
PT1 Revision - Practice Questions (Answers)
4 pages
Question Bank
No ratings yet
Question Bank
23 pages
R Programming Course Notes: Overview and History of R
No ratings yet
R Programming Course Notes: Overview and History of R
22 pages
Kunci Jawaban PYTHON
No ratings yet
Kunci Jawaban PYTHON
6 pages
Viva Questions-2024-25
No ratings yet
Viva Questions-2024-25
3 pages
BDAunit V
No ratings yet
BDAunit V
3 pages
DataScience Unit 2
No ratings yet
DataScience Unit 2
45 pages
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Imp Questions
No ratings yet
Imp Questions
23 pages
Bda. Unit. 5
No ratings yet
Bda. Unit. 5
27 pages
Chapter 2 Review Questions
No ratings yet
Chapter 2 Review Questions
4 pages
12 CS EM Public Answer Key May 2022
No ratings yet
12 CS EM Public Answer Key May 2022
10 pages
H.W. of Class 12
No ratings yet
H.W. of Class 12
4 pages
Bıoınformatıcs and R Programmıng - Second Presentatıon
No ratings yet
Bıoınformatıcs and R Programmıng - Second Presentatıon
31 pages
ComputerScience-SQP Set5-MS
No ratings yet
ComputerScience-SQP Set5-MS
13 pages
Data Science Introduction
No ratings yet
Data Science Introduction
9 pages
Sol CS Pre BRD 2 2024-25 5thsub
No ratings yet
Sol CS Pre BRD 2 2024-25 5thsub
15 pages
Programming in R (Ubccde51)
No ratings yet
Programming in R (Ubccde51)
22 pages
Statistics With R Programming For Bigdata (Autosaved)
No ratings yet
Statistics With R Programming For Bigdata (Autosaved)
41 pages
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
12th C S Half Yearly Model QP
No ratings yet
12th C S Half Yearly Model QP
18 pages
Computer Science1711032664387
No ratings yet
Computer Science1711032664387
8 pages
12CS em MLM
No ratings yet
12CS em MLM
41 pages
Data Science QnA
No ratings yet
Data Science QnA
15 pages
Data Science 2023
No ratings yet
Data Science 2023
3 pages
Sample Paper 13 IP
No ratings yet
Sample Paper 13 IP
9 pages
12th Computer Science Book Back One Mark Questions English Medium PDF Download
No ratings yet
12th Computer Science Book Back One Mark Questions English Medium PDF Download
10 pages
DataScience - Unit 1
No ratings yet
DataScience - Unit 1
12 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
R Programming 2 MARKS
No ratings yet
R Programming 2 MARKS
12 pages
AP Lab Assignment 1
No ratings yet
AP Lab Assignment 1
30 pages
Orange IP065 11 QP
100% (1)
Orange IP065 11 QP
7 pages
Ip Xi Half Yearly Retest 2024 25
No ratings yet
Ip Xi Half Yearly Retest 2024 25
4 pages
Sample MCQ Questions
100% (1)
Sample MCQ Questions
26 pages
3 - R Programming Main File
No ratings yet
3 - R Programming Main File
137 pages
Chapter 2 Answer Key
No ratings yet
Chapter 2 Answer Key
9 pages
Data Science Selection Questions and Their Answers 2022
No ratings yet
Data Science Selection Questions and Their Answers 2022
5 pages
Random Questions
No ratings yet
Random Questions
11 pages
Namma Kalvi 12th Computer Science Notes em 216674
No ratings yet
Namma Kalvi 12th Computer Science Notes em 216674
32 pages
Python 2025 Mcqs
No ratings yet
Python 2025 Mcqs
56 pages
12 Cs Slow Learners Mat 2023-24
No ratings yet
12 Cs Slow Learners Mat 2023-24
19 pages
R-Unit 1
No ratings yet
R-Unit 1
30 pages
DSF Gourav-2
No ratings yet
DSF Gourav-2
30 pages
Grade 11 CS Term 2 QP 2024-25
No ratings yet
Grade 11 CS Term 2 QP 2024-25
8 pages
12th Computer Science EM 1st Revision Test 2024 Question Paper Madurai District English Medium PDF Download
No ratings yet
12th Computer Science EM 1st Revision Test 2024 Question Paper Madurai District English Medium PDF Download
2 pages
Coumputational Thinking
No ratings yet
Coumputational Thinking
4 pages
Namma Kalvi 12th Computer Science 1 Mark Question Bank em 217866
No ratings yet
Namma Kalvi 12th Computer Science 1 Mark Question Bank em 217866
18 pages
Statistics With R
No ratings yet
Statistics With R
19 pages
Machine Learning &deep Learning in Python &R
No ratings yet
Machine Learning &deep Learning in Python &R
48 pages
A Study On Performance of Cricket Players Using Factor Analysis Approach
No ratings yet
A Study On Performance of Cricket Players Using Factor Analysis Approach
5 pages
Random Variables: Corr (X, Y) Cov (X, Y) / Cov (X, Y) Is The Covariance (X, Y)
No ratings yet
Random Variables: Corr (X, Y) Cov (X, Y) / Cov (X, Y) Is The Covariance (X, Y)
15 pages
Structure of A Questionnaire On Children's Attitudes Towards Inclusive Physical Education (Caipe-Cz)
No ratings yet
Structure of A Questionnaire On Children's Attitudes Towards Inclusive Physical Education (Caipe-Cz)
6 pages
S. Ilte
No ratings yet
S. Ilte
138 pages
Bearing Fault Detection and Diagnosis
No ratings yet
Bearing Fault Detection and Diagnosis
6 pages
CH 10 Current Issues in Financial Markets HDRJDGFKMW
No ratings yet
CH 10 Current Issues in Financial Markets HDRJDGFKMW
104 pages
Climate Change Impacts and Vulnerability-Ambo University
No ratings yet
Climate Change Impacts and Vulnerability-Ambo University
13 pages
(MCQ) Data
No ratings yet
(MCQ) Data
8 pages
Sequence Prediction and Anomaly Detection For Satellite Telemetry
No ratings yet
Sequence Prediction and Anomaly Detection For Satellite Telemetry
12 pages
Selectedtopicsksaltlm Selectedtopics Ekinyaynevi
No ratings yet
Selectedtopicsksaltlm Selectedtopics Ekinyaynevi
25 pages
AI Enhanced+Cybersecurity+in+Smart+Manufacturing
No ratings yet
AI Enhanced+Cybersecurity+in+Smart+Manufacturing
38 pages
Tourism Destination Competitiveness - The Spanish Mediterranean Case
No ratings yet
Tourism Destination Competitiveness - The Spanish Mediterranean Case
21 pages
Biosensors and Bioelectronics 22 (2007) 2643-2649
No ratings yet
Biosensors and Bioelectronics 22 (2007) 2643-2649
7 pages
Elman RNN
No ratings yet
Elman RNN
31 pages
Deep Learning - IIT Ropar - Unit 8 - Week 5
No ratings yet
Deep Learning - IIT Ropar - Unit 8 - Week 5
4 pages
Benavides Hernández Dumeignil 2024 From Characterization To Discovery Artificial Intelligence Machine Learning and High
No ratings yet
Benavides Hernández Dumeignil 2024 From Characterization To Discovery Artificial Intelligence Machine Learning and High
31 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
11 pages
Rapid Analysis Technologies With Chemometrics Forf
No ratings yet
Rapid Analysis Technologies With Chemometrics Forf
28 pages
Comparative Study of The Critical Success Factors (CSFS) For Community Resilience Assessment (CRA) in Developed and Developing Countries
No ratings yet
Comparative Study of The Critical Success Factors (CSFS) For Community Resilience Assessment (CRA) in Developed and Developing Countries
13 pages
Human Resource Management in Education
No ratings yet
Human Resource Management in Education
7 pages
Mental Health Analysis in Social Media Posts: A Survey: Muskan Garg
No ratings yet
Mental Health Analysis in Social Media Posts: A Survey: Muskan Garg
24 pages
Job Satisfaction in Insurance Sector
No ratings yet
Job Satisfaction in Insurance Sector
20 pages
Thesis - Personality, - Self - Efficacy - An PDF
No ratings yet
Thesis - Personality, - Self - Efficacy - An PDF
689 pages
How Many Dimensions of Mind Perception Really Are There? (Malle, 2019)
No ratings yet
How Many Dimensions of Mind Perception Really Are There? (Malle, 2019)
27 pages
PCA Gen Manual
No ratings yet
PCA Gen Manual
11 pages
Some Experiences With F Scale in Yugoslavia
No ratings yet
Some Experiences With F Scale in Yugoslavia
11 pages
Customer Durables
No ratings yet
Customer Durables
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
UNIT2
No ratings yet
UNIT2
20 pages