0% found this document useful (0 votes)
18 views21 pages

Data Science Short Notes

Data science short question pdf

Uploaded by

justcallmemrx9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

Data Science Short Notes

Data science short question pdf

Uploaded by

justcallmemrx9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

2021

a) Linear regression helps in understanding the linear rela onship between the dependent and the
independent variables.

b) Logis c regression is a classifica on algorithm that can be used when the dependent variable is
binary.

c) The logis c regression algorithm actually produces an S-shaped curve.

d) The confusion matrix is a table that is used to es mate the performance of a model.

e) R is a programming language and so ware environment for sta s cal analysis.

f) A data frame is a two-dimensional rectangular data set.

g) A loop statement allows us to execute a statement or group of statements mul ple mes.

h) A func on is a set of statements organized together to perform a specific task.

2022
a) Linear regression helps in understanding the linear rela onship between the dependent and
independent variables.

b) Logis c regression is a classifica on algorithm that can be used when the dependent variable is
binary.

c) The logis c regression algorithm actually processes an S-shaped curve.

d) The confusion matrix is a table that is used to es mate the performance of a model.

e) R is a programming language and so ware environment for sta s cal analysis.


f) A data frame is a two-dimensional rectangular data set.

g) A loop statement allows us to execute a statement or group of statements mul ple mes.

h) A func on is a set of statements organized together to perform a specific task.

2023
a) GitHub is an online so ware development pla orm.

b) A class of systems responsible for managing changes to computer programs and other documents
are referred to as version control systems.

c) R programming uses # symbol to write comments.

d) The process of correc ng errors is referred to as debugging.

e) A database is an organized collec on of related data.

f) The process of conver ng data from one form to another is known as data transforma on.

g) There are two types of hypothesis: null hypothesis and alterna ve hypothesis.

h) Visualiza on is an effec ve way to communicate.s

2021

2)
a) What is Markdown?
Markdown is a lightweight markup language with plain text forma ng syntax. Its design allows
people to write using an easy-to-read, easy-to-write plain text format, which can be converted into
structurally valid HTML. It's commonly used for documenta on, readme files, and crea ng rich text
without complex forma ng.
b) What is Git?
Git is a distributed version control system that helps track changes to files, typically source code, and
allows mul ple developers to work on the same project simultaneously. It enables features like
branching, merging, and versioning, making it easier to collaborate on so ware development
projects.

c) What is GitHub?
GitHub is a pla orm that hosts Git repositories. It provides tools for version control and
collabora on, allowing mul ple developers to contribute to projects, track issues, and review code.
GitHub is widely used for open-source projects and private repositories.

d) What is Keyword?
A keyword is a reserved word in programming that has a predefined meaning and cannot be used for
any other purpose, like naming variables or func ons. For example, in R, if, else, for, and while are
keywords that define control structures.

e) What is Operator?
An operator is a symbol or keyword used to perform opera ons on variables and values. In
programming, operators include arithme c operators (like +, -, *, /), logical operators (like &&, ||),
and rela onal operators (like ==, !=, <, >).

f) What is While Loop?


A while loop is a control flow statement in programming that repeatedly executes a block of code as
long as a specified condi on is true. For example, in R:

while (condi on) {

# code to be executed

The loop con nues to run as long as the condi on evaluates to true.

g) What is RStudio?
RStudio is an integrated development environment (IDE) for the R programming language. It provides
tools to help users write, debug, and visualize R code. RStudio includes features like a console, syntax
highligh ng, code comple on, and integrated plo ng tools.

h) What is Bug?
A bug is an error or flaw in a so ware program that causes it to behave incorrectly or produce
incorrect results. Bugs can occur due to mistakes in the code or unexpected interac ons between
parts of the program.

i) What is Debugging?
Debugging is the process of iden fying, analyzing, and fixing bugs or errors in so ware code. It
involves using debugging tools, adding print statements, or using breakpoints to inspect the
program's behavior and correct issues in the code.

j) Define R Programming?
R programming is a programming language and environment designed for sta s cal compu ng, data
analysis, and graphical visualiza on. It is widely used in fields like data science, bioinforma cs, and
academic research for tasks such as data manipula on, sta s cal modeling, and producing high-
quality plots and graphics.
3)

a) What is Raw Data Type in R Programming?

The raw data type in R is used to represent raw bytes of data. It is par cularly useful when dealing
with binary data, like files in a non-text format (e.g., images, audio files, or encrypted data). The raw
data type stores each byte as an individual element in a vector, and the elements are represented by
integers between 0 and 255, which correspond to byte values.

For example:

raw_data <- charToRaw("Hello") # Convert the string "Hello" to raw data

print(raw_data)

This will return:

[1] 48 65 6c 6c 6f

Each number corresponds to the ASCII value of the characters 'H', 'e', 'l', 'l', 'o'.

b) Define Logical Data Type

The logical data type in R is used to represent Boolean values: TRUE or FALSE. This data type is
commonly used for condi onal opera ons, comparisons, and controlling the flow of execu on in
control structures (like if statements, loops, etc.).

For example:

is_true <- TRUE

is_false <- FALSE

print(is_true)

print(is_false)

This will output:

[1] TRUE

[1] FALSE

Logical values in R are used for making comparisons between variables, such as checking if one
number is greater than another:

result <- 5 > 3 # result is TRUE

Here are the defini ons and explana ons for each term:

c) What is an if statement? Write its syntax.

An if statement is a condi onal control flow statement used to execute a block of code only if a
specified condi on evaluates to TRUE. If the condi on is FALSE, the block of code is skipped.
Syntax:

if (condi on) {

# Code to be executed if condi on is TRUE

Op onally, you can include an else statement to execute code if the condi on is FALSE:

if (condi on) {

# Code to be executed if condi on is TRUE

} else {

# Code to be executed if condi on is FALSE

Example:

x <- 5

if (x > 3) {

print("x is greater than 3")

} else {

print("x is not greater than 3")

d) What is code profiling?

Code profiling is the process of analyzing a program to iden fy performance bo lenecks or


inefficient code. Profiling helps to measure how much me is spent in various parts of the code,
which can guide op miza on efforts. Profiling tools provide metrics like execu on me, memory
usage, and CPU usage to help developers improve the performance of their code.

e) What is Data Cleaning?

Data cleaning refers to the process of iden fying and rec fying errors or inconsistencies in the
dataset. It involves tasks like handling missing data, correc ng incorrect entries, removing duplicates,
and ensuring the data is in a consistent format. Data cleaning is essen al for preparing data for
analysis or modeling.

Examples of data cleaning tasks:

 Impu ng missing values

 Removing outliers

 Correc ng data entry errors


 Conver ng data to the correct format

f) What is Tidy Data?

Tidy data is a concept from data science where data is structured in a consistent and organized
manner. In dy data, each variable forms a column, each observa on forms a row, and each type of
observa onal unit forms a table.

For example, a dy dataset might look like this:

Name Age Gender

Alice 23 Female

Bob 30 Male

Carol 22 Female

Each column represents a variable, and each row represents an observa on.

g) What is EDA?

EDA (Exploratory Data Analysis) is the process of analyzing and summarizing the main characteris cs
of a dataset, o en with the help of visual methods. EDA helps to uncover pa erns, detect anomalies,
check assump ons, and test hypotheses. It usually involves:

 Visualiza ons (e.g., histograms, box plots)

 Summary sta s cs (e.g., mean, median, variance)

 Data transforma ons

h) What is Data Mining?

Data mining refers to the process of discovering pa erns, trends, and useful informa on from large
datasets using sta s cal, mathema cal, and computa onal techniques. It combines methods from
machine learning, sta s cs, and database systems to iden fy meaningful insights from raw data.

Common tasks in data mining include:

 Classifica on

 Clustering

 Associa on rule mining

 Regression analysis

i) What is Data Engineering?


Data engineering is the prac ce of designing, building, and maintaining the infrastructure and tools
needed to collect, store, process, and analyze data. Data engineers work with large datasets and
ensure that the data pipeline is efficient, reliable, and scalable. They typically work with technologies
like databases, big data systems (e.g., Hadoop), and cloud pla orms.

Key tasks in data engineering:

 Data extrac on, transforma on, and loading (ETL)

 Designing databases and data architectures

 Building and managing data pipelines

j) What is Data Visualiza on?

Data visualiza on is the graphical representa on of data to help people understand and interpret
the informa on more easily. It involves crea ng charts, graphs, maps, and other visual tools to make
pa erns, trends, and insights in the data more accessible.

Common types of data visualiza ons include:

 Bar charts

 Line graphs

 Sca er plots

 Heatmaps

 Pie charts

Data visualiza on is o en used in repor ng, dashboards, and business intelligence to help
stakeholders make informed decisions based on data.

2022
2)

(a) What is Markdown?

Markdown is a lightweight markup language used to format plain text. It is designed to be easy to
write and read, and it can be converted into HTML. Markdown is commonly used for documenta on,
readme files, and wri ng content for web pla orms like GitHub. It allows users to format text (such
as making text bold or italic), create lists, and add links and images using simple syntax.

(b) What is Git?

Git is a distributed version control system that tracks changes to files, par cularly source code,
allowing mul ple users to collaborate on a project. It enables versioning, branching, and merging,
which helps in managing code changes and preven ng conflicts. Git allows users to keep a history of
all changes, making it easier to roll back to previous versions if needed.

(c) What is GitHub?

GitHub is a pla orm that hosts Git repositories, providing version control and collabora on features.
It allows developers to store, share, and manage their code online. GitHub also includes tools for
tracking issues, reviewing code, and managing team collabora ons, making it a central hub for open-
source and private so ware projects.

(d) What is a Keyword?

A keyword in programming is a reserved word that has a predefined meaning and cannot be used for
any other purpose, like naming variables or func ons. Keywords are part of the programming
language syntax and are essen al for defining the structure of the code. For example, in R, if, else,
for, and while are keywords used to control the flow of execu on.

(e) Define Operator.

An operator in programming is a symbol or keyword used to perform opera ons on variables and
values. Operators include:

 Arithme c operators (e.g., +, -, *, /)

 Rela onal operators (e.g., ==, !=, <, >)

 Logical operators (e.g., &&, ||)

 Assignment operators (e.g., =, <- in R)

Operators are essen al for manipula ng data and controlling the flow of a program.

(f) What is a While Loop?

A while loop is a control structure in programming that repeatedly executes a block of code as long
as a given condi on evaluates to true. The loop checks the condi on before each itera on, and if the
condi on is true, it con nues execu ng the code block.

Syntax:

while (condi on) {

# Code to be executed while the condi on is TRUE

Example in R:

x <- 1
while (x <= 5) {

print(x)

x <- x + 1

(g) Write something about RStudio?

RStudio is an integrated development environment (IDE) for the R programming language. It provides
an interface for wri ng, debugging, and visualizing R code. RStudio includes features like a console
for running code, a script editor with syntax highligh ng, and tools for managing packages, plots, and
data. It is widely used by data analysts, sta s cians, and data scien sts for working with R.

(h) What is a Bug?

A bug is an error, flaw, or unintended behavior in a program that causes it to func on incorrectly or
produce incorrect results. Bugs can occur due to mistakes in coding, logic errors, or unexpected
interac ons between parts of a program.

(i) State Debugging.

Debugging is the process of iden fying, isola ng, and fixing bugs or errors in a program. It involves
running the program with various inputs, using debugging tools (such as breakpoints, print
statements, or debuggers), and inspec ng the code to locate and correct problems. Debugging is an
essen al skill for so ware development, as it ensures that the code works as intended.

(j) Define R Programming.

R programming is a language and environment primarily used for sta s cal compu ng, data analysis,
and graphical visualiza on. It is widely used by sta s cians, data scien sts, and researchers to
analyze data, build sta s cal models, and produce high-quality plots and reports. R provides a rich
ecosystem of packages and func ons for data manipula on, machine learning, and data visualiza on.

3)

(a) What is raw data type in R Programming?

The raw data type in R represents raw bytes of data. It is used to store binary data or data in byte
format (e.g., images, files, or encrypted data). Each element of a raw vector in R is an individual byte,
which can take integer values between 0 and 255.

Example:
raw_data <- charToRaw("Hello") # Converts a string into raw bytes

print(raw_data)

Output:

[1] 48 65 6c 6c 6f

Here, the ASCII values of the characters in "Hello" are stored as raw bytes.

(b) Define logical data type.

The logical data type in R represents Boolean values: TRUE or FALSE. Logical values are used to
evaluate condi ons and control the flow of the program (e.g., in if statements, loops). Logical values
are essen al in comparisons, filtering, and condi onal opera ons.

Example:

x <- TRUE

y <- FALSE

print(x)

print(y)

(c) What is a statement? Write its syntax.

A statement in programming is a single instruc on or opera on that performs a specific task. In R, a


statement could be an expression or a func on call that is executed. For example, a simple
assignment statement is:

x <- 5 # Assigns the value 5 to variable x

Another example of a statement is calling a func on:

print("Hello, World!") # Prints a message to the console

(d) What is code profiling?

Code profiling is the process of analyzing a program's performance, typically to iden fy performance
bo lenecks. Profiling measures aspects like execu on me, memory usage, and CPU u liza on,
which helps developers op mize their code. Profiling tools give insights into which func ons or parts
of the code consume the most resources.

(e) What is data learning?

It seems like you meant data cleaning or data learning might refer to machine learning.

 Data cleaning is the process of preparing raw data for analysis by correc ng errors, handling
missing values, and ensuring consistency.
 Data learning isn't a common term in data science, but machine learning refers to the use of
algorithms and sta s cal models to allow a computer system to improve its performance on
a task through experience (data).

(f) What is dy data?

Tidy data refers to a structured way of organizing data, where:

 Each variable is in a separate column.

 Each observa on forms a row.

 Each type of observa onal unit forms a table.

In dy data, the structure allows for easy analysis and visualiza on. For example:

Name Age Gender

Alice 23 Female

Bob 30 Male

Carol 22 Female

Each column represents a variable, and each row represents an observa on.

(g) What is EDA?

EDA (Exploratory Data Analysis) is the process of analyzing a dataset to summarize its main
characteris cs, o en using visual methods. EDA is important for understanding the data, spo ng
anomalies, iden fying pa erns, and forming hypotheses. It typically involves:

 Visualiza ons (e.g., histograms, sca er plots)

 Summary sta s cs (e.g., mean, median)

 Data cleaning and transforma on

(h) What is data mining?

Data mining is the process of discovering pa erns, correla ons, trends, or useful informa on from
large datasets using algorithms and sta s cal methods. Data mining involves tasks such as:

 Classifica on: Assigning labels to data.

 Clustering: Grouping similar data points.

 Associa on: Finding rela onships between variables.

It is commonly used in applica ons like customer segmenta on, fraud detec on, and
recommenda on systems.
(i) What is data engineering?

Data engineering refers to the design, crea on, and management of systems and infrastructure that
allow the collec on, storage, processing, and analysis of data. Data engineers build and maintain
data pipelines, work with databases, and ensure that the data infrastructure is scalable, reliable, and
efficient. They work closely with data scien sts and analysts to ensure that data is accessible and
usable.

(j) What do you mean by data visualiza on?

Data visualiza on is the process of crea ng graphical representa ons of data to help people
understand complex data more easily. It uses charts, graphs, and maps to present pa erns, trends,
and insights clearly. Common types of data visualiza ons include:

 Bar charts

 Line graphs

 Pie charts

 Heatmaps

 Sca er plots

Data visualiza on is crucial for communica ng insights from data effec vely to a wide audience,
o en helping in decision-making processes.

2023
2)

a) Name the data types supported by R programming.

R programming supports several data types, including:

1. Numeric: For represen ng both integers and real numbers (e.g., 5, 3.14).

2. Character: For represen ng text or strings (e.g., "hello", "data").

3. Logical: For represen ng boolean values (TRUE or FALSE).

4. Integer: For whole numbers (e.g., 1L, 45L).

5. Complex: For complex numbers (e.g., 2 + 3i).

6. Raw: For storing raw bytes of data (e.g., binary data).


7. List: For storing ordered collec ons of objects, which can be of different types (e.g., list(1,
"hello", TRUE)).

8. Data Frame: A two-dimensional table-like structure, o en used for datasets.

9. Matrix: A two-dimensional array-like object where all elements are of the same data type.

10. Factor: Used for categorical data, storing data as levels (e.g., "low", "medium", "high").

b) Define data transforma on.

Data transforma on is the process of conver ng data from its original format into a format that is
more suitable for analysis or processing. This may involve opera ons such as:

 Normalizing or scaling values

 Aggrega ng data (e.g., summing, averaging)

 Changing the structure of the data (e.g., conver ng a wide format to a long format)

 Handling missing values

 Encoding categorical variables

Data transforma on is an essen al step in data cleaning and prepara on before analysis.

c) What is the significance of indenta on in R programming?

Indenta on in R programming is important for improving the readability and organiza on of code.
While R does not require specific indenta on for the code to run (like Python), consistent indenta on
helps developers understand the structure of the code, especially when working with nested loops,
condi onal statements, or func ons. Proper indenta on also helps with debugging and maintenance
by clearly indica ng the scope of loops, condi ons, or func on bodies.

d) Define func on.

A func on in R is a block of reusable code that performs a specific task. Func ons take input
(arguments), perform an opera on, and return an output. Func ons are essen al for code
modularity and reusability.

Syntax:

func on_name <- func on(arg1, arg2) {

# Code to perform the task

result <- arg1 + arg2 # Example opera on

return(result) # Return the result

Example:
add_numbers <- func on(a, b) {

return(a + b)

print(add_numbers(3, 5)) # Output: 8

e) Name two operators with their use in R programming.

1. Arithme c Operators: These are used for mathema cal calcula ons.

o Example: +, -, *, /, ^

o Usage:

o x <- 5

o y <- 2

o z <- x + y # Adds x and y

o print(z) # Output: 7

2. Logical Operators: These are used to perform logical opera ons, o en in control flow (e.g., if
statements).

o Example: & (AND), | (OR), ! (NOT)

o Usage:

o a <- TRUE

o b <- FALSE

o result <- a & b # AND opera on

o print(result) # Output: FALSE

f) What do you mean by high dimensionality data?

High dimensionality data refers to datasets that have a large number of features (variables)
compared to the number of observa ons (data points). In such datasets, each data point is
represented by many a ributes (dimensions). High-dimensional data is common in fields like
genomics, text analysis, and image processing.

Challenges with high-dimensional data:

 Curse of dimensionality: As the number of features increases, the space becomes sparse,
making it harder to analyze.

 Overfi ng: More features can lead to models that fit noise in the data rather than the true
pa erns.
g) What is the use of API?

An API (Applica on Programming Interface) is a set of protocols, rou nes, and tools that allow
different so ware applica ons to communicate with each other. APIs allow developers to access the
func onality or data of another applica on, service, or pla orm without needing to understand its
internal workings.

Common uses of APIs:

 Accessing data from remote servers (e.g., weather, stock prices).

 Integra ng with third-party services (e.g., payment gateways, social media pla orms).

 Automa ng tasks and workflows.

h) Define knowledge.

Knowledge refers to the understanding, awareness, or familiarity gained through experience,


learning, or study. In the context of data science or machine learning:

 Knowledge can be inferred from data, o en a er analyzing it through various techniques


(e.g., sta s cal analysis, pa ern recogni on).

 In machine learning, knowledge is typically stored in models that have learned pa erns from
training data.

i) What is the significance of EDA?

Exploratory Data Analysis (EDA) is the process of analyzing data sets to summarize their main
characteris cs, o en using visual methods. The significance of EDA includes:

 Understanding data distribu on: Helps to understand the central tendencies, spread, and
shape of the data.

 Iden fying outliers: Detects anomalies that may affect the analysis.

 Uncovering rela onships: Visualizes correla ons and dependencies between variables.

 Data cleaning: Iden fies issues like missing values, incorrect entries, and duplicates.

EDA helps in forming hypotheses and making informed decisions about the next steps in analysis or
modeling.

j) Define dy data.

Tidy data is a way of organizing data to make it easy to analyze. In dy data:

 Each variable is in a separate column.

 Each observa on is in a separate row.

 Each type of observa onal unit forms a table.


Tidy data follows these principles, making it easier to manipulate, analyze, and visualize the data.

Example of dy data:

Name Age Gender

Alice 23 Female

Bob 30 Male

Carol 22 Female

3)

(a) Differen ate between Raw data and Final data.

Raw Data:

 Defini on: Raw data refers to data that is collected directly from a source without any
processing or cleaning. It is o en unorganized, messy, and may contain errors,
inconsistencies, or irrelevant informa on.

 Characteris cs:

o Unstructured and unrefined.

o May contain missing values, duplicates, or outliers.

o Needs processing, cleaning, and transforma on before it can be analyzed.

o Example: A survey dataset with incomplete responses or raw logs from a web server.

Final Data:

 Defini on: Final data refers to the cleaned, processed, and transformed version of the raw
data, ready for analysis or repor ng. It has been refined to ensure accuracy, consistency, and
completeness.

 Characteris cs:

o Structured, clean, and organized.

o Missing values are handled, outliers are addressed, and data is transformed for
specific analy cal purposes.

o Ready for sta s cal analysis, machine learning, or visualiza on.

o Example: A dataset where errors have been corrected, and missing values have been
imputed.

In summary, raw data is the unprocessed, raw form, while final data is the result of cleaning and
transforma on that makes it suitable for analysis.
(b) Write down three benefits of visualiza on.

1. Simplifies Complex Data: Visualiza on helps in represen ng large datasets in a simple and
understandable format, making complex rela onships, trends, and pa erns more accessible.
For example, a sca er plot can reveal correla ons between two variables, which might be
difficult to comprehend from a table of numbers.

2. Iden fies Trends and Pa erns: Through various visualiza on techniques (e.g., line graphs,
bar charts, heatmaps), trends and pa erns that might be hidden in raw data become more
apparent. This can help in making informed decisions or formula ng hypotheses.

3. Improves Data Interpreta on: Visualiza ons are o en more intui ve than raw data. They
make it easier to communicate findings, helping stakeholders, analysts, and decision-makers
quickly grasp the key points of a dataset without needing to dive into the numbers. This is
par cularly useful in presenta ons and reports.

(c) What is Exploratory Data Analysis (EDA) in data analysis?

Exploratory Data Analysis (EDA) is an approach used to analyze and summarize the main
characteris cs of a dataset, o en with the help of graphical and sta s cal methods. EDA is used to
understand the underlying structure of the data, detect outliers, iden fy pa erns, test hypotheses,
and check assump ons. It helps in making decisions on the next steps in data processing, feature
selec on, or sta s cal modeling.

Key objec ves of EDA:

 To explore data through visualiza on techniques (e.g., histograms, box plots, sca er plots).

 To check for data quality issues such as missing values, duplicates, or inconsistencies.

 To generate hypotheses about the data that can be tested later.

 To guide the prepara on of data for more advanced modeling or analysis.

(d) What is a data warehouse?

A data warehouse is a centralized repository designed to store large volumes of structured data from
mul ple sources for analysis and repor ng purposes. It is typically used in business intelligence (BI)
to support decision-making processes by consolida ng historical data, enabling users to query and
analyze data efficiently.

Key features of a data warehouse:

 Data integra on: It collects data from different opera onal systems (e.g., databases, external
sources).

 Historical storage: It stores historical data that can be used for trend analysis and repor ng.
 Op miza on for queries: Data is structured in a way that makes querying fast and efficient,
o en using techniques like indexing and par oning.

(e) Define simula on.

Simula on is the process of crea ng a model or a virtual representa on of a real-world system or


process to study its behavior under different condi ons. Simula ons are commonly used in areas
such as engineering, finance, and science to predict outcomes, assess risks, or op mize performance.

In data analysis, simula ons might involve genera ng synthe c data based on certain assump ons or
running repeated trials to understand variability and uncertainty. For example, Monte Carlo
simula ons use random sampling to model complex systems and es mate probabili es.

(f) Write down the control structures used in R programming.

R programming includes several control structures that dictate the flow of execu on in a program.
The key control structures in R are:

1. Condi onal Statements:

o if statement: Executes a block of code if the condi on is true.

o else statement: Executes a block of code if the if condi on is false.

o else if statement: Checks addi onal condi ons if the ini al if condi on is false.

2. if (x > 10) {

3. print("x is greater than 10")

4. } else {

5. print("x is less than or equal to 10")

6. }

7. Loops:

o for loop: Iterates over a sequence (e.g., vector, list).

o while loop: Repeats a block of code as long as a condi on is true.

o repeat loop: Repeats a block of code indefinitely un l a condi on breaks it.

8. for (i in 1:5) {

9. print(i)

10. }

11. Break and Next:

o break: Exits the loop immediately.

o next: Skips the current itera on of the loop and moves to the next one.
(g) Name the common mul variate sta s cal techniques.

Some common mul variate sta s cal techniques include:

1. Principal Component Analysis (PCA): A dimensionality reduc on technique that transforms


data into a new set of variables (principal components) while retaining as much variance as
possible.

2. Factor Analysis: Iden fies underlying factors that explain the correla ons between observed
variables.

3. Cluster Analysis (Clustering): Groups similar observa ons into clusters based on mul variate
data.

4. Mul variate Analysis of Variance (MANOVA): Extends ANOVA to analyze the differences in
mul ple dependent variables across different groups.

5. Canonical Correla on Analysis (CCA): Measures the rela onship between two sets of
variables.

6. Discriminant Analysis: Classifies observa ons into predefined groups based on mul ple
predictor variables.

(h) Write down the steps used for data cleaning.

The common steps for data cleaning include:

1. Remove or Handle Missing Data:

o Check for missing values.

o Decide whether to remove, impute, or fill missing values based on the context.

2. Remove Duplicates:

o Iden fy and remove duplicate records that can distort analysis.

3. Correct Data Types:

o Ensure that columns have the correct data types (e.g., numeric, date).

4. Handle Outliers:

o Iden fy and deal with outliers by capping, removing, or transforming them.

5. Standardize Data:

o Ensure consistent forma ng (e.g., dates in the same format, consistent units).

6. Normalize/Scale Data:

o Standardize or scale numerical values to bring them to the same range or scale for
be er analysis.

7. Remove Irrelevant Data:


o Drop columns or features that are irrelevant to the analysis.

8. Validate Data:

o Check for logical inconsistencies, such as nega ve values where they don't make
sense or contradictory entries.

(i) What are the different data models used in case of databases?

In databases, several types of data models are used to structure and organize data. Some common
data models are:

1. Rela onal Model:

o Organizes data into tables (rela ons), with rows as records and columns as
a ributes. It uses SQL for querying.

2. Hierarchical Model:

o Organizes data in a tree-like structure where each child record has only one parent,
crea ng a hierarchy.

3. Network Model:

o Similar to the hierarchical model but allows each child to have mul ple parents,
forming a graph structure.

4. Object-Oriented Model:

o Organizes data as objects, similar to objects in programming languages, combining


both data and the methods to manipulate them.

5. Document Model:

o Used in NoSQL databases, where data is stored as documents (e.g., JSON or XML)
and can vary in structure.

6. Key-Value Model:

o Used in NoSQL databases, stores data as key-value pairs, providing high flexibility
and scalability.

(j) What are the different sources to get the data?

There are several sources from which data can be obtained:

1. Internal Data:

o Data generated within the organiza on, such as sales records, customer databases,
and opera onal logs.

2. External Data:
o Data obtained from outside sources, such as government datasets, industry reports,
and public APIs (e.g., weather data, social media data).

3. Web Scraping:

o Extrac ng data from websites using web scraping tools or libraries.

4. Surveys and Ques onnaires:

o Data collected directly from individuals via surveys, interviews, or ques onnaires.

5. Social Media:

o Data collected from pla orms like Twi er, Facebook, or Instagram using APIs or
scraping tools.

6. IoT Devices:

o Data generated by sensors and smart devices, such as temperature readings, heart
rate, or GPS data.

7. Open Data Repositories:

o Publicly available datasets shared by organiza ons or ins tu ons for research and
analysis (e.g., Kaggle, UCI Machine Learning Repository).

8. Business Partners:

o Data shared by business partners, such as joint venture partners, suppliers, or


collaborators.

You might also like