What Is Data Analyst 1
What Is Data Analyst 1
Data analysts is a person that uses statistical methods, programming, and visualization tools to analyze
and interpret data, helping organizations make informed decisions. They clean, process, and organize
data to identify trends, patterns, and anomalies, contributing crucial insights that drive strategic and
operational decision-making within businesses and other sectors.
Here we have mentioned the top questions that are more likely to be asked by the interviewer during
the interview process of experienced data analysts as well as beginner analyst job profiles.
Data analysis is a multidisciplinary field of data science, in which data is analyzed using mathematical,
statistical, and computer science with domain expertise to discover useful information or patterns from
the data. It involves gathering, cleaning, transforming, and organizing data to draw conclusions, forecast,
and make informed decisions. The purpose of data analysis is to turn raw data into actionable knowledge
that may be used to guide decisions, solve issues, or reveal hidden trends.
Data analysts and Data Scientists can be recognized by their responsibilities, skill sets, and areas of
expertise. Sometimes the roles of data analysts and data scientists may conflict or not be clear.
Data analysts are responsible for collecting, cleaning, and analyzing data to help businesses make better
decisions. They typically use statistical analysis and visualization tools to identify trends and patterns in
data. Data analysts may also develop reports and dashboards to communicate their findings to
stakeholders.
Data scientists are responsible for creating and implementing machine learning and statistical models on
data. These models are used to make predictions, automate jobs, and enhance business processes. Data
scientists are also well-versed in programming languages and software engineering.
Data analysis and Business intelligence are both closely related fields, Both use data and make analysis to
make better and more effective decisions. However, there are some key differences between the two.
• Data analysis involves data gathering, inspecting, cleaning, transforming and finding relevant
information, So, that it can be used for the decision-making process.
• Business Intelligence(BI) also makes data analysis to find insights as per the business
requirements. It generally uses statistical and Data visualization tools popularly known as BI tools
to present the data in user-friendly views like reports, dashboards, charts and graphs.
The similarities and differences between the Data Analysis and Business Intelligence are as follows:
Similarities Differences
4. What are the different tools mainly used for data analysis?
There are different tools used for data analysis. each has some strengths and weaknesses. Some of the
most commonly used tools for data analysis are as follows:
• Spreadsheet Software: Spreadsheet Software is used for a variety of data analysis tasks, such as
sorting, filtering, and summarizing data. It also has several built-in functions for performing
statistical analysis. The top 3 mostly used Spreadsheet Software are as follows:
o Microsoft Excel
o Google Sheets
o LibreOffice Calc
• Database Management Systems (DBMS): DBMSs, or database management systems, are crucial
resources for data analysis. It offers a secure and efficient way to manage, store, and organize
massive amounts of data.
o MySQL
o PostgreSQL
o Oracle Database
• Statistical Software: There are many statistical software used for Data analysis, Each with its
strengths and weaknesses. Some of the most popular software used for data analysis are as
follows:
o SAS: Widely used in various industries for statistical analysis and data management.
o SPSS: A software suite used for statistical analysis in social science research.
o Stata: A tool commonly used for managing, analyzing, and graphing data in various
fields.SPSS:
• Programming Language: In data analysis, programming languages are used for deep and
customized analysis according to mathematical and statistical concepts. For Data analysis, two
programming languages are highly popular:
o R: R is a free and open-source programming language widely popular for data analysis. It
has good visualizations and environments mainly designed for statistical analysis and
data visualization. It has a wide variety of packages for performing different data analysis
tasks.
o Python: Python is also a free and open-source programming language used for Data
analysis. Nowadays, It is becoming widely popular among researchers. Along with data
analysis, It is used for Machine Learning, Artificial Intelligence, and web development.
Data Wrangling is very much related concepts to Data Preprocessing. It’s also known as Data munging. It
involves the process of cleaning, transforming, and organizing the raw, messy or unstructured data into a
usable format. The main goal of data wrangling is to improve the quality and structure of the dataset. So,
that it can be used for analysis, model building, and other data-driven tasks.
Data wrangling can be a complicated and time-consuming process, but it is critical for businesses that
want to make data-driven choices. Businesses can obtain significant insights about their products,
services, and bottom line by taking the effort to wrangle their data.
Some of the most common tasks involved in data wrangling are as follows:
• Data Cleaning: Identify and remove the errors, inconsistencies, and missing values from the
dataset.
• Data Transformation: Transformed the structure, format, or values of data as per the
requirements of the analysis. that may include scaling & normalization, encoding categorical
values.
• Data Integration: Combined two or more datasets, if that is scattered from multiple sources, and
need of consolidated analysis.
• Data Restructuring: Reorganize the data to make it more suitable for analysis. In this case, data
are reshaped to different formats or new variables are created by aggregating the features at
different levels.
• Data Enrichment: Data are enriched by adding additional relevant information, this may be
external data or combined aggregation of two or more features.
• Quality Assurance: In this case, we ensure that the data meets certain quality standards and is
fit for analysis.
Descriptive and predictive analysis are the two different ways to analyze the data.
• Descriptive Analysis: Descriptive analysis is used to describe questions like “What has happened
in the past?” and “What are the key characteristics of the data?”. Its main goal is to identify the
patterns, trends, and relationships within the data. It uses statistical measures, visualizations,
and exploratory data analysis techniques to gain insight into the dataset.
The key characteristics of descriptive analysis are as follows:
o Summary Statistics: It often involves calculating basic statistical measures like mean,
median, mode, standard deviation, and percentiles.
o Visualizations: Graphs, charts, histograms, and other visual representations are used to
illustrate data patterns.
o Patterns and Trends: Descriptive analysis helps identify recurring patterns and trends
within the data.
o Exploration: It’s used for initial data exploration and hypothesis generation.
• Predictive Analysis: Predictive Analysis, on the other hand, uses past data and applies statistical
and machine learning models to identify patterns and relationships and make predictions about
future events. Its primary goal is to predict or forecast what is likely to happen in future.
The key characteristics of predictive analysis are as follows:
o Future Projection: Predictive analysis is used to forecast and predict future events.
o Model Building: It involves developing and training models using historical data to
predict outcomes.
o Validation and Testing: Predictive models are validated and tested using unseen data to
assess their accuracy.
o Feature Selection: Identifying relevant features (variables) that influence the predicted
outcome is crucial.
Univariate, Bivariate and multivariate are the three different levels of data analysis that are used to
understand the data.
1. Univariate analysis: Univariate analysis analyzes one variable at a time. Its main purpose is to
understand the distribution, measures of central tendency (mean, median, and mode),
measures of dispersion (range, variance, and standard deviation), and graphical methods such as
histograms and box plots. It does not deal with the courses or relationships from the other
variables of the dataset.
Common techniques used in univariate analysis include histograms, bar charts, pie charts, box
plots, and summary statistics.
2. Bivariate analysis: Bivariate analysis involves the analysis of the relationship between the two
variables. Its primary goal is to understand how one variable is related to the other variables. It
reveals, Are there any correlations between the two variables, if yes then how strong the
correlations is? It can also be used to predict the value of one variable from the value of another
variable based on the found relationship between the two.
Common techniques used in bivariate analysis include scatter plots, correlation analysis,
contingency tables, and cross-tabulations.
3. Multivariate analysis: Multivariate analysis is used to analyze the relationship between three or
more variables simultaneously. Its primary goal is to understand the relationship among the
multiple variables. It is used to identify the patterns, clusters, and dependencies among the
several variables.
Common techniques used in multivariate analysis include principal component analysis (PCA),
factor analysis, cluster analysis, and regression analysis involving multiple predictor variables.
8. Name some of the most popular data analysis and visualization tools used for data analysis.
Some of the most popular data analysis and visualization tools are as follows:
• Tableau: Tableau is a powerful data visualization application that enables users to generate
interactive dashboards and visualizations from a wide range of data sources. It is a popular
choice for businesses of all sizes since it is simple to use and can be adjusted to match any
organization’s demands.
• Power BI: Microsoft’s Power BI is another well-known data visualization tool. Power BI’s
versatility and connectivity with other Microsoft products make it a popular data analysis and
visualization tool in both individual and enterprise contexts.
• Qlik Sense: Qlik Sense is a data visualization tool that is well-known for its speed and
performance. It enables users to generate interactive dashboards and visualizations from several
data sources, and it can be used to examine enormous datasets.
• SAS: A software suite used for advanced analytics, multivariate analysis, and business
intelligence.
• Google Data Studio: Google Data Studio is a free web-based data visualization application that
allows users to create customized dashboards and simple reports. It aggregates data from up to
12 different sources, including Google Analytics, into an easy-to-modify, easy-to-share, and easy-
to-read report.
Data analysis involves a series of steps that transform raw data into relevant insights, conclusions, and
actionable suggestions. While the specific approach will vary based on the context and aims of the study,
here is an approximate outline of the processes commonly followed in data analysis:
• Problem Definition or Objective: Make sure that the problem or question you’re attempting to
answer is stated clearly. Understand the analysis’s aims and objectives to direct your strategy.
• Data Collection: Collate relevant data from various sources. This might include surveys, tests,
databases, web scraping, and other techniques. Make sure the data is representative and
accurate.ALso
• Data Preprocessing or Data Cleaning: Raw data often has errors, missing values, and
inconsistencies. In Data Preprocessing and Cleaning, we redefine the column’s names or values,
standardize the formats, and deal with the missing values.
• Exploratory Data Analysis (EDA): EDA is a crucial step in Data analysis. In EDA, we apply various
graphical and statistical approaches to systematically analyze and summarize the main
characteristics, patterns, and relationships within a dataset. The primary objective behind the
EDA is to get a better knowledge of the data’s structure, identify probable abnormalities or
outliers, and offer initial insights that can guide further analysis.
• Data Visualizations: Data visualizations play a very important role in data analysis. It provides
visual representation of complicated information and patterns in the data which enhances the
understanding of data and helps in identifying the trends or patterns within a data. It enables
effective communication of insights to various stakeholders.
10. What is data cleaning?
Data cleaning is the process of identifying the removing misleading or inaccurate records from the
datasets. The primary objective of Data cleaning is to improve the quality of the data so that it can be
used for analysis and predictive model-building tasks. It is the next process after the data collection and
loading.
2. Duplicate entries: Duplicate records may biased analysis results, resulting in exaggerated counts
or incorrect statistical summaries. So, we also remove it.
3. Missing Values: Some data points may be missing. Before going further either we remove the
entire rows or columns or we fill the missing values with probable items.
4. Outlier: Outliers are data points that drastically differ from the average which may result in
machine error when collecting the dataset. if it is not handled properly, it can bias results even
though it can offer useful insights. So, we first detect the outlier and then remove it.
11. What is the importance of exploratory data analysis (EDA) in data analysis?
Exploratory data analysis (EDA) is the process of investigating and understanding the data through
graphical and statistical techniques. It is one of the crucial parts of data analysis that helps to identify the
patterns and trends in the data as well as help in understanding the relationship between variables.
EDA is a non-parametric approach in data analysis, which means it does take any assumptions about the
dataset. EDA is important for a number of reasons that are as follows:
1. With EDA we can get a deep understanding of patterns, distributions, nature of data and
relationship with another variable in the dataset.
2. With EDA we can analyze the quality of the dataset by making univariate analyses like the mean,
median, mode, quartile range, distribution plot etc and identify the patterns and trends of single
rows of the dataset.
3. With EDA we can also get the relationship between the two or more variables by making
bivariate or multivariate analyses like regression, correlations, covariance, scatter plot, line plot
etc.
4. With EDA we can find out the most influential feature of the dataset using correlations,
covariance, and various bivariate or multivariate plotting.
5. With EDA we can also identify the outliers using Box plots and remove them further using a
statistical approach.
EDA provides the groundwork for the entire data analysis process. It enables analysts to make more
informed judgments about data processing, hypothesis testing, modelling, and interpretation, resulting
in more accurate and relevant insights.
Time Series analysis is a statistical technique used to analyze and interpret data points collected at
specific time intervals. Time series data is the data points recorded sequentially over time. The data
points can be numerical, categorical, or both. The objective of time series analysis is to understand the
underlying patterns, trends and behaviours in the data as well as to make forecasts about future values.
• Trend: The data’s long-term movement or direction over time. Trends can be upward,
downward, or flat.
• Seasonality: Patterns that repeat at regular intervals, such as daily, monthly, or yearly cycles.
• Cyclical Patterns: Longer-term trends that are not as regular as seasonality, and are frequently
associated with economic or business cycles.
• Irregular Fluctuations: Unpredictable and random data fluctuations that cannot be explained by
trends, seasonality, or cycles.
• Auto-correlations: The link between a data point and its prior values. It quantifies the degree of
dependence between observations at different time points.
Time series analysis approaches include a variety of techniques including Descriptive analysis to identify
trends, patterns, and irregularities, smoothing techniques like moving averages or exponential
smoothing to reduce noise and highlight underlying trends, Decompositions to separate the time series
data into its individual components and forecasting technique like ARIMA, SARIMA,
and Regression technique to predict the future values based on the trends.
Feature engineering is the process of selecting, transforming, and creating features from raw data in
order to build more effective and accurate machine learning models. The primary goal of feature
engineering is to identify the most relevant features or create the relevant features by combining two or
more features using some mathematical operations from the raw data so that it can be effectively
utilized for getting predictive analysis by machine learning model.
• Feature Selection: In this case we identify the most relevant features from the dataset based on
the correlation with the target variables.
• Create new feature: In this case, we generate the new features by aggregating or transforming
the existing features in such a way that it can be helpful to capture the patterns or trends which
is not revealed by the original features.
• Transformation: In this case, we modify or scale the features so, that it can helpful in building
the machine learning model. Some of the common transformations method are Min-Max
Scaling, Z-Score Normalization, and log transformations etc.
• Feature encoding: Generally ML algorithms only process the numerical data, so, that we need to
encode categorical features into the numerical vector. Some of the popular encoding technique
are One-Hot-Encoding, Ordinal label encoding etc.
Data normalization is the process of transforming numerical data into standardised range. The objective
of data normalization is scale the different features (variables) of a dataset onto a common scale, which
make it easier to compare, analyze, and model the data. This is particularly important when features
have different units, scales, or ranges because if we doesn’t normalize then each feature has different-
different impact which can affect the performance of various machine learning algorithms and statistical
analyses.
• Min-Max Scaling: Scales the data to a range between 0 and 1 using the formula:
(x – min) / (max – min)
• Robust Scaling: Scales data by removing the median and scaling to the interquartile range(IQR)
to handle outliers using the formula:
(X – Median) / IQR
• Unit Vector Scaling: Scales each data point to have a Euclidean norm (length) (||X||) of 1 using
the formula:
X / ||X||
15. What are the main libraries you would use for data analysis in Python?
For data analysis in Python, many great libraries are used due to their versatility, functionality, and ease
of use. Some of the most common libraries are as follows:
• NumPy: A core Python library for numerical computations. It supports arrays, matrices, and a
variety of mathematical functions, making it a building block for many other data analysis
libraries.
• Pandas: A well-known data manipulation and analysis library. It provides data structures (like as
DataFrames) that make to easily manipulate, filter, aggregate, and transform data. Pandas is
required when working with structured data.
• SciPy: SciPy is a scientific computing library. It offers a wide range of statistical, mathematical,
and scientific computing functions.
• Matplotlib: Matplotlib is a library for plotting and visualization. It provides a wide range of
plotting functions, making it easy to create beautiful and informative visualizations.
• Seaborn: Seaborn is a library for statistical data visualization. It builds on top of Matplotlib and
provides a more user-friendly interface for creating statistical plots.
• Statsmodels: A statistical model estimation and interpretation library. It covers a wide range of
statistical models, such as linear models and time series analysis.
Structured and unstructured data depend on the format in which the data is stored. Structured data is
information that has been structured in a certain format, such as a table or spreadsheet. This facilitates
searching, sorting, and analyzing. Unstructured data is information that is not arranged in a certain
format. This makes searching, sorting, and analyzing more complex.
The differences between the structured and unstructured data are as follows:
1. Data loading functions: Pandas provides different functions to read the dataset from the
different-different formats like read_csv, read_excel, and read_sql functions are used to read the
dataset from CSV, Excel, and SQL datasets respectively in a pandas DataFrame.
2. Data Exploration: Pandas provides functions like head, tail, and sample to rapidly inspect the
data after it has been imported. In order to learn more about the different data types, missing
values, and summary statistics, use pandas .info and .describe functions.
3. Data Cleaning: Pandas offers functions for dealing with missing values (fillna), duplicate rows
(drop_duplicates), and incorrect data types (astype) before analysis.
4. Data Transformation: Pandas may be used to modify and transform data. It is simple to do
actions like selecting columns, filtering rows (loc, iloc), and adding new ones. Custom
transformations are feasible using the apply and map functions.
5. Data Aggregation: With the help of pandas, we can group the data using groupby function, and
also apply aggregation tasks like sum, mean, count, etc., on specify columns.
6. Time Series Analysis: Pandas offers robust support for time series data. We can easily conduct
date-based computations using functions like resample, shift etc.
7. Merging and Joining: Data from different sources can be combined using Pandas merge and join
functions.
18. What is the difference between pandas Series and pandas DataFrames?
In pandas, Both Series and Dataframes are the fundamental data structures for handling and analyzing
tabular data. However, they have distinct characteristics and use cases.
A series in pandas is a one-dimensional labelled array that can hold data of various types like integer,
float, string etc. It is similar to a NumPy array, except it has an index that may be used to access the data.
The index can be any type of object, such as a string, a number, or a datetime.
The key differences between the pandas Series and Dataframes are as follows:
pandas Series pandas DataFrames
Similar to the single vector or column in a Similar to a spreadsheet, which can have
spreadsheet multiple vectors or columns as well as.
One-hot encoding is a technique used for converting categorical data into a format that machine learning
algorithms can understand. Categorical data is data that is categorized into different groups, such as
colors, nations, or zip codes. Because machine learning algorithms often require numerical input,
categorical data is represented as a sequence of binary values using one-hot encoding.
To one-hot encode a categorical variable, we generate a new binary variable for each potential value of
the category variable. For example, if the category variable is “color” and the potential values are “red,”
“green,” and “blue,” then three additional binary variables are created: “color_red,” “color_green,” and
“color_blue.” Each of these binary variables would have a value of 1 if the matching category value was
present and 0 if it was not.
A boxplot is a graphic representation of data that shows the distribution of the data. It is a standardized
method of the distribution of a data set based on its five-number summary of data points: the minimum,
first quartile [Q1], median, third quartile [Q3], and maximum.
Boxplot
Boxplot is used for detection the outliers in the dataset by visualizing the distribution of data.
Descriptive statistics and inferential statistics are the two main branches of statistics
• Descriptive Statistics: Descriptive statistics is the branch of statistics, which is used to summarize
and describe the main characteristics of a dataset. It provides a clear and concise summary of
the data’s central tendency, variability, and distribution. Descriptive statistics help to understand
the basic properties of data, identifying patterns and structure of the dataset without making
any generalizations beyond the observed data. Descriptive statistics compute measures of
central tendency and dispersion and also create graphical representations of data, such as
histograms, bar charts, and pie charts to gain insight into a dataset.
Descriptive statistics is used to answer the following questions:
• Inferential Statistics: Inferential statistics is the branch of statistics, that is used to conclude,
make predictions, and generalize findings from a sample to a larger population. It makes
inferences and hypotheses about the entire population based on the information gained from a
representative sample. Inferential statistics use hypothesis testing, confidence intervals, and
regression analysis to make inferences about a population.
Inferential statistics is used to answer the following questions:
o Is there any difference in the monthly income of the Data analyst and the Data Scientist?
Measures of central tendency are the statistical measures that represent the centre of the data set. It
reveals where the majority of the data points generally cluster. The three most common measures of
central tendency are:
• Mean: The mean, also known as the average, is calculated by adding up all the values in a
dataset and then dividing by the total number of values. It is sensitive to outliers since a single
extreme number can have a large impact on the mean.
Mean = (Sum of all values) / (Total number of values)
• Median: The median is the middle value in a data set when it is arranged in ascending or
descending order. If there is an even number of values, the median is the average of the two
middle values.
• Mode: The mode is the value that appears most frequently in a dataset. A dataset can have no
mode (if all values are unique) or multiple modes (if multiple values have the same highest
frequency). The mode is useful for categorical data and discrete distributions.
Measures of dispersion, also known as measures of variability or spread, indicate how much the values
in a dataset deviate from the central tendency. They help in quantifying how far the data points vary
from the average value.
• Range: The range is the difference between the highest and lowest values in a data set. It gives
an idea of how much the data spreads from the minimum to the maximum.
• Variance: The variance is the average of the squared deviations of each data point from the
mean. It is a measure of how spread out the data is around the mean.
Variance(σ2)=∑(X−μ)2NVariance(σ2)=N∑(X−μ)2
• Standard Deviation: The standard deviation is the square root of the variance. It is a measure of
how spread out the data is around the mean, but it is expressed in the same units as the data
itself.
• Mean Absolute Deviation (MAD): MAD is the average of the absolute differences between each
data point and the mean. Unlike variance, it doesn’t involve squaring the differences, making it
less sensitive to extreme values. it is less sensitive to outliers than the variance or standard
deviation.
• Percentiles: Percentiles are statistical values that measure the relative positions of values within
a dataset. Which is computed by arranging the dataset in descending order from least to the
largest and then dividing it into 100 equal parts. In other words, a percentile tells you what
percentage of data points are below or equal to a specific value. Percentiles are often used to
understand the distribution of data and to identify values that are above or below a certain
threshold within a dataset.
• Interquartile Range (IQR): The interquartile range (IQR) is the range of values ranging from the
25th percentile (first quartile) to the 75th percentile (third quartile). It measures the spread of
the middle 50% of the data and is less affected by outliers.
• Coefficient of Variation (CV): The coefficient of variation (CV) is a measure of relative variability,
It is the ratio of the standard deviation to the mean, expressed as a percentage. It’s used to
compare the relative variability between datasets with different units or scales.
A probability distribution is a mathematical function that estimates the probability of different possible
outcomes or events occurring in a random experiment or process. It is a mathematical representation of
random phenomena in terms of sample space and event probability, which helps us understand the
relative possibility of each outcome occurring.
1. Discrete Probability Distribution: In a discrete probability distribution, the random variable can
only take on distinct, separate values. Each value is associated with a probability. Examples of
discrete probability distributions include the binomial distribution, the Poisson distribution, and
the hypergeometric distribution.
A normal distribution, also known as a Gaussian distribution, is a specific type of probability distribution
with a symmetric, bell-shaped curve. The data in a normal distribution clustered around a central value
i.e mean, and the majority of the data falls within one standard deviation of the mean. The curve
gradually tapers off towards both tails, showing that extreme values are becoming
distribution having a mean equal to 0 and standard deviation equal to 1 is known as standard normal
distribution and Z-scores are used to measure how many standard deviations a particular data point is
from the mean in standard normal distribution.
Normal distributions are a fundamental concept that supports many statistical approaches and helps
researchers understand the behaviour of data and variables in a variety of scenarios.
The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that, under certain
conditions, the distribution of sample means approaches a normal distribution as sample size rises,
regardless of the the original population distribution. In other words, even if the population distribution
is not normal, when the sample size is high enough, the distribution of sample means will tend to be
normal.
2. The samples must be random. This means that each sample must be drawn from the population
in a way that gives all members of the population an equal chance of being selected.
3. The sample size must be large enough. The CLT typically applies when the sample size is greater
than 30.
In statistics, the null and alternate hypotheses are two mutually exclusive statements regarding a
population parameter. A hypothesis test analyzes sample data to determine whether to accept or reject
the null hypothesis. Both null and alternate hypotheses represent the opposing statements or claims
about a population or a phenomenon under investigation.
• Null Hypothesis (H0 H0 ): The null hypothesis is a statement regarding the status quo
representing no difference or effect after the phenomena unless there is strong evidence to the
contrary.
A p-value, which stands for “probability value,” is a statistical metric used in hypothesis testing to
measure the strength of evidence against a null hypothesis. When the null hypothesis is considered to be
true, it measures the chance of receiving observed outcomes (or more extreme results). In layman’s
words, the p-value determines whether the findings of a study or experiment are statistically significant
or if they might have happened by chance.
The p-value is a number between 0 and 1, which is frequently stated as a decimal or percentage. If the
null hypothesis is true, it indicates the probability of observing the data (or more extreme data).
The significance level, often denoted as α (alpha), is a critical parameter in hypothesis testing and
statistical analysis. It defines the threshold for determining whether the results of a statistical test are
statistically significant. In other words, it sets the standard for deciding when to reject the null
hypothesis (H0) in favor of the alternative hypothesis (Ha).
If the p-value is less than the significance level, we reject the null hypothesis and conclude that there is a
statistically significant difference between the groups.
• If p-value ≤ α: Reject the null hypothesis. This indicates that the results are statistically
significant, and there is evidence to support the alternative hypothesis.
• If p-value > α: Fail to reject the null hypothesis. This means that the results are not statistically
significant, and there is insufficient evidence to support the alternative hypothesis.
The choice of a significance level involves a trade-off between Type I and Type II errors. A lower
significance level (e.g., α = 0.01) decreases the risk of Type I errors while increasing the chance of Type II
errors (failure to identify a real impact). A higher significance level (e.g., = 0.10), on the other hand,
increases the probability of Type I errors while decreasing the chance of Type II errors.
In hypothesis testing, When deciding between the null hypothesis (H0) and the alternative hypothesis
(Ha), two types of errors may occur. These errors are known as Type I and Type II errors, and they are
important considerations in statistical analysis.
• Type I error (False Positive, α): Type I error occurs when the null hypothesis is rejected when it is
true. This is also referred as a false positive. The probability of committing a Type I error is
denoted by α (alpha) and is also known as the significance level. A lower significance level (e.g.,
= 0.05) reduces the chance of Type I mistakes while increasing the risk of Type II errors.
For example, a Type I error would occur if we estimated that a new medicine was successful
when it was not.
• Type II Error (False Negative, β): Type II error occurs when a researcher fails to reject the null
hypothesis when it is actually false. This is also referred as a false negative. The probability of
committing a Type II error is denoted by β (beta)
For example, a Type II error would occur if we estimated that a new medicine was not effective
when it is actually effective.
o Type II Error (False Negative, β): Failing to reject a false null hypothesis.
31. What is a confidence interval, and how does it is related to point estimates?
The confidence interval is a statistical concept used to estimates the uncertainty associated with
estimating a population parameter (such as a population mean or proportion) from a sample. It is a
range of values that is likely to contain the true value of a population parameter along with a level of
confidence in that statement.
• Point estimate: A point estimate is a single that is used to estimate the population parameter
based on a sample. For example, the sample mean (x̄) is a point estimate of the population mean
(μ). The point estimate is typically the sample mean or the sample proportion.
• Confidence interval: A confidence interval, on the other hand, is a range of values built around a
point estimate to account for the uncertainty in the estimate. It is typically expressed as an
interval with an associated confidence level (e.g., 95% confidence interval). The degree of
confidence or confidence level shows the probability that the interval contains the true
population parameter.
The relationship between point estimates and confidence intervals can be summarized as follows:
• A point estimate provides a single value as the best guess for a population parameter based on
sample data.
• A confidence interval provides a range of values around the point estimate, indicating the range
of likely values for the population parameter.
• The confidence level associated with the interval reflects the level of confidence that the true
parameter value falls within the interval.
For example, A 95% confidence interval indicates that you are 95% confident that the real population
parameter falls inside the interval. A 95% confidence interval for the population mean (μ) can be
expressed as :
where x̄ is the point estimate (sample mean), and the margin of error is calculated using the standard
deviation of the sample and the confidence level.
ANOVA, or Analysis of Variance, is a statistical technique used for analyzing and comparing the means of
two or more groups or populations to determine whether there are statistically significant differences
between them or not. It is a parametric statistical test which means that, it assumes the data is normally
distributed and the variances of the groups are identical. It helps researchers in determining the impact
of one or more categorical independent variables (factors) on a continuous dependent variable.
ANOVA works by partitioning the total variance in the data into two components:
• Between-group variance: It analyzes the difference in means between the different groups or
treatment levels being compared.
• Within-group variance: It analyzes the variance within each individual group or treatment level.
Depending on the investigation’s design and the number of independent variables, ANOVA has
numerous varieties:
• One-Way ANOVA: Compares the means of three or more independent groups or levels of a
single categorical variable. For Example: One-way ANOVA can be used to compare the average
age of employees among the three different teams in a company.
• Two-Way ANOVA: Compare the means of two or more independent groups while taking into
account the impact of a two independent categorical variables (factors) . For example, Two-way
ANOVA can be to compare the average age of employees among the three different teams in a
company, while also taking into account the gender of the employees.
Correlation is a statistical term that analyzes the degree of a linear relationship between two or more
variables. It estimates how effectively changes in one variable predict or explain changes in
another.Correlation is often used to access the strength and direction of associations between variables
in various fields, including statistics, economics.
The correlation between two variables is represented by correlation coefficient, denoted as “r”. The
value of “r” can range between -1 and +1, reflecting the strength of the relationship:
• Positive correlation (r > 0): As one variable increases, the other tends to increase. The greater
the positive correlation, the closer “r” is to +1.
• Negative correlation (r < 0): As one variable rises, the other tends to fall. The closer “r” is to -1,
the greater the negative correlation.
34. What are the differences between Z-test, T-test and F-test?
The Z-test, t-test, and F-test are statistical hypothesis tests that are employed in a variety of contexts and
for a variety of objectives.
• Z-test: The Z-test is performed when the population standard deviation is known. It is a
parametric test, which means that it makes certain assumptions about the data, such as that the
data is normally distributed. The Z-test is most accurate when the sample size is large.
• T-test: The T-test is performed when the population standard deviation is unknown. It is also a
parametric test, but unlike the Z-test, it is less sensitive to violations of the normality
assumption. The T-test is most accurate when the sample size is large.
• F-test: The F-test is performed to compare two or more groups’ variances. It assume that
populations being compared follow a normal distribution.. When the sample sizes of the groups
are equal, the F-test is most accurate.
The key differences between the Z-test, T-test, and F-test are as follows:
1. Population
1. The variances of
follows a
1. Population the populations
normal
follows a from which the
distribution or
normal samples are
the sample
distribution. drawn should be
size is large
equal
2. Population enough for the
(homoscedastic).
standard Central Limit
deviation is Theorem to 2. Populations
known apply. being compared
have normal
2. Also applied
Assumptions distributions and
when the
Z-Test T-Test F-Test
N<30 or population
Used to test the
N>30 standard deviation is
variances
Data unknown.
35. What is linear regression, and how do you interpret its coefficients?
Linear regression is a statistical approach that fits a linear equation to observed data to represent the
connection between a dependent variable (also known as the target or response variable) and one or
more independent variables (also known as predictor variables or features). It is one of the most basic
and extensively used regression analysis techniques in statistics and machine learning. Linear regression
presupposes that the independent variables and the dependent variable have a linear relationship.
Y=β0+β1X+ϵY=β0+β1X+ϵ
Where:
• X: Independent variables
• β1 β1 is the coefficient for the independent variable X, representing the change in Y for a one-
unit change in X.
• ϵ ϵ is represents the error term (i.e Difference between the actual and predicted value from the
linear relationship.
DBMS stands for Database Management System. It is software designed to manage, store, retrieve, and
organize data in a structured manner. It provides an interface or a tool for performing CRUD operations
into a database. It serves as an intermediary between the user and the database, allowing users or
applications to interact with the database without the need to understand the underlying complexities
of data storage and retrieval.
SQL CRUD stands for CREATE, READ(SELECT), UPDATE, and DELETE statements in SQL Server. CRUD is
nothing but Data Manipulation Language (DML) Statements. CREATE operation is used to insert new data
or create new records in a database table, READ operation is used to retrieve data from one or more
tables in a database, UPDATE operation is used to modify existing records in a database table and DELETE
is used to remove records from the database table based on specified conditions. Following are the basic
query syntax examples of each operation:
CREATE
It is used to create the table and insert the values in the database. The commands used to create the
table are as follows:
READ
UPDATE
UPDATE employees
SET salary = 55000
WHERE last_name = 'Gunjan';
DELETE
38. What is the SQL statement used to insert new records into a table?
We use the ‘INSERT‘ statement to insert new records into a table. The ‘INSERT INTO’ statement in SQL is
used to add new records (rows) to a table.
Syntax
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
Example
39. How do you filter records using the WHERE clause in SQL?
We can filter records using the ‘WHERE‘ clause by including ‘WHERE’ clause in ‘SELECT’ statement,
specifying the conditions that records must meet to be included.
Syntax
Example : In this example, we are fetching the records of employee where job title is Developer.
40. How can you sort records in ascending or descending order using SQL?
We can sort records in ascending or descending order by using ‘ORDER BY; clause with the ‘SELECT’
statement. The ‘ORDER BY’ clause allows us to specify one or more columns by which you want to sort
the result set, along with the desired sorting order i.e ascending or descending order.
Example: This statement selects all customers from the ‘Customers’ table, sorted ascending by the
‘Country’
Example: This statement selects all customers from the ‘Customers’ table, sorted descending by the
‘Country’ column
The purpose of GROUP BY clause in SQL is to group rows that have the same values in specified columns.
It is used to arrange different rows in a group if a particular column has the same values with the help of
some functions.
Syntax
Example: This SQL query groups the ‘CUSTOMER’ table based on age by using GROUP BY
42. How do you perform aggregate functions like SUM, COUNT, AVG, and MAX/MIN in SQL?
An aggregate function groups together the values of multiple rows as input to form a single value of
more significant meaning. It is also used to perform calculations on a set of values and then returns a
single result. Some examples of aggregate functions are SUM, COUNT, AVG, and MIN/MAX.
Example: In this example, we are calculating sum of costs from cost column in PRODUCT table.
SELECT SUM(Cost)
FROM Products;
COUNT: It counts the number of rows in a result set or the number of non-null values in a column.
Example: Ij this example, we are counting the total number of orders in an “orders” table.
SELECT COUNT(*)
FROM Orders;
Example: In this example, we are finding average salary of employees in an “employees” table.
SELECT AVG(Price)
FROM Products;
Example: In this example, we are finding the maximum temperature in the ‘weather’ table.
SELECT MAX(Price)
FROM Orders;
Example: In this example, we are finding the minimum price of a product in a “products” table.
SELECT MIN(Price)
FROM Products;
43. What is an SQL join operation? Explain different types of joins (INNER, LEFT, RIGHT, FULL).
SQL Join operation is used to combine data or rows from two or more tables based on a common field
between them. The primary purpose of a join is to retrieve data from multiple tables by linking records
that have a related value in a specified column. There are different types of join i.e, INNER, LEFT, RIGHT,
FULL. These are as follows:
INNER JOIN: The INNER JOIN keyword selects all rows from both tables as long as the condition is
satisfied. This keyword will create the result-set by combining all rows from both the tables where the
condition satisfies i.e the value of the common field will be the same.
Example:
LEFT JOIN: A LEFT JOIN returns all rows from the left table and the matching rows from the right table.
Example:
SELECT departments.department_name, employees.first_name
FROM departments
LEFT JOIN employees
ON departments.department_id = employees.department_id;
RIGHT JOIN: RIGHT JOIN is similar to LEFT JOIN. This join returns all the rows of the table on the right
side of the join and matching rows for the table on the left side of the join.
Example:
FULL JOIN: FULL JOIN creates the result set by combining the results of both LEFT JOIN and RIGHT JOIN.
The result set will contain all the rows from both tables.
Example:
44. How can you write an SQL query to retrieve data from multiple related tables?
To retrieve data from multiple related tables, we generally use ‘SELECT’ statement along with help of
‘JOIN‘ operation by which we can easily fetch the records from the multiple tables. Basically, JOINS are
used when there are common records between two tables. There are different types of joins i.e. INNER,
LEFT, RIGHT, FULL JOIN. In the above question, detailed explanation is given regarding JOIN so you can
refer that.
45. What is a subquery in SQL? How can you use it to retrieve specific data?
A subquery is defined as query with another query. A subquery is a query embedded in WHERE clause of
another SQL query. Subquery can be placed in a number of SQL clause: WHERE clause, HAVING clause,
FROM clause. Subquery is used with SELECT, INSERT, DELETE, UPDATE statements along with expression
operator. It could be comparison or equality operator such as =>,=,<= and like operator.
SELECT customer_name,
(SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.customer_id) AS order_count
FROM customers;
Example 2: Subquery in the WHERE Clause
We can use subquery in combination with IN or EXISTS condition. Example of using a subquery in
combination with IN is given below. In this example, we will try to find out the geek’s data from table
geeks_data, those who are from the computer science department with the help of geeks_dept table
using sub-query.
47. What is the purpose of the HAVING clause in SQL? How is it different from the WHERE clause?
In SQL, the HAVING clause is used to filter the results of a GROUP BY query depending on aggregate
functions applied to grouped columns. It allows you to filter groups of rows that meet specific conditions
after grouping has been performed. The HAVING clause is typically used with aggregate functions like
SUM, COUNT, AVG, MAX, or MIN.
The main differences between HAVING and WHERE clauses are as follows:
HAVING WHERE
Command: Command:
48. How do you use the UNION and UNION ALL operators in SQL?
In SQL, the UNION and UNION ALL operators are used to combine the result sets of multiple SELECT
statements into a single result set. These operators allow you to retrieve data from multiple tables or
queries and present it as a unified result. However, there are differences between the two operators:
1. UNION Operator:
The UNION operator returns only distinct rows from the combined result sets. It removes duplicate rows
and returns a unique set of rows. It is used when you want to combine result sets and eliminate
duplicate rows.
Syntax:
Example:
The UNION ALL operator returns all rows from the combined result sets, including duplicates. It does not
remove duplicate rows and returns all rows as they are. It is used when you want to combine result sets
but want to include duplicate rows.
Syntax:
Database Normalization is the process of reducing data redundancy in a table and improving data
integrity. It is a way of organizing data in a database. It involves organizing the columns and tables in the
database to ensure that their dependencies are correctly implemented using database constraints.
• The normalization is important because it allows the database to take up less disk space.
50. Can you list and briefly describe the normal forms (1NF, 2NF, 3NF) in SQL?
Normalization can take numerous forms, the most frequent of which are 1NF (First Normal Form), 2NF
(Second Normal Form), and 3NF (Third Normal Form). Here’s a quick rundown of each:
• First Normal Form (1NF): In 1NF, each table cell should contain only a single value, and each
column should have a unique name. 1NF helps in eliminating duplicate data and simplifies the
queries. It is the fundamental requirement for a well-structured relational database. 1NF
eliminates all the repeating groups of the data and also ensures that the data is organized at its
most basic granularity.
• Second Normal Form (2NF): In 2NF, it eliminates the partial dependencies, ensuring that each of
the non-key attributes in the table is directly related to the entire primary key. This further
reduces data redundancy and anomalies. The Second Normal form (2NF) eliminates redundant
data by requiring that each non-key attribute be dependent on the primary key. In 2NF, each
column should be directly related to the primary key, and not to other columns.
• Third Normal Form (3NF): Third Normal Form (3NF) builds on the Second Normal Form (2NF) by
requiring that all non-key attributes are independent of each other. This means that each column
should be directly related to the primary key, and not to any other columns in the same table.
51. Explain window functions in SQL. How do they differ from regular aggregate functions?
In SQL, window functions provide a way to perform complex calculations and analysis without the need
for self-joins or subqueries.
SELECT col_name1,
window_function(col_name2)
OVER([PARTITION BY col_name1] [ORDER BY col_name3]) AS new_col
FROM table_name;provides
Example:
SELECT
department,
AVG(salary) OVER(PARTITION BY department ORDER BY employee_id) AS avg_salary
FROM
employees;
Window functions return a result for each row in Aggregate functions return a single result for the
the result set based on its specific window. Each entire dataset. Each row receives the same
row can have a different result. value.
Window functions provide both an aggregate Regular aggregates provide a summary of the
result and retain the details of individual rows entire dataset, often losing detail about
within the defined window. individual rows.
Window functions require the use of the OVER() Regular aggregate functions do not use the
clause to specify the window’s characteristics, OVER() clause because they do not have a
such as the partitioning and ordering of rows. notion of windows.
52. What are primary keys and foreign keys in SQL? Why are they important?
Primary keys and foreign keys are two fundamental concepts in SQL that are used to build and enforce
connections between tables in a relational database management system (RDBMS).
• Primary key: Primary keys are used to ensure that the data in the specific column is always
unique. In this, a column cannot have a NULL value. The primary key is either an existing table
column or it’s specifically generated by the database itself according to a sequence.
Importance of Primary Keys:
o Uniqueness
o Query Optimization
o Data Integrity
o Relationships
o Data Retrieval
• Foreign key: Foreign key is a group of column or a column in a database table that provides a link
between data in given two tables. Here, the column references a column of another table.
Importance of Foreign Keys:
o Relationships
o Data Consistency
o Query Efficiency
o Referential Integrity
o Cascade Actions
53. Describe the concept of a database transaction. Why is it important to maintain data integrity?
Database transactions are the set of operations that are usually used to perform logical work. Database
transactions mean that data in the database has been changed. It is one of the major characteristics
provided in DBMS i.e. to protect the user’s data from system failure. It is done by ensuring that all the
data is restored to a consistent state when the computer is restarted. It is any one execution of the user
program. Transaction’s one of the most important properties is that it contains a finite number of steps.
They are important to maintain data integrity because they ensure that the database always remains in a
valid and consistent state, even in the presence of multiple users or several operations. Database
transactions are essential for maintaining data integrity because they enforce ACID properties i.e,
atomicity, consistency, isolation, and durability properties. Transactions provide a solid and robust
mechanism to ensure that the data remains accurate, consistent, and reliable in complex and concurrent
database environments. It would be challenging to guarantee data integrity in relational database
systems without database transactions.
54. Explain how NULL values are handled in SQL queries, and how you can use functions like IS NULL
and IS NOT NULL.
In SQL, NULL is a special value that usually represents that the value is not present or absence of the
value in a database column. For accurate and meaningful data retrieval and manipulation, handling NULL
becomes crucial. SQL provides IS NULL and IS NOT NULL operators to work with NULL values.
IS NULL: IS NULL operator is used to check whether an expression or column contains a NULL value.
Syntax:
Example: In the below example, the query retrieves all rows from the employee table where the first
name does not contains NULL values.
55. What is the difference between normalization and denormalization in database design.
Normalization is used in a database to reduce the data redundancy and inconsistency from the
table. Denormalization is used to add data redundancy to execute the query as quick as possible.
They are used for grouping and They are used for performing
segmenting data, creating calculations, and creating the
hierarchies and the structure numerical representation of the
Usage for visualizations. data as sum, average, etc.