Data Analyst Question-Answers
Data Analyst Question-Answers
1) Data analysis is the structured procedure that involves working with data by
performing activities such as ingestion, cleaning, transforming, and assessing it to
provide insights, which can be used to drive revenue. (Data analysis is defined as a
process of cleaning, transforming, and modeling data to discover useful information
for business decision-making. The purpose of Data Analysis is to extract useful
information from data and taking the decision based upon the data analysis.)
2) Data is collected, to begin with, from varied sources. Since the data is a raw entity, it
has to be cleaned and processed to fill out missing values and to remove any entity
that is out of the scope of usage.
3) After pre-processing the data, it can be analyzed with the help of models, which use
the data to perform some analysis on it.
4) The last step involves reporting and ensuring that the data output is converted to a
format that can also cater to a non-technical audience, alongside the analysts.
Collect Data: The data gets collected from various sources and is stored so that it can
be cleaned and prepared. In this step, all the missing values and outliers are removed.
Analyse Data: Once the data is ready, the next step is to analyze the data. A model is
run repeatedly for improvements. Then, the mode is validated to check whether it
meets the business requirements.
Create Reports: Finally, the model is implemented and then reports thus generated
are passed onto the stakeholders.
Presence of duplicate entries and spelling mistakes. These errors can hamper data
quality.
Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will
have to spend a significant amount of time in cleansing the data.
Data extracted from multiple sources may vary in representation. Once the collected
data is combined after being cleansed and organized, the variations in data
representation may cause a delay in the analysis process.
Incomplete data is another major challenge in the data analysis process. It would
inevitably lead to erroneous or faulty results.
Here are some key skills usually required for a data analyst:
Analytical skills: Ability to collect, organize, and dissect data to make it meaningful.
Mathematical and Statistical skills: Proficiency in applying the right statistical
methods or algorithms on data to get the insights needed.
Problem-solving skills: Ability to identify issues, obstacles, and opportunities in data
and come up with effective solutions.
Attention to Detail: Ensuring precision in data collection, analysis, and interpretation.
Knowledge of Machine Learning: In some roles, a basic understanding of machine
learning concepts can be beneficial.
6.What are the different types of sampling techniques used by data analysts?
Sampling is a statistical method to select a subset of data from an entire dataset (population)
to estimate the characteristics of the whole population.
1) Univariate analysis is the simplest and easiest form of data analysis where the data being
analyzed contains only one variable.
Univariate analysis can be described using Central Tendency, Dispersion, Quartiles, Bar
charts, Histograms, Pie charts, and Frequency distribution tables.
2) The bivariate analysis involves the analysis of two variables to find causes, relationships,
and correlations between the variables.
Example – Analyzing the sale of ice creams based on the temperature outside.
The bivariate analysis can be explained using Correlation coefficients, Linear regression,
Logistic regression, Scatter plots, and Box plots.
3) The multivariate analysis involves the analysis of three or more variables to understand the
relationship of each variable with the other variables.
. This could be the break up of population growth in a specific city based on gender, income,
employment type, etc.
The answer to this question may vary from a case to case basis. However, some general
strengths of a data analyst may include strong analytical skills, attention to detail, proficiency
in data manipulation and visualization, and the ability to derive insights from complex
datasets. Weaknesses could include limited domain knowledge, lack of experience with
certain data analysis tools or techniques, or challenges in effectively communicating technical
findings to non-technical stakeholders.
9.What are the common problems that data analysts encounter during analysis?
Handling duplicate
Collecting the meaningful right data and the right time
Handling data purging and storage problems
Making data secure and dealing with compliance issues
10. What are the various steps involved in any analytics project?
Collecting Data
Gather the right data from various sources and other information based on your priorities.
Cleaning Data
Clean the data to remove unwanted, redundant, and missing values, and make it ready for
analysis.
12.Which are the technical tools that you have used for analysis and presentation purposes?
As a data analyst, you are expected to know the tools mentioned below for analysis and
presentation purposes. Some of the popular tools you should know are:
MS Excel, Tableau
For creating reports and dashboards
Python, R, SPSS
For statistical analysis, data modeling, and exploratory analysis
MS PowerPoint
For presentation, displaying the final results and important conclusions
12. What are the common problems that data analysts encounter during analysis?
Handling duplicate
Collecting the meaningful right data and the right time
Handling data purging and storage problems
Making data secure and dealing with compliance issues
Create a data cleaning plan by understanding where the common errors take place and keep
all the communications open.
Before working with the data, identify and remove the duplicates. This will lead to an easy
and effective data analysis process.
Focus on the accuracy of the data. Set cross-field validation, maintain the value types of
data, and provide mandatory constraints.
Normalize the data at the entry point so that it is less chaotic. You will be able to ensure
that all information is standardized, leading to fewer errors on entry.
Removing a data block entirely
Finding ways to fill black data in, without causing redundancies
Replacing data with its mean or median values
Making use of placeholders for empty spaces
egregating data, according to their respective attributes.
Breaking large chunks of data into small datasets and then cleaning them.
Analyzing the statistics of each data column.
Creating a set of utility functions or scripts for dealing with common cleaning tasks.
Keeping track of all the data cleansing operations to facilitate easy addition or
removal from the datasets, if required.
Removal of unwanted observations which are not in reference to the filed of
study one is carrying.
Quality Check
Data standardisation
Data normalisation
Deduplication
Data Analysis
Exporting of data
i) Memory-based approach
A good example of collaborative filtering is when you see a statement like “recommended for
you” on online shopping sites that’s pops out based on your browsing history.
Field Level Validation – In this method, data validation is done in each field as and
when a user enters the data. It helps to correct the errors as you go.
Form Level Validation – In this method, the data is validated after the user
completes the form and submits it. It checks the entire data entry form at once,
validates all the fields in it, and highlights the errors (if any) so that the user can
correct it.
Data Saving Validation – This data validation technique is used during the process
of saving an actual file or database record. Usually, it is done when multiple data
entry forms must be validated.
Search Criteria Validation – This validation technique is used to offer the user
accurate and related matches for their searched keywords or phrases. The main
purpose of this validation method is to ensure that the user’s search queries can return
the most relevant results.
Logistic regression is a statistical method for examining a dataset in which there are one or
more independent variables that defines an outcome
18.Mention what are the missing patterns that are generally observed?
In KNN imputation, the missing attribute values are imputed by using the attributes value that
are most similar to the attribute whose values are missing. By using a distance function, the
similarity of two attributes is determined.
20.Mention what are the data validation methods used by data analyst?
Data screening
Data verification
Prepare a validation report that gives information of all suspected data. It should give
information like validation criteria that it failed and the date and time of occurrence
Experience personnel should examine the suspicious data to determine their
acceptability
Invalid data should be assigned and replaced with a validation code
To work on missing data use the best analysis strategy like deletion method, single
imputation methods, model based methods, etc.
22.Mention how to deal the multi-source problems?
The outlier is a commonly used terms by analysts referred for a value that appears far away
and diverges from an overall pattern in a sample. There are two types of Outliers
Univariate
Multivariate
In K-mean algorithm,
The clusters are spherical: the data points in a cluster are centered around that cluster
The variance/spread of the clusters is similar: Each data point belongs to the closest
cluster
KPI: It stands for Key Performance Indicator, it is a metric that consists of any combination
of spreadsheets, reports or charts about business process
Design of experiments: It is the initial process used to split your data, sample and set up of a
data for statistical analysis
80/20 rules: It means that 80 percent of your income comes from 20 percent of your clients
27.Mention what is the difference between data mining and data profiling?
Map-reduce is a framework to process large data sets, splitting them into subsets, processing
each subset on a different server and then blending results obtained on each.
29.What Is Linear Regression?
Linear regression is a statistical method used to find out how two variables are related to each
other. One of the variables is the dependent variable and the other one is the explanatory
variable. The process used to establish this relationship involves fitting a linear equation to
the dataset.
30.Explain what is Clustering? What are the properties for clustering algorithms?.
Clustering is the technique of identifying groups or categories within a dataset and placing
data values into those groups, thus creating clusters.
Iterative
Hard or soft
Disjunctive
Flat or hierarchical
A data warehouse is a data storage system that collects data from various disparate sources
and stores them in a way that makes it easy to produce important business insights. Data
warehousing is the process of identifying heterogeneous data sources, sourcing data, cleaning
it, and transforming it into a manageable form for storage in a data warehouse.
There are two main ways to deal with missing data in data analysis.
Imputation is a technique of creating an informed guess about what the missing data point
could be. It is used when the amount of missing data is low and there appears to be natural
variation within the available data.
The other option is to remove the data. This is usually done if data is missing at random and
there is no way to make reasonable conclusions about what those missing values might b
33.What are some of the statistical methods that are useful for data-analyst?
Bayesian method
Markov process
Spatial and cluster processes
Rank statistics, percentile, outliers detection
Imputation techniques, etc.
Simplex algorithm
Mathematical optimization
34.What is time series analysis?
Time series analysis can be done in two domains, frequency domain and the time domain. In
Time series analysis the output of a particular process can be forecast by analyzing the
previous data by the help of various methods like exponential smoothening, log-linear
regression method, etc.
36.Explain Outlier.
An outlier is a data point that significantly differs from other similar points. It’s an
observation that lies an abnormal distance from other values in a random sample from a
population. In other words, an outlier is very much different from the “usual” data.
Depending on the context, outliers can have a significant impact on your data analysis. In
statistical analysis, outliers can distort the interpretation of the data by skewing averages and
inflating the standard deviation.
37.What are the ways to detect outliers? Explain different ways to deal with it.
Outliers can be detected in several ways, including visual methods and statistical techniques:
Box Plots: A box plot (or box-and-whisker plot) can help you visually identify
outliers. Points that are located outside the whiskers of the box plot are often
considered outliers.
Scatter Plots: These can be useful for spotting outliers in multivariate data.
Z-Scores: Z-scores measure how many standard deviations a data point is from the
mean. A common rule of thumb is that a data point is considered an outlier if its z-
score is greater than 3 or less than -3.
IQR Method: The interquartile range (IQR) method identifies as outliers any points
that fall below the first quartile minus 1.5 times the IQR or above the third quartile
plus 1.5 times the IQR.
DBSCAN Clustering: Density-Based Spatial Clustering of Applications with Noise
(DBSCAN) is a density-based clustering algorithm, which can be used to detect
outliers in the data.
Predictability: The data model should work in ways that are predictable so that its
performance outcomes are always dependable.
Scalability: The data model’s performance shouldn’t become hampered when it is fed
increasingly large datasets.
Adaptability: It should be easy for the data model to respond to changing business
scenarios and goals.
Results-oriented: The organization that you work for or its clients should be able to
derive profitable insights using the model.
As the name suggests Data Validation is the process of validating data. This step mainly has
two processes involved in it. These are Data Screening and Data Verification.
Data Screening: Different kinds of algorithms are used in this step to screen the
entire data to find out any inaccurate values.
Data Verification: Each and every suspected value is evaluated on various use-cases,
and then a final decision is taken on whether the value has to be included in the data
or not.
42. What do you think are the criteria to say whether a developed data model is good or
not?
A model developed for the dataset should have predictable performance. This is
required to predict the future.
A model is said to be a good model if it can easily adapt to changes according to
business requirements.
If the data gets changed, the model should be able to scale according to the data.
The model developed should also be able to easily consumed by the clients for
actionable and profitable results.
43. When do you think you should retrain a model? Is it dependent on the data?
Business data keeps changing on a day-to-day basis, but the format doesn’t change. As and
when a business operation enters a new market, sees a sudden rise of opposition or sees its
own position rising or falling, it is recommended to retrain the model. So, as and when the
business dynamics change, it is recommended to retrain the model with the changing
behaviors of customers.
44. Can you mention a few problems that data analyst usually encounter while
performing the analysis?
The following are a few problems that are usually encountered while performing data
analysis.
This method is used to impute the missing attribute values which are imputed by the attribute
values that are most similar to the attribute whose values are missing. The similarity of the
two attributes is determined by using the distance functions.
46. Mention the name of the framework developed by Apache for processing large
dataset for an application in a distributed computing environment?
The complete Hadoop Ecosystem was developed for processing large dataset for an
application in a distributed computing environment. The Hadoop Ecosystem consists of the
following Hadoop components.
HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr & Lucene -> Searching & Indexing
Ambari -> Provision, Monitor and Maintain cluster
A hash table collision happens when two different keys hash to the same value. Two data
cannot be stored in the same slot in array.
To avoid hash table collision there are many techniques, here we list out two
Separate Chaining:
It uses the data structure to store multiple items that hash to the same slot.
Open addressing:
It searches for other slots using a second function and store item in first empty slot that is
found
During imputation we replace missing data with substituted values. The types of imputation
techniques involve are
Single Imputation
Multiple Imputation
Unlike single imputation, multiple imputation estimates the values multiple times
Although single imputation is widely used, it does not reflect the uncertainty created by
missing data at random. So, multiple imputation is more favorable then single imputation in
case of data missing at random.
N-gram:
Example:
“In general, data analysts collect, run and crunch data for insight that helps their company
make good decisions. They look for correlations and must communicate their results well.
Part of the job is also using the data to spot opportunities for preventative measures. That
requires critical thinking and creativity.”
Example:
“I have a breadth of software experience. For example, at my current employer, I do a lot of
ELKI data management and data mining algorithms. I can also create databases in Access
and make tables in Excel.”
Example:
“My most difficult project was on endangered animals. I had to predict how many of [animal]
would survive to 2020, 2050 and 2100. Before this, I’d dealt with data that was already there,
with events that had already happened. So, I researched the various habitats, the animal’s
predators and other factors, and did my predictions. I have high confidence in the results.”
Example:
A data analyst’s job is to take data and use it to help companies make better business
decisions. I’m good with numbers, collecting data, and market research. I chose this
role because it encompasses the skills I’m good at, and I find data and marketing
research interesting.
A data lake is a large volume of raw data that is unstructured and unformatted. A data
warehouse is a data storage structure that contains data that has been cleaned and processed
into a form where it can be used to easily generate valuable insights.
Overfitting occurs when a model begins to describe the noise or errors in a dataset instead of
the important relationships between data points. Underfitting occurs when a model isn’t able
to find any trends in a given dataset at all because an inappropriate model has been applied to
it.
1) Variance is the measure of how far from the mean is each value in a dataset. The
higher the variance, the more spread the dataset. This measures magnitude.
2) Covariance is the measure of how two random variables in a dataset will change
together. If the covariance of two variables is positive, they move in the same
direction, else, they move in opposite directions. This measures direction.
3) Correlation is the degree to which two random variables in a dataset will change
together. This measures magnitude and direction. The covariance will tell you
whether or not the two variables move, the correlation coefficient will tell you by
what degree they’ll move.
A normal distribution, also called Gaussian distribution, is one that is symmetric about the
mean. This means that half the data is on one side of the mean and half the data on the other.
Normal distributions are seen to occur in many natural situations, like in the height of a
population, which is why it has gained prominence in the world of data analysis.
61.Can a Data Analyst Highlight Cells Containing Negative Values in an Excel Sheet?
Yes, it is possible to highlight cells with negative values in Excel. Here’s how to do that:
1. Go to the Home option in the Excel menu and click on Conditional Formatting.
2. Within the Highlight Cells Rules option, click on Less Than.
3. In the dialog box that opens, select a value below which you want to highlight cells.
You can choose the highlight color in the dropdown menu.
4. Hit OK.
https://www.simplilearn.com/tutorials/data-analytics-tutorial/data-analyst-interview-questions
https://www.edureka.co/blog/interview-questions/data-analyst-interview-questions/