Data Analyst Role Tasks Skills
Data Analyst Role Tasks Skills
1. List and explain different key roles for successful data analytics.
Ans: -
1. Data Analyst
• Tasks: Perform data cleaning, run statistical analyses, create reports, and dashboards.
2. Data Scientist
• Role: Builds models and algorithms to predict outcomes and uncover patterns.
• Tasks: Develop machine learning models, perform deep statistical analysis, experimental design.
• Skills: Machine learning, deep learning, advanced statistics, Python/R, big data tools.
3. Data Engineer
• Tasks: Build data pipelines, ensure data quality, integrate data from multiple sources.
• Skills: SQL, ETL tools, Hadoop, Spark, cloud platforms (AWS, Azure, GCP).
4. Business Analyst
• Tasks: Gather business requirements, translate them into data tasks, and interpret analytical
results in a business context.
5. Data Architect
• Role: Designs the overall data structure and ensures it supports business needs.
• Tasks: Create database designs, oversee data storage and retrieval, ensure data security and
scalability.
• Tasks: Optimize and scale models, deploy models, monitor model performance.
• Tasks: Define data governance policies, align data initiatives with business goals, ensure regulatory
compliance.
• Skills: Leadership, data governance, regulatory knowledge (GDPR, HIPAA), strategic planning.
• Skills: Visualization tools (Power BI, Tableau, D3.js), design thinking, user experience.
• Tasks: Manage data policies, ensure ethical use of data, maintain data integrity and privacy.
Summary:
Successful data analytics is not just about technical skills — it requires a team effort blending technology,
business understanding, and clear communication.
Q.2. what is stepwise regression? explain its type
Stepwise Regression is a method of building a regression model by automatically selecting which independent
variables (predictors) should be included.
Instead of manually trying combinations, the algorithm adds or removes predictors based on certain criteria like:
• Adjusted R².
Goal:
Find a model that is both accurate and simple (no unnecessary variables).
1. Forward Selection
• How it works:
2. Add the most significant variable (the one with the lowest p-value).
3. Add variables one by one, each time picking the next most significant one.
4. Stop when no more variables meet the inclusion criteria (e.g., p-value < 0.05).
• Used when:
You have many predictors and want to build up the model gradually.
2. Backward Elimination
• How it works:
3. Keep removing variables until all remaining ones are statistically significant.
• Used when:
You suspect most variables matter, but you want to prune unnecessary ones.
3. Bidirectional Elimination (Stepwise Selection)
• How it works:
2. Start either with no variables (like forward) or all variables (like backward).
3. After adding a new variable, check if any previously included variable has become non-
significant (and remove it).
4. Continue adding and removing variables until no more changes improve the model.
• Used when:
You want a more flexible approach — allowing additions and removals at each step.
Bonus:
• Advantages:
Automates model selection, saves time.
Reduces model complexity (removes irrelevant variables).
• Disadvantages:
Can overfit if not careful.
May miss the best model because it's based on greedy choices at each step (not looking at all
combinations).
It is a numerical statistic used to measure how important a word is in a document relative to a collection of
documents (corpus).
• Text Mining
Why TF-IDF?
• Some words (like "the", "is", "and") appear very often but don't tell us much about the topic.
• TF-IDF down-weights common words and up-weights rare but important words.
In short:
Frequent in one document + rare across others = Important.
TF-IDF Formula:
Simple Example:
Thus:
Ans: -
Ans: - Great! Time series data is everywhere — stock prices, weather data, website traffic, etc. To analyse it
effectively, we break it down into four main components:
• Why it matters: Helps understand overall growth or decline beyond short-term fluctuations.
2. Seasonality (S)
• Definition: Regular, repeating patterns or cycles at fixed time intervals (hourly, daily, weekly,
monthly, yearly).
• Why it matters: Shows predictable patterns due to calendar effects (holidays, seasons).
• Definition: Fluctuations that occur over longer periods but not at fixed intervals.
• Difference from seasonality: Not tied to a calendar — more irregular and long-term.
• Definition: The random noise or unpredictable variation left over after removing trend,
seasonality, and cycles.
• Example: Sudden sales drop due to a one-time event like a natural disaster.
Q.6. Explain Arima in details also state its pros and cons
Ans. Absolutely! Let’s dive deep into ARIMA, one of the most popular models in time series forecasting.
What is ARIMA?
1. AR (AutoRegressive) [p]
2. I (Integrated) [d]
ARIMA Notation:
ARIMA(p,d,q)ARIMA(p, d, q)
1. Check for stationarity (mean and variance don’t change over time).
3. Identify p and q using ACF (Autocorrelation Function) and PACF (Partial ACF) plots.
Let's say:
ARIMA(1,1,1)A
Advantage Description
Handles TrendWorks well with data that shows a consistent upward or downward trend.
Flexible Can model many real-world time series with just (p,d,q).
Easy to analyze residuals, check model fit using ACF/PACF, AIC/BIC, etc.
Clear Diagnostics
Disadvantage Description
Univariate Only Can't directly handle multiple variables (unless extended to ARIMAX).
Variants of ARIMA:
• You want a statistical, interpretable model (not black-box like some ML methods).
Would you like a code example (in Python using statsmodels) to see how ARIMA is implemented in practice?
Ans: - Certainly! Text analytics involves extracting meaningful information from unstructured text data. It’s
widely used in business, healthcare, social media, and more. Below are the seven key practice areas of text
analytics explained in detail:
• Tasks Include:
o Event Extraction: Detecting events and actions (e.g., "Company A acquired Company B").
Use Case: Extracting company names and acquisition dates from news articles.
• Types:
Use Case: Email spam detection, sentiment analysis (positive, neutral, negative).
3. Clustering
4. Topic Modelling
Use Case: Automatically identifying topics people are discussing in online forums.
5. Summarization
• Types:
o Abstractive: Generates new sentences to summarize (like how humans write summaries).
6. Sentiment Analysis
• Outputs:
o Auto-complete
o Chatbots
o Machine Translation
Use Case: Suggesting next words while typing in Google search or writing an email.
Ans:- An analytical sandbox is a secure, isolated environment where data scientists and analysts can explore,
analyse, and experiment with data without affecting the live production systems or databases.
Key Features:
• Isolated environment: Separate from production, so mistakes don’t impact business operations.
• Access to curated data: Usually includes cleansed, structured, and often anonymized data.
• Flexible tools: Supports various analytics tools, programming languages (like Python, R, SQL), and
libraries.
1. Safe experimentation:
Analysts can test hypotheses, models, and scripts freely without the risk of disrupting live systems.
2. Accelerates innovation:
Speeds up development of data-driven insights, models, and dashboards by providing a flexible
and independent space.
3. Supports reproducibility:
Workflows and experiments in a sandbox are easier to document and repeat, supporting
transparency and auditing.
4. Improves collaboration:
Multiple users can share the same controlled environment, making it easier to collaborate on
analytics projects.
Ans :- The main difference between ARMA and ARIMA models lies in their ability to handle non-stationary data.
ARMA models are suitable for stationary time series, while ARIMA models can also handle non-stationary data
by incorporating an "integration" component that transforms the data into stationarity through differencing. [1,
2]
ARMA Model:
• Assumes stationarity: ARMA models assume that the time series data has a constant mean and
variance over time. [1, 3]
• Suitable for stationary data: When the data is already stationary, ARMA models can effectively
capture the relationships between past values and past errors to predict future values. [1, 4]
• Notation: ARMA(p, q), where 'p' is the order of the autoregressive (AR) component and 'q' is the
order of the moving average (MA) component. [4, 5]
• Handles non-stationarity: ARIMA models can handle time series data that is not stationary by
applying a differencing process to transform it into stationarity. [1, 1, 2, 2]
• Integration component: The 'integrated' part in ARIMA refers to the order of differencing
(denoted by 'd' in ARIMA(p,d,q)) required to make the data stationary. [1, 1, 5, 5]
• Suitable for non-stationary data: When the data has trends, seasonality, or other non-stationary
characteristics, ARIMA models can effectively capture these patterns by first removing the non-
stationarity through differencing and then modeling the remaining stationary data. [1, 1, 6, 6, 7]
The Box-Jenkins Methodology is a structured approach to modeling time series data, particularly using ARIMA
models (AutoRegressive Integrated Moving Average). Developed by George Box and Gwilym Jenkins, it focuses
on identifying, estimating, and validating models for forecasting.
1. Identification
o Determine if the series is stationary (constant mean and variance over time).
o Use plots, ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) to
decide the ARIMA model order:
2. Estimation
3. Diagnostic Checking
o Check residuals (errors) to ensure they resemble white noise (no autocorrelation).
4. Forecasting
o Once the model passes diagnostic checks, use it to predict future values.
Applications
• Sales forecasting
Advantages
Ans: - In today’s world, a lot of data is being generated on a daily basis. And sometimes to analyse this data for
certain trends, patterns may become difficult if the data is in its raw format. To overcome this data visualization
comes into play. Data visualization provides a good, organized pictorial representation of the data which makes it
easier to understand, observe, analyse. In this tutorial, we will discuss how to visualize data using Python.
Python provides various libraries that come with different features for visualizing data. All these libraries come
with different features and can support various types of graphs. In this tutorial, we will be discussing four such
libraries.
• Matplotlib
• Seaborn
• Bokeh
• Plotly
Matplotlib
• Matplotlib is an easy-to-use, low-level data visualization library
that is built on NumPy arrays. It consists of various plots like
scatter plot, line plot, histogram, etc. Matplotlib provides a lot of
flexibility.
Scatter Plot
• Scatter plots are used to observe relationships between
variables and uses dots to represent the relationship between
them. The scatter() method in the matplotlib library is used to
draw a scatter plot.
•
• This graph can be more meaningful if we can add colors and also
change the size of the points. We can do this by using the c and
s parameter respectively of the scatter function. We can also
show the color bar using the colorbar() method.
Line Chart
A bar plot or bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is proportional to
the values which they represent. It can be created using
the bar() method
Histogram
•
•
Ans:- https://www.geeksforgeeks.org/data-visualization-in-r/
Ans:-
Exploratory Data Analysis or EDA is a statistical approach or technique for analyzing data sets to summarize
their important and main characteristics generally by using some visual aids. The EDA approach can be used to
gather knowledge about the following aspects of data.
• Finding out the important variables that can be used in our problem.
• Searching for the answers by using visualization, transformation, and modeling of our data.
• Using the lessons that we learn to refine our set of questions or to generate a new set of
questions.
In R Programming Language, we are going to perform EDA under two broad classifications:
• Descriptive Statistics, which includes mean, median, mode, inter-quartile range, and so on.
• Graphical Methods, which includes histogram, density estimation, box plots, and so on.
Before we start working with EDA, we must perform the data inspection properly. Here in our analysis, we will be
using the loafercreek from the soilDB package in R. We are going to inspect our data in order to find all the typos
and blatant errors. Further EDA can be used to determine and identify the outliers and perform the required
statistical analysis. For performing the EDA, we will have to install and load the following packages:
• “aqp” package
• “ggplot2” package
• “soilDB” package
We can install these packages from the R console using the install.packages() command and load them into our
R Script by using the library() command. We will now see how to inspect our data and remove the typos and
blatant errors.
To ensure that we are dealing with the right information we need a clear view of your data at every stage of the
transformation process. Data Inspection is the act of viewing data for verification and debugging purposes,
before, during, or after a translation. Now let’s see how to inspect and remove the errors and typos from the
data.
For Descriptive Statistics in order to perform EDA in R, we will divide all the functions into the following
categories:
• Measures of dispersion
• Correlation
We will try to determine the mid-point values using the functions under the Measures of Central tendency.
Under this section, we will be calculating the mean, median, mode, and frequencies.
Since we have already checked our data for missing values, blatant errors, and typos, we can now examine our
data graphically in order to perform EDA. We will see the graphical representation under the following
categories:
• Distributions
Under the Distribution, we shall examine our data using the bar plot, Histogram, Density curve, box plots, and
QQplot.
• Description: Gather textual data from sources like documents, websites, tweets, customer
reviews, emails, etc.
• Tools: Manual entry, web scraping (e.g., using rvest in R or BeautifulSoup in Python), APIs (Twitter
API, etc.).
2. Text Preprocessing
• Description: Clean and normalize text data to make it suitable for analysis.
• Common Steps:
o Lowercasing text
o Removing punctuation
o Removing numbers
3. Text Representation
• Common Methods:
• Examples:
5. Sentiment Analysis
• Description: Analyze the emotional tone (positive, negative, neutral) of the text.
• Methods:
6. Topic Modeling
• Description:
• Description: Visualize insights from the text using charts, word clouds, network diagrams, etc.
Ans :-