0% found this document useful (0 votes)
26 views25 pages

Data Science

Uploaded by

shubhkr191
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views25 pages

Data Science

Uploaded by

shubhkr191
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Hi, My name is Shubham kr.

Currently, I am working as Sen ass ana


in Genpact. Regarding my ednal background, I did bachelors with
honors in stats from st xvr college and masters in statistics from cen
uni odisha.I have also done some online courses to upgrade myself
in data analytics.At Genpact I’ve worked on projects that involved
advanced statistical techniques, including regression analysis,
ARIMA, and machine learning models. In addition, I have hands-on
experience with data manipulation tools like SQL, Excel, and
Tableau, which helps me in both data exploration and visualization.
Before Genpact, I interned as a Data Science Intern at Ai Variant,
I’m passionate about solving real-world problems through data-
driven insights, and I’m excited about opportunities where I can
apply my skills to drive impactful solutions.

"I am working on a project at Genpact focused on automating the


categorization of support tickets to enhance efficiency and accuracy.
Initially, I perform exploratory data analysis on ticket dumps from
ServiceNow to uncover patterns and prepare the data. I then develop
predictive models using techniques like SVM, Random Forest, and
Naive Bayes to categorize tickets accurately. By iteratively refining
these models, I’ve helped streamline the support process, reduce
response times, and improve accuracy, ultimately adding value to
our support operations."

Objective : The goal of this project was to automate ticket


categorization in a ServiceNow support system, which involved
analyzing ticket data, extracting meaningful information, and
building predictive models. The objective was to enhance efficiency
in managing support tickets by accurately classifying them based on
their content. Scope: Use machine learning models to classify
support tickets based on historical data, reducing manual effort and
ensuring quick, accurate responses.

Data Collection and Exploration


Data Source: The data came from ServiceNow, containing multiple
fields like ticket ID, description
Exploratory Data Analysis (EDA):
Data Cleaning: Cleaned the data by handling missing values,
standardizing formats.
Text Analysis: Focused on the text fields (e.g., ticket descriptions)
to understand common keywords, phrases, and patterns.
Algorithm of Logistic Regression
Logistic regression is a statistical method used for binary classification
problems, where the goal is to predict the probability that a given input
point belongs to a certain class. Unlike linear regression, which predicts
continuous output, logistic regression predicts a probability that falls
between 0 and 1.

any real-valued number to a value between 0 and 1. 𝜎σ(z)= 1/1+e −z


1.The core of logistic regression is the sigmoid function, which maps

2.Linear Combination of Input Features:


Calculate the linear combination of input features (weighted sum). z=β
0+β1x1+β2x2+…+βnxn



Named Entity Recognition (NER): Used spaCy’s NER to extract entities, identifying
software names, error types, locations, and other relevant details from ticket
descriptions. This added contextual information specific to the type of issue reported in
each ticket.
Vectorization (Converting Text into
Numerical Format)
Machine learning models work with numerical data, so we need to convert the cleaned
text into a numerical representation.

Model Selection and Development


Modeling Approach: Implemented and compared several machine learning algorithms,
including:
Support Vector Machine (SVM): To classify tickets based on non-linear
relationships.
Random Forest: For its interpretability and ability to handle complex data relationships.
Naive Bayes: As a baseline model due to its efficiency and performance in text
classification.
Training: Trained these models on labeled data to understand each ticket’s category.
Used libraries like scikit-learn in Python within Jupyter Notebook.
Evaluation: Evaluated models using metrics such as accuracy, precision, recall, and F1
score. This helped refine the models and identify the best-performing one.
"The system is not working",
"Error in the login page",
"Need help with the software installation",
"Issue with network connectivity"
In my current role at Genpact, I worked on a project where I
developed predictive models to categorize support tickets using
various machine learning techniques such as Support Vector
Machine and Random Forest.
Regarding model deployment, while I didn’t directly handle
deployment in production, I gained valuable insights into the
process. For instance, I collaborated closely with the engineering
team to ensure the models were production-ready. We discussed
the importance of using frameworks like Flask for creating APIs to
serve our models.

Model Tuning and Iterative Refinement


Hyperparameter Tuning: Performed grid search for optimal
hyperparameters for random forest model, focusing on metrics
like F1 score to balance precision and recall.
Cross-Validation: Used cross-validation techniques to ensure the
model's robustness and prevent overfitting.
Iterative Improvements: Based on model performance, refined
the models by adjusting features, reprocessing the text, and
tuning parameters. This iterative process improved classification
accuracy and response times in ticket handling.

Outcomes and Business Impact


Efficiency: Automating the ticket categorization reduces
manual work, speeds up response times, and ensures
consistency in categorization.
Scalability: The solution can handle large volumes of
tickets and adapt as new data is provided.
Accuracy: The model improves categorization accuracy,
helping agents to address tickets faster and more
accurately, improving customer satisfaction.

Business Impact: The implementation of the predictive models


significantly reduced the average resolution time for support
tickets, as tickets were categorized accurately and routed to the
right teams faster. This enhancement in efficiency led to improved
customer satisfaction and reduced operational costs for the client.
Tokenization involves breaking a piece of text into
smaller units, called tokens. Tokens are typically
words, but they can also be characters or subwords.
The tokenizer splits the text using delimiters like
spaces, punctuation, or special characters.
Regular expressions (regex) are often used to define
these delimiters. For ex: "This is a ticket." can be
tokenized into ["This", "is", "a", "ticket"].

Stemming: Reduces a word to its base or root form


by chopping off prefixes/suffixes.
Example: "running" → "run," "flies" → "fli."
Lemmatization: Reduces a word to its canonical or
dictionary form using linguistic rules.
Example: "running" → "run," "flies" → "fly."

Stemming:
Removes prefixes or suffixes based on predefined
patterns.
Example: The Porter Stemmer uses rules like:
Remove "ing" → "run"
Lemmatization:
Looks up the root form of a word in a dictionary.
Example: "ran" → "run" (based on part of speech).
Stopwords are common words like "the," "and," "in," etc., that
don’t add significant meaning to text analysis. Removing them
reduces noise in the data.
A predefined list of stopwords (e.g., from NLTK or spaCy) is used.
Tokens are compared against this list.
Words present in the list are excluded.
TF-IDF (Term Frequency-Inverse Document Frequency) is widely used in text classification tasks like ticket categorization for
reasons: Advantages of TF-IDF:
Feature Importance: TF-IDF assigns higher importance to words that are unique to a document and less importance to comm
that appear frequently across all documents.
For example, words like "urgent" in a ticket description may carry significant meaning compared to generic words like "is" o

How TF-IDF Works: Term Frequency (TF): Measures how frequently a word occurs in a document.
TF(t,d)= Total words in d/Frequency of t in d and IDF(t)=log Total
documents/Number of documents containing t TF-IDF Score:
TF-IDF(t,d)=TF(t,d)×IDF(t)


Drawbacks of TF-IDF
High Dimensionality: TF-IDF creates a large feature space proportional to the vocabulary size, leading to memory inefficienc
large datasets.
Context Ignorance: TF-IDF doesn’t consider the context or semantic meaning of words. For instance, synonyms ("error" vs. "
are treated as separate features.
Fixed Vocabulary: If a new document contains words not present in the training data, those words are ignored.
Alternatives to TF-IDF
1. Bag of Words (BoW)
Represents text as a count of words without weighting.
Advantage: Simple to implement and understand.
Disadvantage: Ignores word importance and context, leading to a less meaningful representation.

Word2Vec represents words as vectors in a continuous vector space where semantically similar words are close to each othe
It comes in two main architectures:
CBOW (Continuous Bag of Words): Predicts a word based on its surrounding context.
Skip-gram: Predicts the surrounding context words for a given word.
Why Use Word2Vec?
Context Awareness: Captures relationships between words (e.g., "error" and "issue" are close in vector space).
Dimensionality Reduction: Provides dense embeddings, unlike sparse representations like TF-IDF.
Domain-Specific Vocabulary: Custom-trained Word2Vec learns domain-specific word associations, making it more relevant fo
ticket descriptions. Improved Accuracy: Word2Vec captures the semantic meaning of ticket descriptions, improving model
performance.
Question: Why did you choose these particular models, and
what are their pros and cons? - I selected Support Vector Machine
(SVM), Random Forest, and Naive Bayes due to their respective
strengths:
SVM: SVM performs well on text data and can handle non-linear
separations in data. It’s robust with high-dimensional spaces, making
it ideal for processing text-based descriptions after TF-IDF
vectorization. However, it’s computationally intensive and might be
slower with large datasets.
Random Forest: This model is versatile and less prone to overfitting
than simpler tree models due to ensemble learning. It provided
interpretability and helped me understand feature importance. Its
drawback is that it requires tuning for optimal performance.
Naive Bayes: This is a simple yet effective model for text
classification, especially when handling large datasets with limited
computation time. Its main limitation is the strong independence
assumption, which may not hold in real-world data.

Can you give examples of the features you engineered, and


how they improved model performance?
I created features from text-based data, such as urgency level from
keywords like “urgent” and “ASAP,” and extracted specific product
names and error codes. Additionally, I included derived features like
ticket submission time (hour of the day, weekday/weekend) to capture
service patterns. These features provided additional context, helping
the model differentiate between urgent and routine tickets more
accurately. Including these features improved the recall rate for
priority tickets by 10%, which directly impacted operational efficiency.
What text processing techniques did you apply, and why?
I applied several NLP techniques to preprocess ticket descriptions:
Tokenization: Breaking down text into individual words or tokens for
analysis.
Stopword Removal: Removed common words (like “the” or “and”)
that don’t add meaningful information.
Stemming/Lemmatization: Reduced words to their root forms
(e.g., “running” to “run”) for uniformity.
TF-IDF Vectorization: Represented text numerically by calculating
term frequency-inverse document frequency, helping the model
focus on important words and ignore less relevant ones.
What were some of the key findings from your exploratory
data analysis?
During EDA, I found patterns in ticket submissions, such as peak
times for issues, common categories of problems, and correlations
between ticket types and resolution times. I also identified missing
values and outliers that needed to be addressed before modeling,
which helped improve the overall data quality.

How do you evaluate model performance, and what metrics


do you use?
Answer: I primarily use accuracy, precision, recall, F1 score, and AUC
to evaluate classification models, depending on the business need.
For example, in predicting ticket categorization, I prioritized
precision to ensure that each ticket was assigned accurately. For
cross-validation and tuning, I typically split the data into train-test
sets and use grid search to optimize hyperparameters, ensuring that
models are both robust and generalizable.
What tools or libraries did you use for entity extraction, and
how did extracted entities improve the classification
process?
For entity extraction, I used Named Entity Recognition (NER) in the
spaCy library, which identifies named entities (e.g., software names,
locations) in text data. I also used regular expressions for simple
pattern recognition to identify specific keywords associated with
recurring issues.
Extracted entities helped the model differentiate between tickets
more effectively by adding contextual information specific to certain
ticket types. For example, identifying software or hardware names
within a ticket allowed the model to better categorize technical
support requests versus general inquiries, improving classification
precision.

Can you describe a specific use case where your predictive


model provided significant business value?
Answer: One significant use case was where our predictive model
identified recurring issues in ticket categories. By analyzing
historical data, the model predicted which types of tickets were
likely to increase during specific times of the year. This allowed the
support team to proactively allocate resources during peak times,
significantly reducing response times and improving overall
customer satisfaction.
What metrics did you prioritize, and why? How
did you handle cases where the model
misclassified tickets?
Metrics: I prioritized precision and recall due to the
importance of accurate classification in support
contexts. For example, high precision was critical to
ensure that tickets flagged as high-priority were
genuinely high-priority, and high recall ensured that no
major ticket types were missed.
Handling Misclassifications: I analyzed the confusion
matrix to understand where the model was
underperforming. Misclassifications often pointed to
ambiguous or overlapping categories, so I refined
feature engineering by adding specific keywords and
rebalancing class weights. I also reviewed and
retrained with additional data to improve the model’s
generalization.

What is A/B Testing?


A/B testing is a method of comparing two versions of a
product or feature (A and B) to determine which one
performs better. It’s commonly used in marketing,
website optimization, and product development. One
version (A) serves as the control group, while the other
version (B) is the experimental group. The goal is to
test a hypothesis to see if the change in version B
leads to statistically significant improvements
Random Forest is an ensemble of decision trees,
where each tree is trained on a random subset of
data, and the final prediction is the average (for
regression) or majority vote (for classification). It’s
great for reducing overfitting and is easy to train.

Random Forest starts by creating multiple samples


from the original dataset using bootstrap sampling
(sampling with replacement). This means each
sample, also known as a bootstrapped dataset, may
contain duplicate data points. For each
bootstrapped sample, a decision tree is trained
independently.

While building each tree, at each split, the Random


Forest algorithm doesn’t consider all features.
Instead, it randomly selects a subset of features
and finds the best split among those. This process
is called feature randomness. The depth of each
tree is allowed to grow to the maximum, meaning
that each tree can potentially be a deep tree.

For Classification: Each decision tree in the forest


provides a "vote" for the class it predicts. The final
classification is made by taking the majority vote
across all trees.
For Regression: The prediction from each tree is
averaged to get the final prediction
PCA is a dimensionality reduction technique that transforms high-
dimensional data into a lower-dimensional space by finding the
directions (principal components) that maximize variance. It is useful
for reducing the complexity of datasets while retaining the most
important features, which can help in visualizing data, speeding up
model training, and mitigating the curse of dimensionality.

Support Vector Machines are supervised learning models used for


classification and regression. The main idea behind SVM is to find the
optimal hyperplane that maximizes the margin between two classes in
the feature space. The points closest to the hyperplane are called
support vectors, and they influence the position and orientation of the
hyperplane. SVM can handle non-linear data by using kernel functions
(e.g., polynomial, radial basis function) to transform the input space
into a higher-dimensional space where a linear separation is possible.

Hyperplane: In SVM, the hyperplane is a line (in 2D space) or a plane


(in 3D space) that separates the data points into two categories
The goal is to find the hyperplane that maximizes the margin
between the classes.
Support Vectors:These are the closest data points from
each class to the hyperplane. These points "support" the hyperplane,
meaning they are the critical points that define the margin.

The margin is the distance between the hyperplane and the nearest
data point from either class SVM aims to maximize the margin to
ensure that the hyperplane generalizes well to new data. The larger the
margin, the better the model is likely to perform on unseen data.
decision boundary (or hyperplane) is determined by the support
vectors. SVM tries to create a hyperplane with the largest margin
between the two classes, making it robust to outliers.

Linear SVM: This is used when the data can be separated by a straight
line (hyperplane). It works well for linearly separable data.
Non-linear SVM: Often, data cannot be separated by a straight line. To
handle this, SVM uses the kernel trick.
Objective of SVM:
Hard Margin SVM: Assumes that the data is perfectly linearly
separable. It finds the largest margin hyperplane without allowing any
misclassification. This approach is not robust to noise or outliers
because no misclassifications are allowed.
Soft Margin SVM: Allows some misclassification by introducing a
regularization parameter C. The trade-off is between maximizing the
margin and minimizing classification errors. A small C gives a larger
margin but allows some misclassifications, while a large C tries to
minimize misclassification at the cost of a smaller margin.

Polynomial Kernel: For non-linear relationships with polynomial


terms.
Radial Basis Function (RBF) Kernel: This is commonly used for non-
linear classification tasks. It maps the input data into a higher-
dimensional space where a linear hyperplane can be used to classify
the data.
Parameters : n_estimators: More trees improve performance but take more computational time.
max_depth: Deeper trees capture more patterns but may overfit; shallow trees generalize better but may
underfit.
max_features: Lower values increase randomness, improving generalization but may miss important features.
min_samples_split: Higher values make the tree simpler and less prone to overfitting.
min_samples_leaf: Larger values prevent overfitting by ensuring leaf nodes contain enough samples.
bootstrap: Increases randomness in the trees and reduces overfitting.
Decision tree works by recursively splitting data based on features to create branches leading to a decision.
The final output depends on the decisions at each node, which are driven by measures like Gini Impurity or
Information Gain. While decision trees are powerful and interpretable, they can overfit easily and may require
pruning or ensemble methods to improve performance.Decision trees can be visualized, where each node
represents a feature, the branches represent conditions, and the leaves represent the final outcomes. This
visualization helps understand the decision-making process of the model.

Gini Impurity measures the likelihood that a randomly chosen element from the dataset would be incorrectly
classified if it were randomly labeled according to the class distribution in that dataset.
Entropy measures the disorder or impurity in a dataset. High entropy means the classes are well
mixed, while low entropy means most data points belong to a single class.
Information Gain (IG) measures the reduction in entropy or impurity after a dataset is split on an attribute. It
represents how much information a feature contributes to making a decision.
What is Gradient Boosting?
Gradient Boosting is an ensemble machine learning technique used for regression and classification tasks, which builds mod
wise fashion and generalizes them by optimizing a differentiable loss function. The idea is to combine the predictions of seve
estimators (typically decision trees) to improve robustness and accuracy. 1.Start with an initial model, usu
mode like the mean of the target values for regression or the log-odds for classification. 2.Compute the residuals (errors) of
model, which represent the difference between the actual target values and the predicted values by the model.

3.Fit a weak learner (e.g., a decision tree with limited depth) to the residuals. This learner is trained to predict the residuals
step. 4. Update the model byadding the weighted prediction of the decision tree. 5. Calculate the residuals again using the u
After many iterations, you end up with a series of small trees. Each tree corrects the errors of the previous one. Your final m
all these small trees’ predictions.

Naive Bayes is based on Bayes' Theorem, which describes the probability of an event based on prior knowledge of conditio
event.
P(A∣B)= P(B∣A)⋅P(A)/P(B) Assumption : Naive Bayes assumes that all features in X (e.g., the words in an email) are ind
other given the class C. This allows us to calculate the likelihood P(X/C) by multiplying the individual probabilities of each fe
​This independence assumption simplifies the computation and makes Naive Bayes scalable to large datasets. Although inde
rarely true in real-world data, the algorithm still often performs well.
XGBoost is optimized for performance and
speed through parallel processing, memory
optimization, and tree pruning. It also includes
regularization techniques, which help prevent
overfitting, making it more robust on large
datasets.

High Bias, Low Variance: Leads to underfitting,


where the model is too simple and fails to
capture the data's complexity.
Low Bias, High Variance: Leads to overfitting,
where the model learns too much from the
training data, including noise, and fails to
generalize well.

Neural Networks are computational models


inspired by the human brain, designed to
recognize patterns and make decisions. They
consist of interconnected nodes, called neurons,
that process information in layers. Each neuron
receives inputs, processes them, and passes
them on to other neurons, allowing neural
networks to "learn" from data through training.

nput Layer: The first layer receives the input


data. Each node (neuron) represents one feature
of the input data. For example, in an image
recognition task, each pixel could be a feature,
so each node would represent a pixel’s value.
Hidden Layers: These are the intermediate
layers where processing takes place. Hidden
layers perform transformations on the input
data, learning features that help make
predictions. A network can have multiple hidden
layers, which enables it to learn complex
relationships.
Output Layer: The final layer provides the output
of the network. For classification, each neuron in
this layer might represent a possible class, and
for regression tasks, it might output a
continuous value.
Hyperparameter tuning is the process of finding the best set of hyperparameters for
a machine learning model to maximize its performance. Unlike model parameters,
which are learned during training (e.g., weights in a neural network), hyperparameters
are set before training and control aspects of the learning process, like the complexity
of the model and the way it learns. Tuning involves systematically searching for the
most optimal hyperparameter values to improve model performance.

Learning Rate: Controls how much the model updates weights with each step.
Number of Layers and Neurons: Determines the architecture of a neural network.
Batch Size: The number of training samples processed before updating the model.
Number of Trees in Random Forest: In ensemble methods, the number of trees
determines the ensemble’s size.
Max Depth: The maximum depth of each tree in decision tree-based models

API (Application Programming Interface) development is the process of creating a


set of rules and protocols that allow different software applications to communicate
with each other. APIs define the methods and data formats that applications can use to
request and exchange information, enabling them to work together seamlessly.
After developing your machine learning model for ticket categorization, you can
deploy it as an API. This allows other applications (like a customer support system) to
send ticket data and receive categorized responses in real-time.

Model Training and Serialization


Train Your Model: Ensure your ticket categorization model (e.g., SVM, Random Forest) is
trained and evaluated.
Serialize the Model: Save the trained model using libraries like joblib or pickle
Create an API for Model Inference
Choose a Framework: Select a web framework (Flask, FastAPI, Django).
Set Up the API:
Install necessary packages: pip install flask joblib pandas.
Write API Code: Create a script to handle API requests and load the mode
Testing the API
Run Locally: Start the Flask server to run your API.
Test the API: Use tools like Postman or cURL to verify functionality.
Deployment
Choose a Platform: Select a cloud service (AWS, Google Cloud,
Monitoring and Maintenance
Logging: Implement logging to track requests and errors.
Monitoring: Use tools like New Relic or Prometheus for performance tracking.
Model Retraining: Plan for periodic retraining with new data to improve accuracy.
Describe your experience with model deployment and the tools you used.
I deployed models in my Telecom Churn Prediction project using Streamlit to create a user-
friendly application. I am also familiar with cloud-based deployment using frameworks like Flask
and Django, which enable API integration for real-time model predictions. My experience with
GCP complements this, where I have deployed models and worked with cloud-based analytics,
which aligns well with Phenom’s use of platforms like AWS Lambda and SageMaker.
How SVM Works for Ticket Classification:

Goal: SVM aims to find a hyperplane that separates different classes of tickets with
the largest possible margin.
Data Transformation: For cases where the data isn’t linearly separable, SVM uses
kernel functions (e.g., RBF, polynomial) to map ticket data into higher-dimensional
space, making it easier to draw a separation boundary.
Decision Boundary: The support vectors (data points near the decision boundary)
influence the position of the hyperplane. SVM tries to maximize the margin between
classes (e.g., ticket types like incident or request).
Optimization: It minimizes misclassification by solving a convex optimization
problem, balancing accuracy with generalization.
Backend Functionality:

Kernel Trick: If using non-linear kernels, SVM transforms the input features (e.g.,
ticket description text, priority) into a higher-dimensional space. For text data, it
often combines with techniques like TF-IDF or word embeddings to represent textual
information in numerical form.
Feature Scaling: SVM requires normalized or standardized features to function
effectively, so each ticket attribute is scaled to enhance model performance.

By using SVM with a kernel trick and proper feature scaling, ticket classification can
effectively handle complex, non-linear relationships in ticket data. This approach
enables the SVM model to classify tickets into categories like "incident" or "request"
accurately, improving the efficiency of ticket sorting and response in service
management systems.

Key Hyperparameters in SVM:


C (Regularization parameter):
Controls the trade-off between maximizing the margin and allowing some
misclassifications.
A small C allows a larger margin but more misclassifications, while a larger C
creates a smaller margin but tries to minimize misclassifications.
Gamma (for RBF Kernel):
Determines the influence of a single training example. A high gamma leads to a
more complex model that fits closely to the training data (risk of overfitting), while a
low gamma leads to a simpler model.
How Random Forest Works for Ticket Classification:
Goal: Random Forest is an ensemble of decision trees that works to capture
complex relationships in the ticket data, helping it make robust classifications.
Tree Construction: Each tree in the forest is trained on a random subset of
tickets and a random subset of features. This process, called bagging, helps
reduce overfitting by introducing diversity in training.
Voting Mechanism: Each tree independently classifies a ticket, and the final
classification is determined by majority voting among the trees.
Backend Functionality:
Feature Importance: Random Forest can rank the importance of different
ticket features. For example, it might reveal that priority and ticket description
significantly influence the ticket’s classification.
Out-of-Bag Error (OOB): To assess model accuracy, it uses OOB data (tickets
not included in the training sample for a particular tree) for internal cross-
validation, providing an estimate of the model’s generalization error.
Decision Paths: Each tree generates decision rules based on the feature
values. For example, rules could include splitting tickets based on keywords in
the descriptions or urgency level.

Eg - One tree might have a path like:


If "priority" is high (greater than a certain value),
And if "description" contains specific keywords (like "urgent" or "outage"),
Then classify the ticket as an "incident."
Different paths and trees make different decisions, creating a variety of
possible classifications that ultimately contribute to the final decision through
majority voting.
How Naive Bayes Works for Ticket Classification:
Goal: Naive Bayes assumes that features (like keywords or categories) are conditionally
independent given the class and calculates the probability of each ticket belonging to
each class.
Probabilistic Model: It calculates the posterior probability for each class based on prior
probabilities and the likelihood of the features (e.g., words in ticket description or other
metadata).
Class Assignment: The ticket is classified into the category with the highest posterior
probability, which maximizes the probability of the ticket belonging to that class.
Backend Functionality:
Feature Independence: Naive Bayes assumes that the features are independent. For
instance, it might treat each word in the ticket description independently when
calculating probabilities, which can be an advantage with text-based tickets.
Likelihood Calculation: For text classification, Naive Bayes typically uses frequency-based
feature extraction (like TF-IDF) and then calculates probabilities for each word being
associated with a class.
Laplace Smoothing: To handle words or phrases that may not appear in the training set,
Naive Bayes uses smoothing techniques to assign small non-zero probabilities to unseen
words, enhancing the model’s ability to generalize.

Laplace Smoothing:
Laplace smoothing is used to handle cases where certain words or phrases in new tickets
do not appear in the training set, which would otherwise result in zero probabilities.

Naive Bayes offers a powerful, efficient way to classify text-based tickets by calculating
probabilities for each class. With assumptions of feature independence and use of
Laplace smoothing, it effectively handles diverse ticket descriptions, providing fast,
interpretable classifications. This model’s efficiency and simplicity make it ideal for
support ticket categorization tasks.
How Laplace Smoothing Works:

Laplace smoothing adds a small, non-zero value (usually 1) to the count of each word in
the vocabulary, ensuring that no probability is ever zero, even if the word is not present
in the training data.
For instance, if a word like "system" does not appear in any "incident" tickets during
training, the probability of "incident" given "system" would theoretically be zero. With
Laplace smoothing, we add a small value, say 1, to the word counts so the probability

𝑃
remains small but non-zero:

(
system

incident
)
=
count of ’system’ in incident tickets
+
1
total words in incident tickets
+
total unique words
P(system∣incident)=
total words in incident tickets+total unique words
count of ’system’ in incident tickets+1

This adjustment allows Naive Bayes to generalize better to new tickets with unseen
words, making it more robust for real-world applications.
SQL Clauses Execution Order:
FROM: Specifies the source table(s) to retrieve data from.
WHERE: Filters rows based on conditions.
GROUP BY: Groups rows that have the same values in
specified columns into summary rows.
HAVING: Filters groups based on conditions (similar to
WHERE but for grouped rows).
SELECT: Specifies the columns to retrieve and performs any
calculations.
ORDER BY: Sorts the result set based on specified columns.
LIMIT: Limits the number of rows returned.

To resolve the issue of subquery X increasing in cost while Y


is not, follow a structured approach by analyzing the query
execution plan, optimizing indexes, rewriting the subquery
(if possible), and ensuring the database is efficiently
utilizing resources. Depending on the exact situation, one or
a combination of these strategies should help improve the
performance of your SQL queries.
A stored procedure in SQL is a pre-written set of The LEAD and LAG functions in SQL are
SQL statements that we can save and reuse. It acts window functions that allow us to access
like a function in programming languages, allowing data from subsequent (LEAD) or
us to encapsulate and execute a series of SQL previous (LAG) rows in the result set
commands as a single unit. Key Features of without needing a self-join. (These
Stored Procedures: functions are especially useful in
Encapsulation: Combines multiple SQL statements analyzing trends, changes, and
into one reusable procedure. relationships between rows in ordered
Reuse: You can call it from different parts of your datasets, particularly in time-series data
application or database. CREATE PROCEDURE or sequences where row context is
procedure_name AS BEGIN -- SQL statements important.)
END;

Write a sql query to get the highest orders based on Time-Series Analysis (Stock Prices)
Category Consider we have a dataset of daily
SELECT category, order_id, order_count stock prices where we want to compare
FROM ( the current day’s price with the previous
SELECT category, order_id, order_count, day’s price to detect price changes.
ROW_NUMBER() OVER (PARTITION BY The LAG function fetches the price from
category ORDER BY order_count DESC) AS row_num the previous day, allowing us to
FROM orders calculate the difference in price from the
) AS ranked_orders previous day.
WHERE row_num = 1;

Analyzing Future Data


For e commerce orders, we may want to
compare the current order’s delivery
time with the next order’s delivery time
to optimize processes or predict delays.
The LEAD function fetches the delivery
date of the next order, enabling us to
analyze the gap between deliveries.
How is the HAVING clause When does the SELECT clause NTILE():
different from the WHERE get executed, and why is it Divides rows into
clause? last in the process? a specified
The HAVING clause filters The SELECT clause is executed number of
data after the GROUP BY after the GROUP BY and HAVING roughly equal
operation, while WHERE is clauses because it defines the groups (or
used before grouping. final columns to return in the buckets) and
HAVING is applied to the result. Aggregate functions like assigns a bucket
result of the aggregation. COUNT() or SUM() are calculated number to each
at this step. row.

RANK Use Case (for DENSE_RANK Use Case (Salary Views in sql are a
Competition Rankings): Bands or Employee Ratings): kind of virtual
When ranking participants in When ranking employees or table. A view also
a competition where ties salaries without gaps in the ranks, has rows and
exist, RANK can be used. For DENSE_RANK is useful. For columns as they
example, if two participants example, when calculating salary are on a real table
tie for 2nd place, the next bands, if two employees share the in the database
participant is ranked 4th (not same salary, the next salary band
3rd), leaving a gap in the should follow sequentially without
ranking. gaps.
Window functions perform calculations
across a set of table rows related to the
current row. They’re useful for cumulative
calculations, running totals, ranking, and
moving averages without altering the
dataset’s structure.

Index is a database object which is applied


on one or more columns of a table. When a
column (or list of columns) from the table
is Indexed, database creates a pointer to
each value stored in that column. This
significantly improves the query execution
time since the database will have a more
efficient way to find a particular value from
the column based on its index.

You might also like