Data Science
Data Science
Named Entity Recognition (NER): Used spaCy’s NER to extract entities, identifying
software names, error types, locations, and other relevant details from ticket
descriptions. This added contextual information specific to the type of issue reported in
each ticket.
Vectorization (Converting Text into
Numerical Format)
Machine learning models work with numerical data, so we need to convert the cleaned
text into a numerical representation.
Stemming:
Removes prefixes or suffixes based on predefined
patterns.
Example: The Porter Stemmer uses rules like:
Remove "ing" → "run"
Lemmatization:
Looks up the root form of a word in a dictionary.
Example: "ran" → "run" (based on part of speech).
Stopwords are common words like "the," "and," "in," etc., that
don’t add significant meaning to text analysis. Removing them
reduces noise in the data.
A predefined list of stopwords (e.g., from NLTK or spaCy) is used.
Tokens are compared against this list.
Words present in the list are excluded.
TF-IDF (Term Frequency-Inverse Document Frequency) is widely used in text classification tasks like ticket categorization for
reasons: Advantages of TF-IDF:
Feature Importance: TF-IDF assigns higher importance to words that are unique to a document and less importance to comm
that appear frequently across all documents.
For example, words like "urgent" in a ticket description may carry significant meaning compared to generic words like "is" o
How TF-IDF Works: Term Frequency (TF): Measures how frequently a word occurs in a document.
TF(t,d)= Total words in d/Frequency of t in d and IDF(t)=log Total
documents/Number of documents containing t TF-IDF Score:
TF-IDF(t,d)=TF(t,d)×IDF(t)
Drawbacks of TF-IDF
High Dimensionality: TF-IDF creates a large feature space proportional to the vocabulary size, leading to memory inefficienc
large datasets.
Context Ignorance: TF-IDF doesn’t consider the context or semantic meaning of words. For instance, synonyms ("error" vs. "
are treated as separate features.
Fixed Vocabulary: If a new document contains words not present in the training data, those words are ignored.
Alternatives to TF-IDF
1. Bag of Words (BoW)
Represents text as a count of words without weighting.
Advantage: Simple to implement and understand.
Disadvantage: Ignores word importance and context, leading to a less meaningful representation.
Word2Vec represents words as vectors in a continuous vector space where semantically similar words are close to each othe
It comes in two main architectures:
CBOW (Continuous Bag of Words): Predicts a word based on its surrounding context.
Skip-gram: Predicts the surrounding context words for a given word.
Why Use Word2Vec?
Context Awareness: Captures relationships between words (e.g., "error" and "issue" are close in vector space).
Dimensionality Reduction: Provides dense embeddings, unlike sparse representations like TF-IDF.
Domain-Specific Vocabulary: Custom-trained Word2Vec learns domain-specific word associations, making it more relevant fo
ticket descriptions. Improved Accuracy: Word2Vec captures the semantic meaning of ticket descriptions, improving model
performance.
Question: Why did you choose these particular models, and
what are their pros and cons? - I selected Support Vector Machine
(SVM), Random Forest, and Naive Bayes due to their respective
strengths:
SVM: SVM performs well on text data and can handle non-linear
separations in data. It’s robust with high-dimensional spaces, making
it ideal for processing text-based descriptions after TF-IDF
vectorization. However, it’s computationally intensive and might be
slower with large datasets.
Random Forest: This model is versatile and less prone to overfitting
than simpler tree models due to ensemble learning. It provided
interpretability and helped me understand feature importance. Its
drawback is that it requires tuning for optimal performance.
Naive Bayes: This is a simple yet effective model for text
classification, especially when handling large datasets with limited
computation time. Its main limitation is the strong independence
assumption, which may not hold in real-world data.
The margin is the distance between the hyperplane and the nearest
data point from either class SVM aims to maximize the margin to
ensure that the hyperplane generalizes well to new data. The larger the
margin, the better the model is likely to perform on unseen data.
decision boundary (or hyperplane) is determined by the support
vectors. SVM tries to create a hyperplane with the largest margin
between the two classes, making it robust to outliers.
Linear SVM: This is used when the data can be separated by a straight
line (hyperplane). It works well for linearly separable data.
Non-linear SVM: Often, data cannot be separated by a straight line. To
handle this, SVM uses the kernel trick.
Objective of SVM:
Hard Margin SVM: Assumes that the data is perfectly linearly
separable. It finds the largest margin hyperplane without allowing any
misclassification. This approach is not robust to noise or outliers
because no misclassifications are allowed.
Soft Margin SVM: Allows some misclassification by introducing a
regularization parameter C. The trade-off is between maximizing the
margin and minimizing classification errors. A small C gives a larger
margin but allows some misclassifications, while a large C tries to
minimize misclassification at the cost of a smaller margin.
Gini Impurity measures the likelihood that a randomly chosen element from the dataset would be incorrectly
classified if it were randomly labeled according to the class distribution in that dataset.
Entropy measures the disorder or impurity in a dataset. High entropy means the classes are well
mixed, while low entropy means most data points belong to a single class.
Information Gain (IG) measures the reduction in entropy or impurity after a dataset is split on an attribute. It
represents how much information a feature contributes to making a decision.
What is Gradient Boosting?
Gradient Boosting is an ensemble machine learning technique used for regression and classification tasks, which builds mod
wise fashion and generalizes them by optimizing a differentiable loss function. The idea is to combine the predictions of seve
estimators (typically decision trees) to improve robustness and accuracy. 1.Start with an initial model, usu
mode like the mean of the target values for regression or the log-odds for classification. 2.Compute the residuals (errors) of
model, which represent the difference between the actual target values and the predicted values by the model.
3.Fit a weak learner (e.g., a decision tree with limited depth) to the residuals. This learner is trained to predict the residuals
step. 4. Update the model byadding the weighted prediction of the decision tree. 5. Calculate the residuals again using the u
After many iterations, you end up with a series of small trees. Each tree corrects the errors of the previous one. Your final m
all these small trees’ predictions.
Naive Bayes is based on Bayes' Theorem, which describes the probability of an event based on prior knowledge of conditio
event.
P(A∣B)= P(B∣A)⋅P(A)/P(B) Assumption : Naive Bayes assumes that all features in X (e.g., the words in an email) are ind
other given the class C. This allows us to calculate the likelihood P(X/C) by multiplying the individual probabilities of each fe
This independence assumption simplifies the computation and makes Naive Bayes scalable to large datasets. Although inde
rarely true in real-world data, the algorithm still often performs well.
XGBoost is optimized for performance and
speed through parallel processing, memory
optimization, and tree pruning. It also includes
regularization techniques, which help prevent
overfitting, making it more robust on large
datasets.
Learning Rate: Controls how much the model updates weights with each step.
Number of Layers and Neurons: Determines the architecture of a neural network.
Batch Size: The number of training samples processed before updating the model.
Number of Trees in Random Forest: In ensemble methods, the number of trees
determines the ensemble’s size.
Max Depth: The maximum depth of each tree in decision tree-based models
Goal: SVM aims to find a hyperplane that separates different classes of tickets with
the largest possible margin.
Data Transformation: For cases where the data isn’t linearly separable, SVM uses
kernel functions (e.g., RBF, polynomial) to map ticket data into higher-dimensional
space, making it easier to draw a separation boundary.
Decision Boundary: The support vectors (data points near the decision boundary)
influence the position of the hyperplane. SVM tries to maximize the margin between
classes (e.g., ticket types like incident or request).
Optimization: It minimizes misclassification by solving a convex optimization
problem, balancing accuracy with generalization.
Backend Functionality:
Kernel Trick: If using non-linear kernels, SVM transforms the input features (e.g.,
ticket description text, priority) into a higher-dimensional space. For text data, it
often combines with techniques like TF-IDF or word embeddings to represent textual
information in numerical form.
Feature Scaling: SVM requires normalized or standardized features to function
effectively, so each ticket attribute is scaled to enhance model performance.
By using SVM with a kernel trick and proper feature scaling, ticket classification can
effectively handle complex, non-linear relationships in ticket data. This approach
enables the SVM model to classify tickets into categories like "incident" or "request"
accurately, improving the efficiency of ticket sorting and response in service
management systems.
Laplace Smoothing:
Laplace smoothing is used to handle cases where certain words or phrases in new tickets
do not appear in the training set, which would otherwise result in zero probabilities.
Naive Bayes offers a powerful, efficient way to classify text-based tickets by calculating
probabilities for each class. With assumptions of feature independence and use of
Laplace smoothing, it effectively handles diverse ticket descriptions, providing fast,
interpretable classifications. This model’s efficiency and simplicity make it ideal for
support ticket categorization tasks.
How Laplace Smoothing Works:
Laplace smoothing adds a small, non-zero value (usually 1) to the count of each word in
the vocabulary, ensuring that no probability is ever zero, even if the word is not present
in the training data.
For instance, if a word like "system" does not appear in any "incident" tickets during
training, the probability of "incident" given "system" would theoretically be zero. With
Laplace smoothing, we add a small value, say 1, to the word counts so the probability
𝑃
remains small but non-zero:
(
system
∣
incident
)
=
count of ’system’ in incident tickets
+
1
total words in incident tickets
+
total unique words
P(system∣incident)=
total words in incident tickets+total unique words
count of ’system’ in incident tickets+1
This adjustment allows Naive Bayes to generalize better to new tickets with unseen
words, making it more robust for real-world applications.
SQL Clauses Execution Order:
FROM: Specifies the source table(s) to retrieve data from.
WHERE: Filters rows based on conditions.
GROUP BY: Groups rows that have the same values in
specified columns into summary rows.
HAVING: Filters groups based on conditions (similar to
WHERE but for grouped rows).
SELECT: Specifies the columns to retrieve and performs any
calculations.
ORDER BY: Sorts the result set based on specified columns.
LIMIT: Limits the number of rows returned.
Write a sql query to get the highest orders based on Time-Series Analysis (Stock Prices)
Category Consider we have a dataset of daily
SELECT category, order_id, order_count stock prices where we want to compare
FROM ( the current day’s price with the previous
SELECT category, order_id, order_count, day’s price to detect price changes.
ROW_NUMBER() OVER (PARTITION BY The LAG function fetches the price from
category ORDER BY order_count DESC) AS row_num the previous day, allowing us to
FROM orders calculate the difference in price from the
) AS ranked_orders previous day.
WHERE row_num = 1;
RANK Use Case (for DENSE_RANK Use Case (Salary Views in sql are a
Competition Rankings): Bands or Employee Ratings): kind of virtual
When ranking participants in When ranking employees or table. A view also
a competition where ties salaries without gaps in the ranks, has rows and
exist, RANK can be used. For DENSE_RANK is useful. For columns as they
example, if two participants example, when calculating salary are on a real table
tie for 2nd place, the next bands, if two employees share the in the database
participant is ranked 4th (not same salary, the next salary band
3rd), leaving a gap in the should follow sequentially without
ranking. gaps.
Window functions perform calculations
across a set of table rows related to the
current row. They’re useful for cumulative
calculations, running totals, ranking, and
moving averages without altering the
dataset’s structure.