2022 Da 04009
2022 Da 04009
DISSERTATION
By
(Bhujith Kumar B)
(2022DA04009)
(September, 2024)
Acknowledgments:
I would like to express my sincere gratitude to everyone who supported and guided me throughout
the completion of my final year project, "Data Catalog Management for Enterprise Data
First and foremost, I would like to thank Mr. Arun Kumar Shanmugam, Principal Data Scientist at
Verizon, for his invaluable guidance, continuous support, and expert advice throughout the duration
of this project. His insights and mentorship were critical in shaping the technical and analytical
I am also deeply grateful to my examiner, Prof. T. Selvaraj, BITS Pilani, Rajasthan, for his
constructive feedback and encouragement during the review process. His valuable suggestions
Furthermore, I would like to extend my appreciation to the faculty and staff of BITS Pilani, who
provided me with the necessary resources and a conducive environment for academic and personal
Finally, I would like to thank my family and friends for their constant support, motivation, and
patience throughout this journey. This project would not have been possible without their
encouragement.
CERTIFICATE
This is to certify that the Dissertation entitled Data Catalogue Management for Enterprise
Data Warehouse – ML and GEN AI Based Solution and submitted by Mr. B.Bhujith Kumar ID
Date: 08/09/2024
Dissertation Abstract
Data Catalog , Data Classification and Metadata Management are an essential
component of any Modern Large Scale Ent rprise Data Management and Data
Governance Framework. It enables to identify the kind of data, organizations are
storing and processing to meet regulatory compliance. As organizations grow in size,
its imperative that right governance policies are in place, so that right user has access
to right kind of data. Also, efficient metadata management reduces the time spent in
data discovery by employees
which in turn optimizes the storage and cost. Presently its becoming increasingly
difficult for large organizations to Catalog, Classify and store the data due to the
huge volume and manual effort needed by the SME's and Data Stewards. Ensuring
the data catalog is continually updated with the latest data can be challenging,
especially in dynamic environments with frequent data change. Over time, data can
become outdated or obsolete, and keeping the catalog updated with current and
relevant information requires ongoing effort.
Present day Organizations stare at huge costs due to non maintenance of Data
Compliance acts like GDPR,CCPA,HIPAA,SOX,PCI DSS,COPPA,PDPA etc.
E.g Equifax, a credit reporting agency, experienced a data breach that exposed the
personal information of 147 million people, including social security numbers, birth
dates, and addresses which violated various state data protection laws and
potentially FCRA (Fair Credit Reporting Act).Finally,Equifax agreed to a settlement of
up to $700 million with the FTC, the Consumer Financial Protection Bureau (CFPB),
and U.S. states.
The goal of this project is use ML Models to catalog the metadata as per the various
pre-identified categories which help in masking of sensitive and personal data.
Also generation of metadata description using GEN Al models will help the users
to understand the data.
Making use of ML and Gen Al models will.save huge time and cost (by reducing the
manual effort) needed otherwise. This will in turn increase data regulatory (
/
compliance and save costs. ..
'7,/
By automation of classifying data/organizations can ensure that decision-makers
have access to the most relevant and critical information, supporting more
informed and strategic decisions. It enhances business intelligence efforts by
ensuring that high-quality, well-organized data is available for analysis and
reporting. Classifying data can streamline various business processes, such as data
retrieval, processing, and reporting, by making it easier to locate and use the
required data. It also helps in managing the data lifecycle by identifying which data
can be archived, retained, or deleted, ensuring that storage resources are used
efficiently.
In the rapidly evolving digital landscape, managing vast amounts of data efficiently and securely is
crucial for organizations. Two essential components of modern data management practices are Data
Catalogs and Data Classification. Both play pivotal roles in ensuring data is accessible, usable, and
protected, thereby driving better decision-making, regulatory compliance, and overall data governance. Here
in this project, we are trying to reduce the manual effort in Cataloguing and Classification·by using ML
Models and Generative AI to generate metadata descriptions.
2.Objectives
3. Scope of Work
pg. 1
4. Detailed Plan of Work (Sample) (for 16 weeks)
The plan of work should have tangible weekly or fortnightly milestones and deliverables,
which can be measured to assess the adherence to the plan and therefore the rate of
progress in the work. The plan of work can be specified in the table given below:
pg. 2
5. List of Symbols & Abbreviations used
CI Confidential Information
BI Business Information
SI Sensitive Information
DI Data Information
RI Restricted Information
FI Financial Information
II Internal Information
NDA Non
IP Intellectual Property
6. List of Tables
Tb.1 Considerations for Choosing Various Algorithms
Tb.2 Evaluation metrices for LLMS
7. List of Figures
Fig.1 System Diagram
Fig 2 Data Flow diagram
pg. 3
8.Table of Contents
Title page No
3.2 Performance Evaluation of Hugging Face Models (BERT, BART, T5, LLAMA) 20
4.0 Conclusion 26
5.0 Bibliography 26
pg. 4
Chapter 1
1.0 Project Overview and previous work referenced for the same.
In the rapidly evolving digital landscape, managing vast amounts of data efficiently
and securely is crucial for organizations. Two essential components of modern data
management practices are Data Catalogs and Data Classification. Both play
pivotal roles in ensuring data is accessible, usable, and protected, thereby driving
better decision-making, regulatory compliance, and overall data governance. Here
in this project, we are trying to reduce the manual effort in Cataloguing and
Classification·by using ML Models and Generative AI to generate metadata
descriptions.
Objectives:
pg. 5
Fig.1 System diagram
pg. 6
Fig.2 Data Flow diagram
1. Data Identification:
o The project began with a thorough identification and collection of the dataset
required for classification. This step involved understanding the problem
domain, defining the scope of the data needed, and gathering relevant
datasets from various sources. The data collected was then reviewed to
ensure it was appropriate for the intended classification tasks, covering a
wide range of features and outcomes to facilitate robust model training and
evaluation.
o Once the data was identified, it was processed and ingested into the working
environment. This stage involved converting the raw data into a structured
format that could be easily manipulated and analyzed. The data was then
stored in a database or a suitable file format, ensuring it was accessible for
further analysis. This step also included initial data exploration to
understand the underlying structure, distributions, and any potential
anomalies present in the dataset.
pg. 7
3. Data Cleaning:
o Data cleaning was a critical step to ensure the quality and reliability of the
dataset. This process involved handling missing values by either imputing
them with appropriate statistics or removing incomplete records. Outliers
were identified and treated to prevent them from skewing the analysis. Data
inconsistencies were resolved to maintain uniformity, and features were
normalized or standardized to bring them onto a common scale. These steps
were crucial to preparing a clean dataset that would yield accurate and
meaningful results during model training.
4. Model Evaluation:
o With the cleaned data, various machine learning algorithms were evaluated
to determine their suitability for the classification task. This phase involved
experimenting with different models such as Decision Trees, K-Nearest
Neighbors, and Logistic Regression. Each model's performance was assessed
using relevant metrics like accuracy, precision, recall, and F1 score. These
metrics provided a comprehensive view of how well each model could
classify the data, highlighting their strengths and weaknesses.
5. Algorithm Selection:
o Based on the performance metrics obtained during evaluation, the top three
algorithms were selected for further analysis. These algorithms
demonstrated the highest potential in terms of classification accuracy and
other relevant criteria. The chosen models included Logistic Regression,
which stood out due to its simplicity, interpretability, and strong
performance metrics, making it a prime candidate for the final model.
6. Model Application:
o The selected algorithms were then applied to classify the dataset. This
involved training each model on the training data and testing them on a
separate validation set to evaluate their real-world performance. The
accuracy metrics for each model were generated and compared to identify
the best performer. Logistic Regression emerged as the best-performing
algorithm, delivering the highest accuracy and demonstrating robustness
across different validation sets.
o Random Forest was identified as the best model for this classification task.
Its performance was superior in terms of accuracy and other key metrics,
confirming its effectiveness for the given dataset. The simplicity and
interpretability of Random Forest was also made it a favorable choice,
providing clear insights into the relationship between the features and the
target variable.
I’ve generated descriptions for my column names using the Hugging Face T5
model. The model worked great for creating concise explanations for each
column, mapping the names to their meanings or functions. It really helped
streamline the process of turning raw column names into something more
readable and useful!
Large Language Models (LLMs) are a class of artificial intelligence models trained on big
amounts of text data to understand and generate human-like text. They have exhibited
Trivial capabilities in various natural language processing (NLP) tasks and has several
applications in various domains. These are neural network models designed to process and
generate text. They are predominantly based on architectures such as the Transformer,
which allows them to handle large amount of complex language patterns. Key
characteristics of LLMs include:
• Scale: They are trained on extensive datasets, often confining billions and trillions of
words or more.
• Pre-training and Fine-tuning: They are initially pre-trained on large corpus and
can be fine-tuned on need basis.
• Context Understanding: They have the ability to understand and generate text as
per the situation or context, there fore helpful in handling language related tasks.
Applications of LLMs:
1. Text Generation:
Use Case: Content creation, chatbot responses.
Examples: Generating articles, stories, and dialogue in conversational
agents.
pg. 9
2. Text Summarization:
Use Case: Simplifying long documents into shorter summaries.
Examples: Summarizing news articles, research papers, or legal documents.
3. Question Answering:
Use Case: Answering questions asked based on input text or general
knowledge.
Examples: Customer support, educational tools.
4. Translation:
Use Case: Translating text one language to another.
Examples: Website localization, cross-lingual communication.
5. Sentiment Analysis:
Use Case: Determining the sentiment expressed in a piece of text.
Examples: Analysing customer feedback, social media monitoring.
6. Text Classification:
Use Case: Categorizing text into predefined classes.
Examples: Spam detection, topic classification.
7. Paraphrasing:
Use Case: Rewriting text in different ways while preserving the meaning.
Examples: Generating alternative phrasings for content creation.
Overview:
Pros:
2. Basic Memory: This can capture dependencies in data. Dependencies in it are sequential
in some cases and not all the time.
Cons:
pg. 10
2. Limited Long-term Dependencies: They weren't that proficient at maintaining long-term
context. Not on longer sequences.
1.2.1. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)
Overview:
LSTM and GRU surfaced to tackle the restrictions in traditional RNNs. These methods
comprised gating mechanisms. These mechanisms could manage information flow. Thus,
these mechanisms enhanced their ability to capture dependencies extending over the long-
term.
Pros:
Cons:
1.Complexity: More complex. It is computationally intense more than simpler RNNs.
2.Training Time: Takes longer to learn. This is due to extra parameters needed.
Overview:
Pros:
1. Selective Focus Can weigh importance of different data. It could improve the
understanding of context.
Cons:
1.2.3. Transformers
Overview:
Transformers were introduced via the paper "Attention Is All You Need" (2017). These
models are empowered by self-attention mechanisms. They discard the recurrence
mechanism of RNNs and LSTMs.
Pros:
1.Parallel Processing: It processes data in a parallel manner rather than sequentially. This
speeds up the training greatly.
3.Scalability: This model scales in an effective manner. It doesn't matter about its size or
the data you throw at it. Still holds up.
Cons:
Overview:
Transformers, they meant a real switch in the scene when the paper "Attention Is All You
Need" was published in 2017. They operate by self-attention mechanisms. They completely
dropped the concept of a recurrence mechanism of RNNs or LSTMs. Transformers make
use of a network architecture named self-attention. This form of architecture takes into
account relationships between all relevant words in a given sentence. Now this mechanism
completely breaks away from the traditional neural network architecture. This is where the
output of a unit is fed back into itself. We find that transformer networks capable of
pg. 12
capturing long-distance dependencies in sequence data better than their recurrent
counterparts.
Pros:
2. Versatility: These models can adjust to different duties. They just need fine-tuning.
Cons:
1. Costs of Training: To train these models from scratch is costly. It is also resource-
intensive.
2. Over-fitting: Fine-tuning small datasets can result in overfitting. This happens if it is not
properly handled.
Evolution Summary:
1. Early RNNs: RNNs offer basic sequential processing. Nonetheless they have trouble with
long-term dependencies.
2. LSTMs and GRUs: They help in managing long-term dependencies. Yet they can't escape
complexity.
4. Transformers: They are pioneers in NLP. They rely on self-attention and parallelization.
Thereby providing models that are functionally superior.
5. Pre-trained Transformers: They rely on large scale pre-training. This method ensures
that they dominate in NLP tasks across a range.
Chapter 2
pg. 13
1. OpenAI's GPT-3/GPT-4: These are considered to be more advances due to their
ability in giving human-like text and perform a wide range of NLP tasks.
2. Google's AI models: Includes models like BERT and T5
3. Microsoft Azure's Cognitive Services: Provides a combination of tools and APIs for
various NLP tasks, including text analysis and language understanding.
4. Typically offered through cloud-based APIs.
5. Pricing is usually pay-per-use, means scalable but potentially cost might vary
depending on usage and complexity.
1. Hugging Face's Transformers library: This includes models like BERT, GPT-2, and
T5. These models are widely used in the NLP community.
2. Available for local deployment.
3. Can be fine-tuned for specific tasks, providing flexibility and control over the
models.
1. Models available for download from platforms like Hugging Face's Model Hub.
2. Can be used locally for specific NLP tasks.
3. Pre-trained models are ready for use and can be further fine-tuned as needed.
1.Integration
• APIs:
o Easy to integrate into applications through RESTful APIs.
o Suitable for applications that need to scale quickly without worrying about
infrastructure.
• Libraries:
o Tools like Hugging Face Transformers provide easy-to-use libraries for
various models.
pg. 14
o Allow for greater customization and flexibility, but may require more setup.
2.Cost
• Commercial APIs:
o Usage-based pricing can be costly for high volume or complex queries.
o Provides ease of use and reliability but at a potentially higher cost.
• Open Source:
o Free to use but may require infrastructure for deployment.
o Cost-effective for organizations with the capability to manage their own
servers and deployments.
3.Customization
• Fine-tuning:
o Models can be fine-tuned on specific datasets to improve performance for
particular tasks.
o Requires some expertise in machine learning and access to relevant data.
• Pre-trained:
o Models can be used out-of-the-box for general tasks without additional
training.
o Useful for quick deployment and for tasks where general language
understanding is sufficient.
4.Performance
• Accuracy:
o The performance of LLMs can vary based on the model and task.
o Generally, models like GPT-3 and BERT perform very well on a wide range of
tasks.
• Latency:
o Considerations for response time, especially for real-time applications.
o Cloud-based solutions may introduce some latency compared to locally
deployed models.
5.Ease of Use
• APIs:
o Commercial APIs usually come with user-friendly documentation and
support.
o Ideal for developers looking to quickly add NLP capabilities to their
applications.
• Libraries:
o Open-source libraries may require more setup but offer flexibility and
control.
pg. 15
o Suitable for developers with more technical expertise who need custom
solutions.
Summary
Large Language Models (LLMs) are powerful tools in natural language processing with a
range of applications from text generation to question answering. They are available
through commercial APIs, open-source libraries, and hosted services. Choosing the right
LLM involves evaluating factors such as accuracy, cost, ease of integration, and
performance based on your specific use case.
Chapter 3
• Architecture: Treats every NLP task as a text-to-text problem, using the same
model, loss function, and hyperparameters.
pg. 16
• Use Cases: Highly versatile, can be used for translation, summarization,
classification, and QA.
• Strengths: Unified approach to NLP tasks, highly flexible.
• Limitations: Requires large amounts of data for fine-tuning specific tasks, which
can be computationally expensive.
Hugging Face provides an easy-to-use interface for these models through transformers
library, which supports a wide range of pre-trained models. This makes it straightforward
to experiment with and deploy these models for our use case.
Summary
pg. 17
Tb.1 Considerations before choosing LLM
Since we have compared all 4 models shortlisted as per above, I have chosen llama and
T5 for our use case.
I have decided to use Google Cloud’s Vertex AI API Service which runs the Ilama and T5 in the
background to generate metadata descriptions.
10. Conclusions:
1. Random Forest is identified as the best model with 70% Accurate predictions.
2. T5 Base models is used to generate the descriptions for Column Names.
pg. 18
11. Bibliography / References
The following are referred journals from the preliminary literature review.
5. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault,
T., Louf, R., Funtowicz, M., & Brew, J. (2020). "Transformers: State-of-the-Art
pg. 19
Natural Language Processing". Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations.
6. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W.,
& Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-
to-Text Transformer". Journal of Machine Learning Research
7. See, A., Liu, P. J., & Manning, C. D. (2017). "Get To The Point: Summarization with
Pointer-Generator Networks". Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers).
8. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Kaiser, Ł.,
Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information
Processing Systems (NeurIPS). Retrieved from https://arxiv.org/abs/1706.03762
9. Hugging Face. (2023). Transformers Library. Hugging Face Documentation.
Retrieved from https://huggingface.co/transformers/
12. Appendices:
AI Artificial Intelligence
ML Machine Learning
pg. 20
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
Work Integrated Learning Programmes Division
II SEMESTER 23-24
DSECLZG628T DISSERTATION
(Final Evaluation Sheet)
ID NO. : 2022DA04009
EVALUATION DETAILS
Organizational Mentor
pg. 21
Name Arun Kumar Shanmugam
Qualification M.Tech
Designation & Address Principal Data Scientist
Email Address [email protected]
Signature
Date 08/09/2024
NB: Kindly ensure that recommended final grade is duly indicated in the above evaluation sheet.
The Final Evaluation Form should be submitted separately in the viva portal.
pg. 22