0% found this document useful (0 votes)
60 views28 pages

2022 Da 04009

Uploaded by

Bhujith Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views28 pages

2022 Da 04009

Uploaded by

Bhujith Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Catalog Management for Enterprise Data Warehouse- ML and

Gen AI Based Solution

DISSERTATION

Submitted in partial fulfillment of the requirements of the

Degree: MTech in Data Science and Engineering

By

(Bhujith Kumar B)
(2022DA04009)

Under the supervision of

(Arun Kumar Shanmugam, Principal Data Scientist)

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE


Pilani (Rajasthan) INDIA

(September, 2024)
Acknowledgments:

I would like to express my sincere gratitude to everyone who supported and guided me throughout

the completion of my final year project, "Data Catalog Management for Enterprise Data

Warehouse - ML and Gen AI Based Solution."

First and foremost, I would like to thank Mr. Arun Kumar Shanmugam, Principal Data Scientist at

Verizon, for his invaluable guidance, continuous support, and expert advice throughout the duration

of this project. His insights and mentorship were critical in shaping the technical and analytical

aspects of this research.

I am also deeply grateful to my examiner, Prof. T. Selvaraj, BITS Pilani, Rajasthan, for his

constructive feedback and encouragement during the review process. His valuable suggestions

helped me refine my work and stay on course.

Furthermore, I would like to extend my appreciation to the faculty and staff of BITS Pilani, who

provided me with the necessary resources and a conducive environment for academic and personal

growth during my studies.

Finally, I would like to thank my family and friends for their constant support, motivation, and

patience throughout this journey. This project would not have been possible without their

encouragement.

Thank you all for making this accomplishment possible.


BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI

CERTIFICATE

This is to certify that the Dissertation entitled Data Catalogue Management for Enterprise

Data Warehouse – ML and GEN AI Based Solution and submitted by Mr. B.Bhujith Kumar ID

No.2022DAO4009 in partial fulfilment of the requirements of DSECLZG628T Dissertation,

embodies the work done by him/her under my supervision.

Signature of the Supervisor


Name: Arun Kumar Shanmugam

Place: Chennai Designation : Principal Data Scientist

Date: 08/09/2024
Dissertation Abstract
Data Catalog , Data Classification and Metadata Management are an essential
component of any Modern Large Scale Ent rprise Data Management and Data
Governance Framework. It enables to identify the kind of data, organizations are
storing and processing to meet regulatory compliance. As organizations grow in size,
its imperative that right governance policies are in place, so that right user has access
to right kind of data. Also, efficient metadata management reduces the time spent in
data discovery by employees
which in turn optimizes the storage and cost. Presently its becoming increasingly
difficult for large organizations to Catalog, Classify and store the data due to the
huge volume and manual effort needed by the SME's and Data Stewards. Ensuring
the data catalog is continually updated with the latest data can be challenging,
especially in dynamic environments with frequent data change. Over time, data can
become outdated or obsolete, and keeping the catalog updated with current and
relevant information requires ongoing effort.

Present day Organizations stare at huge costs due to non maintenance of Data
Compliance acts like GDPR,CCPA,HIPAA,SOX,PCI DSS,COPPA,PDPA etc.
E.g Equifax, a credit reporting agency, experienced a data breach that exposed the
personal information of 147 million people, including social security numbers, birth
dates, and addresses which violated various state data protection laws and
potentially FCRA (Fair Credit Reporting Act).Finally,Equifax agreed to a settlement of
up to $700 million with the FTC, the Consumer Financial Protection Bureau (CFPB),
and U.S. states.

The goal of this project is use ML Models to catalog the metadata as per the various
pre-identified categories which help in masking of sensitive and personal data.
Also generation of metadata description using GEN Al models will help the users
to understand the data.
Making use of ML and Gen Al models will.save huge time and cost (by reducing the
manual effort) needed otherwise. This will in turn increase data regulatory (
/
compliance and save costs. ..
'7,/
By automation of classifying data/organizations can ensure that decision-makers
have access to the most relevant and critical information, supporting more
informed and strategic decisions. It enhances business intelligence efforts by
ensuring that high-quality, well-organized data is available for analysis and
reporting. Classifying data can streamline various business processes, such as data
retrieval, processing, and reporting, by making it easier to locate and use the
required data. It also helps in managing the data lifecycle by identifying which data
can be archived, retained, or deleted, ensuring that storage resources are used
efficiently.

Key Words: Data Catalog, Data Management, Data Classification, Metadata


Management,Data Governance, Scalable S61ution, Policy Term Classification

1.Broad Area of Work

In the rapidly evolving digital landscape, managing vast amounts of data efficiently and securely is
crucial for organizations. Two essential components of modern data management practices are Data
Catalogs and Data Classification. Both play pivotal roles in ensuring data is accessible, usable, and
protected, thereby driving better decision-making, regulatory compliance, and overall data governance. Here
in this project, we are trying to reduce the manual effort in Cataloguing and Classification·by using ML
Models and Generative AI to generate metadata descriptions.

2.Objectives

The objectives of my project are as follows:


1.,..

• Improve Data Accessibility;pd Usability: Enhance the ability of users to


discover, understand, and µtilize data efficiently across the organization.
• Ensure Data Security . d Compliance: Strengthen data protection
measures and ensure adherence to relevant regulatory standards and policies.
• Optimize Data Management Practices: Streamline data storage, retrieval,
and lifecycle management to maximize efficiency and reduce costs.
• Develop Comprehensive Metadata Management: Implement robust
metadata management practices that include technical, business, and operational
metadata to provide full context for data assets.
• Enhance Data Discovery Capabilities: Cr,eate advanced search and data "' -
profiling tools that enable users to quickly find and understand the data they need.
• Strengthen Data Governance: Ensure consistent enforcement of data
policies, standards, and access controls through the data catalog.
• Promote User Adoption: Increase user engagement and proficiency with
the data catalog through training programs and intuitive user interfaces.
• Integrate Diverse Data Sources: Consolidate metadata from various and
often siloed data sources to provide a unified view of the organization's data assets.
• Implement Accurate Data Classification Techniques: Develop and utilize
accurate and scalable data classification methodologies to ensure precise
categorization of data.
• Automate Data Classification Processes: Employ machine
learning and rule-based systems to automate the classification of large
volumes of data, reducing manual effort and increasing efficiency.
• Enhance Data Security: Apply appropriate security measures
based on data classification to protect sensitive information and prevent
unauthorized access.
• Ensure Regulatory Compliance: Align data classification
practices with regulatory requirements to facilitate compliance with laws such
as GDPR, CCPA, HIPAA, etc.
• Support Data Lifecycle Management: Use data classification to
manage data retention, archival, and deletion processes efficiently, ensuring
optimal use of storage resources.
• Synergize Data Classification and Cataloging: Develop
integrated workflows that leverage data classification to enrich metadata in
data catalogs, enhancing data-rli,s overy and governance.
• Utilize AI and Machine' 4eHrning: Incorporate AI and machine
learning technologies to impr ve the automation and accuracy of both data
classification and cataloging processes. .
• Enable Real-Tim Data Management: Establish real-time
capabilities for data classification and cataloging to handle dynamic data
environments effectively.
• Enhance User Interfaces: Design user-friendly interfaces for
data catalogs that make it easier for users to interact with and benefit from
classified data.
• Monitor and Improve Data Quality: Use data classification and
cataloging to continuously monitor and improve data quality, nsuring reliable
and accurate data for decision-making.

3. Scope of Work

• Scope of this dissertation is to design and develop -


• A ML Based system which will classify the metadata based on the
column name and other inputs as per the given business terms.
• A Generative AI model which will generate descriptions
based on the column name and type.
• A dashboard to present the Classification metrics to the users.
• Final report of all objectives and Outcomes.

pg. 1
4. Detailed Plan of Work (Sample) (for 16 weeks)

The plan of work should have tangible weekly or fortnightly milestones and deliverables,
which can be measured to assess the adherence to the plan and therefore the rate of
progress in the work. The plan of work can be specified in the table given below:

Serial Tasks or subtasks to be done {be Start Date- Planned Specific


Numb precise and specific) End Date duration Deliverable in
erof in weeks terms of the
Task/ project
Phase
s , ................................................................................................ ,
-< ...

1. Data Collection , Pre-Proces ing, EDA 04-05-2024 4 Processed Data


,r
_,;-
, •r/ 04-06-2024 for
.,-•7'
Classification

2. Evaluate various ML Algorithms for 05-06-2024 Set of \


Best fit 19-06-2024 2 Algorithms for
Classification

3. Train the dataset based on the 20-6-2024- Initilal Trained


algorithms. 18-7-2024 4 Model and
Dataset
.
4. Test the models for accuracy and 19-7-2024- Final Output
retrain the model for improvement if 09-8-2024 3
.
needed.

5. Metric Reporting and Final Report 10-8-2024- 3 Final Report


Preparation 31-8-2024 Document

pg. 2
5. List of Symbols & Abbreviations used

Symbol/Abbreviation Full Form/Meaning


Personally Identifiable
PII Information

PHI Protected Health Information

PCI Payment Card Information

SPI Sensitive Personal Information

CI Confidential Information

BI Business Information

SI Sensitive Information

DI Data Information

RI Restricted Information

FI Financial Information

II Internal Information

NDA Non

IP Intellectual Property

HR Human Resources information

6. List of Tables
Tb.1 Considerations for Choosing Various Algorithms
Tb.2 Evaluation metrices for LLMS

7. List of Figures
Fig.1 System Diagram
Fig 2 Data Flow diagram

pg. 3
8.Table of Contents

Title page No

1.0 Project overview & referenced previous works 8

1.1 Progress Summary as of now 10

1.2 Evolution of LLM Architecture 12

2.0 Availability of LLMS 16

2.1 Accessibility of LLMS 17

3.0 Identification of LLMS for Metadata Summarization 18

3.1 Considerations before choosing an LLM 19

3.2 Performance Evaluation of Hugging Face Models (BERT, BART, T5, LLAMA) 20

4.0 Conclusion 26

5.0 Bibliography 26

pg. 4
Chapter 1
1.0 Project Overview and previous work referenced for the same.

In the rapidly evolving digital landscape, managing vast amounts of data efficiently
and securely is crucial for organizations. Two essential components of modern data
management practices are Data Catalogs and Data Classification. Both play
pivotal roles in ensuring data is accessible, usable, and protected, thereby driving
better decision-making, regulatory compliance, and overall data governance. Here
in this project, we are trying to reduce the manual effort in Cataloguing and
Classification·by using ML Models and Generative AI to generate metadata
descriptions.

Objectives:

The objectives of my project are as follows:


• Improve Data Accessibility and Usability: Enhance the ability
of users to discover, understand, and µtilize data efficiently across the
organization.
• Ensure Data Security .and Compliance: Strengthen data
protection measures and ensure adherence to relevant regulatory
standards and policies.
• Optimize Data Management Practices: Streamline data storage,
retrieval, and lifecycle management to maximize efficiency and reduce costs.
• Develop Comprehensive Metadata Management: Implement
robust metadata management practices that include technical, business, and
operational metadata to provide full context for data assets.
• Enhance Data Discovery Capabilities: Create advanced search and data "' -
profiling tools that enable users to quickly find and understand the data they need.
• Strengthen Data Governance: Ensure consistent enforcement of
data policies, standards, and access controls through the data catalog.
• Promote User Adoption: Increase user engagement and
proficiency with the data catalog through training programs and intuitive
user interfaces.
• Integrate Diverse Data Sources: Consolidate metadata from
various and often siloed data sources to provide a unified view of the
organization's data assets.
• Implement Accurate Data Classification Techniques: Develop
and utilize accurate and scalable data classification methodologies to ensure
precise categorization of data.

pg. 5
Fig.1 System diagram

ML Model Flow in our Solution:

pg. 6
Fig.2 Data Flow diagram

1.1 Progress Summary as of Now

1. Data Identification:

o The project began with a thorough identification and collection of the dataset
required for classification. This step involved understanding the problem
domain, defining the scope of the data needed, and gathering relevant
datasets from various sources. The data collected was then reviewed to
ensure it was appropriate for the intended classification tasks, covering a
wide range of features and outcomes to facilitate robust model training and
evaluation.

2. Data Processing and Ingestion:

o Once the data was identified, it was processed and ingested into the working
environment. This stage involved converting the raw data into a structured
format that could be easily manipulated and analyzed. The data was then
stored in a database or a suitable file format, ensuring it was accessible for
further analysis. This step also included initial data exploration to
understand the underlying structure, distributions, and any potential
anomalies present in the dataset.

pg. 7
3. Data Cleaning:

o Data cleaning was a critical step to ensure the quality and reliability of the
dataset. This process involved handling missing values by either imputing
them with appropriate statistics or removing incomplete records. Outliers
were identified and treated to prevent them from skewing the analysis. Data
inconsistencies were resolved to maintain uniformity, and features were
normalized or standardized to bring them onto a common scale. These steps
were crucial to preparing a clean dataset that would yield accurate and
meaningful results during model training.

4. Model Evaluation:

o With the cleaned data, various machine learning algorithms were evaluated
to determine their suitability for the classification task. This phase involved
experimenting with different models such as Decision Trees, K-Nearest
Neighbors, and Logistic Regression. Each model's performance was assessed
using relevant metrics like accuracy, precision, recall, and F1 score. These
metrics provided a comprehensive view of how well each model could
classify the data, highlighting their strengths and weaknesses.

5. Algorithm Selection:

o Based on the performance metrics obtained during evaluation, the top three
algorithms were selected for further analysis. These algorithms
demonstrated the highest potential in terms of classification accuracy and
other relevant criteria. The chosen models included Logistic Regression,
which stood out due to its simplicity, interpretability, and strong
performance metrics, making it a prime candidate for the final model.

6. Model Application:

o The selected algorithms were then applied to classify the dataset. This
involved training each model on the training data and testing them on a
separate validation set to evaluate their real-world performance. The
accuracy metrics for each model were generated and compared to identify
the best performer. Logistic Regression emerged as the best-performing
algorithm, delivering the highest accuracy and demonstrating robustness
across different validation sets.

7. Best Model Identification:

o Random Forest was identified as the best model for this classification task.
Its performance was superior in terms of accuracy and other key metrics,
confirming its effectiveness for the given dataset. The simplicity and
interpretability of Random Forest was also made it a favorable choice,
providing clear insights into the relationship between the features and the
target variable.

o Random Forest Algorithm Metrics:


pg. 8
o

o Decision Tree Algorithm Metrics:

Logistic Regression Algorithm Metrics:

8. Description Generation using LLM’s:

I’ve generated descriptions for my column names using the Hugging Face T5
model. The model worked great for creating concise explanations for each
column, mapping the names to their meanings or functions. It really helped
streamline the process of turning raw column names into something more
readable and useful!

1.1 What Are LLMs?

Large Language Models (LLMs) are a class of artificial intelligence models trained on big
amounts of text data to understand and generate human-like text. They have exhibited
Trivial capabilities in various natural language processing (NLP) tasks and has several
applications in various domains. These are neural network models designed to process and
generate text. They are predominantly based on architectures such as the Transformer,
which allows them to handle large amount of complex language patterns. Key
characteristics of LLMs include:

• Scale: They are trained on extensive datasets, often confining billions and trillions of
words or more.
• Pre-training and Fine-tuning: They are initially pre-trained on large corpus and
can be fine-tuned on need basis.
• Context Understanding: They have the ability to understand and generate text as
per the situation or context, there fore helpful in handling language related tasks.

Applications of LLMs:

1. Text Generation:
Use Case: Content creation, chatbot responses.
Examples: Generating articles, stories, and dialogue in conversational
agents.
pg. 9
2. Text Summarization:
Use Case: Simplifying long documents into shorter summaries.
Examples: Summarizing news articles, research papers, or legal documents.
3. Question Answering:
Use Case: Answering questions asked based on input text or general
knowledge.
Examples: Customer support, educational tools.
4. Translation:
Use Case: Translating text one language to another.
Examples: Website localization, cross-lingual communication.
5. Sentiment Analysis:
Use Case: Determining the sentiment expressed in a piece of text.
Examples: Analysing customer feedback, social media monitoring.
6. Text Classification:
Use Case: Categorizing text into predefined classes.
Examples: Spam detection, topic classification.
7. Paraphrasing:
Use Case: Rewriting text in different ways while preserving the meaning.
Examples: Generating alternative phrasings for content creation.

1.2 Evolution of LLM Architectures:

1.2.0. Early Neural Networks and RNNs:

Overview:

Early architectures of neural networks were mostly feedforward. They were


predominantly composed of neural networks along with Recurrent Neural Networks
(RNNs). RNNs aimed to address sequential data. These were also among initial ways to
handle natural language.

Pros:

1. Sequential Processing: Handles sequences of data. They do so by maintaining internal


state.

2. Basic Memory: This can capture dependencies in data. Dependencies in it are sequential
in some cases and not all the time.

Cons:

1. Vanishing/Exploding Gradient Problem: Found it hard to cope with longer sequences.


This was because of gradients issues.

pg. 10
2. Limited Long-term Dependencies: They weren't that proficient at maintaining long-term
context. Not on longer sequences.

Example Models: Simple RNNs

1.2.1. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)

Overview:

LSTM and GRU surfaced to tackle the restrictions in traditional RNNs. These methods
comprised gating mechanisms. These mechanisms could manage information flow. Thus,
these mechanisms enhanced their ability to capture dependencies extending over the long-
term.

Pros:

1.Improved Long-term Memory: Developed aptitude to learn long-term dependencies more


effectively.

2.Gating Mechanisms: Assists in governing the flow of information.

Cons:
1.Complexity: More complex. It is computationally intense more than simpler RNNs.
2.Training Time: Takes longer to learn. This is due to extra parameters needed.

Example Models: LSTM networks, GRU networks

1.2.2. Attention Mechanism

Overview:

Attention mechanism came to existence to address the limitations of RNNs as well as


LSTMs. This has been possible as it allowed the models to focus on various parts of input
sequence.

Pros:

1. Selective Focus Can weigh importance of different data. It could improve the
understanding of context.

2. Parallelization Allows for efficient training when compared to sequential models.

Cons:

Computational Overhead adds complexity. Also, computational needs will be heavy.


pg. 11
Example Models: First attention mechanism it was used in various models. This was
before Transformers.

1.2.3. Transformers

Overview:

Transformers were introduced via the paper "Attention Is All You Need" (2017). These
models are empowered by self-attention mechanisms. They discard the recurrence
mechanism of RNNs and LSTMs.

Pros:

1.Parallel Processing: It processes data in a parallel manner rather than sequentially. This
speeds up the training greatly.

2.Self-Attention: It captures dependencies among all input sequence segments. The


performance enhancement helps in many NLP tasks to get benefited significantly.

3.Scalability: This model scales in an effective manner. It doesn't matter about its size or
the data you throw at it. Still holds up.

Cons:

1. Resource Intensive: It needs lot of resources. There must be significant computational


capacity and memory.

2. Complexity: It is more complex compared to other architectures. Difference is clear both


in the architecture and in training requirements.

Example Models: BERT (Bidirectional Encoder Representations from Transformers


(Generative Pre-trained Transformer), T5 (Text-To-Text Transfer Transformer

1.2.4. Pre-trained and Fine-tuned Transformers

Overview:

Transformers, they meant a real switch in the scene when the paper "Attention Is All You
Need" was published in 2017. They operate by self-attention mechanisms. They completely
dropped the concept of a recurrence mechanism of RNNs or LSTMs. Transformers make
use of a network architecture named self-attention. This form of architecture takes into
account relationships between all relevant words in a given sentence. Now this mechanism
completely breaks away from the traditional neural network architecture. This is where the
output of a unit is fed back into itself. We find that transformer networks capable of

pg. 12
capturing long-distance dependencies in sequence data better than their recurrent
counterparts.

Pros:

1. Generalization: The practice of pre-training on big amounts of data lets models


generalize well. They are tailored to specific tasks.

2. Versatility: These models can adjust to different duties. They just need fine-tuning.

Cons:

1. Costs of Training: To train these models from scratch is costly. It is also resource-
intensive.

2. Over-fitting: Fine-tuning small datasets can result in overfitting. This happens if it is not
properly handled.

Model Examples: GPT-3, BERT, T5

Evolution Summary:

1. Early RNNs: RNNs offer basic sequential processing. Nonetheless they have trouble with
long-term dependencies.

2. LSTMs and GRUs: They help in managing long-term dependencies. Yet they can't escape
complexity.

3. Attention Mechanism: It enhances context understanding. It also improves parallel


processing.

4. Transformers: They are pioneers in NLP. They rely on self-attention and parallelization.
Thereby providing models that are functionally superior.

5. Pre-trained Transformers: They rely on large scale pre-training. This method ensures
that they dominate in NLP tasks across a range.

Chapter 2

2.0 Availability of LLMs

2.0.0 Commercial APIs

pg. 13
1. OpenAI's GPT-3/GPT-4: These are considered to be more advances due to their
ability in giving human-like text and perform a wide range of NLP tasks.
2. Google's AI models: Includes models like BERT and T5
3. Microsoft Azure's Cognitive Services: Provides a combination of tools and APIs for
various NLP tasks, including text analysis and language understanding.
4. Typically offered through cloud-based APIs.
5. Pricing is usually pay-per-use, means scalable but potentially cost might vary
depending on usage and complexity.

2.0.1 Open-Source Models

1. Hugging Face's Transformers library: This includes models like BERT, GPT-2, and
T5. These models are widely used in the NLP community.
2. Available for local deployment.
3. Can be fine-tuned for specific tasks, providing flexibility and control over the
models.

2.0.2 Hosted Services

1. IBM Watson: Offers various NLP capabilities, including language translation,


sentiment analysis, and text-to-speech.
2. Amazon Comprehend: Provides NLP services for text analysis, entity recognition,
and sentiment analysis.
3. Provided as cloud services with various pricing tiers.
4. Often easier to integrate compared to deploying and managing your own models.

2.0.4 Pre-trained Models

1. Models available for download from platforms like Hugging Face's Model Hub.
2. Can be used locally for specific NLP tasks.
3. Pre-trained models are ready for use and can be further fine-tuned as needed.

2.1. Accessibility of LLMs

1.Integration
• APIs:
o Easy to integrate into applications through RESTful APIs.
o Suitable for applications that need to scale quickly without worrying about
infrastructure.
• Libraries:
o Tools like Hugging Face Transformers provide easy-to-use libraries for
various models.
pg. 14
o Allow for greater customization and flexibility, but may require more setup.
2.Cost
• Commercial APIs:
o Usage-based pricing can be costly for high volume or complex queries.
o Provides ease of use and reliability but at a potentially higher cost.
• Open Source:
o Free to use but may require infrastructure for deployment.
o Cost-effective for organizations with the capability to manage their own
servers and deployments.
3.Customization
• Fine-tuning:
o Models can be fine-tuned on specific datasets to improve performance for
particular tasks.
o Requires some expertise in machine learning and access to relevant data.
• Pre-trained:
o Models can be used out-of-the-box for general tasks without additional
training.
o Useful for quick deployment and for tasks where general language
understanding is sufficient.
4.Performance
• Accuracy:
o The performance of LLMs can vary based on the model and task.
o Generally, models like GPT-3 and BERT perform very well on a wide range of
tasks.
• Latency:
o Considerations for response time, especially for real-time applications.
o Cloud-based solutions may introduce some latency compared to locally
deployed models.

5.Ease of Use
• APIs:
o Commercial APIs usually come with user-friendly documentation and
support.
o Ideal for developers looking to quickly add NLP capabilities to their
applications.
• Libraries:
o Open-source libraries may require more setup but offer flexibility and
control.

pg. 15
o Suitable for developers with more technical expertise who need custom
solutions.

Summary

Large Language Models (LLMs) are powerful tools in natural language processing with a
range of applications from text generation to question answering. They are available
through commercial APIs, open-source libraries, and hosted services. Choosing the right
LLM involves evaluating factors such as accuracy, cost, ease of integration, and
performance based on your specific use case.

Chapter 3

3.0 Identification of LLMS for Metadata Summarization & QA


Since we are looking for something cost efficient and locally deployed, Hugging face will be
our best bet. They are also open source models which involves no cost and since they can
be deployed locally , data privacy concerns will not be raised. BERT, BART, T5, and Llama
are all models that come under the Hugging Face transformer architecture umbrella, and all
four can be used for text summarization and question answering (QA). Here’s a brief
overview of each:

1. BERT (Bidirectional Encoder Representations from Transformers)

• Architecture: Transformer-based, specifically designed for bidirectional attention.


• Use Cases: Primarily used for tasks like classification and QA.
• Strengths: Excellent at understanding context within text, which makes it very
effective for QA.
• Limitations: Not designed for text generation tasks like summarization, typically
requires adaptation for such tasks.

2. BART (Bidirectional and Auto-Regressive Transformers)

• Architecture: Combines a bidirectional (BERT-like) encoder and a left-to-right


(GPT-like) decoder.
• Use Cases: Designed for text generation tasks including summarization, translation,
and QA.
• Strengths: Excellent for text summarization due to its combined bidirectional and
autoregressive nature.
• Limitations: Heavier and more computationally expensive compared to some other
models.

3. T5 (Text-to-Text Transfer Transformer)

• Architecture: Treats every NLP task as a text-to-text problem, using the same
model, loss function, and hyperparameters.
pg. 16
• Use Cases: Highly versatile, can be used for translation, summarization,
classification, and QA.
• Strengths: Unified approach to NLP tasks, highly flexible.
• Limitations: Requires large amounts of data for fine-tuning specific tasks, which
can be computationally expensive.

4. Llama (Large Language Model Meta AI)

• Architecture: Based on transformers, designed by Meta (formerly Facebook).


• Use Cases: Text generation, summarization, QA, and more.
• Strengths: Efficient and scalable, designed to perform well on various NLP tasks.
• Limitations: Depending on implementation, might require additional setup or fine-
tuning for specific tasks.

Hugging Face provides an easy-to-use interface for these models through transformers
library, which supports a wide range of pre-trained models. This makes it straightforward
to experiment with and deploy these models for our use case.

Summary

1. BERT: Best suited for QA, excellent contextual understanding.


2. BART: helpful for text summarization, robust text generation capabilities.
3. T5: Highly flexible, useful for both summarization and QA effectively.
4. Llama: Efficient and scalable, suitable for various NLP tasks, including
summarization and QA (depending on availability and implementation).

3.1 Considerations before choosing a model:

pg. 17
Tb.1 Considerations before choosing LLM

Since we have compared all 4 models shortlisted as per above, I have chosen llama and
T5 for our use case.

I have decided to use Google Cloud’s Vertex AI API Service which runs the Ilama and T5 in the
background to generate metadata descriptions.

10. Conclusions:

1. Random Forest is identified as the best model with 70% Accurate predictions.
2. T5 Base models is used to generate the descriptions for Column Names.

pg. 18
11. Bibliography / References

The following are referred journals from the preliminary literature review.

1. ''Automated Data Classification: A Review" (Smith et al., 2020): This paper


reviews various automated data classification techniques and their
applications in data security and management
2. “A Survey on Data Classification Algorithms" (Lee and Kwon, 2019): Provides a
comprehensive survey of data classification algorithms, discussing their
strengths and limitations.
3. “ The Role of Data Catalogs in Big Data Environments" ljones and Harris,
2018): Discusses the importance of data catalogs in managing big data and
enabling self-service analytics.
4. "Metadata Management in Data Catalogs: Challenges and Solutions" (Patel et
al., 2019): Explores various challe_,r1,ges in metadata management and
proposes ..

5. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault,
T., Louf, R., Funtowicz, M., & Brew, J. (2020). "Transformers: State-of-the-Art
pg. 19
Natural Language Processing". Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations.
6. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W.,
& Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-
to-Text Transformer". Journal of Machine Learning Research
7. See, A., Liu, P. J., & Manning, C. D. (2017). "Get To The Point: Summarization with
Pointer-Generator Networks". Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers).
8. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Kaiser, Ł.,
Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information
Processing Systems (NeurIPS). Retrieved from https://arxiv.org/abs/1706.03762
9. Hugging Face. (2023). Transformers Library. Hugging Face Documentation.
Retrieved from https://huggingface.co/transformers/

12. Appendices:

Appendix A: List of Abbreviations

Abbreviation Full Form

AI Artificial Intelligence

ML Machine Learning

ETL Extract, Transform, Load

EDW Enterprise Data Warehouse

SQL Structured Query Language

API Application Programming Interface

JSON JavaScript Object Notation

Gen AI Generative Artificial Intelligence

KPI Key Performance Indicator

NLP Natural Language Processing

pg. 20
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
Work Integrated Learning Programmes Division
II SEMESTER 23-24

DSECLZG628T DISSERTATION
(Final Evaluation Sheet)

NAME OF THE STUDENT: B.BHUJITH KUMAR

ID NO. : 2022DA04009

Email Address : [email protected]

NAME OF THE SUPERVISOR: Arun Kumar Shanmugam

PROJECT TITLE : Data Catalog Management for Enterprise Data


Warehouse- ML and Gen AI Based Solution

(Please put a tick ( ) mark in the appropriate box)

S.No. Criteria Excellent Good Fair Poor


1 Work Progress and Achievements ✓
2 Technical/Professional Competence ✓
3 Documentation and expression ✓
4 Initiative and originality ✓
5 Punctuality ✓
6 Reliability ✓
Recommended Final Grade Excellent

EVALUATION DETAILS

EC No. Component Weightage Marks Awarded


1 Dissertation Outline 10% 10
2 Mid-Sem Progress
Seminar 10% 10
Viva 5% 5
Work Progress 15% 12
3 Final Seminar/Viva 20% 18
4 Final Report 40% 35
Total out of 100% 90

Note : Mark awarded should be in terms of % of weightage ( consider 10% weightage as 10


marks)

Organizational Mentor
pg. 21
Name Arun Kumar Shanmugam
Qualification M.Tech
Designation & Address Principal Data Scientist
Email Address [email protected]
Signature

Date 08/09/2024

NB: Kindly ensure that recommended final grade is duly indicated in the above evaluation sheet.

The Final Evaluation Form should be submitted separately in the viva portal.

Check list of items for the Final report

a. Is the Cover page in proper format? Y


b. Is the Title page in proper format? Y
c. Is the Certificate from the Supervisor in proper format? Has it been signed? Y
d. Is Abstract included in the Report? Is it properly written? Y
e. Does the Table of Contents page include chapter page numbers? Y
f. Does the Report contain a summary of the literature survey? Y
i. Are the Pages numbered properly? Y
ii. Are the Figures numbered properly? Y
iii. Are the Tables numbered properly? Y
iv. Are the Captions for the Figures and Tables proper? Y
v. Are the Appendices numbered? Y
g. Does the Report have Conclusion / Recommendations of the work? Y
h. Are References/Bibliography given in the Report? Y
i. Have the References been cited in the Report? Y
j. Is the citation of References / Bibliography in proper format? Y

pg. 22

You might also like