0% found this document useful (0 votes)
171 views7 pages

Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance

The document discusses the development of an Indian Legal Language Model (LLM) for enhanced legal text analysis and assistance. It aims to leverage natural language processing and machine learning to provide accurate answers to legal questions, analyze case law and statutes, and simplify complex legal documents. The researchers collected a large corpus of Indian legal literature and trained an LLAMA 2 model on the annotated data to develop an expert system for Indian law.

Uploaded by

prajwal p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views7 pages

Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance

The document discusses the development of an Indian Legal Language Model (LLM) for enhanced legal text analysis and assistance. It aims to leverage natural language processing and machine learning to provide accurate answers to legal questions, analyze case law and statutes, and simplify complex legal documents. The researchers collected a large corpus of Indian legal literature and trained an LLAMA 2 model on the annotated data to develop an expert system for Indian law.

Uploaded by

prajwal p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ISSN:0126-6039 Dec.

2023

SAINS MALAYSIANA Vol.52 No.12

Development of an Indian Legal Language Model (LLM) for Enhanced Legal Text Analysis and
Assistance
1
Dr. Jyothi A P , 2 Shivaranjini C, 3 Prajwal P
1,2
Department of Computer Science, Faculty of Engineering and Technology , MSRUAS, Bangalore, India
3
Senior ML Engineer, Finarb Analytics Consulting, Kolkata, India
Email: [email protected] DOI: 10.5281/zenodo.10072672

Abstract
This project aims to revolutionize the field of Indian law by harnessing the power of natural language processing and machine learning. The advanced language model
is trained to provide accurate answers to legal questions, analyze case law and statutes, assist with legal research, and simplify complex legal documents. By
leveraging cutting-edge technology, this tool will significantly enhance efficiency, accuracy, and effectiveness in the legal domain. The ultimate goal is to offer legal
professionals, researchers, and the public a reliable and easily accessible resource for navigating the intricacies of Indian law, transforming the way legal research
and decision-making are conducted.
Keywords: Indian Law, Legal Data Corpus, LLAMA – 2, NLP, Transformers

1. Introduction
In this era of rapid technological advancement, Natural Language Processing (NLP) stands out as a ground-breaking field that has fundamentally changed how people
connect with machines. By giving computers the ability to understand, interpret, and produce human language, NLP makes a vast variety of applications in many
different fields possible. One of the most significant and outstanding developments in the field of natural language processing is the introduction of Transformers, a
ground-breaking architecture that has completely changed how language models analyze sequential input. In recent advancements the release of large language
models from several institutes has boosted the growth of transformers. In this view it's utterly important for all the domains to adopt and leverage this technology.

In India most of the legal lawyers, students, and judges must rely on a large number of books to understand and get the information about Indian law and their causes.
Common people of the country must hire a lawyer even for small legal doubts. They must study several old case documents and go through them to understand the
implication of the law and their judgements. Although the commercially available LLM models are trained with a huge corpus of data, they are not introduced with
Indian law [1-3]. Hence in this attempt we try to solve and train the current commercially available large language models to work with Indian legal context.

In recent years, pre-trained language models (PLMs) based on transformers like BERT have had a major influence on the field of natural language processing (NLP).
These models have been extensively tested and deployed in a wide range of downstream use cases after being pre-trained on massive volumes of text data. Interest
in domain-specific pre-training to address PLMs' ongoing failure to learn new content areas, such as the scientific and legal fields, has increased [4-6].

As natural language processing (NLP) has grown increasingly widespread in the legal field, PLMs tailored to legal text have emerged. European Union (EU), United
Kingdom (UK), and United States (US) law papers have been used to pre-train BERT-based models. Text categorization, summarization, and named entity
identification are just some of the areas where these legal domain models have proven beneficial [7]. Pre-training LMs with Indian legal literature might be helpful in
avoiding these issues. By pre-training on patterns seen in Indian legal literature, the model is better prepared to deal with the complexities of the domain [11-13]. The
next legal processes under Indian law may progress more quickly because of this. After the boom of GPT-3.5 there are several Indian researchers who started
working on transformer models, one of such contribution is Investigating pre-training techniques in the Indian legal domain where they have explored the use of BERT
model to Indian law with training on around 5.4 million documents, where their model shows significant improvement when compared to models from other countries
[10]
. To achieve this goal, collecting the right amount of data and preparation of data is very important. We searched for commercially available legal texts and found a
very limited amount of data for Indian law. A couple of researchers addressed the issue by incorporating several methods such as LSTM, BERT and other NLP models
[11, 14, 15]
. It is also to be noted that the summarization methods developed to summarize news and other methodologies helps us understand the transformer
architecture [16],[17]. The datasets that are available are limited to 30 to 40k legal texts, which is very small to understand the entire legal system of India [12].

From the above literature survey, it is evident that there is no proper methodology for scraping the LEGAL contents, and the current database for LEGAL texts is very
limited. It can also be noted that the current models available are trained on BERT, which is very old and trained on a small number of parameters when compared to
the currently available PLM models. Although the currently available models can recognize named entities, they are still inaccurate due to a lack of data that they
have been trained on. It is also evident that none of the current research focuses on annotating the data while training the PLM, which is evident when training a large

1
ISSN:0126-6039 Dec. 2023

SAINS MALAYSIANA Vol.52 No.12

language model. If the annotation is not provided, the model may not learn the context properly due to the fact that the structure of Indian law texts is significantly
different when compared to foreign laws.

Fig. 1: Named Entities that are annotated


Keeping in mind all the above limitations, we try to bridge the gap between the data and the model by creating a large corpus of data for Indian literary texts and
developing an efficient algorithm to extract and scrape the data from all the internet resources. We suggest using an Openayai-built pipeline to annotate and validate
each of the texts. We propose to extract the core law statements and their implications along with the legal texts to give the model robust knowledge. After having
the availability of data, we propose to train an LLAMA 2 model on the entire database and benchmark its performance to compare it with the other models. We also
aim to create an API endpoint to access the model inference at later stages of the project with a minimally designed interface.

Fig. 2: Annotated Rhetorical Roles


2. Method
In order to solve this complex problem, we have designed a architecture as shown in Figure 1, The first step of the project includes collection of the dataset, the raw
dataset is collected from different open-source website.An algorithm is developed to collect case file information from websites which is called data scraping. Data
scraping algorithm takes links of the case file as an input and extracts the contents from the case files. The collected dataset contains both Structured, annotated and
unstructured data. Second step of the project is to preprocess the data, this step includes removing unwanted characters, line breaks and any irrelevant sections like
table of contents, footnotes. After preprocessing steps are completed an algorithm for the purpose of annotating PDF documents by utilizing the headings present
within the document. Additionally, the system aims to transform the PDF files into JSON format. Annotating the document and features extraction using OpennyAI
public repo which includes (refer figure 1 and 2).

Authors: Dr. Jyothi A P , Shivaranjini C, Prajwal P is Licensed Under CC-BY-NC-ND 4.0 2


ISSN:0126-6039 Dec. 2023

SAINS MALAYSIANA Vol.52 No.12

➔ Rhetorical role: Rhetorical role is preamble, facts, ruling from lower court, arguments, issues, analysis etc.
➔ Named entity: Annotating named entities in legal papers means pointing out and marking specific pieces of information, like the names of
people, organizations, places, dates, and legal terms. This process helps to understand and makes it easier to find essential information.
Tokenizing Final legal data corpus (annotated data which is converted to Json format) and training a model by feeding data in the form of batch size, Feed the model to
API to create a web interface
2.1. Data Scraping
To achieve the data scraping we collected data from various sources, some of the small datasets are already annotated, some of the datasets were
available only via web, and most of them were PDF. We have developed an universal algorithm which takes the files, links and preprocesses them and
stores them in a dedicated storage whose details were given in Table 2.
After all the preprocessing and data scraping of all the above-mentioned documents, the final data corpus without annotation adds up to 45L documents of
around 56GB of annotated text, which comprises legal judgements, Law, textbooks and Acts. We then used the pre-trained models and pipelines from
OpennyAi and processed the entire corpus of legal text which includes the case and their judgements. The annotations are done at two levels, Rhetorical
role and Named entity.

Fig. 3: Shows the flowchart of the AI-based models and experimental methods applied
Table 2 – Data Source details

SL NO DATA TYPE IS IT SOURCE INFO

ANNOTATED?

1 PDF Yes Constitution Benches of the SC 900 case details and their judgment with annotation
2 Text Yes abstractive summaries 7,130 Indian Supreme Court case documents & their `abstractive' summaries
3 Json Yes Indian legalbert 35k Supreme court judgments
4 Json Yes CJPE · GitHub 40k supreme court judgements
5 Text Yes Central Acts enacted by the 858 annotated Central Acts enacted by the Indian Parliament from the year
Indian Parliament 1838 to 2020
6 Text Yes Representative Judgements 9435 judgment sentences and 1560 preambles
Sample
7 Web No Indiankanoon An online search for Indian legal documents (30L).
8 Web No advocatekhoj - bareact Am Online web interface for Bare act on Indian constitution (20K).

Authors: Dr. Jyothi A P , Shivaranjini C, Prajwal P is Licensed Under CC-BY-NC-ND 4.0 3


ISSN:0126-6039 Dec. 2023

SAINS MALAYSIANA Vol.52 No.12

2.2. Model training


The final JSON has been verified and then we have written an algorithm to train the model with different hyper parameters using auto train modules from
the hugging face. Parameters used: Batch size – 4, Learning rate 1e-4 (adaptive). Loss function – inbuilt, Method – multimodal, Model – llama-7b-chat-hf,
Epoch – 3, Checkpoints saved - 500 steps, Logging steps – 100. The overall training was done in a VM on a machine with T4 GPU due to its huge size and
enormous amount of data the model was trained for 400k steps. The model has been tuned to work with several prompt engineering along with the testing
at several checkpoints to tune the parameter. Overall flow of the model training architecture is represented in Figure 2. The final model has been directly
pushed to hugging face for easy inference and an API endpoint is created to consume the model. It is important to note that our training steps are
comparatively low due to limited availability of the GPU resources although the data and the scripts can be used to train up to 1.4 million steps to increase
its accuracy and perplexity.

Fig. 4: Model training flow


3. Results And Discussion
In this section we explain our findings in data scraping where we explain our final data corpus stats, model training and benchmarking of the models.

3.1. Data corpus stat and comparison


Our dataset is a combination of several documents, and it stacks up to the industry standards, we compared our final data corpus to several commercially
available datasets and the results are shown in Table 3.
Table 3 – Comparison of dataset stats to other commercially available models
MODEL DATA SOURCE DOCUMENTS SIZE (GB)
LEGALBERT EU, UK, US legislation, cases from ECJ, ECtHR 350K 12
CASELAWBERT Harvard Case Law (based on US federal and state courts) 3.4M 37
POLBERT Legal Analyses, Court Opinions, Government Publications, Contracts, 10M 256
Statutes, Regulations, and more from US and EU
LAWFORMER China Judgment Online (based on cases from Chinese Courts) 22.7M 84
INLEGALBERT, INCASELAWBERT, Cases from Indian Supreme Court and High Courts 5.4M 27
CUSTOMINLAWBERT
ILDC Cases from Supreme court 40K 1.2
OUR DATA CORPUS Case document from supreme court, high court, local legal bodies, acts, 4.5M 56
constitution law, general amendments and Indian law texts
From the above comparisons it can be noted that the number of documents that we have collected is enormous and significantly matches well with the
documents that are commercially available. It is to be noted that the Lawformer has a greater number of data due to China’s easy availability of law documents
under open source from their government and when compared to India it has a greater number of laws and strict policies. It can also be noted that the

Authors: Dr. Jyothi A P , Shivaranjini C, Prajwal P is Licensed Under CC-BY-NC-ND 4.0 4


ISSN:0126-6039 Dec. 2023

SAINS MALAYSIANA Vol.52 No.12

PoLBERT uses 10M documents which is a combination of both US and EU. By the above analysis we can also see that the commercially available data for the
Indian law is very limited and restricted. We hope in the coming days the Indian government will take action to keep all the Indian laws as open sourced as
possible.
3.2. Model training and evaluation
The final model that we have trained uses the open sourced LLM model with 7B parameters, we chose this model due to limited resource availability and time.
It is to be noted that training large language models takes a huge amount of time and resources which can start from days to months. We have optimized the
training in several ways to train the model under different circumstances. The final training matrices, loss ratio and the learning rate that we have achieved is
shown in Figure 3.

(a)

(b)
Fig 5 – Model training – (a) epoch, learning_rate and (b) loss vs the training steps
From the Figures it can be noted that while training the model initially started to learn at a higher rate which is a very good sign that the model is accepting
the new values at a very good learning rate, it also indicates that the provided data to the model is very new. It can also be inferred that as we feed more data
the learning ratio starts to reduce drastically which is a sign of good training of the model. From Figure 3 we can observe that the loss of the model is initially
very high due to new data and high learning rate, which is changing the model weights drastically to get accustomed with new data. As the training
commences to 200k steps the loss reduces drastically and settles to under 0.45 which is a sign of good training. At the end of training the final loss of the
model is found to be 0.41 which is very good while training a large language model in a small machine for a huge amount of data. After successful evaluation of
the model metrics, we found that the model is very good in determining and getting all the related information for a law. We evaluated our model with various
metrics such as model precession in generating the text, model recall values for classification and model accuracy in giving the right sections. The results of
these comparisons are given in the below Table 4.

Authors: Dr. Jyothi A P , Shivaranjini C, Prajwal P is Licensed Under CC-BY-NC-ND 4.0 5


ISSN:0126-6039 Dec. 2023

SAINS MALAYSIANA Vol.52 No.12

Table 4 – Model evaluation metrics mP, mR and mF1


METRICS LEGALBERT CASELAWBERT INLEGALBERT INCASELAWBERT OUR MODEL
MP 79.85 82.62 83.43 83.05 84.62
MR 78.49 82.42 83.15 82.82 86.82
MF1 78.21 82.38 83.09 82.77 79.07
The evaluation shows us that the previous models were mainly based on the BERT which is very outdated, although we have trained our model with very limited
resources and limited number of steps, Figure 4 shows the results show high model precision and recall when compared to other models. It is to be noted that
the model accuracy is very low which is expected since we trained for a small number of steps. It is to be noted that the metrics are very different from
classical ML algorithm matrices hence the precision and recall won’t add up to accuracy since they have been measured for different tasks.

Fig. 6: Graphical representation of comparison of different model benchmarks


The next form of evaluation we did is KL Divergence of the mod score which is an industry standard metric when question answering generates the texts. The
results are shown in Figure 5. From the above metrics we can see that our model performs way better in all the KL divergence metrics in legal contexts.
When the average score of the KL divergence metrics is compared with all other models, we found that our model has very good results as shown in Figure 5.
In the comparison model we have also included the best model that is available for the question answering whose KL divergence metrics are 90.939 which
shows that although our model is way better than the commercially available models in legal contexts it still needs a lot of improvement when compared to
other fields. With all said, it is still to be noted that a model trained with limited sources has outperformed many models that were trained with huge resources.
All the models that we have trained are pushed into the hugging face repository and has been deployed there and inference has been generated. Finally, we
have also deployed the model in Amazon sage maker to create an inference end point. This endpoint is then used in the web interface to access it efficiently.

Fig. 7: Average KL divergence model metrics comparison to other models


Conclusion
This research project has successfully developed an Indian Legal Language Model (LLM) using advanced natural language processing techniques. By addressing
challenges related to data scarcity and model efficacy, the project has created a robust and proficient LLM capable of handling complex legal texts. The curated dataset,
enriched through precise annotations, facilitated effective model training, resulting in impressive performance metrics. The LLM outperformed existing legal models in

Authors: Dr. Jyothi A P , Shivaranjini C, Prajwal P is Licensed Under CC-BY-NC-ND 4.0 6


ISSN:0126-6039 Dec. 2023

SAINS MALAYSIANA Vol.52 No.12

precision, recall, and F1-score, marking a significant advancement in NLP applications for Indian law.
Looking ahead, future directions include continual model improvements through fine-tuning, adapting to the evolving legal landscape, and incorporating domain-specific
knowledge bases. Collaborating with legal professionals for validation and feedback will be crucial. Additionally, exploring multilingual capabilities and integrating legal
reasoning for predicting and explaining case outcomes will enhance the model's utility for legal practitioners. Overall, this project's findings provide a promising
foundation for further research and development, driving increased efficiency, accuracy, and accessibility in legal research and decision-making processes.
Acknowledgements
We acknowledge MS ramaiah University of applied science for their constant support. We also thank Finarb analytics for their help in providing the necessary resources.
REFERENCES:
1. Gupta, A., Furniturewala, S., Kumari, V. and Sharma, Y., 2023, July. Steno AI at SemEval-2023 Task 6: Rhetorical Role Labelling of Legal Documents using
Transformers and Graph Neural Networks. In Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 1858-1862).
2. Rawat, A.J., Ghildiyal, S. and Dixit, A.K., 2022. Topic modelling of legal documents using NLP and bidirectional encoder representations from transformers. Indonesian
Journal of Electrical Engineering and Computer Science, 28(3), pp.1749-1755.
3. de Lima, A.G., Boughanem, M., Aranha, E.H.D.S., Dkaki, T. and Moreno, J.G., 2022, July. Exploring SBERT and mixup data augmentation in rhetorical role labeling of
indian legal sentences. In Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), Samatan, Gers, France, July.
4. Gandhi, P.A. and Talwar, V., 2023. Artificial intelligence and ChatGPT in the legal context. Int J Med Sci, 10, pp.1-2.
5. Furniturewala, S., Jain, R., Kumari, V. and Sharma, Y., 2021. Legal text classification and summarization using transformers and joint text features.
6. Shridhar, B.S., Kayalvizhi, S. and Thenmozhi, D., 2021. Simple Transformers in Rhetoric Role Labelling for Legal Judgements.
7. Paul, S., Mandal, A., Goyal, P. and Ghosh, S., 2022. Pre-training transformers on indian legal text. arXiv preprint arXiv:2209.06049.
8. Malik, V., Sanjay, R., Nigam, S.K., Ghosh, K., Guha, S.K., Bhattacharya, A. and Modi, A., 2021. ILDC for CJPE: Indian legal documents corpus for court judgment
prediction and explanation. arXiv preprint arXiv:2105.13562.
9. Modi, A., Kalamkar, P., Karn, S., Tiwari, A., Joshi, A., Tanikella, S.K., Guha, S.K., Malhan, S. and Raghavan, V., 2023. SemEval 2023 Task 6: LegalEval--Understanding
Legal Texts. arXiv preprint arXiv:2304.09548.
10. Girhepuje, S., Goel, A., Krishnan, G., Goyal, S., Pandey, S., Kumaraguru, P. and Ravindran, B., 2023. Are Models Trained on Indian Legal Data Fair?. arXiv preprint
arXiv:2303.07247.
11. Mamakas, D., Tsotsi, P., Androutsopoulos, I. and Chalkidis, I., 2022. Processing long legal documents with pre-trained transformers: Modding legalbert and
longformer. arXiv preprint arXiv:2211.00974.
12. Kapoor, A., Dhawan, M., Goel, A., Arjun, T.H., Bhatnagar, A., Agrawal, V., Agrawal, A., Bhattacharya, A., Kumaraguru, P. and Modi, A., 2022. HLDC: Hindi legal
documents corpus. arXiv preprint arXiv:2204.00806.
13. Kalamkar, P., Tiwari, A., Agarwal, A., Karn, S., Gupta, S., Raghavan, V. and Modi, A., 2022. Corpus for automatic structuring of legal documents. arXiv preprint
arXiv:2201.13125.
14. Savelka, J., Westermann, H. and Benyekhlef, K., 2021. Cross-domain generalization and knowledge transfer in transformers trained on legal data. arXiv preprint
arXiv:2112.07870.
15. Parikh, V., Mathur, V., Mehta, P., Mittal, N. and Majumder, P., 2021. Lawsum: A weakly supervised approach for indian legal document summarization. arXiv preprint
arXiv:2110.01188.
16. USBN Dr Jyothi A P, Shivaranjini C., 2023, AI-Powered News Article Summarization for Efficient Information Consumption, TECHNIX INTERNATIONAL JOURNAL FOR
ENGINEERING RESEARCH(TIJER) 10 (7), 162-166
17. Nayagam, V.S., Jyothi, A.P., Abirami, P., Femila Roseline, J., Sudhakar, M., Al-Ammar, E.A., Wabaidur, S.M., Hoda, N. and Sisay, A., 2022. Deep learning model on energy
management in grid-connected solar systems. International Journal of Photoenergy, 2022.

Authors: Dr. Jyothi A P , Shivaranjini C, Prajwal P is Licensed Under CC-BY-NC-ND 4.0 7

You might also like