Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance
Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance
2023
Development of an Indian Legal Language Model (LLM) for Enhanced Legal Text Analysis and
Assistance
1
Dr. Jyothi A P , 2 Shivaranjini C, 3 Prajwal P
1,2
Department of Computer Science, Faculty of Engineering and Technology , MSRUAS, Bangalore, India
3
Senior ML Engineer, Finarb Analytics Consulting, Kolkata, India
Email: [email protected] DOI: 10.5281/zenodo.10072672
Abstract
This project aims to revolutionize the field of Indian law by harnessing the power of natural language processing and machine learning. The advanced language model
is trained to provide accurate answers to legal questions, analyze case law and statutes, assist with legal research, and simplify complex legal documents. By
leveraging cutting-edge technology, this tool will significantly enhance efficiency, accuracy, and effectiveness in the legal domain. The ultimate goal is to offer legal
professionals, researchers, and the public a reliable and easily accessible resource for navigating the intricacies of Indian law, transforming the way legal research
and decision-making are conducted.
Keywords: Indian Law, Legal Data Corpus, LLAMA – 2, NLP, Transformers
1. Introduction
In this era of rapid technological advancement, Natural Language Processing (NLP) stands out as a ground-breaking field that has fundamentally changed how people
connect with machines. By giving computers the ability to understand, interpret, and produce human language, NLP makes a vast variety of applications in many
different fields possible. One of the most significant and outstanding developments in the field of natural language processing is the introduction of Transformers, a
ground-breaking architecture that has completely changed how language models analyze sequential input. In recent advancements the release of large language
models from several institutes has boosted the growth of transformers. In this view it's utterly important for all the domains to adopt and leverage this technology.
In India most of the legal lawyers, students, and judges must rely on a large number of books to understand and get the information about Indian law and their causes.
Common people of the country must hire a lawyer even for small legal doubts. They must study several old case documents and go through them to understand the
implication of the law and their judgements. Although the commercially available LLM models are trained with a huge corpus of data, they are not introduced with
Indian law [1-3]. Hence in this attempt we try to solve and train the current commercially available large language models to work with Indian legal context.
In recent years, pre-trained language models (PLMs) based on transformers like BERT have had a major influence on the field of natural language processing (NLP).
These models have been extensively tested and deployed in a wide range of downstream use cases after being pre-trained on massive volumes of text data. Interest
in domain-specific pre-training to address PLMs' ongoing failure to learn new content areas, such as the scientific and legal fields, has increased [4-6].
As natural language processing (NLP) has grown increasingly widespread in the legal field, PLMs tailored to legal text have emerged. European Union (EU), United
Kingdom (UK), and United States (US) law papers have been used to pre-train BERT-based models. Text categorization, summarization, and named entity
identification are just some of the areas where these legal domain models have proven beneficial [7]. Pre-training LMs with Indian legal literature might be helpful in
avoiding these issues. By pre-training on patterns seen in Indian legal literature, the model is better prepared to deal with the complexities of the domain [11-13]. The
next legal processes under Indian law may progress more quickly because of this. After the boom of GPT-3.5 there are several Indian researchers who started
working on transformer models, one of such contribution is Investigating pre-training techniques in the Indian legal domain where they have explored the use of BERT
model to Indian law with training on around 5.4 million documents, where their model shows significant improvement when compared to models from other countries
[10]
. To achieve this goal, collecting the right amount of data and preparation of data is very important. We searched for commercially available legal texts and found a
very limited amount of data for Indian law. A couple of researchers addressed the issue by incorporating several methods such as LSTM, BERT and other NLP models
[11, 14, 15]
. It is also to be noted that the summarization methods developed to summarize news and other methodologies helps us understand the transformer
architecture [16],[17]. The datasets that are available are limited to 30 to 40k legal texts, which is very small to understand the entire legal system of India [12].
From the above literature survey, it is evident that there is no proper methodology for scraping the LEGAL contents, and the current database for LEGAL texts is very
limited. It can also be noted that the current models available are trained on BERT, which is very old and trained on a small number of parameters when compared to
the currently available PLM models. Although the currently available models can recognize named entities, they are still inaccurate due to a lack of data that they
have been trained on. It is also evident that none of the current research focuses on annotating the data while training the PLM, which is evident when training a large
1
ISSN:0126-6039 Dec. 2023
language model. If the annotation is not provided, the model may not learn the context properly due to the fact that the structure of Indian law texts is significantly
different when compared to foreign laws.
➔ Rhetorical role: Rhetorical role is preamble, facts, ruling from lower court, arguments, issues, analysis etc.
➔ Named entity: Annotating named entities in legal papers means pointing out and marking specific pieces of information, like the names of
people, organizations, places, dates, and legal terms. This process helps to understand and makes it easier to find essential information.
Tokenizing Final legal data corpus (annotated data which is converted to Json format) and training a model by feeding data in the form of batch size, Feed the model to
API to create a web interface
2.1. Data Scraping
To achieve the data scraping we collected data from various sources, some of the small datasets are already annotated, some of the datasets were
available only via web, and most of them were PDF. We have developed an universal algorithm which takes the files, links and preprocesses them and
stores them in a dedicated storage whose details were given in Table 2.
After all the preprocessing and data scraping of all the above-mentioned documents, the final data corpus without annotation adds up to 45L documents of
around 56GB of annotated text, which comprises legal judgements, Law, textbooks and Acts. We then used the pre-trained models and pipelines from
OpennyAi and processed the entire corpus of legal text which includes the case and their judgements. The annotations are done at two levels, Rhetorical
role and Named entity.
Fig. 3: Shows the flowchart of the AI-based models and experimental methods applied
Table 2 – Data Source details
ANNOTATED?
1 PDF Yes Constitution Benches of the SC 900 case details and their judgment with annotation
2 Text Yes abstractive summaries 7,130 Indian Supreme Court case documents & their `abstractive' summaries
3 Json Yes Indian legalbert 35k Supreme court judgments
4 Json Yes CJPE · GitHub 40k supreme court judgements
5 Text Yes Central Acts enacted by the 858 annotated Central Acts enacted by the Indian Parliament from the year
Indian Parliament 1838 to 2020
6 Text Yes Representative Judgements 9435 judgment sentences and 1560 preambles
Sample
7 Web No Indiankanoon An online search for Indian legal documents (30L).
8 Web No advocatekhoj - bareact Am Online web interface for Bare act on Indian constitution (20K).
PoLBERT uses 10M documents which is a combination of both US and EU. By the above analysis we can also see that the commercially available data for the
Indian law is very limited and restricted. We hope in the coming days the Indian government will take action to keep all the Indian laws as open sourced as
possible.
3.2. Model training and evaluation
The final model that we have trained uses the open sourced LLM model with 7B parameters, we chose this model due to limited resource availability and time.
It is to be noted that training large language models takes a huge amount of time and resources which can start from days to months. We have optimized the
training in several ways to train the model under different circumstances. The final training matrices, loss ratio and the learning rate that we have achieved is
shown in Figure 3.
(a)
(b)
Fig 5 – Model training – (a) epoch, learning_rate and (b) loss vs the training steps
From the Figures it can be noted that while training the model initially started to learn at a higher rate which is a very good sign that the model is accepting
the new values at a very good learning rate, it also indicates that the provided data to the model is very new. It can also be inferred that as we feed more data
the learning ratio starts to reduce drastically which is a sign of good training of the model. From Figure 3 we can observe that the loss of the model is initially
very high due to new data and high learning rate, which is changing the model weights drastically to get accustomed with new data. As the training
commences to 200k steps the loss reduces drastically and settles to under 0.45 which is a sign of good training. At the end of training the final loss of the
model is found to be 0.41 which is very good while training a large language model in a small machine for a huge amount of data. After successful evaluation of
the model metrics, we found that the model is very good in determining and getting all the related information for a law. We evaluated our model with various
metrics such as model precession in generating the text, model recall values for classification and model accuracy in giving the right sections. The results of
these comparisons are given in the below Table 4.
precision, recall, and F1-score, marking a significant advancement in NLP applications for Indian law.
Looking ahead, future directions include continual model improvements through fine-tuning, adapting to the evolving legal landscape, and incorporating domain-specific
knowledge bases. Collaborating with legal professionals for validation and feedback will be crucial. Additionally, exploring multilingual capabilities and integrating legal
reasoning for predicting and explaining case outcomes will enhance the model's utility for legal practitioners. Overall, this project's findings provide a promising
foundation for further research and development, driving increased efficiency, accuracy, and accessibility in legal research and decision-making processes.
Acknowledgements
We acknowledge MS ramaiah University of applied science for their constant support. We also thank Finarb analytics for their help in providing the necessary resources.
REFERENCES:
1. Gupta, A., Furniturewala, S., Kumari, V. and Sharma, Y., 2023, July. Steno AI at SemEval-2023 Task 6: Rhetorical Role Labelling of Legal Documents using
Transformers and Graph Neural Networks. In Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 1858-1862).
2. Rawat, A.J., Ghildiyal, S. and Dixit, A.K., 2022. Topic modelling of legal documents using NLP and bidirectional encoder representations from transformers. Indonesian
Journal of Electrical Engineering and Computer Science, 28(3), pp.1749-1755.
3. de Lima, A.G., Boughanem, M., Aranha, E.H.D.S., Dkaki, T. and Moreno, J.G., 2022, July. Exploring SBERT and mixup data augmentation in rhetorical role labeling of
indian legal sentences. In Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), Samatan, Gers, France, July.
4. Gandhi, P.A. and Talwar, V., 2023. Artificial intelligence and ChatGPT in the legal context. Int J Med Sci, 10, pp.1-2.
5. Furniturewala, S., Jain, R., Kumari, V. and Sharma, Y., 2021. Legal text classification and summarization using transformers and joint text features.
6. Shridhar, B.S., Kayalvizhi, S. and Thenmozhi, D., 2021. Simple Transformers in Rhetoric Role Labelling for Legal Judgements.
7. Paul, S., Mandal, A., Goyal, P. and Ghosh, S., 2022. Pre-training transformers on indian legal text. arXiv preprint arXiv:2209.06049.
8. Malik, V., Sanjay, R., Nigam, S.K., Ghosh, K., Guha, S.K., Bhattacharya, A. and Modi, A., 2021. ILDC for CJPE: Indian legal documents corpus for court judgment
prediction and explanation. arXiv preprint arXiv:2105.13562.
9. Modi, A., Kalamkar, P., Karn, S., Tiwari, A., Joshi, A., Tanikella, S.K., Guha, S.K., Malhan, S. and Raghavan, V., 2023. SemEval 2023 Task 6: LegalEval--Understanding
Legal Texts. arXiv preprint arXiv:2304.09548.
10. Girhepuje, S., Goel, A., Krishnan, G., Goyal, S., Pandey, S., Kumaraguru, P. and Ravindran, B., 2023. Are Models Trained on Indian Legal Data Fair?. arXiv preprint
arXiv:2303.07247.
11. Mamakas, D., Tsotsi, P., Androutsopoulos, I. and Chalkidis, I., 2022. Processing long legal documents with pre-trained transformers: Modding legalbert and
longformer. arXiv preprint arXiv:2211.00974.
12. Kapoor, A., Dhawan, M., Goel, A., Arjun, T.H., Bhatnagar, A., Agrawal, V., Agrawal, A., Bhattacharya, A., Kumaraguru, P. and Modi, A., 2022. HLDC: Hindi legal
documents corpus. arXiv preprint arXiv:2204.00806.
13. Kalamkar, P., Tiwari, A., Agarwal, A., Karn, S., Gupta, S., Raghavan, V. and Modi, A., 2022. Corpus for automatic structuring of legal documents. arXiv preprint
arXiv:2201.13125.
14. Savelka, J., Westermann, H. and Benyekhlef, K., 2021. Cross-domain generalization and knowledge transfer in transformers trained on legal data. arXiv preprint
arXiv:2112.07870.
15. Parikh, V., Mathur, V., Mehta, P., Mittal, N. and Majumder, P., 2021. Lawsum: A weakly supervised approach for indian legal document summarization. arXiv preprint
arXiv:2110.01188.
16. USBN Dr Jyothi A P, Shivaranjini C., 2023, AI-Powered News Article Summarization for Efficient Information Consumption, TECHNIX INTERNATIONAL JOURNAL FOR
ENGINEERING RESEARCH(TIJER) 10 (7), 162-166
17. Nayagam, V.S., Jyothi, A.P., Abirami, P., Femila Roseline, J., Sudhakar, M., Al-Ammar, E.A., Wabaidur, S.M., Hoda, N. and Sisay, A., 2022. Deep learning model on energy
management in grid-connected solar systems. International Journal of Photoenergy, 2022.