0% found this document useful (0 votes)
17 views3 pages

Generate Insights From Unstructured Financial Data

This document discusses building a backend storage architecture for an AI system that analyzes financial documents. It involves preprocessing data through steps like removing stop words, lemmatization and vectorization. The preprocessed data is then converted to embeddings and stored in a vector database to enable identifying related information quickly based on embedding similarities. Knowledge graphs are also generated by extracting entities through named entity recognition and storing the relationships between entities in a graph database for generating insights from the unstructured financial data.

Uploaded by

Azharudeen salim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views3 pages

Generate Insights From Unstructured Financial Data

This document discusses building a backend storage architecture for an AI system that analyzes financial documents. It involves preprocessing data through steps like removing stop words, lemmatization and vectorization. The preprocessed data is then converted to embeddings and stored in a vector database to enable identifying related information quickly based on embedding similarities. Knowledge graphs are also generated by extracting entities through named entity recognition and storing the relationships between entities in a graph database for generating insights from the unstructured financial data.

Uploaded by

Azharudeen salim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Earnings Call Transcipts

Sources News Articles

Investor Reports

UI Interface for Users

End Product Backend Storage Architecture

ML Models

To store embeddings
On daily basis, get latest Articles /
Why do we need it? Reports / Transcripts
Links between Embeddings to give
back related Information Quickly
Remove Stop words

Store
Lematization

Data Cleaning Normalize Data


dding
Converting Data to Embeddings Preprocessing Required
Vectorization
Glove?

Vector Store Store to Vector DB


enAI /

Cost Low Accuracy Low


n, refers to an AI-native open-source
er productivity and happiness. It is Mitigation Transfer Learning
Traditional NLP Approaches
representations of data efficiently. High No of Training Documents
Chroma Limitations
ns of documents and perform Needed No. Of Labeled Documents
uments based on their vector 2000 - 3000
Needed
Options Available
Overall Sentiment
Cost High Accuracy High
Pinecone

GPT 3.5 : 8193 tokens / ~6000


Faiss OpenAI / Pre trained LLM
words
Limitations? Token Limits

To identify links between Generate insights from Mitigation Vectorization


Information based on Probability / Why do we need it? unstructured financial
Repeated Occurence
data Products / Product Lines
Steps for Individual Document
converts to Triplets REBEL Clean Data and Create Graph Knowledge Graph People Names Designation
What are we identifying ?
Neo4j Geographical Information
Storing the Graph Information &
Visualizing it Interactively
Graph DB + PyViz NER Companies

Why is it Needed? REBEL Model


Data processing

Chains & Agents How are we doing it? AWS Comprehend

Chunking Spacy

Vector Store Rule Based Approach


LLM & Prompting Identifying Topics / Keywords
g Parameters Preprocessing Required Train Model based on Labeled
Data

pt Templates
Context Aware Sentiment Analysis How?
Fine tuning the LLM

s by Transfer GPT based Approach


Document Summary How?
seq-to-seq model

LLama Index? Rule based approach to identify Click on Question to navigate to


Key Questions asked
and list questions asked in the call? Response

Network Graph of Related Parties How? Neo4j

Identify Trending Topics


At Aggregate Level
Identify Risks

Depends on Vector Store

Document Q&A How scalable is it?


Dependency on Open AI Tokens to
pass in Vectors
Generate insights from unstructured financial data
1. Sources
1.1. Earnings Call Transcipts

1.2. News Articles

1.3. Investor Reports

2. End Product
2.1. UI Interface for Users

2.2. Backend Storage Architecture

2.3. ML Models
3. Steps for Individual Document
3.1. On daily basis, get latest Articles / Reports / Transcripts

3.2. Data Cleaning

3.2.1. Remove Stop words

3.2.2. Lematization

3.2.3. Normalize Data

3.2.4. Vectorization

3.2.5. Store to Vector DB

3.3. Overall Sentiment

3.3.1. Traditional NLP Approaches

3.3.1.1. Cost Low Accuracy Low

3.3.1.2. Limitations

3.3.1.2.1. High No of Training Documents Needed

3.3.1.2.1.1. Mitigation

3.3.1.2.1.1.1. Transfer Learning

3.3.1.2.1.2. No. Of Labeled Documents Needed

3.3.1.2.1.2.1. 2000 - 3000

3.3.2. OpenAI / Pre trained LLM

3.3.2.1. Cost High Accuracy High

3.3.2.2. Limitations?

3.3.2.2.1. Token Limits

3.3.2.2.1.1. GPT 3.5 : 8193 tokens / ~6000 words

3.3.2.2.1.2. Mitigation

3.3.2.2.1.2.1. Vectorization

3.4. Data processing

3.4.1. NER

3.4.1.1. What are we identifying ?

3.4.1.1.1. Products / Product Lines

3.4.1.1.2. People Names

3.4.1.1.2.1. Designation

3.4.1.1.3. Geographical Information

3.4.1.1.4. Companies

3.4.1.2. How are we doing it?

3.4.1.2.1. REBEL Model

3.4.1.2.2. AWS Comprehend

3.4.1.2.3. Spacy

3.4.2. Identifying Topics / Keywords

3.4.2.1. Rule Based Approach

3.4.2.2. Train Model based on Labeled Data

3.5. Context Aware Sentiment Analysis

3.5.1. How?
3.6. Document Summary

3.6.1. How?

3.6.1.1. GPT based Approach

3.6.1.2. seq-to-seq model

3.7. Key Questions asked

3.7.1. Rule based approach to identify and list questions asked in the call?

3.7.1.1. Click on Question to navigate to Response

3.8. Network Graph of Related Parties

3.8.1. How?

3.8.1.1. Neo4j

4. At Aggregate Level
4.1. Identify Trending Topics

4.2. Identify Risks

5. Document Q&A
5.1. How scalable is it?

5.1.1. Depends on Vector Store

5.1.2. Dependency on Open AI Tokens to pass in Vectors


6. LLM & Prompting
6.1. Why is it Needed?

6.2. Preprocessing Required

6.2.1. Chains & Agents

6.2.2. Chunking

6.2.3. Vector Store

6.2.4. Fine tuning the LLM

6.2.4.1. Adjusting Parameters

6.2.4.2. Prompt Templates

6.2.4.3. Add more Layers by Transfer Learning

6.2.4.3.1. Not possible for OpenAI

6.2.4.3.2. Can be done for Falcon 30B?

7. Knowledge Graph
7.1. Why do we need it?

7.1.1. To identify links between Information based on Probability / Repeated Occurence

7.2. Clean Data and Create Graph

7.2.1. REBEL

7.2.1.1. Does NER and converts to Triplets

7.3. Storing the Graph Information & Visualizing it Interactively

7.3.1. Neo4j

7.3.2. Graph DB + PyViz

8. LLama Index?
9. Vector Store
9.1. Why do we need it?

9.1.1. To store embeddings

9.1.2. Links between Embeddings to give back related Information Quickly

9.2. Preprocessing Required

9.2.1. Converting Data to Embeddings

9.2.1.1. Internally handled by Vector Store Creation

9.2.1.2. Word Embedding

9.2.1.3. Glove?

9.2.1.4. Interanlly handled by OpenAI / Hugging Face

9.3. Options Available

9.3.1. Chroma

9.3.1.1. Chroma, in the context of vectorization, refers to an AI-native open-source vector database focused on developer productivity and happiness. It is designed to store and retrieve vector representations of data
efficiently. Chroma allows you to create collections of documents and perform similarity searches to find similar documents based on their vector representations.

9.3.2. Pinecone

9.3.3. Faiss

You might also like