0% found this document useful (0 votes)
18 views9 pages

6-Week Project Plan - Advanced NIFTY 50 Stock Prediction System

The document outlines a 6-week project plan for developing an advanced stock prediction system for NIFTY 50 stocks, starting with data collection and preprocessing in the first two weeks. It includes steps for fine-tuning a financial language model, implementing a sentiment analysis pipeline, and setting up a prediction module for stock forecasting. Each week focuses on specific tasks such as data organization, model training, and sentiment scoring, with detailed coding examples and resources provided for implementation.

Uploaded by

vedharshacts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

6-Week Project Plan - Advanced NIFTY 50 Stock Prediction System

The document outlines a 6-week project plan for developing an advanced stock prediction system for NIFTY 50 stocks, starting with data collection and preprocessing in the first two weeks. It includes steps for fine-tuning a financial language model, implementing a sentiment analysis pipeline, and setting up a prediction module for stock forecasting. Each week focuses on specific tasks such as data organization, model training, and sentiment scoring, with detailed coding examples and resources provided for implementation.

Uploaded by

vedharshacts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

6-Week Project Plan: Advanced NIFTY 50 Stock

Prediction System
Week 1-2: Data Collection & Preprocessing
• Setup: Install Python (3.8+), create a virtual environment or Colab notebook. Install libraries:
yfinance 1 , pandas , numpy , requests , beautifulsoup4 , pdfplumber ,
snscrape (or tweepy ), spacy (or nltk ), etc. Assign roles: e.g. two members on price/
data collection, one on reports/news, one on tweets, one on initial data organization.
• Stock Prices: Use yfinance to download 10+ years of daily and weekly prices for each
NIFTY50 ticker. For example:

import yfinance as yf
tickers = ["TCS.NS","INFY.NS",...] # NIFTY50 symbols
for ticker in tickers:
df_daily = yf.Ticker(ticker).history(period="10y", interval="1d")
df_daily.to_csv(f"{ticker}/prices.csv")
df_weekly = yf.Ticker(ticker).history(period="10y", interval="1wk")
df_weekly.to_csv(f"{ticker}/prices_weekly.csv")

This saves CSVs in each company’s folder (e.g. TCS/prices.csv ). yfinance “offers a Pythonic
way to fetch financial & market data” from Yahoo Finance 1 . Use small batches or threading to
avoid rate limits.
• Financial Reports (PDF/HTML): For each company, locate investor-relations or stock-exchange
pages. Use requests + BeautifulSoup to find PDF links of quarterly and annual reports.
Example:

from bs4 import BeautifulSoup


import requests, re
ir_url = "https://example.com/IR/financials"
res = requests.get(ir_url); soup = BeautifulSoup(res.text,
'html.parser')
for a in soup.find_all('a', href=re.compile(r'\.pdf$')):
pdf_url = a['href']
filename = pdf_url.split("/")[-1]
content = requests.get(pdf_url).content
open(f"{ticker}/financials/{filename}", "wb").write(content)

After downloading PDFs, extract text and tables using PDFPlumber 2 . Example:

import pdfplumber
with pdfplumber.open(f"{ticker}/financials/Q4_2023.pdf") as pdf:

1
text = "\n".join(page.extract_text() for page in pdf.pages)
open(f"{ticker}/financials/Q4_2023.txt", "w").write(text)

Store raw text (and optionally raw PDF) in financials/ . Parsing financial tables may also use
tabula-py if needed.
• News Headlines/Articles: Use a news API (e.g. NewsAPI) or scrape websites (like Moneycontrol,
EconomicTimes). For example, fetch via NewsAPI:

import requests
api_key = "YOUR_NEWSAPI_KEY"
res = requests.get(f"https://newsapi.org/v2/everything?q={ticker}
&language=en&apiKey={api_key}")
articles = res.json()['articles']
with open(f"{ticker}/news.json","w") as f: json.dump(articles, f)

Or scrape RSS/HTML: use BeautifulSoup to collect headlines and dates. Store in JSON or CSV
as a list of {date, title, content} .
• Twitter Data: Use snscrape (no API key required) to collect tweets containing company
name/handles. Example:

import snscrape.modules.twitter as sntwitter


import pandas as pd
tweets = []
for tweet in sntwitter.TwitterSearchScraper(f"{ticker} lang:en since:
2018-01-01 until:2025-01-01").get_items():
tweets.append({'date': tweet.date.date(), 'content': tweet.content})
if len(tweets)>=10000: break
pd.DataFrame(tweets).to_json(f"{ticker}/tweets.json", orient='records')

Save tweets as JSON list of {date, content} . Be mindful of legal/social media terms.
• Text Preprocessing: Ingest all collected text (news, tweets, report text). Use spaCy or NLTK to
clean: lowercase, remove stopwords, punctuation, and lemmatize. Example with spaCy:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["ner","parser"])
def preprocess(text):
doc = nlp(text)
return " ".join(token.lemma_.lower() for token in doc
if token.is_alpha and not token.is_stop)
clean_texts = [preprocess(t) for t in raw_texts]

Store cleaned text if needed (or process on-the-fly later).


• Data Organization: Create a folder per company with this structure:

TCS/
prices.csv # daily prices
prices_weekly.csv
financials/ # downloaded PDFs and extracted text/tables

2
news.json # list of news items (date, title, content)
tweets.json # list of tweets (date, text)
sentiment.csv # (to be filled later)

Use consistent naming. This structured hierarchy keeps each company’s data self-contained.
• Learning Resources:
• yfinance Tutorial: (e.g. HuggingFace repo 1 or AlgorTrading101 guide)
• Web Scraping: BeautifulSoup tutorial (official docs or tutorial)
• PDF Parsing: PDFPlumber guide 3 or PyMuPDF
• NLP Basics: spaCy/NLTK documentation on tokenization, lemmatization
• Twitter Scraping: snscrape GitHub examples
• Environment: Run data collection on a local machine or Colab. No heavy GPU needed. Ensure
~10–20 GB disk for data. Use Google Drive (mounted) if using Colab to store data.

Week 3: Fine-Tuning the Financial LLM


• What to do: Prepare a domain-specific Q&A dataset and fine-tune an LLM so it can answer
finance questions.
• Q&A Data Prep: From the collected financial texts (reports, transcripts, news), generate
instruction-response pairs. For example:

Q: "What was TCS’s net profit in Q4 2023?"


A: "Net profit was ₹4,453 crore."

You can script this by identifying key figures in tables and crafting corresponding questions, or
manually create a few hundred samples to start. Format data in “instruction-input-response”
JSONL (Alpaca style). For example:

{"instruction": "What was TCS's net profit in Q4 2023?", "input": "",


"output": "₹4,453 crore"}

• Model Selection: Use an open-source model (e.g. LLaMA-2 7B or Mistral-7B). These require
substantial GPU.
• Environment: Use Google Colab Pro+ with an A100 (40 GB VRAM) or a similar GPU machine.
Install transformers , peft , and bitsandbytes for low-memory fine-tuning.
• Setup QLoRA/LoRA: Load the model in 4-bit to save memory and attach LoRA adapters.
Example code:

from transformers import AutoModelForCausalLM, AutoTokenizer


from peft import LoraConfig, get_peft_model
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name,
load_in_4bit=True,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
lora_config = LoraConfig(r=8, lora_alpha=32,
target_modules=["q_proj","v_proj"])
model = get_peft_model(model, lora_config)

3
This employs 4-bit quantization (via BitsAndBytes) together with LoRA adapters – a strategy
known as QLoRA. QLoRA “quantizes a model to 4-bits and then trains it with LoRA, enabling fine-
tuning of even very large models on a single GPU” 4 .
• Training: Use the Hugging Face Trainer or TRL SFT script. For example:

from transformers import Trainer, TrainingArguments


training_args = TrainingArguments(
output_dir="./llm-ft",
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
fp16=True, # use mixed precision
logging_steps=50
)
trainer = Trainer(model=model, args=training_args,
train_dataset=qa_dataset)
trainer.train()

Use a small batch size (1) with gradient accumulation to simulate larger batches. Enable fp16/
bf16 and possibly gradient checkpointing to fit into memory. According to HuggingFace
examples, using --load_in_4bit with --use_peft performs QLoRA training 5 . Expect
several hours of training for a few epochs.
• Validation: After training, test the model on held-out questions. For example:

prompt = "What was TCS's net profit in Q4 2023?"


inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The model should answer consistently with the company reports.


• Tools/Libraries: PyTorch, HuggingFace Transformers, PEFT (LoRA), BitsAndBytes, Datasets (for
QA data).
• What to Learn:
• HuggingFace PEFT/QLoRA tutorials (see docs or example scripts 4 5 )
• Creating instruction-tuning datasets (Alpaca/Stanford Alpaca guides)
• Google Colab GPU usage (A100 details)
• Hardware Guidance: On an A100-40GB, a 7B model with LoRA at 4-bit typically fits with batch
size 1. Use gradient_accumulation_steps ~ 16 to emulate a larger batch. Training 3–5
epochs may take ~4–8 hours. Save checkpoints to Google Drive or Hugging Face Hub.

Week 4: Sentiment Analysis Pipeline & RAG Setup


• Sentiment Labeling: Use a financial-domain sentiment model to score news and tweets. A good
choice is FinBERT (e.g. yiyanghkust/finbert-tone on HuggingFace). This model is pre-
trained on financial text and fine-tuned for sentiment 6 . Example usage with Transformers
pipeline 7 :

from transformers import pipeline


sentiment_pipeline = pipeline(

4
"sentiment-analysis",
model="yiyanghkust/finbert-tone",
tokenizer="yiyanghkust/finbert-tone"
)
texts = ["Revenue is growing steadily", "We have some concerns about
cash flow"]
results = sentiment_pipeline(texts)
# results: [{'label':'LABEL_1','score':0.95}, ...]

This labels each piece as positive/neutral/negative. Apply this to each news headline/article and
tweet.
• Aggregate Scores: For each company, aggregate sentiment by day or week. For example, count
how many positive/negatives occurred on each date. Use pandas:

import pandas as pd
df = pd.read_json(f"{ticker}/news.json") # columns: date, content
df['sentiment'] = df['content'].apply(lambda t: sentiment_pipeline(t)[0]
['label'])
daily_counts = df.groupby('date')
['sentiment'].value_counts().unstack(fill_value=0)
daily_counts.to_csv(f"{ticker}/sentiment.csv")

The resulting sentiment.csv has columns like


date, positive_count, neutral_count, negative_count . This file is stored in the
company folder.
• Retrieval-Augmented Generation (RAG): Build a semantic search over the collected documents
so the LLM can look up facts. Steps:
• Chunk Documents: Break each company’s reports/transcripts into smaller chunks (e.g. ~1000
characters or paragraphs) to limit context size.
• Embeddings: Use a sentence-transformers model (e.g. all-MiniLM-L6-v2 ) to embed each
chunk. Example:

from sentence_transformers import SentenceTransformer


embedder = SentenceTransformer('all-MiniLM-L6-v2')
chunks = ["chunk1 text", "chunk2 text", ...]
embeddings = embedder.encode(chunks)

(See the Sentence Transformers Quickstart for details.)


• Vector Store: Store embeddings in FAISS or Chroma. Using FAISS:

import faiss, numpy as np


dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.vstack(embeddings))
# Save index if needed

Keep a mapping from vector indices to the original text chunks (e.g. a list or dict).
• Retrieval Pipeline: At query time, embed the user query and find nearest chunks. For example:

5
query = "What was TCS's revenue in 2022?"
q_emb = embedder.encode([query])
D, I = index.search(q_emb.astype(np.float32), k=3)
context = " ".join(chunks[i] for i in I[0])

Prepend or append these contexts to the query prompt. Then feed to the fine-tuned LLM to
generate an answer grounded in retrieved texts.
• Where to Run: Embedding and FAISS building can be done on Colab (GPU accelerates
embedding) or local machine. FAISS retrieval is fast on CPU. Sentiment pipeline can use GPU (for
speed) or CPU.
• Tools: Transformers, sentence-transformers, FAISS (or ChromaDB), pandas.
• What to Learn:
• FinBERT/model-card usage 7 and sentiment pipelines.
• Sentence-Transformers documentation (see Quickstart 8 ).
• FAISS tutorial for vector search.
• RAG concepts (HuggingFace RAG tutorials or blog posts).
• Sample Code:

# Sentiment analysis example (using FinBERT)


sentences = ["The market outlook is positive.", "Earnings fell short of
expectations."]
pipeline = pipeline("sentiment-analysis", model="yiyanghkust/finbert-
tone")
labels = pipeline(sentences)
print(labels) # e.g. [{'label':'LABEL_1','score':0.97}, ...]

# Building and querying FAISS index


import faiss
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
chunks = ["Earnings summary ...", "Balance sheet ...", "Cash flow ..."]
embeds = model.encode(chunks)
index = faiss.IndexFlatL2(embeds.shape[1])
index.add(embeds)
query = model.encode(["TCS net profit Q1 2023?"])
D,I = index.search(query, k=2)
retrieved = [chunks[i] for i in I[0]]
print(retrieved)

• Data Structuring: Add sentiment.csv to each company folder. Save any index files or pickled
embeddings (e.g. TCS/embeddings.faiss ) separately.

Week 5: Prediction Module (Stock Forecasting)


• Features: For each company, merge historical prices, technical indicators, and sentiment
features into a training dataset. Compute indicators using TA-Lib (or pandas-ta ):

6
import talib
df = pd.read_csv(f"{ticker}/prices.csv", parse_dates=['Date'],
index_col='Date')
df['RSI_14'] = talib.RSI(df['Close'], timeperiod=14)
df['SMA_50'] = talib.SMA(df['Close'], timeperiod=50)

Merge sentiment data (from sentiment.csv ) on date. For example, include pos_count ,
neg_count as additional features for the same date.
• Data Preparation: Create input-output pairs. For example, use the past 30 days of features to
predict the next day’s closing price. Normalize features with sklearn.preprocessing
(MinMax or StandardScaler). Split data chronologically (e.g. train on 2013–2021, test on 2022–
2023).
• Model: Implement a sequential model in PyTorch. An example LSTM regressor:

import torch.nn as nn
class StockLSTM(nn.Module):
def __init__(self, input_dim, hidden_dim=64):
super().__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=2,
batch_first=True)
self.fc = nn.Linear(hidden_dim, 1)
def forward(self, x):
out, _ = self.lstm(x)
return self.fc(out[:, -1, :])

This follows the pattern in PyTorch LSTM tutorials 9 . You can also experiment with GRU or
simple Transformer layers for time series. Include sentiment features as part of input_dim .
• Training: Train with MSE loss to predict next-day price. Example:

model = StockLSTM(input_dim=number_of_features)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
for x_batch, y_batch in train_loader:
optimizer.zero_grad()
pred = model(x_batch)
loss = criterion(pred, y_batch)
loss.backward()
optimizer.step()

Use appropriate batch sizing (small batches if data limited). Consider training a separate model
per ticker or a single multi-stock model (one-hot encode company as feature).
• Evaluation: Compute RMSE/MAE on test set. Plot predicted vs actual prices. Adjust
hyperparameters as needed.
• Tools: PyTorch (or TensorFlow/Keras if preferred), sklearn (for scaling/split), talib .
• What to Learn:
• Time-series forecasting with LSTM in PyTorch (e.g. MachineLearningMastery code 9 ).
• Computing technical indicators (TA-Lib docs).

7
• Handling multivariate regression.
• Sample Code:

import torch
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Example: prepare data as tensors


X_train = torch.tensor(train_features.values,
dtype=torch.float32).reshape(-1, window, input_dim)
y_train = torch.tensor(train_targets.values,
dtype=torch.float32).reshape(-1,1)
loader = DataLoader(TensorDataset(X_train, y_train), batch_size=16,
shuffle=True)

model = StockLSTM(input_dim=input_dim, hidden_dim=32)


optimizer = torch.optim.Adam(model.parameters())
for epoch in range(50):
for X_batch, Y_batch in loader:
optimizer.zero_grad()
output = model(X_batch)
loss = nn.MSELoss()(output, Y_batch)
loss.backward()
optimizer.step()

• Hardware: Training LSTM on one company’s data is light (~MBs of data). A CPU can suffice, but a
GPU (even a small one) will speed up training. Ensure memory for data scaling (few dozen
features × a few thousand timesteps).

Week 6: Integration & UI Planning


• Integration: Test end-to-end flow. For a user query (e.g. “What was Infosys’s profit last
quarter?”), run it through the RAG pipeline: embed query, retrieve chunks, feed to fine-tuned
LLM, and return answer. Validate with known facts. Also check prediction module: ensure it can
output forecasts given recent data. Write scripts/notebooks that tie these steps together.
• UI (Planned): Design a simple interface (deferred to after core work). For now, outline a
Streamlit or Flask app. Example Streamlit idea:

import streamlit as st
st.title("NIFTY 50 Stock Predictor")
ticker = st.selectbox("Select company", tickers)
query = st.text_input("Ask a question about the company")
if st.button("Get Answer"):
answer = rag_llm_answer(ticker, query)
st.write(answer)
st.line_chart(prediction_dataframe[ticker]) # show forecasts

No code needed immediately, but consider how to present retrieved context vs model answer,
and how to display price charts.

8
• Documentation & Testing: Write README with instructions. Each team member documents
their module (data, LLM, sentiment, prediction). Perform unit tests on functions (e.g. does data
loader handle missing values?).
• What to Learn:
• Streamlit quickstart (https://docs.streamlit.io) for dashboards.
• Flask tutorial (if choosing web app) for forms and API endpoints.
• Plotting libraries (Matplotlib/Plotly) for time-series charts.
• Hardware: The UI is light – can run on any server or local machine. Final integration just uses
already-trained models.

Note: Throughout the project, communicate frequently. Use version control (Git) and share
intermediate results (e.g., checkpoints, sample data). Allocate tasks so multiple people work in parallel:
for instance, while two team members scrape and clean data (Weeks 1–2), others can begin designing
the Q&A format and LLM fine-tune pipeline (Weeks 2–3). The 5-person team can split roughly: Data
Engineering, NLP (LLM), Sentiment, Prediction Modeling, and Integration/UX.

Key References:
- yfinance (financial data) 1 ;
- FinBERT (financial sentiment) 6 7 ;
- QLoRA fine-tuning (PEFT tutorial) 4 5 ;
- PyTorch LSTM example 9 .

1 GitHub - ranaroussi/yfinance: Download market data from Yahoo! Finance's API


https://github.com/ranaroussi/yfinance

2 3A Step-by-Step Guide to Parsing PDFs using the pdfplumber Library In Python | by Azhar Sayyad
| Medium
https://azhar-sayyad.medium.com/a-step-by-step-guide-to-parsing-pdfs-using-the-pdfplumber-library-in-python-
c12d94ae9f07

4 Quantization
https://huggingface.co/docs/peft/en/developer_guides/quantization

5 Llama2 fine-tunning with PEFT QLora and testing the model - Transformers - Hugging Face Forums
https://discuss.huggingface.co/t/llama2-fine-tunning-with-peft-qlora-and-testing-the-model/50581

6 GitHub - ProsusAI/finBERT: Financial Sentiment Analysis with BERT


https://github.com/ProsusAI/finBERT

7 yiyanghkust/finbert-tone · Hugging Face


https://huggingface.co/yiyanghkust/finbert-tone

8 Quickstart — Sentence Transformers documentation


https://sbert.net/docs/quickstart.html

9 LSTM for Time Series Prediction in PyTorch - MachineLearningMastery.com


https://machinelearningmastery.com/lstm-for-time-series-prediction-in-pytorch/

You might also like