6-Week Project Plan - Advanced NIFTY 50 Stock Prediction System
6-Week Project Plan - Advanced NIFTY 50 Stock Prediction System
Prediction System
Week 1-2: Data Collection & Preprocessing
• Setup: Install Python (3.8+), create a virtual environment or Colab notebook. Install libraries:
yfinance 1 , pandas , numpy , requests , beautifulsoup4 , pdfplumber ,
snscrape (or tweepy ), spacy (or nltk ), etc. Assign roles: e.g. two members on price/
data collection, one on reports/news, one on tweets, one on initial data organization.
• Stock Prices: Use yfinance to download 10+ years of daily and weekly prices for each
NIFTY50 ticker. For example:
import yfinance as yf
tickers = ["TCS.NS","INFY.NS",...] # NIFTY50 symbols
for ticker in tickers:
df_daily = yf.Ticker(ticker).history(period="10y", interval="1d")
df_daily.to_csv(f"{ticker}/prices.csv")
df_weekly = yf.Ticker(ticker).history(period="10y", interval="1wk")
df_weekly.to_csv(f"{ticker}/prices_weekly.csv")
This saves CSVs in each company’s folder (e.g. TCS/prices.csv ). yfinance “offers a Pythonic
way to fetch financial & market data” from Yahoo Finance 1 . Use small batches or threading to
avoid rate limits.
• Financial Reports (PDF/HTML): For each company, locate investor-relations or stock-exchange
pages. Use requests + BeautifulSoup to find PDF links of quarterly and annual reports.
Example:
After downloading PDFs, extract text and tables using PDFPlumber 2 . Example:
import pdfplumber
with pdfplumber.open(f"{ticker}/financials/Q4_2023.pdf") as pdf:
1
text = "\n".join(page.extract_text() for page in pdf.pages)
open(f"{ticker}/financials/Q4_2023.txt", "w").write(text)
Store raw text (and optionally raw PDF) in financials/ . Parsing financial tables may also use
tabula-py if needed.
• News Headlines/Articles: Use a news API (e.g. NewsAPI) or scrape websites (like Moneycontrol,
EconomicTimes). For example, fetch via NewsAPI:
import requests
api_key = "YOUR_NEWSAPI_KEY"
res = requests.get(f"https://newsapi.org/v2/everything?q={ticker}
&language=en&apiKey={api_key}")
articles = res.json()['articles']
with open(f"{ticker}/news.json","w") as f: json.dump(articles, f)
Or scrape RSS/HTML: use BeautifulSoup to collect headlines and dates. Store in JSON or CSV
as a list of {date, title, content} .
• Twitter Data: Use snscrape (no API key required) to collect tweets containing company
name/handles. Example:
Save tweets as JSON list of {date, content} . Be mindful of legal/social media terms.
• Text Preprocessing: Ingest all collected text (news, tweets, report text). Use spaCy or NLTK to
clean: lowercase, remove stopwords, punctuation, and lemmatize. Example with spaCy:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["ner","parser"])
def preprocess(text):
doc = nlp(text)
return " ".join(token.lemma_.lower() for token in doc
if token.is_alpha and not token.is_stop)
clean_texts = [preprocess(t) for t in raw_texts]
TCS/
prices.csv # daily prices
prices_weekly.csv
financials/ # downloaded PDFs and extracted text/tables
2
news.json # list of news items (date, title, content)
tweets.json # list of tweets (date, text)
sentiment.csv # (to be filled later)
Use consistent naming. This structured hierarchy keeps each company’s data self-contained.
• Learning Resources:
• yfinance Tutorial: (e.g. HuggingFace repo 1 or AlgorTrading101 guide)
• Web Scraping: BeautifulSoup tutorial (official docs or tutorial)
• PDF Parsing: PDFPlumber guide 3 or PyMuPDF
• NLP Basics: spaCy/NLTK documentation on tokenization, lemmatization
• Twitter Scraping: snscrape GitHub examples
• Environment: Run data collection on a local machine or Colab. No heavy GPU needed. Ensure
~10–20 GB disk for data. Use Google Drive (mounted) if using Colab to store data.
You can script this by identifying key figures in tables and crafting corresponding questions, or
manually create a few hundred samples to start. Format data in “instruction-input-response”
JSONL (Alpaca style). For example:
• Model Selection: Use an open-source model (e.g. LLaMA-2 7B or Mistral-7B). These require
substantial GPU.
• Environment: Use Google Colab Pro+ with an A100 (40 GB VRAM) or a similar GPU machine.
Install transformers , peft , and bitsandbytes for low-memory fine-tuning.
• Setup QLoRA/LoRA: Load the model in 4-bit to save memory and attach LoRA adapters.
Example code:
3
This employs 4-bit quantization (via BitsAndBytes) together with LoRA adapters – a strategy
known as QLoRA. QLoRA “quantizes a model to 4-bits and then trains it with LoRA, enabling fine-
tuning of even very large models on a single GPU” 4 .
• Training: Use the Hugging Face Trainer or TRL SFT script. For example:
Use a small batch size (1) with gradient accumulation to simulate larger batches. Enable fp16/
bf16 and possibly gradient checkpointing to fit into memory. According to HuggingFace
examples, using --load_in_4bit with --use_peft performs QLoRA training 5 . Expect
several hours of training for a few epochs.
• Validation: After training, test the model on held-out questions. For example:
4
"sentiment-analysis",
model="yiyanghkust/finbert-tone",
tokenizer="yiyanghkust/finbert-tone"
)
texts = ["Revenue is growing steadily", "We have some concerns about
cash flow"]
results = sentiment_pipeline(texts)
# results: [{'label':'LABEL_1','score':0.95}, ...]
This labels each piece as positive/neutral/negative. Apply this to each news headline/article and
tweet.
• Aggregate Scores: For each company, aggregate sentiment by day or week. For example, count
how many positive/negatives occurred on each date. Use pandas:
import pandas as pd
df = pd.read_json(f"{ticker}/news.json") # columns: date, content
df['sentiment'] = df['content'].apply(lambda t: sentiment_pipeline(t)[0]
['label'])
daily_counts = df.groupby('date')
['sentiment'].value_counts().unstack(fill_value=0)
daily_counts.to_csv(f"{ticker}/sentiment.csv")
Keep a mapping from vector indices to the original text chunks (e.g. a list or dict).
• Retrieval Pipeline: At query time, embed the user query and find nearest chunks. For example:
5
query = "What was TCS's revenue in 2022?"
q_emb = embedder.encode([query])
D, I = index.search(q_emb.astype(np.float32), k=3)
context = " ".join(chunks[i] for i in I[0])
Prepend or append these contexts to the query prompt. Then feed to the fine-tuned LLM to
generate an answer grounded in retrieved texts.
• Where to Run: Embedding and FAISS building can be done on Colab (GPU accelerates
embedding) or local machine. FAISS retrieval is fast on CPU. Sentiment pipeline can use GPU (for
speed) or CPU.
• Tools: Transformers, sentence-transformers, FAISS (or ChromaDB), pandas.
• What to Learn:
• FinBERT/model-card usage 7 and sentiment pipelines.
• Sentence-Transformers documentation (see Quickstart 8 ).
• FAISS tutorial for vector search.
• RAG concepts (HuggingFace RAG tutorials or blog posts).
• Sample Code:
• Data Structuring: Add sentiment.csv to each company folder. Save any index files or pickled
embeddings (e.g. TCS/embeddings.faiss ) separately.
6
import talib
df = pd.read_csv(f"{ticker}/prices.csv", parse_dates=['Date'],
index_col='Date')
df['RSI_14'] = talib.RSI(df['Close'], timeperiod=14)
df['SMA_50'] = talib.SMA(df['Close'], timeperiod=50)
Merge sentiment data (from sentiment.csv ) on date. For example, include pos_count ,
neg_count as additional features for the same date.
• Data Preparation: Create input-output pairs. For example, use the past 30 days of features to
predict the next day’s closing price. Normalize features with sklearn.preprocessing
(MinMax or StandardScaler). Split data chronologically (e.g. train on 2013–2021, test on 2022–
2023).
• Model: Implement a sequential model in PyTorch. An example LSTM regressor:
import torch.nn as nn
class StockLSTM(nn.Module):
def __init__(self, input_dim, hidden_dim=64):
super().__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=2,
batch_first=True)
self.fc = nn.Linear(hidden_dim, 1)
def forward(self, x):
out, _ = self.lstm(x)
return self.fc(out[:, -1, :])
This follows the pattern in PyTorch LSTM tutorials 9 . You can also experiment with GRU or
simple Transformer layers for time series. Include sentiment features as part of input_dim .
• Training: Train with MSE loss to predict next-day price. Example:
model = StockLSTM(input_dim=number_of_features)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
for x_batch, y_batch in train_loader:
optimizer.zero_grad()
pred = model(x_batch)
loss = criterion(pred, y_batch)
loss.backward()
optimizer.step()
Use appropriate batch sizing (small batches if data limited). Consider training a separate model
per ticker or a single multi-stock model (one-hot encode company as feature).
• Evaluation: Compute RMSE/MAE on test set. Plot predicted vs actual prices. Adjust
hyperparameters as needed.
• Tools: PyTorch (or TensorFlow/Keras if preferred), sklearn (for scaling/split), talib .
• What to Learn:
• Time-series forecasting with LSTM in PyTorch (e.g. MachineLearningMastery code 9 ).
• Computing technical indicators (TA-Lib docs).
7
• Handling multivariate regression.
• Sample Code:
import torch
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
• Hardware: Training LSTM on one company’s data is light (~MBs of data). A CPU can suffice, but a
GPU (even a small one) will speed up training. Ensure memory for data scaling (few dozen
features × a few thousand timesteps).
import streamlit as st
st.title("NIFTY 50 Stock Predictor")
ticker = st.selectbox("Select company", tickers)
query = st.text_input("Ask a question about the company")
if st.button("Get Answer"):
answer = rag_llm_answer(ticker, query)
st.write(answer)
st.line_chart(prediction_dataframe[ticker]) # show forecasts
No code needed immediately, but consider how to present retrieved context vs model answer,
and how to display price charts.
8
• Documentation & Testing: Write README with instructions. Each team member documents
their module (data, LLM, sentiment, prediction). Perform unit tests on functions (e.g. does data
loader handle missing values?).
• What to Learn:
• Streamlit quickstart (https://docs.streamlit.io) for dashboards.
• Flask tutorial (if choosing web app) for forms and API endpoints.
• Plotting libraries (Matplotlib/Plotly) for time-series charts.
• Hardware: The UI is light – can run on any server or local machine. Final integration just uses
already-trained models.
Note: Throughout the project, communicate frequently. Use version control (Git) and share
intermediate results (e.g., checkpoints, sample data). Allocate tasks so multiple people work in parallel:
for instance, while two team members scrape and clean data (Weeks 1–2), others can begin designing
the Q&A format and LLM fine-tune pipeline (Weeks 2–3). The 5-person team can split roughly: Data
Engineering, NLP (LLM), Sentiment, Prediction Modeling, and Integration/UX.
Key References:
- yfinance (financial data) 1 ;
- FinBERT (financial sentiment) 6 7 ;
- QLoRA fine-tuning (PEFT tutorial) 4 5 ;
- PyTorch LSTM example 9 .
2 3A Step-by-Step Guide to Parsing PDFs using the pdfplumber Library In Python | by Azhar Sayyad
| Medium
https://azhar-sayyad.medium.com/a-step-by-step-guide-to-parsing-pdfs-using-the-pdfplumber-library-in-python-
c12d94ae9f07
4 Quantization
https://huggingface.co/docs/peft/en/developer_guides/quantization
5 Llama2 fine-tunning with PEFT QLora and testing the model - Transformers - Hugging Face Forums
https://discuss.huggingface.co/t/llama2-fine-tunning-with-peft-qlora-and-testing-the-model/50581