Stage 1 - Data Ingestion and Organization
Stage 1 - Data Ingestion and Organization
• What: Loop through each company’s folder and load all raw files into Python. Use Pandas to read
prices.csv , redditdata.csv , and newsdata.csv (parsing dates), plus each CSV in
financials/ (e.g. balancesheet.csv , etc.). Collect these in a structured dictionary or set of
DataFrames (e.g. data[company]["prices"] , data[company]["financials"]["ratios"] ,
etc.).
• Why: Bringing all raw data into memory lets us validate formats and dates upfront. We catch
missing files or date parsing issues early. Organizing by company ensures each model only sees its
own data (per the requirement of per-stock models 1 ).
• Expected Output: For each ticker (NIFTY 50 company) we have:
{
'CompanyX': {
'prices': DataFrame(Date, Open, High, …, technical indicators…,
target),
'reddit': DataFrame(Date, comment_text, …),
'news': DataFrame(Date, headline, …),
'financials': {
'balancesheet': DataFrame, 'cashflow': DataFrame, 'ratios':
DataFrame, …
}
},
…
}
1
news = pd.read_csv(os.path.join(path, "newsdata.csv"),
parse_dates=['Date'])
financials = {}
for file in os.listdir(os.path.join(path, "financials")):
fin_name = file.replace('.csv','')
financials[fin_name] =
pd.read_csv(os.path.join(path,"financials",file), parse_dates=['Date'])
data[company] = {'prices': prices, 'reddit': reddit, 'news': news,
'financials': financials}
• Hardware: Simple CPU/memory is fine for loading CSVs. A team member can run this on any
machine or a Colab.
• Resources: Pandas I/O docs (e.g. pandas.read_csv). This stage is basic data engineering.
# Clean prices
prices = prices.sort_values('Date').reset_index(drop=True)
prices.fillna(method='ffill', inplace=True) # fill technical indicator
NaNs
prices.drop_duplicates(subset='Date', inplace=True)
# Clean news text
import re
def clean_text(s):
2
s = s.lower()
s = re.sub(r"http\S+|www\S+","", s) # remove URLs
s = re.sub(r'[^a-z0-9 ]',' ', s) # remove non-alphanum
return s
news['headline_clean'] = news['headline'].astype(str).apply(clean_text)
prices['return'] = prices['Close'].pct_change()
prices['MA7'] = prices['Close'].rolling(7).mean()
prices['volatility_7d'] = prices['return'].rolling(7).std()
• Sentiment/NLP features: Process text data to numeric signals. A practical approach is sentiment
scoring: apply a pre-trained financial sentiment model (e.g. FinBERT) to each news headline and
Reddit comment. Then aggregate by day (e.g., average sentiment score, count of positive vs
negative). This yields daily sentiment features. FinBERT is a BERT variant trained on finance text 2 ;
using the Hugging Face pipeline we can easily classify text as positive/negative/neutral 3 4 .
Example:
(Similarly aggregate Reddit comments sentiment or volume.) These features capture market mood.
Prior work shows news sentiment adds predictive power to technical features 1 5 .
• Financial statement features: From each quarterly CSV in financials/ , select relevant metrics
(e.g., revenue, net income, EPS, debt ratios). Normalize or compute growth rates (YoY revenue
growth, etc.). Merge these quarterly numbers into daily data by forward-filling so that each trading
3
day has the latest known fundamental values. For example, after each quarter’s release date, use
that quarter’s EPS as a feature until the next release. These features inform long-term trends.
Fundamental indicators (P/E, EBITDA, profit margins) help capture company health 1 .
• Feature scaling/encoding: After generating features, scale numeric columns (e.g.
StandardScaler ) and encode any categorical data. Ensure all DataFrames align on dates and can
be merged into one final feature table per company.
• Why: Each data source brings unique insight. Historical price features model short-term momentum,
fundamental features capture longer-term value, and sentiment features encode the market’s
reaction to news. Combining them (as recommended by Cagliero et al. 1 ) yields a richer feature set
that has been shown to improve forecasting. Sentiment tends to correlate strongly with recent price
changes but weakens further out 5 , so it’s especially useful for short horizons.
• Expected Output: For each company, a merged feature DataFrame indexed by date, containing all
selected predictors. Columns might include: Close , MA7 , volatility_7d , avg_news_sent ,
avg_reddit_sent , latest_eps , etc. Missing values (e.g. at very early dates) should be handled
(e.g. drop first few rows). This final feature set X (with columns X1…Xm) is the input to the model.
• How Connects: The feature table is directly used to train models. The next stage uses this X to learn
to predict targets (price, volatility, direction).
• Tools: Pandas (groupby, rolling), NumPy, scikit-learn (for scaling), Hugging Face Transformers for
sentiment pipeline. For NLP alternatives: nltk or TextBlob (simple sentiment), but FinBERT is
recommended for financial text 2 4 .
• Example Code: (continuing above)
• Hardware: Mostly CPU for feature computations. For sentiment scoring, inference can use CPU or
GPU. If processing hundreds of thousands of text items, a GPU (e.g. Colab A100) will accelerate
FinBERT. However, you can also do sentiment batching or use a smaller model (like DistilBERT) on
CPU if needed.
• Resources: See Hugging Face’s Pipelines guide for sentiment 3 , and FinBERT documentation 2 .
For technical indicators: Investopedia (RSI, MACD) or TA-Lib library. For NLP, Hugging Face tutorials
or this Medium example (e.g., FinBERT with news).
4
• Directional Movement (Classification): Use the existing target column if it encodes up/down
movement (or recompute it as (future_close > today_close) ). This yields 0/1 labels for up/
down. Multi-day directions could be defined similarly. We may make separate direction labels for
each horizon if needed.
• Volatility (Regression): Define volatility as the standard deviation of returns over the next period.
For example, 7-day realized volatility =
df['Close'].pct_change().rolling(7).std().shift(-7) . (Investopedia defines volatility
via std dev of returns 6 .) Include columns like target_vol_7d .
• Long-term (Quarterly) Targets: Align with quarter windows. E.g., predict next quarter’s average/
closing price, or percent change, using fundamentals. The output could be next quarter price or
earnings. Add these if doing quarterly forecasting.
• Why: Supervised learning needs clear labels. Multi-horizon targets let one model predict different
timeframes. Direction (classification) and price (regression) cover different decision needs: price
helps quantify gain/loss, direction is simpler buy/sell signal. Volatility prediction informs risk. Multi-
task learning can leverage relationships between these targets. Defining targets via shifting ensures
no look-ahead (respect chronology). Short horizons use mostly price/sentiment features; long-term
use more fundamental trends 5 .
• Expected Output: The feature DataFrame augmented with target columns. For instance:
features['target_price_1d'] = features['Close'].shift(-1)
features['target_dir_1d'] = (features['target_price_1d'] >
features['Close']).astype(int)
features['target_vol_7d'] =
features['Close'].pct_change().rolling(7).std().shift(-1)
Rows with NaN targets at the end (and possibly the first few for volatility) are removed. Now X
(features) and Y (targets) are aligned for training.
• How Connects: This completed dataset (features + targets) is what the model training stage
consumes. Having explicit multi-horizon targets means the model can be trained to output any or all
of them.
• Tools: Pandas for shifting and rolling.
• Example Code:
df = features.copy()
# 1-day horizon
df['target_price_1d'] = df['Close'].shift(-1)
df['target_dir_1d'] = (df['target_price_1d'] > df['Close']).astype(int)
# 7-day volatility (std of returns)
df['target_vol_7d'] = df['Close'].pct_change().rolling(7).std().shift(-7)
# Drop rows with NaN targets
df.dropna(subset=['target_price_1d','target_vol_7d'], inplace=True)
• Hardware: CPU.
• Resources: For more on multi-step forecasting, see multi-step LSTM guides (e.g. Jason Brownlee’s
tutorial shows how to frame X/Y for sequences).
5
Stage 5: Model Development (Short-Term vs Long-
Term)
• What (Short-Term): Design models to predict daily/weekly targets. Options include:
• Time-series models: LSTM or GRU networks that take a sequence of past days (features from day t–
T to t) and predict next 1–7 days. For example, an LSTM with input shape (window,
num_features) and outputs for price, volatility, direction. You can build a multi-output Keras
model: one head (regression) for price, one head for volatility, one (sigmoid) head for direction.
Example architecture:
• Classical ML: Random Forests or Gradient Boosting (e.g. XGBoost) using flattened tabular features
(no sequence). You would include lag features explicitly. These models can output price (regressor)
and direction (classifier) separately, or you can train a classifier for direction and regressor for price.
They require less tuning and can serve as strong baselines.
• What (Long-Term): For quarterly horizons, a simpler regression (or even time-series extrapolation)
might suffice. Use aggregated features (e.g. quarter-end price, EPS, sentiment per quarter). A
Random Forest or even linear regression on fundamentals could predict next quarter’s price change.
Alternatively, an LSTM with a larger timestep (one step per quarter) is possible but less common. The
key is to leverage financial statement features heavily here.
• Why: LSTM/sequence models capture temporal patterns and are a natural choice for time-series
forecasting. Ensemble trees handle complex non-linearities on tabular data and automatically deal
with mixed features. Multi-task models can leverage shared signals. We choose separate per-
company models to let each stock’s unique behavior guide its model (per-stock models are the
objective 7 ). Having both deep and classical approaches allows benchmarking.
• Expected Output: Trained model object(s) per company. For example, saved model weights
( model.save() ) or serialized sklearn model ( joblib.dump ). Each company ends up with its own
model file (or set of model files if separate direction vs price models).
• How Connects: The trained models will be used in Stage 6 for prediction and evaluation on new
data. Models produce the predicted closing price, volatility, and direction when fed the latest
features.
• Tools:
• TensorFlow/Keras or PyTorch: For building LSTM/NNs. Keras is user-friendly for beginners.
6
• scikit-learn: RandomForestRegressor/Classifier , or GradientBoosting . XGBoost
( xgboost package) is popular for tabular data.
• Data pipelines: scikit-learn’s Pipeline or custom scripts to apply the same scaling and feature
engineering to training and test sets.
• Example Code (LSTM):
For scikit-learn:
• Hardware: Training deep RNNs can be slow on large windows, so use a GPU (e.g. Colab’s A100) for
LSTM training. Classical models (RF, XGBoost) are fine on CPU or modest GPU. Each company’s model
can be trained in parallel if multiple GPUs/machines are available.
• Learning Resources:
• LSTM Tutorial: MachineLearningMastery LSTM guide (step-by-step example).
• Scikit-learn Ensembles: Random Forest docs for regression/classification.
• Multi-output Keras: Keras functional API docs (the example above).
• Hugging Face Transformers (if you embed text via BERT): Transformers documentation.
7
• Tools: scikit-learn’s TimeSeriesSplit or manual train-test splits, mean_squared_error ,
accuracy_score . Visualization via matplotlib .
• Example Code:
8
• Hardware: Prediction is lightweight – can run on a standard server or even a client machine. A GPU
is not needed for inference at small scale.
• Resources: Look into TensorFlow Serving or FastAPI tutorials if a real-time API is needed.
This staged plan ensures that data flows smoothly from raw files to final predictions. Each stage’s output
feeds the next, allowing team members to work in parallel (e.g., one on data cleaning, another on feature/
NLP engineering). The citations above provide guidance on proven techniques: combining technical and
news features 1 5 , using domain-specific NLP (FinBERT) 2 4 , and respecting time-series validation
8 . With clear tasks and tools at each step, an intermediate Python team can build and iterate this
decision-support system.
1 5 7 Combining News Sentiment and Technical Analysis to Predict Stock Trend Reversal
https://www.sentic.net/sentire2019cagliero.pdf
3 Pipelines
https://huggingface.co/docs/transformers/en/main_classes/pipelines