0% found this document useful (0 votes)
15 views9 pages

Stage 1 - Data Ingestion and Organization

The document outlines a multi-stage process for data ingestion, cleaning, feature engineering, target preparation, and model development for stock prediction using Python and Pandas. Each stage focuses on organizing raw data, ensuring data quality, creating predictive features, defining targets, and developing models for short-term and long-term forecasting. The expected output includes structured data for each company, cleaned DataFrames, feature sets, and trained models for predicting stock prices and movements.

Uploaded by

vedharshacts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

Stage 1 - Data Ingestion and Organization

The document outlines a multi-stage process for data ingestion, cleaning, feature engineering, target preparation, and model development for stock prediction using Python and Pandas. Each stage focuses on organizing raw data, ensuring data quality, creating predictive features, defining targets, and developing models for short-term and long-term forecasting. The expected output includes structured data for each company, cleaned DataFrames, feature sets, and trained models for predicting stock prices and movements.

Uploaded by

vedharshacts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Stage 1: Data Ingestion and Organization

• What: Loop through each company’s folder and load all raw files into Python. Use Pandas to read
prices.csv , redditdata.csv , and newsdata.csv (parsing dates), plus each CSV in
financials/ (e.g. balancesheet.csv , etc.). Collect these in a structured dictionary or set of
DataFrames (e.g. data[company]["prices"] , data[company]["financials"]["ratios"] ,
etc.).
• Why: Bringing all raw data into memory lets us validate formats and dates upfront. We catch
missing files or date parsing issues early. Organizing by company ensures each model only sees its
own data (per the requirement of per-stock models 1 ).
• Expected Output: For each ticker (NIFTY 50 company) we have:

{
'CompanyX': {
'prices': DataFrame(Date, Open, High, …, technical indicators…,
target),
'reddit': DataFrame(Date, comment_text, …),
'news': DataFrame(Date, headline, …),
'financials': {
'balancesheet': DataFrame, 'cashflow': DataFrame, 'ratios':
DataFrame, …
}
},

}

This raw data store is passed to Stage 2.


• How Connects: These DataFrames feed into cleaning and feature stages. We maintain the same
company-wise structure so that each company’s data flows through its own pipeline branch.
• Tools: Python’s pandas (e.g. pd.read_csv with parse_dates ), os or pathlib to list
folders.
• Example Code:

import pandas as pd, os


data = {}
root = "data/NIFTY50/"
for company in os.listdir(root):
path = os.path.join(root, company)
prices = pd.read_csv(os.path.join(path, "prices.csv"),
parse_dates=['Date'])
reddit = pd.read_csv(os.path.join(path, "redditdata.csv"),
parse_dates=['Date'])

1
news = pd.read_csv(os.path.join(path, "newsdata.csv"),
parse_dates=['Date'])
financials = {}
for file in os.listdir(os.path.join(path, "financials")):
fin_name = file.replace('.csv','')
financials[fin_name] =
pd.read_csv(os.path.join(path,"financials",file), parse_dates=['Date'])
data[company] = {'prices': prices, 'reddit': reddit, 'news': news,
'financials': financials}

• Hardware: Simple CPU/memory is fine for loading CSVs. A team member can run this on any
machine or a Colab.
• Resources: Pandas I/O docs (e.g. pandas.read_csv). This stage is basic data engineering.

Stage 2: Data Cleaning and Preprocessing


• What: Clean each DataFrame to ensure consistency. For price data: sort by date, reset index, handle
missing or duplicate rows (e.g. df.drop_duplicates() , forward-fill any gaps in indicators with
df.fillna(method='ffill') ). For the Reddit/news text: lower-case, strip HTML or URLs, and
remove non-alphanumeric characters. For financial tables: ensure numeric columns are correctly
typed, and dates align to quarter-ends. For all data, check for outliers or impossible values (e.g.
negative volumes) and decide on filtering or correction.
• Why: Raw data often has missing or dirty values. Cleaning prevents garbage-in effects. For example,
technical indicators at the start may be NaN – we can drop early rows or fill sensibly. Text cleaning
(lowercasing, removing punctuation) simplifies NLP later. Without this, NLP models may misinterpret
URLs or stray HTML. Standardizing dates ensures merging across sources works.
• Expected Output: “Cleaned” DataFrames (overwriting or new variables) with:
• Sorted, gap-filled price series.
• Reddit/news text columns cleaned and maybe tokenized.
• Financial tables with NaNs handled or flagged.
(For example, the cleaned price DataFrame will have no missing dates within trading days, and
technical indicator columns filled or dropped.) These cleaned frames feed feature engineering.
• How Connects: Cleaned data is the input to feature extraction. E.g. a filled prices DF is ready for
computing returns or volatility, and cleaned news text is ready for sentiment analysis.
• Tools: Pandas (e.g. df.sort_values , df.interpolate / ffill ), Python regex (the re
module) or simple string ops for text. Optionally nltk or spaCy for stopwords/removal if desired.
• Example Code:

# Clean prices
prices = prices.sort_values('Date').reset_index(drop=True)
prices.fillna(method='ffill', inplace=True) # fill technical indicator
NaNs
prices.drop_duplicates(subset='Date', inplace=True)
# Clean news text
import re
def clean_text(s):

2
s = s.lower()
s = re.sub(r"http\S+|www\S+","", s) # remove URLs
s = re.sub(r'[^a-z0-9 ]',' ', s) # remove non-alphanum
return s
news['headline_clean'] = news['headline'].astype(str).apply(clean_text)

• Hardware: CPU. Cleaning is not computationally heavy.


• Resources: Text cleaning guides (e.g. nlp-with-python tutorial), Pandas data cleaning (e.g.
Pandas Tips & Tricks).

Stage 3: Feature Engineering


• What: Create predictive features from raw data. This includes:
• Technical/price features: Using prices.csv , compute returns ( (Close-prev_Close)/
prev_Close ), rolling statistics (moving averages, rolling volatility/std of returns), lagged values (e.g.
price or return at t–1, t–2…), RSI, MACD, etc. Many indicators may already exist, but adding standard
ones can help. E.g.:

prices['return'] = prices['Close'].pct_change()
prices['MA7'] = prices['Close'].rolling(7).mean()
prices['volatility_7d'] = prices['return'].rolling(7).std()

• Sentiment/NLP features: Process text data to numeric signals. A practical approach is sentiment
scoring: apply a pre-trained financial sentiment model (e.g. FinBERT) to each news headline and
Reddit comment. Then aggregate by day (e.g., average sentiment score, count of positive vs
negative). This yields daily sentiment features. FinBERT is a BERT variant trained on finance text 2 ;
using the Hugging Face pipeline we can easily classify text as positive/negative/neutral 3 4 .
Example:

from transformers import pipeline


sent_pipeline = pipeline("sentiment-analysis", model="ProsusAI/finbert")
# Apply to headlines
news['sentiment'] = news['headline_clean'].apply(lambda x: sent_pipeline(x)
[0]['label'])
# Convert labels to scores (+1/-1/0) and aggregate by date
score_map = {'positive':1, 'negative':-1, 'neutral':0}
news['sent_score'] = news['sentiment'].map(score_map)
daily_sent = news.groupby('Date')
['sent_score'].mean().rename('avg_news_sent')

(Similarly aggregate Reddit comments sentiment or volume.) These features capture market mood.
Prior work shows news sentiment adds predictive power to technical features 1 5 .
• Financial statement features: From each quarterly CSV in financials/ , select relevant metrics
(e.g., revenue, net income, EPS, debt ratios). Normalize or compute growth rates (YoY revenue
growth, etc.). Merge these quarterly numbers into daily data by forward-filling so that each trading

3
day has the latest known fundamental values. For example, after each quarter’s release date, use
that quarter’s EPS as a feature until the next release. These features inform long-term trends.
Fundamental indicators (P/E, EBITDA, profit margins) help capture company health 1 .
• Feature scaling/encoding: After generating features, scale numeric columns (e.g.
StandardScaler ) and encode any categorical data. Ensure all DataFrames align on dates and can
be merged into one final feature table per company.
• Why: Each data source brings unique insight. Historical price features model short-term momentum,
fundamental features capture longer-term value, and sentiment features encode the market’s
reaction to news. Combining them (as recommended by Cagliero et al. 1 ) yields a richer feature set
that has been shown to improve forecasting. Sentiment tends to correlate strongly with recent price
changes but weakens further out 5 , so it’s especially useful for short horizons.
• Expected Output: For each company, a merged feature DataFrame indexed by date, containing all
selected predictors. Columns might include: Close , MA7 , volatility_7d , avg_news_sent ,
avg_reddit_sent , latest_eps , etc. Missing values (e.g. at very early dates) should be handled
(e.g. drop first few rows). This final feature set X (with columns X1…Xm) is the input to the model.
• How Connects: The feature table is directly used to train models. The next stage uses this X to learn
to predict targets (price, volatility, direction).
• Tools: Pandas (groupby, rolling), NumPy, scikit-learn (for scaling), Hugging Face Transformers for
sentiment pipeline. For NLP alternatives: nltk or TextBlob (simple sentiment), but FinBERT is
recommended for financial text 2 4 .
• Example Code: (continuing above)

# Merge sentiment back into price DF


prices = prices.merge(daily_sent, on='Date', how='left')
prices['avg_news_sent'].fillna(0, inplace=True) # days with no news ->
neutral
# Merge financials (assume 'ratios' has date and P/E)
pe = financials['ratios'][['Date','PE_ratio']]
prices = prices.merge(pe, on='Date', how='left').ffill()

• Hardware: Mostly CPU for feature computations. For sentiment scoring, inference can use CPU or
GPU. If processing hundreds of thousands of text items, a GPU (e.g. Colab A100) will accelerate
FinBERT. However, you can also do sentiment batching or use a smaller model (like DistilBERT) on
CPU if needed.
• Resources: See Hugging Face’s Pipelines guide for sentiment 3 , and FinBERT documentation 2 .
For technical indicators: Investopedia (RSI, MACD) or TA-Lib library. For NLP, Hugging Face tutorials
or this Medium example (e.g., FinBERT with news).

Stage 4: Label/Target Preparation


• What: Define the prediction targets and align them with features. From prices.csv we already
have (or can compute) targets:
• Closing Price (Regression): For short-term, define future close price or return at +1, +3, +7 days.
Add columns like target_price_1d = Close.shift(-1) , target_price_7d =
Close.shift(-7) . For each horizon, this is a regression label.

4
• Directional Movement (Classification): Use the existing target column if it encodes up/down
movement (or recompute it as (future_close > today_close) ). This yields 0/1 labels for up/
down. Multi-day directions could be defined similarly. We may make separate direction labels for
each horizon if needed.
• Volatility (Regression): Define volatility as the standard deviation of returns over the next period.
For example, 7-day realized volatility =
df['Close'].pct_change().rolling(7).std().shift(-7) . (Investopedia defines volatility
via std dev of returns 6 .) Include columns like target_vol_7d .
• Long-term (Quarterly) Targets: Align with quarter windows. E.g., predict next quarter’s average/
closing price, or percent change, using fundamentals. The output could be next quarter price or
earnings. Add these if doing quarterly forecasting.
• Why: Supervised learning needs clear labels. Multi-horizon targets let one model predict different
timeframes. Direction (classification) and price (regression) cover different decision needs: price
helps quantify gain/loss, direction is simpler buy/sell signal. Volatility prediction informs risk. Multi-
task learning can leverage relationships between these targets. Defining targets via shifting ensures
no look-ahead (respect chronology). Short horizons use mostly price/sentiment features; long-term
use more fundamental trends 5 .
• Expected Output: The feature DataFrame augmented with target columns. For instance:

features['target_price_1d'] = features['Close'].shift(-1)
features['target_dir_1d'] = (features['target_price_1d'] >
features['Close']).astype(int)
features['target_vol_7d'] =
features['Close'].pct_change().rolling(7).std().shift(-1)

Rows with NaN targets at the end (and possibly the first few for volatility) are removed. Now X
(features) and Y (targets) are aligned for training.
• How Connects: This completed dataset (features + targets) is what the model training stage
consumes. Having explicit multi-horizon targets means the model can be trained to output any or all
of them.
• Tools: Pandas for shifting and rolling.
• Example Code:

df = features.copy()
# 1-day horizon
df['target_price_1d'] = df['Close'].shift(-1)
df['target_dir_1d'] = (df['target_price_1d'] > df['Close']).astype(int)
# 7-day volatility (std of returns)
df['target_vol_7d'] = df['Close'].pct_change().rolling(7).std().shift(-7)
# Drop rows with NaN targets
df.dropna(subset=['target_price_1d','target_vol_7d'], inplace=True)

• Hardware: CPU.
• Resources: For more on multi-step forecasting, see multi-step LSTM guides (e.g. Jason Brownlee’s
tutorial shows how to frame X/Y for sequences).

5
Stage 5: Model Development (Short-Term vs Long-
Term)
• What (Short-Term): Design models to predict daily/weekly targets. Options include:
• Time-series models: LSTM or GRU networks that take a sequence of past days (features from day t–
T to t) and predict next 1–7 days. For example, an LSTM with input shape (window,
num_features) and outputs for price, volatility, direction. You can build a multi-output Keras
model: one head (regression) for price, one head for volatility, one (sigmoid) head for direction.
Example architecture:

from tensorflow.keras import Input, Model


from tensorflow.keras.layers import LSTM, Dense
inp = Input(shape=(window, F))
x = LSTM(64)(inp)
price_out = Dense(1, name='price')(x)
vol_out = Dense(1, name='volatility')(x)
dir_out = Dense(1, activation='sigmoid', name='direction')(x)
model = Model(inp, [price_out, vol_out, dir_out])
model.compile(loss={'price':'mse','volatility':'mse','direction':'binary_crossentropy'},
loss_weights={'price':1.0,'volatility':0.5,'direction':1.0},
optimizer='adam')

• Classical ML: Random Forests or Gradient Boosting (e.g. XGBoost) using flattened tabular features
(no sequence). You would include lag features explicitly. These models can output price (regressor)
and direction (classifier) separately, or you can train a classifier for direction and regressor for price.
They require less tuning and can serve as strong baselines.
• What (Long-Term): For quarterly horizons, a simpler regression (or even time-series extrapolation)
might suffice. Use aggregated features (e.g. quarter-end price, EPS, sentiment per quarter). A
Random Forest or even linear regression on fundamentals could predict next quarter’s price change.
Alternatively, an LSTM with a larger timestep (one step per quarter) is possible but less common. The
key is to leverage financial statement features heavily here.
• Why: LSTM/sequence models capture temporal patterns and are a natural choice for time-series
forecasting. Ensemble trees handle complex non-linearities on tabular data and automatically deal
with mixed features. Multi-task models can leverage shared signals. We choose separate per-
company models to let each stock’s unique behavior guide its model (per-stock models are the
objective 7 ). Having both deep and classical approaches allows benchmarking.
• Expected Output: Trained model object(s) per company. For example, saved model weights
( model.save() ) or serialized sklearn model ( joblib.dump ). Each company ends up with its own
model file (or set of model files if separate direction vs price models).
• How Connects: The trained models will be used in Stage 6 for prediction and evaluation on new
data. Models produce the predicted closing price, volatility, and direction when fed the latest
features.
• Tools:
• TensorFlow/Keras or PyTorch: For building LSTM/NNs. Keras is user-friendly for beginners.

6
• scikit-learn: RandomForestRegressor/Classifier , or GradientBoosting . XGBoost
( xgboost package) is popular for tabular data.
• Data pipelines: scikit-learn’s Pipeline or custom scripts to apply the same scaling and feature
engineering to training and test sets.
• Example Code (LSTM):

# Assume X_train shaped [samples, window, features], Y_train as dict of


targets
model.fit(X_train, {'price':Y_price, 'volatility':Y_vol,
'direction':Y_dir},
epochs=50, batch_size=32, validation_split=0.1)

For scikit-learn:

from sklearn.ensemble import RandomForestRegressor


rf_price = RandomForestRegressor()
rf_price.fit(X_train_tabular, y_price_train)
# Similarly, RandomForestClassifier for direction

• Hardware: Training deep RNNs can be slow on large windows, so use a GPU (e.g. Colab’s A100) for
LSTM training. Classical models (RF, XGBoost) are fine on CPU or modest GPU. Each company’s model
can be trained in parallel if multiple GPUs/machines are available.
• Learning Resources:
• LSTM Tutorial: MachineLearningMastery LSTM guide (step-by-step example).
• Scikit-learn Ensembles: Random Forest docs for regression/classification.
• Multi-output Keras: Keras functional API docs (the example above).
• Hugging Face Transformers (if you embed text via BERT): Transformers documentation.

Stage 6: Model Evaluation and Validation


• What: Test model performance using time-aware splits. Use the last portion of the time series as a
hold-out or apply walk-forward cross-validation (e.g. TimeSeriesSplit in scikit-learn 8 ).
Compute metrics: for price/volatility (regression) use MSE or MAE; for direction (classification) use
accuracy or F1. Also examine multi-horizon errors (does 7-day prediction degrade as expected?).
Visual inspection (line plots of actual vs predicted price) can help.
• Why: Time-series data cannot be randomly shuffled. We must validate on future (unseen) dates only,
to avoid look-ahead bias 8 . Metrics quantify if the model is better than naive baselines (e.g.,
“predict no change”). We might iteratively adjust model complexity or features if performance is
poor.
• Expected Output: Performance report (metrics table) for each company and horizon, and possibly
plots. A decision on final model parameters. The chosen model versions (weights) are kept, others
discarded.
• How Connects: Results may feed back into Stage 5 for model tuning. Once finalized, the “best”
model per company is ready for production (Stage 7).

7
• Tools: scikit-learn’s TimeSeriesSplit or manual train-test splits, mean_squared_error ,
accuracy_score . Visualization via matplotlib .
• Example Code:

from sklearn.model_selection import TimeSeriesSplit


tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
model.fit(X[train_idx], y[train_idx])
pred = model.predict(X[test_idx])
print("Fold MSE:", mean_squared_error(y[test_idx], pred))

• Hardware: CPU is fine for evaluation.


• Resources: Scikit-learn’s TimeSeriesSplit docs 8 explain the importance of order-preserving CV.

Stage 7: Deployment and Prediction Pipeline


• What: Package the data pipeline and models so new data yields predictions. For each company,
implement a routine: load the latest data (prices, news, financials), apply the same cleaning and
feature-engineering steps, then feed into the saved model to get predicted closing price, volatility,
and direction for desired horizons. Assemble these outputs into a report or visualization. Optionally,
automate this in a daily/weekly job or wrap in an API.
• Why: The decision-support system must produce forecasts on up-to-date data. An automated
pipeline ensures consistency and repeatability. Saving models and scalers means we apply identical
transformations as during training.
• Expected Output: For each new date (or end-of-week), predicted metrics for each stock: e.g., “Stock
X will close at ₹Y in 1 day, with predicted volatility Z and probability of up-move p.” These can be
saved to CSV or a database, or shown on a dashboard.
• How Connects: This stage is the end-user interface of the pipeline. It uses all previous outputs
(models, scaling parameters) and feeds into decision-making systems (trading signals or analyst
reports).
• Tools: Python for scripting. For model loading: tf.keras.models.load_model() or
joblib.load() . Version control or Docker can ensure environment consistency. If deploying as a
service, Flask/FastAPI can serve predictions.
• Example Code:

# Load model and scaler


model = load_model("companyX_model.h5")
scaler = joblib.load("companyX_scaler.pkl")
# New data ingestion
new_prices = pd.read_csv(...); new_news = ...
# Same cleaning/feature steps
new_features = make_features(new_prices, new_news, new_financials)
X_new = scaler.transform(new_features)
# Predict
pred_price, pred_vol, pred_dir = model.predict(X_new)

8
• Hardware: Prediction is lightweight – can run on a standard server or even a client machine. A GPU
is not needed for inference at small scale.
• Resources: Look into TensorFlow Serving or FastAPI tutorials if a real-time API is needed.

Learning Resources (Overall):


- Python/Pandas: Official Pandas docs for data handling.
- NLP Sentiment: Hugging Face Pipeline introduction 3 and the FinBERT model card 2 .
- Time-Series ML: Jason Brownlee’s “Deep Learning for Time Series Forecasting” tutorial for LSTM concepts
(pay attention to framing multistep problems).
- Finance Concepts: Investopedia Volatility for definitions (volatility = std dev of returns 6 ).
- Modeling Best Practices: Scikit-learn’s TimeSeriesSplit guide 8 to avoid data leakage.

This staged plan ensures that data flows smoothly from raw files to final predictions. Each stage’s output
feeds the next, allowing team members to work in parallel (e.g., one on data cleaning, another on feature/
NLP engineering). The citations above provide guidance on proven techniques: combining technical and
news features 1 5 , using domain-specific NLP (FinBERT) 2 4 , and respecting time-series validation
8 . With clear tasks and tools at each step, an intermediate Python team can build and iterate this

decision-support system.

1 5 7 Combining News Sentiment and Technical Analysis to Predict Stock Trend Reversal
https://www.sentic.net/sentire2019cagliero.pdf

2 ProsusAI/finbert · Hugging Face


https://huggingface.co/ProsusAI/finbert

3 Pipelines
https://huggingface.co/docs/transformers/en/main_classes/pipelines

4 Enhancing Stock Market Forecasting Through a Service-Driven Approach: Microservice System


https://thesai.org/Downloads/Volume16No1/Paper_27-Enhancing_Stock_Market_Forecasting.pdf

6 Volatility: Meaning in Finance and How It Works With Stocks


https://www.investopedia.com/terms/v/volatility.asp

8 TimeSeriesSplit — scikit-learn 1.7.0 documentation


https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

You might also like