0% found this document useful (0 votes)
9 views3 pages

Automl Code

The document outlines the creation of a dynamic ETL pipeline for real-time data processing, focusing on extracting data from a live API, transforming it, and loading it into an AWS RDS database. It includes error handling, scheduling with AWS solutions, and data visualization using Python libraries. The project structure is organized with directories for the ETL pipeline and Terraform configuration files.

Uploaded by

rahuldabhade250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views3 pages

Automl Code

The document outlines the creation of a dynamic ETL pipeline for real-time data processing, focusing on extracting data from a live API, transforming it, and loading it into an AWS RDS database. It includes error handling, scheduling with AWS solutions, and data visualization using Python libraries. The project structure is organized with directories for the ETL pipeline and Terraform configuration files.

Uploaded by

rahuldabhade250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Build a Dynamic Data Pipeline for Real-Time Insights.

Objective: Create an ETL pipeline for real-time data processing and insights.
• Extract data from a live API (e.g., stock prices or weather data).
• Normalize, clean, and transform the data to handle missing values and compute
new metrics.
• Load the processed data into a aws rds cloud-based database.
• Implement error handling to manage failed API requests and ensure data
integrity.
• Schedule the pipeline using aws cloud-based solution.
• Visualize the data trends using Python libraries like matplotlib or export
the data for further analysis.
• Deploy this whole structure in Terraform

project-root/

├── etl_pipeline/
│ ├── etl_pipeline.py
│ ├── requirements.txt
│ └── README.md

├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── README.md

└── .gitignore

============================ version ===========================================

import requests
import json
import pandas as pd
import numpy as np
import boto3
import schedule
import time
import logging
from datetime import datetime

# Set up logging
logging.basicConfig(level=logging.INFO)

# Alpha Vantage API URL and key


api_url = "https://www.alphavantage.co/query"
api_key = "your_api_key_here" # Replace with your actual API key

# AWS DynamoDB setup


dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table('StockPrices') # Ensure the DynamoDB table is created

# Fetch stock data from Alpha Vantage API


def fetch_data():
try:
params = {
"function": "TIME_SERIES_INTRADAY",
"symbol": "AAPL", # Example: Apple Inc.
"interval": "5min", # Fetch 5-minute interval stock prices
"apikey": api_key
}
response = requests.get(api_url, params=params)
response.raise_for_status() # Will raise an error if the API request fails
return response.json()
except requests.exceptions.RequestException as e:
logging.error(f"Error fetching data: {e}")
return None

# Transform the data (clean, handle missing values, compute new metrics)
def transform_data(raw_data):
try:
# Extract the 'Time Series (5min)' data
time_series = raw_data.get("Time Series (5min)", {})
if not time_series:
logging.error("No data found in the API response.")
return None

# Prepare the data for processing


records = []
for timestamp, values in time_series.items():
record = {
'timestamp': timestamp,
'price': float(values['4. close']),
}
records.append(record)

# Convert to DataFrame
df = pd.DataFrame(records)

# Handle missing values (forward fill)


df['price'] = df['price'].fillna(method='ffill')

# Compute the price change percentage


df['price_change'] = df['price'].pct_change() * 100

# Normalize the price (optional)


df['price_normalized'] = (df['price'] - df['price'].min()) /
(df['price'].max() - df['price'].min())

return df
except Exception as e:
logging.error(f"Error transforming data: {e}")
return None

# Load the data into DynamoDB


def load_to_dynamodb(df):
try:
for _, row in df.iterrows():
table.put_item(
Item={
'timestamp': row['timestamp'],
'price': row['price'],
'price_change': row['price_change'],
'price_normalized': row['price_normalized'],
}
)
logging.info("Data loaded into DynamoDB successfully.")
except Exception as e:
logging.error(f"Error loading data into DynamoDB: {e}")

# Run the ETL pipeline


def run_etl():
# Step 1: Fetch the data
data = fetch_data()
if data:
# Step 2: Transform the data
df = transform_data(data)
if df is not None:
# Step 3: Load the data into DynamoDB
load_to_dynamodb(df)
else:
logging.error("Data transformation failed.")
else:
logging.error("Data fetching failed.")

# Schedule the ETL pipeline to run every 5 minutes


schedule.every(5).minutes.do(run_etl)

# Keep the script running and executing the scheduled jobs


if __name__ == "__main__":
logging.info("Starting the ETL pipeline.")
while True:
schedule.run_pending()
time.sleep(1)

===================================================================================
========

You might also like