0% found this document useful (0 votes)
20 views3 pages

Host A Scheduled Scraper On AWS As An API Endpoint - Amen

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views3 pages

Host A Scheduled Scraper On AWS As An API Endpoint - Amen

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Scheduled scraper on Flask as an API Endpoint:

● Python libraries are at the top of the list of web scraping technologies accessible today.

● Beautiful Soup is the most popular Python web scraping library.

● We'll build a web scraper app with Flask, Python's most lightweight web framework.

Step 1: Setup the environment

pip install flask requests beautifulsoup4

Step 2: Create a Flask app

App.py

from bs4 import BeautifulSoup


from newspaper import Article
import requests
from flask import Flask, render_template, redirect
from flask_sqlalchemy import SQLAlchemy
import os
project_dir = os.path.dirname(os.path.abspath(__file__))
database_file = "sqlite:///{}".format(os.path.join(project_dir, "news_scrape.db"))
app = Flask(__name__)
app.config["SQLALCHEMY_DATABASE_URI"] = database_file
app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
app.config["SECRET_KEY"] = "newisthesecretofsecretscrape"
db = SQLAlchemy(app)
class Articlelist(db.Model):
id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.Text)
author = db.Column(db.Text)
summary = db.Column(db.Text)
@app.route('/')
def index():
articles = Articlelist.query.all()
return render_template("index.html", articles=articles)
if __name__ == "__main__":
app.run()

Step 3: Create the Scrape function

scrape.py
Create a scraping function to fetch data from the example news portal, `https://www.example-news.com`.
We’ll use the Requests library to send HTTP requests and Beautiful Soup for parsing the HTML.

from bs4 import BeautifulSoup


from newspaper import Article
import requests
import sqlite3
import os
import schedule
import time
connection = sqlite3.connect('news_scrape.db')
cursor = connection.cursor()
class NewsArticle:
def __init__(self, news_urls):
self.news_urls = news_urls
for news_url in self.news_urls:
article = Article(news_url)
article.download()
article.parse()
self.title = article.title
self.author = article.authors[0]
article.nlp()
self.summary = article.summary
cursor.execute("""INSERT INTO Articlelist VALUES (NULL, ?, ?, ?)""", (self.title,
self.author, self.summary))
connection.commit()
def scrape_news():
cursor.execute("DROP TABLE Articlelist")
print("Dropped Table")
print("Creating Table")
cursor.execute(
"""CREATE TABLE Articlelist(
id INTEGER PRIMARY KEY,
title TEXT,
author TEXT,
summary TEXT
)
""")
connection.commit()
print("Created Table")
print("SCRAPPING SITE ONE")
site1_content = requests.get('https://thehackernews.com')
site1_data = site1_content.text
soup1 = BeautifulSoup(site1_data, 'html.parser')
news_urls = []
story_links1 = soup1.find_all('a', class_="story-link")

for story_link1 in story_links1:


url = story_link1.get('href')
news_urls.append(url)
site1 = NewsArticle(news_urls)
news_urls.clear()
print("SCRAPPING SITE TWO")
site2_content =
requests.get('https://www.ehackingnews.com/search/label/Cyber%20Crime?max-results=7')
site2_data = site2_content.text
soup2 = BeautifulSoup(site2_data, 'html.parser')
blog_posts = soup2.find_all('article', class_="home-post")
for blog_post in blog_posts:
url = blog_post.h2.a.get('href')
news_urls.append(url)
site2 = NewsArticle(news_urls)
print("DONE")
#schedule.every(5).minutes.do(scrape_news)
schedule.every().day.at("24:00").do(scrape_news)
while True:
schedule.run_pending()
time.sleep(1)

Step 4: Testing the web scraper

Finally, test your web scraper by running the Flask application with the command `python
app.py`. Using http://localhost:5000 in your browser to see the latest news headlines displayed
in your app.

You might also like