0% found this document useful (0 votes)

26 views8 pages

While Web Scraping As We Know It Today Has Existed For Well Over A Decade Now

Uploaded by

Obed ur Rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views8 pages

While Web Scraping As We Know It Today Has Existed For Well Over A Decade Now

Uploaded by

Obed ur Rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

While web scraping as we know it today has existed for well over a decade now, its

relevance has never been more profound. The ability to make it useful and value-
generating is currently applicable to almost all slices of business and beyond.

Here are some of the factors that provide evidence for web scraping’s ever-increasing
necessity in businesses:

• Scheduling and automation

• Quick, low maintenance results
• Cost-effectiveness
• Enablement of competitive advantages such as data based market analysis
• Data gathering to better understand the needs of your customers

And many more.

The main question is how will the already successful data scraping techniques be
affected by machine learning?

What is machine learning?

First, let’s take a more in-depth look at the primary features of machine learning (ML).
As an essential component of data science and a branch/byproduct of AI, it aims to
imitate the way humans learn. It does so by gathering data and using algorithms, which
are then used for gradual self-improvement in terms of predictions and their accuracy.

Such features are rather convenient since they allow for a more hands-off approach,
i.e., instead of hand-coding various software routines or instructions, this will enable you
to achieve a specific given task primarily through a machine with little interference from
a developer.

The next question is how the benefits of ML translate to real-life scenarios? To answer
this, let’s look at some use cases.

Customer service: chatbots are starting to replace human agents, with FAQs often
being answered without a human reply. The Virtual Agents of Slack and Facebook are
prime examples of this.

Web Unblocker: an AI-powered proxy solution, allows for block-free data gathering.
This proxy solution contains ML-driven proxy management as well as ML-powered
response recognition both of which ensure an effortless data collection process.
Computer Vision: the AI, and as such, ML technology, allows for the extraction of
meaningful information taken solely from visual data, upon which recognition tasks can
be achieved. A prime example of this use is ML integration within self-driving cars.

Stock trading: enables automated trading that optimizes stock portfolios by potentially
making millions of automated trades per day.

The importance of web scraping in

machine learning
Having discussed the features of Machine Learning, let’s take a look at how they all
translate within web scraping.

Primarily, web scraping in ML is centered around the core problem of gathering quality
data.

While the internal information gathered on day-to-day business can provide valuable
insights, such data is insufficient. Therefore, gathering from external sources is
essential, although a more complex task. Inaccuracy/poor data quality becomes a
severe concern when scraping, and a final clean-up step must always be included
within any scraping project, though this will be discussed in greater detail later on in this
guide.

Using machine learning for web

scraping
The following example collects historical stock prices using web scraping. Data points,
such as daily opening, daily highest, daily lowest, daily closing, will be collected as well.

Thankfully, numerous websites provide such data, and it’s usually, conveniently,
presented in a table. Typically, you’ll see the HTML code that renders these tables,
such as the following image.

Typical stock data along with HTML markup

With that in mind, let’s get started with the first step of web scraping.
Project requirements
In this blog, we’ll be working with Python 3.9. However, this code will work with Python
3.7 and 3.8 as well.

There are two sets of requirements for this project. Firstly, libraries for web scraping and
secondly, libraries for machine learning.

For web scraping, we’ll need Requests-HTML and BeautifulSoup4. Install these from
the terminal as follows:

$ python3 -m pip install requests_html beautifulsoup4

In regards to machine learning, we’ll be using multiple libraries instead. Primarily,

Pandas and Numpy are going to be our choice on how to handle our data. For
visualization, Matplotlib will be our choice. For preprocessing data, we’ll need help from
the SciKit Learn library. Finally, we’ll use Tensorflow for creating a neural network
machine learning model. Install all these libraries from the terminal as follows:

$ python3 -m pip install pandas numpy matplotlib seaborn tensorflow sklearn

Extracting the data

If we’re looking at machine learning projects, Jupyter Notebook is a great choice as it’s
easier to run and rerun a few lines of code. Moreover, the plots are in the same
Notebook.

Begin with importing required libraries as follows:

from requests_html import HTMLSession

import pandas as pd

For web scraping, we only need Requests-HTML. The primary reason is that Requests-
HTML is a powerful library that can handle all our web scraping tasks, such as
extracting the HTML code from websites and parsing this code into Python objects.
Further benefits come from the library’s ability to function as an HTML parser, meaning
collecting data and labeling can be performed using the same library.

Next, we use Pandas for loading the data in a DataFrame for further processing.

In the next cell, create a session and get the response from your target URL.

url = 'http://your-target-url'
session = HTMLSession()
r = session.get(url)

After this, use XPath to select the desired data. It’ll be easier if each row is represented
as a dictionary where the key is the column name. All these dictionaries can then be
added to a list.

rows = r.html.xpath('//table/tbody/tr')
symbol = 'AAPL'
data = []
for row in rows:
if len(row.xpath('.//td')) < 7:
continue
data.append({
'Symbol':symbol,
'Date':row.xpath('.//td[1]/span/text()')[0],
'Open':row.xpath('.//td[2]/span/text()')[0],
'High':row.xpath('.//td[3]/span/text()')[0],
'Low':row.xpath('.//td[4]/span/text()')[0],
'Close':row.xpath('.//td[5]/span/text()')[0],
'Adj Close':row.xpath('.//td[6]/span/text()')[0],
'Volume':row.xpath('.//td[7]/span/text()')[0]
})
The results of web scraping are being stored in the variable data. To understand why
such actions are taken, we must consider that these variables are a list of dictionaries
that can be easily converted to a data frame. Furthermore, completing the steps
mentioned above will also help to complete the vital step of data labeling.

Scraped data loaded in a data frame

The provided example’s data frame is not yet ready for the machine learning step. It still
needs additional cleaning.

Cleaning the data

Now that the data has been collected using web scraping, we need to clean it up. The
primary reason for this action is uncertainty whether the data frame is acceptable;
therefore, it’s recommended to verify everything by running df.info().

Data frame has everything stored as strings

As evident from the above screen-print, all the columns have data type as object. For
machine learning algorithms, these should be numbers.
Dates can be handled using Pandas.to_datetime. It’ll take a series and convert the
values to datetime. This can then be used as follows:

df['Date'] = pd.to_datetime(df['Date'])

The issue we ran into now is that the other columns were not automatically converted to
numbers because of comma separators.

Thankfully, there are multiple ways to handle this. The easiest one is to remove the
comma by calling str.replace() function. The astype function can also be called in
the same line which will then return a float.

str_cols = ['High', 'Low', 'Close', 'Adj Close', 'Volume']

df[str_cols]=df[str_cols].replace(',', '', regex=True).astype(float)

Finally, if there are any None or NaN values, these can be deleted by calling
the dropna().

df.dropna(inplace=True)

As the last step, set the Date column as the index and preview the data frame.

df = df.set_index('Date')
df.head()

The clean DataFrame is now ready to be processed.

The data frame is now clean and ready to be sent to the machine learning model.

Visualizing the data

Before we begin the section on machine learning, let’s have a quick look at the closing
price trend.

First, import the packages and set the plot styles:

import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('darkgrid')
plt.style.use("ggplot")

Next, enter the following lines to plot the Adj Close, which is the adjusted closing
price:
plt.figure(figsize=(15, 6))
df['Adj Close'].plot()
plt.ylabel('Adj Close')
plt.xlabel(None)
plt.title('Closing Price of AAPL')

Plotting the adjusted closing price of AAPL

Preparing data for machine learning
The first step to machine learning is the selection of features and values we want to
predict.

In this example, the “Adjusted Close” is dependent on the “Close” part. Therefore, we’ll
ignore the Close column and focus on Adj Close.

The features are usually stored in a variable named X and the values that we want to
predict are stored in a variable y.

features = ['Open', 'High', 'Low', 'Volume']

y = df.filter(['Adj Close'])

The next step we have to consider is feature scaling. It’s used to normalize the features,
i.e., the independent variables. Within our example, we can use MinMaxScaler. This
class is part of the preprocessing module of the Sci Kit Learn library.

First, we’ll create an object of this class. Then, we’ll train and transform the values using
the fit_transform method as follows:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X = scaler.fit_transform(df[features])

The next step is splitting the data we have received into two datasets, test and training.

The example we’re working with today is a time-series data, meaning data that changes
over a time period requires specialized handling. The TimeSeriesSplit function from
SKLearn’s model_selection module will be what we need here.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=10)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Our approach for today will be creating a neural network that uses an LSTM or a Long
Short-Term Memory layer. LSTM expects a 3-dimensional input with information about
the batch size, timesteps, and input dimensions. We need to reshape the features as
follows:

X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])

X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])

Training the model and predictions

We’re now ready to create a model. Import the Sequential model, LSTM layer,
and Dense layer from Keras as follows:

from keras.models import Sequential

from keras.layers import LSTM, Dense

Continue by creating an instance of the Sequential model and adding two layers. The
first layer will be an LSTM with 32 units while the second will be a Dense layer.

model = Sequential()
model.add(LSTM(32, activation='relu', return_sequences=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

The model can be trained with the following line of code:

model.fit(X_train, y_train, epochs=100, batch_size=8)

While the predictions can be made using this line of code:

y_pred= model.predict(X_test)

Finally, let’s plot the actual values and predicted values with the following:

plt.figure(figsize=(15, 6))
plt.plot(y_test, label='Actual Value')
plt.plot(y_pred, label='Predicted Value')
plt.ylabel('Adjusted Close (Scaled)')
plt.xlabel('Time Scale')
plt.legend()

The trend and values of the predictions are very close.

The plot shows that the predictions are close to the actual values. Yet, more
importantly, the trends are similarly close as well.

Daggerheart Core Rulebook 5-20-2025 1
100% (29)
Daggerheart Core Rulebook 5-20-2025 1
415 pages
1cr in 365days Ebook
93% (14)
1cr in 365days Ebook
101 pages
5K Link Mega - NZ
100% (1)
5K Link Mega - NZ
157 pages
Sim Owner Details - Pakistan No #1 Number Information System 2025
60% (5)
Sim Owner Details - Pakistan No #1 Number Information System 2025
3 pages
Serial Number Corel Draw All Version
74% (91)
Serial Number Corel Draw All Version
1 page
Dating Format
90% (144)
Dating Format
23 pages
Corel Draw X7 Serial Number & Activation Code
61% (38)
Corel Draw X7 Serial Number & Activation Code
1 page
HC Verma Concepts of Physics Volume 1
24% (29)
HC Verma Concepts of Physics Volume 1
1 page
Xtream IPTV Activation Code 2025
67% (3)
Xtream IPTV Activation Code 2025
1 page
Aimbot+ Headshot This Codding File Free Fire Made by Rahul Gamerx
83% (24)
Aimbot+ Headshot This Codding File Free Fire Made by Rahul Gamerx
1 page
Microsoft Office 2007 Activation Keys
85% (26)
Microsoft Office 2007 Activation Keys
2 pages
Microsoft Office 2013 Professional Plus Product Keys
75% (16)
Microsoft Office 2013 Professional Plus Product Keys
1 page
VIDEO@18++ Pakcricketinfo Sapna Shah Viral Video
No ratings yet
VIDEO@18++ Pakcricketinfo Sapna Shah Viral Video
4 pages
Microsoft Windows 7 Ultimate
67% (15)
Microsoft Windows 7 Ultimate
9 pages
R. D. Sharma Class 9th Book PDF - Unlocked
73% (22)
R. D. Sharma Class 9th Book PDF - Unlocked
464 pages
All Mobile Imei Code
77% (13)
All Mobile Imei Code
2 pages
Microsoft Office 2016 Product Key
59% (17)
Microsoft Office 2016 Product Key
2 pages
Serial Number Adobe Photoshop CS5
63% (19)
Serial Number Adobe Photoshop CS5
3 pages
A2Z Telugu Boothu Kathalu
65% (52)
A2Z Telugu Boothu Kathalu
8 pages
Quiz 1
100% (2)
Quiz 1
10 pages
It - Stephen King's PDF
100% (3)
It - Stephen King's PDF
588 pages
NADANPENKODI - Malayalam Kambi Kathakal
60% (10)
NADANPENKODI - Malayalam Kambi Kathakal
8 pages
Telugu Boothu Kathala 5
65% (17)
Telugu Boothu Kathala 5
33 pages
Full 18++ Archita Phukan Viral Video Original
No ratings yet
Full 18++ Archita Phukan Viral Video Original
4 pages
(Full-Video) Archita Pukham Video Viral On Social Media X
No ratings yet
(Full-Video) Archita Pukham Video Viral On Social Media X
4 pages
(ExcluSivE HOT) 18+ Archita Pukham Viral Video Original
No ratings yet
(ExcluSivE HOT) 18+ Archita Pukham Viral Video Original
4 pages
New Job DSK
No ratings yet
New Job DSK
804 pages
18+ (Original-CLIP) Archita Pukham Viral Video On Social MEDIa
0% (1)
18+ (Original-CLIP) Archita Pukham Viral Video On Social MEDIa
4 pages
Ministirii Kutaa 8ffaa
100% (7)
Ministirii Kutaa 8ffaa
910 pages
List of Mobile Network Prefixes in The Philippines
50% (2)
List of Mobile Network Prefixes in The Philippines
4 pages

While Web Scraping As We Know It Today Has Existed For Well Over A Decade Now

Uploaded by

While Web Scraping As We Know It Today Has Existed For Well Over A Decade Now

Uploaded by

While web scraping as we know it today has existed for well over a decade now, its

• Scheduling and automation

And many more.

What is machine learning?

The importance of web scraping in

Using machine learning for web

Typical stock data along with HTML markup

$ python3 -m pip install requests_html beautifulsoup4

In regards to machine learning, we’ll be using multiple libraries instead. Primarily,

$ python3 -m pip install pandas numpy matplotlib seaborn tensorflow sklearn

Extracting the data

Begin with importing required libraries as follows:

from requests_html import HTMLSession

Scraped data loaded in a data frame

Cleaning the data

Data frame has everything stored as strings

str_cols = ['High', 'Low', 'Close', 'Adj Close', 'Volume']

The clean DataFrame is now ready to be processed.

Visualizing the data

First, import the packages and set the plot styles:

import matplotlib.pyplot as plt

Plotting the adjusted closing price of AAPL

features = ['Open', 'High', 'Low', 'Volume']

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import TimeSeriesSplit

X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])

Training the model and predictions

from keras.models import Sequential

The model can be trained with the following line of code:

model.fit(X_train, y_train, epochs=100, batch_size=8)

While the predictions can be made using this line of code:

The trend and values of the predictions are very close.

You might also like