100% found this document useful (1 vote)
757 views141 pages

Octoparse Webscraping 2020.08.03

Uploaded by

dddwdwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
757 views141 pages

Octoparse Webscraping 2020.08.03

Uploaded by

dddwdwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

www.octoparse.

com
www.octoparse.com

Disclaimer
© 2020 by Tim Luprich Veröffentlichungen

All rights reserved


No part of this book may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or
otherwise, without the prior written permission of the
publisher except in the case of brief quotations embodied in
critical articles and reviews. For information, please write:
[email protected]

This publication contains the author’s opinions and is


designed to provide accurate and authoritative information.
It is sold with the understanding that the author is not
engaged in rendering legal, accounting, health, or other
professional advice. The reader should seek the services of a
qualified professional for such advice; the author cannot be
held responsible for any loss incurred as a result of specific

Tim Luprich
Ludwigstr. 54
70176 Stuttgart
Germany

2
www.octoparse.com

Table of Content

1 Foreword

2 Data is the new gold

3 Web scraping - efficiency and automation

4 Advantages of web scraping

5 Problems of Web scraping

6 Web scraping software

7 Web Scraping by means of Octoparse

8 Getting Started with Octoparse


8.1 System Requirements
8.2 Download and installation
8.3 Sign up for Octoparse
8.4 The GUI –Graphic User Interface
8.4.1 The Home screen
8.4.2 The Sidebar Menu
8.4.3 Dashboard
8.4.4 Quick Filters
8.4.5 Recent Tasks
8.4.6 Data Services
8.4.7 Contact us

3
www.octoparse.com

8.4.8 The Workspace

9 Octoparse Workflow Methods

10 Octoparse Advanced Mode


10.1 Add a new task with the Advanced Mode
10.2 Workflow Tips
10.3 Auto Detection
10.3.1 Improve the auto detect Mode
10.4 Edit or create your Advanced Workflow
10.5 Create a Scraping Task with the Advanced
Workflow from scratch
10.5.1 Click on "Go to Web Page"
10.5.2 Click the "Pagination" box
10.5.3 Click on "Click to Paginate"
10.5.4 Click on the "Loop Item" box
10.5.5 Click on "Extract Data"
10.6 Web scraping for advanced users - Subpages
10.7 Start a run
10.8 Check your data
10.9 Ways to get data
10.10 Export your data

11 Extraction with the Octoparse cloud


11.1 To run your task with cloud extraction:
11.2 To batch run tasks with cloud extraction:

4
www.octoparse.com

11.3 Auto-data export (for Cloud data)

12 Schedule regular runs


12.1 Task scheduling

13 Octoparse Template Mode


13.1 How to use the Template Mode?

14 Octoparse Hacks for your Workflows


14.1 Extract data behind a login
14.1.1 Use cookies to optimize the workflow
14.1.2 Clear cookies instead
14.2 How to click through options in a drop-down
menu?
14.3 Text/keyword input
14.3.1 Input a single keyword into the textbox
14.3.2 Input multiple keywords into a search
box
14.4 Refine your data
14.4.1 Rename/move/duplicate/delete a field
14.5 Clean data
14.5.1 Capture HTML code
14.5.2 Extract page-level data and date & time
14.6 Octoparse Anti-Blocking settings
14.6.1 IP rotation
14.7 Switch user-agents and clear cookies

5
www.octoparse.com

14.8 Wait before Execution


14.9 Auto switch browser
14.10 Workflow trigger
14.11 How to set up Triggers?
14.11.1 Create a new trigger
14.11.2 For general texts
14.11.3 For numerals
14.11.4 For time
14.11.5 Add more conditions using [AND] or
[OR]
14.12 Incremental Extraction- Get updated data
easily

15 Closing words

6
www.octoparse.com

1 Foreword
Social, professional and corporate life has now been
undergoing digital change for some time. People shop
online, book a table in their favorite restaurant on the
move and compare the prices of travel bookings and
products of all kinds through portals.

Information is more important than ever before. So, it's


high time to build on your personal, digital strengths.

For the former strong professionalism, both in the


professional and private spheres, is increasingly giving
way to a shift towards digital transformation.

Today, the majority of work processes and tasks are


automated. This is reflected not least in the changed job
advertisements for most professions. Where a few years
ago, dealers or consultants were still being sought, IT
specialists and programmers are taking their place.
A strongly changing sector is the banking industry, as
well as many other branches of industry, trade and
services. It is therefore only logical that obsolete
manual processes are obsolete ones replaced by
efficient processes with IT support.
Many of the people currently working in professional
life have not enjoyed IT training. The acquisition of

7
www.octoparse.com

programming skills is equivalent to learning a new


language in terms of the time required.

Fortunately, there are more and more tools available


today that will keep you from learning Java, C++ or
other programming languages.

Such as the Octoparse web scraping program described


in this book.

My name is Tim Luprich and for almost two decades


now I have been working in the banking industry,
which is now also becoming more and more an IT
industry. I am pleased to introduce you to web scraping
on the following pages and hope you enjoy designing
and executing your web scraping solutions.

If you would like to dive deeper into Octoparse and its


many advantages, I can recommend
Octoparse blog. It offers useful content and its premium
version in many places. You can purchase this via the
following affiliate link. Here I receive a small
commission through your purchase, which does not
affect your purchase price. Of course, you can also buy
the premium offer yourself without using the link.

http://agent.octoparse.com/ws/324

8
www.octoparse.com

I wish you much success with your scraping project and


especially much fun reading this book!

9
www.octoparse.com

2 Data is the new gold

The importance of data and information is bigger than


ever before. The technological change that has been
taking place for years now affects all people around the
world. It is not only changing our private lives, but is
also having a massive impact on professional, economic,
political and social decision-making.

Where a few decades ago manual human activities were


necessary for information processing, today it is
algorithms that work independently and fully
automatically through large amounts of data.

The job descriptions and thus the requirements for


employees have changed at least as drastically. Job
descriptions such as “Big Data Analyst” or “Data
Scientist” are becoming more and more common, and
so almost every newly created job requires an IT
affinity. This is because companies are generating more
and more turnover and profits from the data available to
them, and the trend is rising.

According to a study by the European Commission, the


total value of the European Data Economy has already
been estimated at 300 billion euros in 2016 and 740
billion euros in 2020. The study mentions that about 10

10
www.octoparse.com

million people will be employed in the pure analysis


and processing of data.

But data alone, without specific and targeted processing,


is of little use. Just like a rough diamond, which must
be polished to unfold its full splendor, data must be
analyzed, processed, enriched and interpreted before it
can be used. Once a data analyst and his team have
completed a technical, methodological and structural
analysis of the data, this task can in turn be automated
to the extent that a computer- assisted algorithm will
perform this task independently in the future and
continuously produce results. Like an industrial
machine that welds the same components together
again and again and creates something new, the
program picks up data, processes it in the desired
format and passes it on to the user or a new program.

The success of data processing has long been confirmed


by the well-known economist Erik Brynjolfsson. In his
research, he concluded that companies that rely on data-
based decision-making are approximately 6% more
efficient than their competitors who do not. The term
data mining is used to refer to the use of the data
obtained and the prediction of certain events from this
information.

11
www.octoparse.com

Data Mining uses information from a database, as it is


often available in medium and large companies, to use
it for business objectives, such as maximizing profit or
increasing sales, by mostly using customer data.
In order to operate data mining successfully, a
structured database is just as necessary as a specific
interpretation of the data. This allows future trends to
be predicted based on historical data and applied to the
future.

This look into the future makes data mining more


valuable than ever for private companies as well as for
NGOs, governments, and political parties. Data mining
helps companies to predict production peaks and to
selectively add or remove resources from a process.
Data mining can help physicians and medical
practitioners based on disease patterns to make an early
diagnosis of future disease development and to
counteract it with appropriate treatment and medication.

Already today, data mining is used in the field of traffic


management and control by accurately predicting traffic
jams at certain bottlenecks and considering possible
measures within route planning. Anyone who has ever
used a navigation system appreciates data mining by
means of real-time traffic situation, bypasses the traffic
jam in advance and thus saves time, nerves, fuel and
money.

12
www.octoparse.com

In the insurance sector, data miners work with


probability calculations for an insurance occurrence.
Insurance companies have been working since their
inception to determine the potential risk of occurrence
and damage to set the insurance premium. Insurance
companies draw on an enormous database that they
have built up internally within their own customer cases.
They assess future losses quantitatively and calculate
the probability of occurrence of the insured event.

In the advertising industry, personal data is used by


companies to tailor advertising even more specifically
to the personal preferences of the target group and to
apply advertisements to the calculated next consumer
goods. Many marketing experts are already talking
about the fact that this data is more valuable than oil or
gold. In recent years, according to the Data and
Marketing Association, U.S. companies have spent over
$10 billion to obtain data for advertising and marketing
and another $10 billion to implement technical
solutions from this data to deliver advertising to
customers. The sheer size of the investment shows how
important data has become for the business world.

In the last US presidential election, the world witnessed


the use of personal data to influence voters. With the
help of information from voters on social media

13
www.octoparse.com

channels and targeted psychological grouping of users,


campaign spots could be sent out to undecided voters.
These ultimately helped Donald Trump to the decisive
election victory in the Swing States.

This kind of election advertising almost borders on


voter manipulation and of course casts a bad light on
data miners and people who use these tools to solve the
real problems of mankind. An insight into the effort that
went into influencing the U.S. election can be found in
the documentation on Cambridge Analytica 2The Great
Hack”, available on Netflix at the time this book was
published.

But before the specific benefit for the owner of the data
can be implemented, huge amounts of data must be
searched with appropriate programs and patterns must
be recognized. However, the effort involved should not
be underestimated. Fortunately, data analysis is also
becoming increasingly efficient and programs help to
interpret data. They already recognize patterns on their
own that would otherwise have had to be worked out
manually or would have remained hidden from the
human eye. The more detailed the data set is, the more
valuable it is.

Data mining and the resulting extraction of data, data


extraction, is therefore only possible if a large database

14
www.octoparse.com

is available. A web scraper therefore provides the basis


for obtaining information that is not itself available in a
structured form.

15
www.octoparse.com

3 Web scraping - efficiency and


automation

Web scrapers have different names. They are called


spider, web spider, search bot, scraper, web crawler,
robot and probably other names. However, they all have
the same function. For simplicity, I will use the term
Web scraper in this book.

The task of the Web Scraper is to identify, structure,


collect and store information available on the Internet in
the form of data and thereby make it processable. The
Web scraper has the task of searching the World Wide
Web and analyzing web pages. The best known
example of a web scraper or web crawler is the search
engine Google.

The first web crawlers were already developed in 1993


with the World Wide Web Wanderer. The purpose of
the web scraper was to record the growth of the Internet.
Shortly afterwards, the company Web crawler was the
first search engine to go online in 1994, enabling text
searches on the Internet. A groundbreaking innovation
at this early stage of the Internet.

After a time, there were plenty of different search


engines online, all which developed different search

16
www.octoparse.com

algorithms. But the method of retrieving data using a


crawler was similar for all of them. This is how search
engines like Web crawler, Yahoo, Fireball, but also
Google came into being.

Even today, web crawlers are still used for the


programs for receiving Internet data. However, these
have nothing in common with the original Web crawler
company, but the company still exists today.

Google's web scraper is used to index web pages and


makes the search results available to its users after
processing by an algorithm. Only after indexing is it
possible to find web pages by entering them in the
familiar Google search box. Depending on the quantity
and quality of the web page data, such as the number of
pages, words per page, but also the structure of the data,
such as an existing meta description, Google can better
analyze the results and display them in a way that is
tailored to the user.

Comparison portals such as Trivago use the same


methods of a web scraper to publish the best offers
from the mass of information for their customers.
Furthermore, web scrapers are also used for research
purposes in sentiment analysis in social media. The
program works with a special type of bots that
automatically perform repetitive tasks on their own.

17
www.octoparse.com

To visualize how a web scraper works, you can imagine


an employee searching for smart phones on the Amazon
platform. They enter the search term Smart phone in the
search field on the Amazon website and click the search
button. The results are then displayed. Now the real
work begins. Because now, the employee clicks into
each search result individually and copies the contents
of the page, such as the model name, the manufacturer,
the price and the delivery time into a table. He now
does this for the total of 4,000 search results.

What for a human being is a boring, time-consuming


and error-prone work, becomes a simple routine task
through the technology of the web scraper with its
ability to access directly into the HTML code of the
web page, as well as to output a standardized output
format such as CSV or XML. Challenging but solvable
problems are web page errors, differently structured
data on the web page and different languages, for which
there are now good solutions.

18
www.octoparse.com

4 Advantages of web
scraping

The advantages of web scraping in general and the use


of scrapers in particular are manifold.
From an economic point of view, an automated and
standardized process increases the productivity of
companies and organizations.

Companies no longer need to employ staff for standard


tasks, but can integrate them into more demanding tasks.
In most cases, these employees will be happy not to
have to perform simple copy & paste tasks. It also
eliminates the need to hire employees, which results in
a reduction of personnel costs (or at least no increase in
costs).

A web scraper works at any time of day or night. As


long as the systems and those of the scraping object are
online, it can do its work. Tasks can be started in the
evening and the results can be available at the start of
work, so that work can be done with this information
without delay.

Marketing via Google search results is an important


part of lead generation for most companies. This is how

19
www.octoparse.com

web crawling helps Internet sites to become well known.


It can also be used for competitor analysis. Thus, a
competing company can be observed closely and news,
advertised job offers and further information can be
obtained.

20
www.octoparse.com

5 Problems of Web scraping

A large part of the entire Internet is not covered by web


scrapers and therefore not covered by public search
engines like Google. Website operators can block their
pages against indexing, which means that they are not
ranked in the search results. This part of the Internet is
also called Deep Web, Hidden Web or Dark net.

Related to the function of a scraper, the ability to


automatically retrieve data is also used to collect email
addresses to use for spam messages. For example, the
use of e-mail addresses for advertising purposes is not
permitted without the consent of the addressees. The
Bot Traffic Report 2019 also revealed that web scrapers
account for about 40% of all Internet traffic, thus
polluting the World Wide Web.

21
www.octoparse.com

6 Web scraping software

In the meantime, several providers have entered the


web scraping market. This makes it difficult for you as
a user to find the right web scraper for your own needs.
Nevertheless, I would like to give you a small list of
several other providers on the market who, besides
Octoparse, have dedicated themselves to simple web
scraping:

• UiPath
• helium scraper
• Beautiful Soup (Python)
• Mozenda
• Parshub
• Crawly
• data miner
• web scraper
• Easy Web Extract
• Fminer
• Scrapy (python)
• Screenscraper
• Scrapehero

22
www.octoparse.com

7 Web scraping by means of


Octoparse

Octoparse was first released in 2016 and has won the


first web scraping users in that year. Since then, the
program has been regularly revised, maintained,
updated and its functions expanded.

The team around Octoparse is customer-oriented and


responds to the feedback of its users in its own surveys
and forums. New functions through user requests are
implemented after internal review.

The development team strives to implement both the


functions themselves and to place the user at the center
of the development. Over many months, customer
surveys on usability were conducted and beta phases
were used to achieve the best possible user experience.

Meanwhile, tutorials and FAQs are available in English,


Japanese and Spanish to make learning Octoparse as
easy as possible. A blog on the official homepage will
inform the user base about news as well as new features
and premium functions.

23
www.octoparse.com

With the high number of available web scraping


software on the market, the question arises what makes
Octoparse more interesting for users like you than its
competitors.

Octoparse can primarily claim to be particularly


suitable for beginners in the field of web scraping, since,
in contrast to many competitors, it requires no previous
knowledge of a programming language, where the
competition still requires code input at certain points in
the program at various points in the workflow.

The workflow-based design has been put in place in


such a way, that it can be operated exclusively within
the GUI. Scripts or manual insertion of code is partly
possible, but not necessary.

The design is clear, easy to understand and limited to


the essential functions. As a criterion not to be
underestimated, Octoparse has the argument of cost on
its side. All functions are available to you as a user free
of charge from the outset. Functional upgrades against
payment are possible as a premium version and from an
advanced web scraping phase on are quite reasonable.
At the beginning and with relatively simple Web
scraping projects, you can work with the basic version
without any problems.

24
www.octoparse.com

The installation of Octoparse on your local computer is


easy and the scraping is powerful. Octoparse now also
offers a selection of various templates for which web
scraping workflows from sites such as Amazon or eBay
already exist in advance and only need to be adapted to
your needs.

A detailed FAQ as well as videos in English language


help you as a user to dive into the functions and to get a
feeling for web scraping in its full extent as Octoparse
offers it.

25
www.octoparse.com

8 Getting Started with


Octoparse

8.1 System Requirements

To run Octoparse on your system and to use the easy


web scraping workflow, your system only needs to
fulfill the following requirements:

● Win7/Win8/Win8.1/Win10(x64) - For Mac


users, you'll need to install a Windows virtual
machine first, then download Octoparse.

● Microsoft .NET Framework 3.5 service pack 1*


(.NET 3.5 SP1)

● Internet access

26
www.octoparse.com

8.2 Download and installation

The Octoparse installation package can be downloaded


on the official website:
https://www.octoparse.com/Download
To use Octoparse, you need to create and sign up with
your Octoparse account.
Then, follow the steps below to install Octoparse on
your PC:

1. Unzip the downloaded file


2. Double click "Octoparse Setup" to start the
installation
3. Once the installation is completed, click the
Octoparse icon to run the application
4. Log-in with your account information
(username/email, password). If you do not have
an existing account, you can sign up for a free
one.

27
www.octoparse.com

8.3 Sign up for Octoparse

Once you've downloaded and opened Octoparse, click


on "Sign up for FREE" to visit the Octoparse sign up
page.

Alternatively, you can also visit the Octoparse website:

www.octoparse.com

There you can sign up for a Free Account or get a 14-


day Free Trial of the available premium versions to try
out premium features such as 10x speed extraction,
cloud extraction, task scheduling, task templates and
more.
After the Free Trial you can use Octoparse will all basic
functions for free.

28
www.octoparse.com

8.4 The GUI –Graphic User Interface

After you started Octoparse and you logged in with


your credentials, you can see the heart of Octoparse, the
dashboard.
The graphic user interface is what you can see every
time at the beginning of your web scraping project.
There are two sections: the home screen and the sidebar.

8.4.1 The Home screen


At the center of the home screen is a search bar. You
can enter the target webpage URL(s) to start building a
task or you can also enter a template name (such as
Amazon or eBay) to search for a pre-built scraping
template.

8.4.2 The Sidebar Menu


The sidebar menu on the left contains everything you
need to navigate within Octoparse.

29
www.octoparse.com

8.4.3 Dashboard

The one place to manage all your scraping tasks. Edit,


delete, rename and organize all the tasks in your
account. You can also conveniently run, stop or
schedule any tasks.

8.4.4 Quick Filters


Shows you the currently running tasks with cloud
extraction and the already finished and completed tasks
stored in the cloud

8.4.5 Recent Tasks


Shows you your recently edited tasks

30
www.octoparse.com

Team Collaboration
This link sends you to the Octoparse Website and their
team that will help you to fulfill every web scraping
project you may have.

8.4.6 Data Services


The same with the “Data Service” possibility of
Octoparse where a team of web scraping experts builds
the whole web scraping process customized for your
needs

8.4.7 Contact us
The contact form where you can place questions and
needs. You can reach out to [email protected] at
any time. The customer service will reach you within
24 hours.

31
www.octoparse.com

8.4.8 The Workspace


The Octoparse workspace is the place where you'll be
building your task. There are four main parts to it with
each part servicing its particular purpose.
The Built-in Brower: Once you've entered a target
webpage URL, the webpage will be loaded in
Octoparse's built-in browser. you can browse the
website in Browse mode, or you can click to extract the
data you need in Select mode.

The Workflow: As you proceed to interact with the


webpage, such as opening a web page and clicking on a
page element/button, the entire process is defined
automatically in the form of a workflow.

Tips Box: Octoparse uses Smart Tips to "talk" to you


during the extraction process, to guide you through the
task building process.

Data Preview: Have a preview of the data selected.


You can also rename the data fields or remove the ones
that are not needed.

32
www.octoparse.com

9 Octoparse Workflow
Methods
Now that you’ve downloaded Octoparse on your PC
and learn about the user interface, you are ready to start
your own web scraping project.
Most of the information on the web is represented as
text, such as product information, news articles, blogs,
job description, etc. In this chapter, you will get to
know, how to capture simple text data from a webpage
using simple points and clicks.
Basic text extracting skill, when coupled with the other
techniques such as pagination, list building lays the
foundation for achieving data scraping on all kinds of
webpages.
With Octoparse you have two possible web scraping
methods which will help you to identify and scrape the
information you need, as simple as to use the copy and
paste, but fully automated. The methods you can use
are the Advanced Mode and the Template Mode.

33
www.octoparse.com

Advanced Mode
The Advanced Mode will let you to have the total
control about every step of your web scraping project.
Within this mode you have every option Octoparse
provides to their users including very useful <<hacks>>
like Anti-Blocking-Settings and much more.
Template Mode

34
www.octoparse.com

The Template Mode is fast and easy way to get your


needed data of most oft he famous websites oft he
internet like Amazon, eBay and much more.

35
www.octoparse.com

10 Octoparse Advanced
Mode

Advanced Mode is a highly flexible and powerful web


scraping mode. It is for people who want to scrape from
websites with complex structures. With Octoparse
Advanced Mode, you can:

● achieve data scraping on almost all kinds of web


page
● extract data like text, URL, image, and HTML
● design a workflow to interact with webpage
such as login authentication, keywords
searching and opening a drop-down menu.
● customize your workflow, such as set up a wait
time, modify XPath and reformat the data
extracted

Octoparse recommends to start with the Advanced


Mode because the highly performant workflow mode is
the perfect tool to use within you web scraping project
and the heart and soul of every structured scraping
process.
You will get fully control of every step and interaction
with the website and you will be able to customize ever

36
www.octoparse.com

single click and even more. E.g. you can simulate to


hover with your cursor over a specific link or menu.
You can tell Octoparse to wait several seconds instead
of scraping immediately and much more.

The Workflow Designer will show you, which steps


Octoparse as going to work on. Every action is editable,
changeable or deletable and you can add steps before or
after the steps already implemented.

With the selection of a step you will get all information


about how Octoparse handles the step and how the
website interaction will be or, what kind of information
Octoparse will transfer in your database.

In this chapter you will get familiar with the Advanced


Mode and after that you will be able to set up
workflows in the Advanced Mode for every website
you can imagine.

37
www.octoparse.com

10.1 Add a new task with the Advanced


Mode

To start directly with the Advanced Mode, you can


begin with building your first task with clicking on
<<New>> on the left side in the dashboard.

After the selection oft he Advanced Mode you will get


to the next screen.

38
www.octoparse.com

There you can add your website URL which contains


the information you want to scrape.
The second option, how to start a new task with the
Advanced Mode is to enter the URL into the search box
at the center of the home screen. Click "Start" to create
a new task with Advanced Mode.

39
www.octoparse.com

10.2 Workflow Tips


Every time Octoparse interacts with your website, you
will see an orange box. This “Tips” box will help you to
perform the actions you want Octoparse to do with your
website like interact, scroll, hover or just to scrape the
data you see.
The box will show you different possibilities to interact
with the website and it will change the options once you
perform an action in the built-in browser. So just try it
and select your actions as you like!

40
www.octoparse.com

10.3 Auto Detection


Octoparse will load the webpage URL in the built-in
browser and start the auto-detect process automatically.
Wait patiently until the process completes and when
more info is provided on “Tips”.
The Auto-Detection is a simple way that Octoparse
offers to extract the information on the page time
efficient and fast without the need of intervention in the
workflow.
After a few seconds the Auto Detection will come up
with a proposal of the data structure found and a
preview, how the extracted data will look in the
database.
When the auto-detection completes, follow the
instruction provided on "Tips" and check your data in
the preview section. You can rename the data fields or
remove those that are not needed. The detected data
will also be highlighted on the webpage for you.
After your review go to "Tips" and check your options.
Based on the type of data detected, a number of options
are provided for you to choose. For this example, list
data is detected so you are provided with the options to:
1. Extract the data in the list - This option is selected by
default as Octoparse thinks this is what you need to do
for sure.

41
www.octoparse.com

2. Click the "Next" button to capture multiple pages -


Apparently, Octoparse has detected a "Next" button on
the page. Check this option if you want Octoparse to
click the "Next" button to extract data from more pages.
To find out if the button detected is the correct one,
click "Check" and watch it gets highlighted on the
webpage. If you need to re-select the "Next" button,
click "Edit" and follow the instructions on "Tips".
3. Click the "links" to capture data on the page that
follows - Octoparse is asking if you want to click on the
links detected and extract more information from the
detail pages. Check this option if this is what you need.
To confirm if the links are the ones you'd like to click
through, click "Check" to have the links highlighted on
the web page.

42
www.octoparse.com

4. After confirming the settings, click "Save Settings".

43
www.octoparse.com

Octoparse would generate a workflow automatically


based on the data detected and the saved settings. You
can choose to run the task now or edit the workflow
manually.

44
www.octoparse.com

10.3.1 Improve the auto detect


Mode

When Octoparse goes on to detect the data on any web


page, it screens the whole page and fetches one or more
sets of data using its machine learning algorithm. If you
don't see your target data being detected on the first
attempt, you can switch to the second set of data by
clicking on "Switch auto-detect results".

If the auto-detection fails to locate the Next button


correctly, you can easily fix it by clicking on "Edit",
then follow the instructions on "Tips" to re-select the
correct Next Page button.

45
www.octoparse.com

Whenever a web page is detected with an infinitive


scroll, Octoparse automatically specifies the number of
times to scroll down the page. If you prefer to scroll
more before capturing the data, you can easily adjust
the number of scroll times by clicking on "Edit", then
complete the settings.

46
www.octoparse.com

10.4 Edit or create your Advanced


Workflow
When you build a scraping task in Octoparse, it
simulates real human browsing actions, such as opening
a web page and clicking on a page element/button to
extract data automatically. The whole extraction
process is defined automatically in a workflow with
each individual step/action representing a particular
instruction in the scraping task.
Though Octoparse tries to make things easier for you
by auto-generating the workflow through auto-detection,
you can technically build the workflow from scratch or
edit the auto-generated workflow to ensure the task
does what you need it to do.
There are many different types of actions you can add
to the workflow. Each step/action has various settings
that you can modify to fine-tune your scraping task.
If Octoparse opened the page you can directly start the
workflow mode with a click on <<cancel>> in the Tips
box.

47
www.octoparse.com

There you can add extra steps to the workflow, place


your mouse at where you'd like to insert the step. Wait
until you see the <<+>> sign show up, click on it and
select the action you'd like to add.

48
www.octoparse.com

● Click: Performs a click on the specific object


selected

● Extract Data: Extracts the Data selected on the


webpage

● Loop: Creates a loop item like loop clicks or


pagination

● Branch Conditions: Sets a condition

● Open Page: Opens the Page from the URL

● Enter Text: Enters text in the specific field

● Iterate through Dropdown: Iterates through a


dropdown menu

● Hover Over: Simulates the hover over a


specific area on the website like the menu

● End Loop: Finishes the loop

● End Workflow: Ends the whole workflow

49
www.octoparse.com

Or you can easily rearrange steps of the workflow by


dragging and dropping to the right spot.

50
www.octoparse.com

Or you can hover over and check the settings of the


specific step.

51
www.octoparse.com

Modify action settings by clicking on the setting icon.

52
www.octoparse.com

Or rename, copy, or delete a step by clicking the shown


more button.

53
www.octoparse.com

10.5 Create a Scraping Task with the


Advanced Workflow from scratch
The steps of the workflow should always be read from
top to bottom, and from inside to outside for nested
steps. So, for our example, we should test the steps in
this order:

1. "Go to Web Page" → test if the web page loads


properly
2. "Pagination" → test if the Next Page button is
located correctly
3. "Click to Paginate" → test if the web page
paginates properly
4. "Loop Item" → test if the list of items is
complete and correct
5. "Extract Data" → test if the data is selected and
extracted correctly

54
www.octoparse.com

It's worth mentioning that not all tasks are created the
same, you may have a completely different task to test,
but the testing methodology can generally be extended
to tasks of all kinds.

55
www.octoparse.com

10.5.1 Click on "Go to Web Page"


Once you click on the step, it should load the web page
in the built-in browser. If the web page loads well, there
isn't much you need to adjust; however, there are a few
things you should always watch out for.
If the web page loads with an infinitive scroll-down
because you want to select "Scroll down the page after
it is loaded" and complete the proper settings.

56
www.octoparse.com

57
www.octoparse.com

If the web page is taking longer than usual to load you


may want to increase the page timeout.

58
www.octoparse.com

10.5.2 Click the "Pagination" box


In order for pagination to work consistently, there are
two things we need to check for sure.

● If the Next Page button/arrow is being located


correctly.
● If the paginating process works well on all
pages, i.e. it needs to paginate correctly going
from page-1 to page-2, page-2 to page 3, page-3
to page-4, so on and so forth.
After you click on the pagination box, go to the
highlighted element on the web page and confirm if it is
the correct Next Page button. If you don't have the
right Next button, you may need to manually fix it
by altering the corresponding XPath.

59
www.octoparse.com

10.5.3 Click on "Click to


Paginate"
When you click on "Click to Paginate", you are literally
instructing Octoparse to click on the Next Page button
defined in Step-2.
If things are working right, it should go from page-1 to
page-2. Repeat this two-steps process (click
"Pagination" box then click "Click to Paginate") as
many times as needed to make sure pagination is
working correctly on all sequential pages. If the web
page is not paginating properly on any of the pages, fix
the element XPath in step 2 and test again.

60
www.octoparse.com

10.5.4 Click on the "Loop Item"


box
Testing the "Loop Item" is essentially confirming if all
the desired items have been selected correctly.
Once clicked, go to the web page in the built-in browser
and make sure all the items you need are being
highlighted.

Or, you can also click open the list-icon to load the list
of items and confirm if the list is complete.

61
www.octoparse.com

10.5.5 Click on "Extract Data"


Here is the final step - check if the data is being
extracted as needed.
Once clicked, check the data in the preview section and
confirm if this is the data that you need.

62
www.octoparse.com

10.6 Web scraping for advanced users -


Subpages

In many cases the important page information is only


hidden on a subpage of the preferred website. This is
the case, for example, with sales sites such as Amazon
or eBay, where the information on the product, delivery
time, description and much more can only be seen by
clicking on the corresponding product link.

This is where Octoparse shows its true strength when it


comes to navigating through pages and accessing this
website information. Within the workflow you can use
a simple click and select method similar to the simple
scraping of Text Octoparse to guide Octoparse to the
subpages and after scraping jump back to the parent
page to continue scraping in the next element. Very
simply with a so-called loop item.

To do this, start by calling up your desired Internet page


by creating a new task and entering the URL in the
URL field. As soon as the Octoparse browser opens in
the workflow view, click on the first link in the browser
(for example the link to the product). Again, Octoparse
will highlight the link in color and all further links in a
different highlighting variant. Now click on the second
link to let Octoparse recognize the recurring pattern.

63
www.octoparse.com

Now click on the option Loop click each element in the


opened Tips Box. The corresponding entry is made in
the workflow.

64
www.octoparse.com

As soon as you have clicked on the corresponding Loop


command in the Tips Box and the entry is visible in the
workflow, you will get to the subpage in the browser.

You are now on the desired subpage and can have the
information scrapped into the database by marking the
desired fields and selecting Extract selected Data. All
selected information is saved as a routine and is also
applied on the next subpage, if available.

Once you have selected all the required information, it


is inserted and stored in the Extract Data workflow
object. Once you are finished, you can save the
workflow by clicking Save and start the extraction.

65
www.octoparse.com

10.7 Start a run


Once you are done building a task, you can click the
"Run" button to start a run.

Alternatively, you can also access the task on the


Dashboard and use the <<play>> and the <<stop>>
buttons to run/stop a task.

66
www.octoparse.com

10.8 Check your data


Now that your run is completed, you can go ahead and
check your data.
Go to the Dashboard and find your task. Hover over the
task status and click on it.

Or, you can also check your data by clicking the "show
more" icon on the Dashboard, select "View data", and
then choose if you'd like to view "Cloud data" or "Local
data".

67
www.octoparse.com

10.9 Ways to get data


There are two ways you can run the task:

● Run on your device (also known as local


extraction)
● Run in the Cloud (as known as Cloud extraction)

If you run a task on your device, you will need to have


the Octoparse App open during the extraction process.
There will be an extraction window running on your PC,
and you can watch the data getting extracted and wait
for it to complete. Local data can only be accessed on
the device in which the local extraction was executed.

68
www.octoparse.com

On the other hand, when you run a task in the Cloud,


the task will be run on the Octoparse Cloud Platform,
which means you can shut off the Octoparse App or
even your computer and come back for your data when
the job is done.
Tasks running in the Cloud generally run 6x to 20x
faster compare to local extractions. Depending on your
project requirements, you can always choose a plan that
works for you.
Data extracted in the Cloud data can be accessed on any
device as long as you log into your account.

69
www.octoparse.com

10.10 Export your data


If the data looks good to go, you can export the data
directly by clicking on "Export Data" at the lower right
hand corner of the Data View tab. Octoparse supports
exporting data to Excel, CSV or HTML file or to a
database.

70
www.octoparse.com

11 Extraction with the


Octoparse cloud
Octoparse offers a powerful Cloud platform for
premium users to run your tasks 24/7.
When you run a task with "Cloud Extraction", it runs in
the cloud with multiple servers using our IP's. You can
shut down the app or your computer while the task is
running. There is no need to worry about hardware
limitation. Data extracted will be saved in the cloud and
can be accessed any time.
Task scheduling is also supported by Octoparse Cloud
extraction. To retrieve the most updated information,
you can schedule your task to run as frequently as you
need.

71
www.octoparse.com

11.1 To run your task with cloud


extraction:
When you finish configuring your task, click "Start
Extraction" and select "Cloud Extraction" to execute a
run in the cloud.

Once a task is set to run in the cloud, its status will


change to "Running in the cloud" on the dashboard. At
the same time, the amount of data extracted and the
extraction time spent will be shown under task status.
You can filter the tasks by their status when you click
on the arrow for “Status”.

72
www.octoparse.com

73
www.octoparse.com

11.2 To batch run tasks with cloud


extraction:
Select any tasks that need to be run, click on “manage
selected Taks” then select "Cloud Extraction".

74
www.octoparse.com

11.3 Auto-data export (for Cloud data)


Data export to database can also be automated and
scheduled. If you need to export data to your databases
on a regular basis, data export scheduling can save you
tons of work.

1. Load the cloud data for your task.


2. Click "Export Data"

3. Click open "Auto-export to database", then select the


type of database you have.

75
www.octoparse.com

4. Complete the information to connect with your


database. Click "Test connection" to test if the database
is connected successfully. Then, click "Next" to
proceed.

76
www.octoparse.com

5. The next step is to map the data fields and choose the
desired time interval for the export.

77
www.octoparse.com

6. Lastly, click "Next" to finish the process.

78
www.octoparse.com

79
www.octoparse.com

12 Schedule regular runs

By now, you've finished building your first scraping


task and knows how to run the task to get the data you
need. Let's take it to the next level and find out how you
can make your daily scraping routines more effective
and efficient using task scheduling, auto-data export,
and API.

12.1 Task scheduling


If you are planning on getting data extracted on any
regular basis, task scheduling is exactly what you need
and can save you a lot of time. You can schedule your
task to run once, on a recurring schedule, or even run
repeatedly, such as every 1 min, 5 mins, 10 mins, or 30
mins.
1. Find your task on the Dashboard, click the show
more icon then choose "Cloud runs" and select "Set
schedule".

80
www.octoparse.com

2. Choose how often you would like to run the task.

81
www.octoparse.com

3. For recurring crawls, select the day of the week/day


of the month, and time of the day to run your task.

For repeating crawls, select the desired time interval.

82
www.octoparse.com

4. You can also save the setting for later use. Give the
setting a name and click "Save". This way, you can
always select the saved schedule setting and apply it
directly to any other tasks.

5. After everything‘s done. Click "Save and Run" to


start running the task on schedule right away.
If you want to save the schedule only, but do not wish
to run the task on schedule yet, click "Save" instead.

83
www.octoparse.com

6. Once you have the schedule set up, you can easily
turn it ON and OFF by clicking the show more icon on
the Dashboard, then select "Cloud runs", there you can
choose "Schedule ON" or "Schedule OFF". #

84
www.octoparse.com

7. When a task is scheduled, you'll see the next run time


on the Dashboard. Click the + sign on the Dashboard,
then select "Next Run". This way, you'll have a clear
picture of the tasks that are scheduled and when the
next run is expected.

85
www.octoparse.com

13 Octoparse Template
Mode
If you have ever wondered about the level of technical
proficiencies required to build a web scraper? With the
newly launched Template Mode Scraping almost none.
More specifically, now there are about dozens of built-
in templates within the program and all ready to be used
to fetch data instantly, with nearly zero learning curve!
Many popular sites like:

● AMAZON

● INSTAGRAM

● BOOKING

● TRIPADVISOR

● TWITTERS

● YOUTUBE
and many more are covered at this moment. And the
best part is if you feel any website should be added, you
can contact the Octoparse team and they will
seriously consider having a template created for the site.

86
www.octoparse.com

Template Mode Scraping can be especially valuable to


anyone that needs to extract data from some of the most
popular websites out there and maybe those that would
prefer to skip the learning and does not require a high
level of data customization.

87
www.octoparse.com

13.1 How to use the Template Mode?


The Template Scrapers take over all the heavy lifting so
all you must do is tell Octoparse your search criteria, i.e.
iPhone then clicks “start” to get data.
With only 6 easy to follow steps you can set up your
webscraping workflow:

1. Select “Task Templates” from the home screen


2. Pick a template
3. Check the pre-defined data fields and
parameters
4. Select “Use Template"
5. Enter the variable for the parameters, such
as “iPhone” for the search keyword
6. Save the template and run

Of course, you can afterwards always change the


scraping process as you like.

88
www.octoparse.com

14 Octoparse Hacks for


your Workflows
Congratulations, you now can scrape all webpages you
might want to. Now that you do have all necessary
information for all of your webscraping processes it is
sometimes handy to improve your workflow in a way to
make it more efficient or to interact slightly better with
the webpage.
So, beside the webscraping functionality of Octoparse
there is a huge amount of helpful and efficient ways to
enhance your workflow. In this chapter you will get to
know the main hacks you can add to your scraping
process.

89
www.octoparse.com

14.1 Extract data behind a login


When the target data is behind authentication, it is still
possible to access the data with Octoparse. Simply text
input the login information (username and password)
then click on the "sign in" button to log in. In this
chapter, you will get to know how to extract data
behind a login, as well as how to use cookies to
optimize the workflow of your task.

Click on the textbox for username input on the web


page and select „Enter Text“ from Tips.

90
www.octoparse.com

Next Input the username into the textbox

● Click "OK", the username entered is


automatically populated to the username textbox
on the web page
● Follow the same steps to enter the password
● Click the "Sign In" button on the page

From Tips, select "Click button" and Octoparse has


now logged on the website successfully.

91
www.octoparse.com

14.1.1 Use cookies to optimize the


workflow
Most of the time, you can optimize the workflow by
saving the cookie in the task after login. This way,
Octoparse will send the saved cookie to the website at
loading, and there's a good chance the website will
remember "you" and skip the login steps.
● Log on the website in Octoparse's built-in
browser if you have not already done so.
● Switch to the Workflow Mode by toggling the
Workflow switch on the top, drag a "Go To
Web Page" action to the workflow, position
right below the sign in steps.
● Enter the URL of the page needed for the
capture into the text box for "Page URL"

● Under "Advanced Options", click open "Cache


Settings"
● Select "Use specified Cookie"
● Click "Load cookie from current web page"

92
www.octoparse.com

● Click "OK" to save the settings


Now as the web page is supposed to "remember" the
login and skip the login steps, we'll remove the
previously created actions for the login to avoid running
into issues when the workflow is executed. Right-click
on the action and select
"Delete".

93
www.octoparse.com

14.1.2 Clear cookies instead

As all websites handle cookies differently, to ensure the


task workflow will work consistently, you may want to
start with the login steps every time the task is executed.
To do this, you can clear any cookies saved before the
login page is loaded. This way, the target website will
always "forget" you and takes you to the login page on
which you can enter all the login information.

Click "Go to Web Page" action for the login page


Select "Clear cache before opening the web page"
within Cache Setting

94
www.octoparse.com

14.2 How to click through options in a


drop-down menu?

A drop-down menu is a list of items that appear when


clicking on a button or text selection. To select options
from a drop-down menu, just perform these actions:
1) Click on the drop-down menu

2) From the Action Panel, click "Loop through options


in the dropdown"

95
www.octoparse.com

3) Switch to the Workflow Mode by toggling the


Workflow switch on the upper right corner. A Loop
Item had been created and added to the workflow
automatically to loop through options in the drop-down
menu.

96
www.octoparse.com

4) Click on the Loop Item for the dropdown then refer


to the looped items in the list on the right side. Check if
all the items added to the loop were desired; If not,
refine the list by using XPath function: position().

5) Now we are done configuring for the drop-down


menu. Move on to select other options or click on the
confirmation button to complete the search.

97
www.octoparse.com

14.3 Text/keyword input

Sometimes you may need to interact with a web page


while extracting data. For example:
· You want to scrape data from a website which
requires to login to first. So you need to input your
username and password to login before accessing the
data you want.
· You have a list of keywords to be searched through
into a search box, but you don’t want to enter them one
by one.

14.3.1 Input a single keyword into


the textbox

Select the input field on the page in the built-in browser


When you click on the input field in the built-in
browser, Octoparse can detect that you select a textbox.
The "Enter text" action will automatically appear in
"Tips".

98
www.octoparse.com

Once you click on "Enter text", a text box will be there


in "Tips".

99
www.octoparse.com

Input the text or keyword in the textbox and click


"OK".

100
www.octoparse.com

You can see what you just input also appear in the input
field on the page in the built-in browser.
Octoparse would inform you with "Input Text Saved"
in "Tips", and you can also notice the "Enter text"
action is added into the workflow.

101
www.octoparse.com

14.3.2 Input multiple keywords into


a search box

If you have a series of pre-defined and specific text


values, you can add them into "Text list" to create a
loop search action. Octoparse will automatically enter
every word in the list into the search box, one word a
time.
Let's see how to create a "Text list" loop mode to scrape
data by searching multiple keywords on a website.

1. Drop a "Loop item" action into the Workflow


designer

2. Go to "Loop Mode" and select "Text list"

102
www.octoparse.com

3. Go to "Text list" below and click "A" to enter the


keywords you want to search in the textbox
Click "OK" when your finish entering. Then You can
see your keywords in the “Loop Item” box.

4. Click on the search box on the page in the built-in


browser and select "Enter text" in "Tips"

103
www.octoparse.com

5. Input the first keyword in your "Text list" in the text


box

6. Drag the "Enter Text" action into the "Loop Item" in


the Workflow designer

104
www.octoparse.com

7. Click on the "Enter Text" action in the Workflow


designer
Go to "Loop Text" and select "Use the text in Loop
Item to fill in the text box"

105
www.octoparse.com

8. Click the search button of the web page and select


"Click button" in "Tips"
After clicking on "Click button", you will notice the
"Click Item" action is added into the workflow.

9. Click "Save" to finish creating the "Text list" search


loop.

106
www.octoparse.com

Finally, don't forget to check the workflow.


Let's see how Octoparse will get these keywords to be
searched through into the search box and interact with
the website.

1. Click on the "Loop Item" box


You can see the keywords that you’ve just input
displayed in "Loop Item".
2. Select one keyword, and click on the "Enter Text"
action
In the built-in browser, you can see that the selected
word is entered in the search box.
3. Click on "Click Item"
Octoparse simulates real browsing activities as it clicks
the search button. You can see the search results of the
select word on the web page in the built-in browser.

107
www.octoparse.com

14.4 Refine your data

14.4.1
Rename/move/duplicate/delete a field
When you start extracting your data through the
Advanced Mode, the automated data detection Wizard
and it is shown in Data Preview, you can now look
through the data set and start organizing your data.
A few typical things you can do to refine your data set
include renaming the fields, reordering the columns,
duplicating data fields, and deleting the fields that are
not required for your project.
To rename a field, click the pencil icon next to the field
name, then type in the new name directly. Note that you
should only use numbers, letters, and "_" for field
names.

108
www.octoparse.com

To move a field, place your cursor at the front of the


field and drag and drop the field to the right spot.
To duplicate a field, click on the show more icon and
select "Copy". The selected field will be duplicated
automatically.
To delete a field, click on the show more icon and
select "Delete"
You can also rename/move/duplicate/delete any data
fields by going to "Action Settings" for the "Extract
Data" action of the workflow.
If you have more fields to delete, you can also batch
delete the fields. Click on the "Action settings" icon for
the "Extract Date" action. On the Setting Panel, click
the "Batch delete fields" icon, select the fields you'd
like to delete, and then click the "Delete" button.

14.5 Clean data


Octoparse provides many ways for you to clean your
data. For example, you can replace a text string, trim
extra spaces, add a prefix/suffix, replace a string with
RegEx, reformat date/time and more. You can clean
any single data field in one or more ways until the data
meets your requirements. Some of these may require
you to deal with Regular Expression with which you
can use the Octoparse RegEx tool for assistance.

109
www.octoparse.com

In Data Preview, right click the show more icon for the
data field you'd like to clean, select "Clean data".

Click "Add step", and then select what you'd like to do


with the data. You can keep working with the data by
adding more steps until the data meets your
requirements.

110
www.octoparse.com

111
www.octoparse.com

● Replace: replace the specific string(s) in the


extracted data with the new string(s) that you
want.
● Replace with Regular Expression: use a
specific regular expression to replace the
matched string(s) in the extracted data with the
string(s) that you want.
● Match with Regular Expression: use a specific
regular expression to pick up the matched
string(s) from the extracted data.
● Trim spaces: remove the unwanted space(s)
from the start or/and the end of the data
extracted.
● Add a prefix: add a string/strings to the front of
the data extracted.
● Add a suffix: add a string/strings to the end of
the data extracted.
● Reformat extracted date/time: shift the
extracted date/time into one of the 14 built-in
formats, or into your own customized format.
● HTML: convert some specific HTML tags into
plain texts automatically. For example,
transcode "&gt" into ">" and "&nbsp" into a
space.

112
www.octoparse.com

14.5.1 Capture HTML code


When auto-detect is used to capture any data from a
web page, Octoparse automatically extracts the text and
the URL of the elements that you've selected. You can
customize the data field and tell Octoparse to extract
any HTML code. In Data Preview, right click the show
more icon and select "Customize
field".

From the "Customize field" setting panel, select what


you'd like to extract.

113
www.octoparse.com

114
www.octoparse.com

14.5.2 Extract page-level data and


date & time
Octoparse offers several pre-defined data fields that you
can use to capture page-level data, current data & time,
or any fixed value conveniently.

● Current date & time: the date and time of when


the data is extracted from the web page
● Page-level data: page URL, page title, meta
keyword, meta description, and HTML source
code
● Fixed value: any fixed value you define

Click on the + sign at the upper right-hand corner of


Data Preview. Select any pre-defined data fields that
you'd like to add to the data set.

115
www.octoparse.com

14.6 Octoparse Anti-Blocking settings

More and more web owners have equipped their sites


with all kinds of anti-scraping techniques to block
scrapers, which makes web scraping more difficult. In
this article, we will introduce you some techniques to
anti-block in Octoparse. Most popular are the following.

● IP Blocking
● Browser recognition
● Cookie Tracking

Octoparse has successfully implemented solutions for


these kind of anti-blocking settings oft he websites you
want to scrape. You can add these in your workflow
settings.

116
www.octoparse.com

14.6.1 IP rotation
There are some websites that might be very sensitive to
web scraping and take some serious anti-scraping
measures like IP’s blocking to stop any possible
scraping activities.

Manual set up proxies in Octoparse is particularly


useful if you would like to access the website with
external proxies (or from a specific country) or you
prefer to use your own proxies instead of using our auto
IP rotation features of cloud extraction.

Unlike other scraping utilities that charge for the set up


external proxies’ feature, Octoparse allows both free
and premium users to add custom proxies for IP
rotation.

Getting your IP address blocked is one of the problems


you may face when scraping websites. So a proxy or
proxy server is an essential part of web scraping and it
is widely used for anonymous web scraping.
To use external proxies for rotation:
Click "Setting" above the workflow once you've
finished configuration.

117
www.octoparse.com

Select "Use proxies" and click "Settings" to add custom


proxies. Currently Octoparse only supports HTTP
proxies. Separate IP address of the proxy server and
port number with a colon.

If you have a list of IP's, add each proxy in "IP


Proxies" on a new line.

118
www.octoparse.com

Click "OK" and "Save" to save your changes.


Octoparse will execute the rotation according to your
settings when running task locally.

Use a proxy to change the IP address for login


Octoparse - If you fail to login to Octoparse due to your
student or company intranet restricts some external
request, use a proxy for login to use Octoparse.

To do this, click "Use IP Proxy" and enter the


information requested:

119
www.octoparse.com

Click "Test" button to test if the connection is


successful. If it's successful, it will prompt:

120
www.octoparse.com

14.7 Switch user-agents and clear


cookies

Every request made by a web browser contains a user-


agent. Using a user-agent for an abnormally large
number of requests will lead you to the block. To get
past the block, you should switch user-agent
frequency instead of sticking to one.
With Octoparse, you can easily enable automatic UA
rotation in your crawler to reduce the risk of being
blocked.

And some websites may remember the cookies you use


for accessing the pages. We can clear the cookies
automatically to pretend to be the first time to access
the pages.

121
www.octoparse.com

14.8 Wait before Execution

Set up a waiting time is to slow down the execution of


the current step. Its ultimate purpose is to make sure the
steps of the scraper process properly.

In some cases, the previous step may require certain


time to complete, like manual input, or the current page
requires a longer time to load. So you would need to set
up "Wait before execution" for steps like those.
Otherwise, it may lead to some error like missing data.

All steps created in the workflow are able to set up a


waiting time, except "Go To Web Page". It’s easily
found "Wait before execution" in the "Advanced
Options".

122
www.octoparse.com

Wait time options in Octoparse. So far, it’s not


available to customize waiting time.

123
www.octoparse.com

14.9 Auto switch browser


Your browser sends what’s known as a user agent for
any web page you visit. This is a string to tell the target
website what kind of device you are accessing the page
with. When scraping a website very consistently with
the same user agent, it is easy to be detected as a
scraping bot activity. Thus, with this feature, the chance
of being blocked can be reduced.
To set up the auto switch browser:
● Check the box for "Auto switch browser (User-
agent)".
● Click "Settings" to set up the type of user agent.
Not all the UAs work for every website, so you might
need some testings. If you want Octoparse to visit the
website "via PC" when scraping the website, you
should check the box for "Select all" and uncheck the
box for "Firefox for mobile 29.0"; if you want
Octoparse to visit the website "via mobile", you should
only check the box for "Firefox for mobile 29.0".
● Click OK to save the change.
● Either check the box for "Custom interval" and
select the number of minutes for switching user
agent or check the box for "Switch IPs
concurrently".

124
www.octoparse.com

Octoparse will automatically switch the user agent as


you set when the task is running locally or in the cloud.

125
www.octoparse.com

14.10 Workflow trigger


With the use of "Trigger", users can define one or more
conditions for whether the data should be extracted.
"Trigger" can be easily added in the Extract Data step.

Triggers are very useful, for example, if you only want


to scrape a portion of the data on a web page, let's say,
products with price less than $100, you can use
Triggers to abandon "useless" data lines, specifically,
any products with price equal/over $100 and only keep
the ones you need.
To achieve, you can create a trigger like this: if data
field "PRICE" is equal or greater than "100", do abandon
the line of data. This way, Octoparse will just "judge"
whether the data meets the defined criteria before
having it actually extracted. In the end, the dataset will
be clean and only has the data desired.
Another useful application is when you need to extract
data associated with a specific date, say, all news
articles published today (ex. 2019-01-01). To achieve
this, you can create a trigger: If the data field "DATE" is

126
www.octoparse.com

not "2019-01-01", do abandon the line of data. As a result,


you will only fetch the article for 2019-01-01.
Multiple conditions can be used together. For example,
if you need to extract news articles for 2019-01-01 and
only when the article title contains the words "CPI", it
can be done using the following two conditions:
Condition 1: If the data field "DATE" is not "2019-01-01",
do abandon the line of data
[AND]
Condition 2: If the data field "TITLE" does not contain
"CPI", do abandon the line of data

127
www.octoparse.com

14.11 How to set up Triggers?

14.11.1 Create a new trigger

Click "Add trigger" to create a new trigger

Name the trigger by typing in the name directly

128
www.octoparse.com

Select the target data field. In the example below, the


data field "TITLE" is selected.

Set the condition for the selected data field. You can set
conditions based on "text", "numerals" or "time"

129
www.octoparse.com

14.11.2 For general texts


There are five options (is, is not, contains, does not
contain, is not blank) for general texts.
For example, if you select "CONTAINS" and type in the
word "PEN" in the text box, the condition will be:
If the data field "TITLE" contains words "PEN". If "IS NOT
BLANK" is selected, there's no need to fill the text box
and the condition will be: If the data field "TITLE" is not
blank.

130
www.octoparse.com

14.11.3 For numerals


There are four options available for numerals: greater
than, less than, greater than or equal to.
For example, if you select data field "PRICE", "greater
than", and fill in the value "8", the condition will be: If
the data field "PRICE" is greater than 8.

14.11.4 For time


There are four options available for time: after, before,
on or after, on or before.
For example, for the data field "PUBLISHED_TIME", if
you select "after", "00:00 the extraction day" and do
"Abandon this line of data", the condition will be: if the
published time is after 0:00 AM on the extraction day,
then discard the line of data. As a result, only those

131
www.octoparse.com

articles with published time before 0:00 AM on


the extraction day gets fetched.

132
www.octoparse.com

14.11.5 Add more conditions using


[AND] or [OR]
Multiple conditions can be added to the same trigger.
Use condition [AND] or condition [OR] to define the
relationships between the various conditions.

If you click "Add [AND] condition" and add a


condition, the action will be executed if the data field
meets both conditions.

133
www.octoparse.com

If you click "Add [OR] condition" and add a condition,


the action will be executed if the data field meets either
one of the two conditions.

134
www.octoparse.com

Do one of the following steps


Now that you have the conditions defined, Octoparse
will execute one of the following steps when the
conditions are triggered.

Abandon this line of data

If "Abandon this line of data" is selected, Octoparse


will abandon this line of data regardless of whether the
other data of the same line has been extracted or not.
More specifically, suppose that a task has two
"Extraction data" steps and only the latter one sets the
trigger. Even if the data for the first "Extraction data"
step has been extracted, Octoparse will abandon this
line of data once the trigger for the latter step is
triggered.

End the loop

If "End the loop" is selected, you'll need to select one of


the loop items from the drop-down list. The selected
loop item will be ended once the corresponding
condition is satisfied.

Terminate the extraction

135
www.octoparse.com

If "Terminate the extraction" is selected, the extraction


will be terminated once the corresponding condition is
satisfied.

136
www.octoparse.com

14.12 Incremental Extraction- Get


updated data easily

Websites, such as News portal or forums, typically have


new contents added fast if not dynamically. To stay up-
to-date with such websites, Octoparse’s incremental
extraction allows you to extract updated data much
more effectively by skipping the pages that have
already been extracted, in another word, only scrape the
new ones.

Consider enabling incremental extraction if the


followings are met:

1. Need updated data from any single website quite


frequently
2. New information shows up as new web pages with
new URLs (as opposed to new information being
added/updated to existing webpages).
So a perfect example will be CNN.com. Imagine if you
need to get News feeds from CNN.com almost in real
time. It is important to schedule and run the
task/crawler as frequently as needed so whatever gets
added to the site can be extracted in a timely manner.
So, criteria (1) is met. Obviously, each news article on
CNN.com is going to have a different URL that can be
easily identified - criteria (2) is also met.

137
www.octoparse.com

Assuming you have a task set up for the job, but it


doesn't really make sense to re-scrape those articles
which have already been captured in previous runs.
Using Incremental extraction, you can easily have the
URLs checked first to make sure they have not been
extracted already, and only capture the ones that are
truly new.

Incremental extraction is going to work only if the newly


added data can be identified with new URLs. During
the extraction process, Octoparse checks each URL to
judge whether it is one that had been crawled before. If
an URL is identified as one from the previous crawl, it
will be skipped automatically when running
with incremental extraction.

You can easily enable incremental extraction following


the steps below:
1. First, make sure Extract data step from the work
flow is selected, click on Setting
2. Tick Enable incremental extraction
3. Select Identify by the entire URL or Identify by part
of the URL

138
www.octoparse.com

Identify by the entire URL

With this option, Octoparse will use the entire URL to


match the current one. Even the slightest difference will
have it identified as a "new" URL.
Identify by part of the URL

In many cases, URLs are composed of various


attributes, for example, the one for eBay below includes
attributes "_from", "_trksid", "_nkw", and "sacat"
(usually anything that comes before "=" sign).

When running with Incremental extraction, Octoparse


detects for attributes automatically and make it
available as parameters. Having one or more attributes
selected as parameters for the match, you are telling
Octoparse to compare the current URL based on the
selected attributes, if any of those are the same, skip it,
otherwise, scrape the page.

139
www.octoparse.com

15 Closing words

Dear reader,

You have now reached the end of the book. Thank you
for buying this book, the trust you have placed in us and
finally the time you have invested.

I very much hope that this book has helped you to


realize your web scraping project and that from now on
you will be able to build your own work flows
efficiently and time saving.

Positive product reviews are the basis for every


successful author. If I could help you further, I would
be happy about a book review.

If you have any questions about the book, you can


reach me at the e-mail address:

[email protected]

If you have questions or suggestions beyond the content


of this book, you can find me in the social media under
the following links.

LinkedIn

140
www.octoparse.com

https://www.linkedin.com/in/tim-luprich-7a5a10158

Xing https://www.xing.com/profile/Tim_Luprich/

Last but not least Octoparse offers useful content in its


premium version in many places. You can purchase this
via the following affiliate link. Here I receive a small
commission through your purchase, which does not
affect your purchase price. Of course, you can also buy
the premium offer yourself without using the link.

http://agent.octoparse.com/ws/324

I wish you much success with your scraping project and


especially much fun reading this book!

141

You might also like