Octoparse Webscraping 2020.08.03
Octoparse Webscraping 2020.08.03
com
www.octoparse.com
Disclaimer
© 2020 by Tim Luprich Veröffentlichungen
Tim Luprich
Ludwigstr. 54
70176 Stuttgart
Germany
2
www.octoparse.com
Table of Content
1 Foreword
3
www.octoparse.com
4
www.octoparse.com
5
www.octoparse.com
15 Closing words
6
www.octoparse.com
1 Foreword
Social, professional and corporate life has now been
undergoing digital change for some time. People shop
online, book a table in their favorite restaurant on the
move and compare the prices of travel bookings and
products of all kinds through portals.
7
www.octoparse.com
http://agent.octoparse.com/ws/324
8
www.octoparse.com
9
www.octoparse.com
10
www.octoparse.com
11
www.octoparse.com
12
www.octoparse.com
13
www.octoparse.com
But before the specific benefit for the owner of the data
can be implemented, huge amounts of data must be
searched with appropriate programs and patterns must
be recognized. However, the effort involved should not
be underestimated. Fortunately, data analysis is also
becoming increasingly efficient and programs help to
interpret data. They already recognize patterns on their
own that would otherwise have had to be worked out
manually or would have remained hidden from the
human eye. The more detailed the data set is, the more
valuable it is.
14
www.octoparse.com
15
www.octoparse.com
16
www.octoparse.com
17
www.octoparse.com
18
www.octoparse.com
4 Advantages of web
scraping
19
www.octoparse.com
20
www.octoparse.com
21
www.octoparse.com
• UiPath
• helium scraper
• Beautiful Soup (Python)
• Mozenda
• Parshub
• Crawly
• data miner
• web scraper
• Easy Web Extract
• Fminer
• Scrapy (python)
• Screenscraper
• Scrapehero
22
www.octoparse.com
23
www.octoparse.com
24
www.octoparse.com
25
www.octoparse.com
● Internet access
26
www.octoparse.com
27
www.octoparse.com
www.octoparse.com
28
www.octoparse.com
29
www.octoparse.com
8.4.3 Dashboard
30
www.octoparse.com
Team Collaboration
This link sends you to the Octoparse Website and their
team that will help you to fulfill every web scraping
project you may have.
8.4.7 Contact us
The contact form where you can place questions and
needs. You can reach out to [email protected] at
any time. The customer service will reach you within
24 hours.
31
www.octoparse.com
32
www.octoparse.com
9 Octoparse Workflow
Methods
Now that you’ve downloaded Octoparse on your PC
and learn about the user interface, you are ready to start
your own web scraping project.
Most of the information on the web is represented as
text, such as product information, news articles, blogs,
job description, etc. In this chapter, you will get to
know, how to capture simple text data from a webpage
using simple points and clicks.
Basic text extracting skill, when coupled with the other
techniques such as pagination, list building lays the
foundation for achieving data scraping on all kinds of
webpages.
With Octoparse you have two possible web scraping
methods which will help you to identify and scrape the
information you need, as simple as to use the copy and
paste, but fully automated. The methods you can use
are the Advanced Mode and the Template Mode.
33
www.octoparse.com
Advanced Mode
The Advanced Mode will let you to have the total
control about every step of your web scraping project.
Within this mode you have every option Octoparse
provides to their users including very useful <<hacks>>
like Anti-Blocking-Settings and much more.
Template Mode
34
www.octoparse.com
35
www.octoparse.com
10 Octoparse Advanced
Mode
36
www.octoparse.com
37
www.octoparse.com
38
www.octoparse.com
39
www.octoparse.com
40
www.octoparse.com
41
www.octoparse.com
42
www.octoparse.com
43
www.octoparse.com
44
www.octoparse.com
45
www.octoparse.com
46
www.octoparse.com
47
www.octoparse.com
48
www.octoparse.com
49
www.octoparse.com
50
www.octoparse.com
51
www.octoparse.com
52
www.octoparse.com
53
www.octoparse.com
54
www.octoparse.com
It's worth mentioning that not all tasks are created the
same, you may have a completely different task to test,
but the testing methodology can generally be extended
to tasks of all kinds.
55
www.octoparse.com
56
www.octoparse.com
57
www.octoparse.com
58
www.octoparse.com
59
www.octoparse.com
60
www.octoparse.com
Or, you can also click open the list-icon to load the list
of items and confirm if the list is complete.
61
www.octoparse.com
62
www.octoparse.com
63
www.octoparse.com
64
www.octoparse.com
You are now on the desired subpage and can have the
information scrapped into the database by marking the
desired fields and selecting Extract selected Data. All
selected information is saved as a routine and is also
applied on the next subpage, if available.
65
www.octoparse.com
66
www.octoparse.com
Or, you can also check your data by clicking the "show
more" icon on the Dashboard, select "View data", and
then choose if you'd like to view "Cloud data" or "Local
data".
67
www.octoparse.com
68
www.octoparse.com
69
www.octoparse.com
70
www.octoparse.com
71
www.octoparse.com
72
www.octoparse.com
73
www.octoparse.com
74
www.octoparse.com
75
www.octoparse.com
76
www.octoparse.com
5. The next step is to map the data fields and choose the
desired time interval for the export.
77
www.octoparse.com
78
www.octoparse.com
79
www.octoparse.com
80
www.octoparse.com
81
www.octoparse.com
82
www.octoparse.com
4. You can also save the setting for later use. Give the
setting a name and click "Save". This way, you can
always select the saved schedule setting and apply it
directly to any other tasks.
83
www.octoparse.com
6. Once you have the schedule set up, you can easily
turn it ON and OFF by clicking the show more icon on
the Dashboard, then select "Cloud runs", there you can
choose "Schedule ON" or "Schedule OFF". #
84
www.octoparse.com
85
www.octoparse.com
13 Octoparse Template
Mode
If you have ever wondered about the level of technical
proficiencies required to build a web scraper? With the
newly launched Template Mode Scraping almost none.
More specifically, now there are about dozens of built-
in templates within the program and all ready to be used
to fetch data instantly, with nearly zero learning curve!
Many popular sites like:
● AMAZON
● BOOKING
● TRIPADVISOR
● TWITTERS
● YOUTUBE
and many more are covered at this moment. And the
best part is if you feel any website should be added, you
can contact the Octoparse team and they will
seriously consider having a template created for the site.
86
www.octoparse.com
87
www.octoparse.com
88
www.octoparse.com
89
www.octoparse.com
90
www.octoparse.com
91
www.octoparse.com
92
www.octoparse.com
93
www.octoparse.com
94
www.octoparse.com
95
www.octoparse.com
96
www.octoparse.com
97
www.octoparse.com
98
www.octoparse.com
99
www.octoparse.com
100
www.octoparse.com
You can see what you just input also appear in the input
field on the page in the built-in browser.
Octoparse would inform you with "Input Text Saved"
in "Tips", and you can also notice the "Enter text"
action is added into the workflow.
101
www.octoparse.com
102
www.octoparse.com
103
www.octoparse.com
104
www.octoparse.com
105
www.octoparse.com
106
www.octoparse.com
107
www.octoparse.com
14.4.1
Rename/move/duplicate/delete a field
When you start extracting your data through the
Advanced Mode, the automated data detection Wizard
and it is shown in Data Preview, you can now look
through the data set and start organizing your data.
A few typical things you can do to refine your data set
include renaming the fields, reordering the columns,
duplicating data fields, and deleting the fields that are
not required for your project.
To rename a field, click the pencil icon next to the field
name, then type in the new name directly. Note that you
should only use numbers, letters, and "_" for field
names.
108
www.octoparse.com
109
www.octoparse.com
In Data Preview, right click the show more icon for the
data field you'd like to clean, select "Clean data".
110
www.octoparse.com
111
www.octoparse.com
112
www.octoparse.com
113
www.octoparse.com
114
www.octoparse.com
115
www.octoparse.com
● IP Blocking
● Browser recognition
● Cookie Tracking
116
www.octoparse.com
14.6.1 IP rotation
There are some websites that might be very sensitive to
web scraping and take some serious anti-scraping
measures like IP’s blocking to stop any possible
scraping activities.
117
www.octoparse.com
118
www.octoparse.com
119
www.octoparse.com
120
www.octoparse.com
121
www.octoparse.com
122
www.octoparse.com
123
www.octoparse.com
124
www.octoparse.com
125
www.octoparse.com
126
www.octoparse.com
127
www.octoparse.com
128
www.octoparse.com
Set the condition for the selected data field. You can set
conditions based on "text", "numerals" or "time"
129
www.octoparse.com
130
www.octoparse.com
131
www.octoparse.com
132
www.octoparse.com
133
www.octoparse.com
134
www.octoparse.com
135
www.octoparse.com
136
www.octoparse.com
137
www.octoparse.com
138
www.octoparse.com
139
www.octoparse.com
15 Closing words
Dear reader,
You have now reached the end of the book. Thank you
for buying this book, the trust you have placed in us and
finally the time you have invested.
140
www.octoparse.com
https://www.linkedin.com/in/tim-luprich-7a5a10158
Xing https://www.xing.com/profile/Tim_Luprich/
http://agent.octoparse.com/ws/324
141