Web Application For Harvesting Data From Complementary Websites
Web Application For Harvesting Data From Complementary Websites
Submitted By
This project titled “Web application for harvesting data from complementary
websites” submitted by Md. Shamim Bhuiyan to the Department of Computer Science
& Engineering, Daffodil Institute Of IT, Dhaka, Bangladesh has been accepted as
satisfactory for the partial fulfillment of the requirements for the Degree of Bachelor of
Science (B.Sc.) in Computer Science and Engineering and approved as to its style and
contents.
………………………..… ………………………..
Examiner Examiner
……………………… …………………….
Supervisor Coordinator
Tahmina Aktar Trisha Md. Imran Hossain
Lecturer Lecturer and Coordinator
Dept. Of CSE Dept. Of CSE
Daffodil Institute of IT Daffodil Institute of IT
I
DECLARATION
declare that the work presented in this project “Web application for harvesting data
Daffodil Institute Of IT. I also declare the neither this project nor any of this has been
submitted elsewhere for award of any other University,College or Organization for any
.……………………
Md. Shamim Bhuiyan
Reg: 14502001237
Session: 2014-15
II
ABSTRACT
III
ACKNOWLEDGEMENT
I would like to pay out gratitude to the almighty ALLAH who created us with not only
the ability to design and program this system but also the power of patience. First of all,
my appreciation heartfelt gratitude goes to our project supervisor’s Tahmina Aktar
Trisha, Lecturer, Department of Computer Science and Engineering, Daffodil institute
Of IT.I am obliged and thankful to her for her continuous encouragement, motivation and
professional guidance during the work of this project which has proven to be an integral
part of it. Without her valuable support and guidance, this project could not elevate up
this level of development from our point of view.
And all the Faculty members, Department of Computer Science and Engineering,
Daffodil Institute of IT for their valuable time spend in requirements analysis and
evaluation of the project work.
IV
Table of Content
Page
No.
APPROVALS……………………………………………………….. I
DECLARATION……………………………………………………. II
ABSTRACT……………………………………………………….….. III
ACKNOWLEDGEMENT…………………………………………. IV
1.5 Methodology…………………………………………….…… 3
1.5.1 Data sources……………………………………………… 3
1.6 Process Model ……………………………………….……. 4
1.6.1 Why incremental process model……………………..…. 4
1.7 Feasibility Study………………………………………..…… 5
1.7.1 Technical feasibility……………………….……..…… 5
1.7.2 Economical feasibility ………………………………… 5
1.7.3Operational Feasibility……………………………….… 5
V
2.5 Others Requirements……………………………….……… 9
2.5.1 Software requirements of client …………….………. 9
2.6 Benefits of the system……………………...........………… 10
2.6.1As a tool of marketing………...................……………..… 10
2.6.2 As a tool of data science……….........................……….... 10
2.6.3 Research…………………….................................………. 11
2.6.4 Some other potential benefits………….............………... 11
4. IMPLEMENTATION…………………………………… 21-34
4.1 Implementation………………………...........…………….. 23
6. CONCLUSION…………………………….....................……… 38-45
6.1 Conclusion…………………………………………………… 39
6.2 Appendix…………………………………………………….. 39
6.3 Reference…………………………………………………….. 40
LIST OF FIGURES
VII
Figure 4.4.2(a): Home page for input links and choices………….. 30
VIII
CHAPTER: 1.0
PROJECT INTRODUCTION
1.1 Introduction
The goal of any system development is to develop and implement the system cost
effectively. It most suited to the user’s analysis is the heart of the process. Analysis is the
study of the various operations performed by the system like as (add, update, delete,
search faculty details) and maintain relationship within through the system. During
analysis, data collected on the files, decision points and transactions handled by the
present system.
Administrator only they can add data into the database. The data can be retrieved easily.
The data are well protected for personal use and makes the data processing very fast. This
system objectives of easily & maintainable information.
There are some tools in the market those are providing such things like taking URL and
giving data to user. But most of them are not much effective in today's market. Most of
them are offline. The data they are providing is not well formatted. That’s why we are
developing this system where user will get rid of all those problems they are facing in
present days.
There are some systems which are performing only limited area of scraping. The
limitations of existing system are :
1
1.3 Objectives
In order to reduce the limitation of data for different work. We know that there are lot’s
of data in the internet for every topic we are dealing daily. But the tools for get those data
in case of our need is very limited. So we felt that creation of such application is very
important. If it is online and anyone can get access then it will be very helpful for
different professionals and students. Making time more productive here this app will
work. Manually people can access internet and download some sort of data they are in
need. But when you are a expert in other important sector then you should make your
time valuable by doing those. You concentrate on your work and our system will collect
you data in your hand. So our objective is to provide an web application that will help to
4. Will form the difficult data in needed formation like JSON, XML,CSV etc.
2
1.4 Proposed Application
A web application that will able to give actionable data to the user. My proposed system
is to create an application based on data harvesting form internet theory. Where the
system is able to take input and analysis on it then will give data by user choice. But
some restriction will be apply to make sure that the data is giving is legal.
1.5 Methodology
The development process on “Internet data harvesting” will complete the following
structure described later on Software Analysis & Design.
For this project in data collection phase we collected two types of data i.e.
Primary Data
Secondary Data
Primary data are generated by scraping data from internet Where data are publicly
available without violating any legal rules. Means all data are safe and legal.
Secondary data are generated by website local database where different users
downloaded data will be safely stored for future use.
In iterative model the whole requirement is divided into various builds. Multiple
development cycles take place here, making the life cycle amulet. Cycles are divided up
into smaller, more easily managed modules.[1]. Each module passes through the
requirements, design, implementation and testing phases. A working version of software
3
is produced during the first module, so you have working software early on during
the software life cycle. Each subsequent release of the module adds function to the
previous release. The process continues till the complete system is achieved.
Initialization
Requirements
Planning Design
Evaluation Implementation
Verification
Deployment
Generates working software quickly and early during the software life cycle.
This model is more flexible – less costly to change scope and requirements.
4
Lowers initial delivery cost.
Feasibility study determines whether that solution is feasible or achievable for the
organization or not. There are three major areas of feasibility study.
Technical Feasibility
Economical Feasibility
Operational Feasibility
The purpose of the economic feasibility assessment is to determine the positive economic
benefits to the organization that the proposed system will provide. Our system is
economically feasible because by using the proposed system many works can be done
within small time and which is not possible by man power within the same time. So we
can say that, if they use proposed system they will be economically benefited. This is
how this system is economically feasible.
User can easily operate the proposed system because the system is user friendly. It’s easy
to insert inventory products and easy to create stocks. If the stuff of the organization has
the basic to computer knowledge they could operate the software easily. Every features
and the activity that are combined within the system is designed and developed belongs
to previous format they had used with a more attractive user interface.
5
CHAPTER: 2.0
REQUIREMENT ENGINEEING
6
2.1 Requirement Engineering
Requirement engineering encompasses the tasks that lead to an understanding of what the
business impact of the application will be, what the user wants, and how end-users will
interact with the software.[2] Requirements engineering is defined in terms of its major
activities.
Understanding problems
Solution determination
Designing and building an elegant computer program that solves the wrong problem
serves no one’s need. That’s why it is important to understand what the customer wants
before we begin to design and build a system.
Admin requirement
System requirement
User requirement
7
System Requirement
User requirement
System Requirement
User requirement
System Requirement
2.2.1.8 Need database for storing data and give download able data.
User requirement
System Requirement
8
2.4 Non-functional Requirements
Browser:
Competitor research.
Campaign study.
Market research.
Brand monitoring.
9
2.6.2 As a tool of Data Science:
2.6.3 Research :
All the information will be stored on to the computer with its formatted screens and
built in databases.
All the information can be carried out more easily or quickly than any other manual
process.
Admin can easily take all information any time when he needs that stored by himself
previously.
Admin can easily input the marks of the students very easily.
10
Admin can manage students easily.
The security of this software is high. Without Admin login details not can enter the
dashboard area.
11
CHAPTER: 3.0
ANALYSIS & DESIGN
In system analysis a study of the system as detailed as possible will occur with the help of
12
some diagrams i.e. Use Case Diagram, Activity Diagram, Swim Lane Diagram Data
flow Diagram and Entity Relational Diagram etc.
Harvesting
data
Unique no
Generate
Target URL
Log in
Admin
User
Register
View Result
Compare
Result
Download
13
The Entity Relationship Diagram (ERD) enables a software engineer to specify the data
objects that are input and output from a system, the attributes that define the properties of
the data structures and relationship.[4] They provide a clear view of the logical structure of
data within the boundary of interest and allow the engineer to model the data without
considering the physical form. Some of the basic terms used in ERD described below:
Entity: An entity is an object with physical existence or may be an object with conceptual
describes how the entities are associated with each other. A relationship described by a
diamond.
14
Foreign key: A foreign key is an attribute of a relation, which refers to an existing
Relationship Cardinality:
E-id password
Name email
password
15
3.4 Data Flow Diagram (DFD)
A Data Flow Diagram (DFD) is a graphical representation of the "flow" of data through
an information system, modeling its process aspects. A DFD is often used as a
preliminary step to create an overview of the system, which can later be elaborated. [8]
DFD can also be used for the visualization of data processing (structured design).
A DFD shows what kind of information will be input to and output from the system,
where the data will come from and go to, and where the data will be stored .[5] It does not
show information about the timing of process or information about whether processes
will operate in sequence or in parallel (which is shown on a flowchart).
Search
Result
Approve
acc.
16
Request
to
system
Register Admin
Figure 3.4(a) : Context level DFD
Level 1 DFD
Registration
approved
Request for 17
registration
1.0
Registration
18
Admin
User
Data
Processing
19
3.5 Work Flow Diagram
A work flow diagram is a way of representing the business process for better
understanding by using standard shapes that represent flow, decisions, process etc. [6] A
work flow diagram can be used in any business for clear the confusions of
understanding. In this project the work flow diagram representing the overall process of
Access
Try Link
Yes
Any
Obstacle?
No
Scrape Data
Store Temporarily
Analyze data
Download
End
20
CHAPTER: 4.0
Implementation
21
4. 1 Implementation
22
4. 2 Working Station
The increasing capabilities of mainstream PCs in the late 1990s have blurred the lines
somewhat with technical/scientific workstations. The workstation market previously
employed proprietary hardware which made them distinct from PCs; for instance IBM
used RISC-based CPUs for its workstations and Intel x86 CPUs for its
business/consumer PCs during the 1990s and 2000s. However, by the early 2000s this
difference disappeared, as workstations now use highly commoditized hardware
dominated by large PC vendors, such as Dell and HP & Fujitsu, selling Microsoft
Windows or GNU/Linux systems running on x86-64 architecture such as Intel Core.
23
The performance of workstation in this project given below:
HDD: 500 GB
The most basic tools are a source code editor and a compiler or interpreter, which are
used ubiquitously and continuously. Other tools are used more or less depending on the
language, development methodology, and individual engineer, and are often used for a
discrete task, like a debugger or profiler. Tools may be discrete programs, executed
separately – often from the command line – or may be parts of a single large program,
called an integrated development environment (IDE). In many cases, particularly for
simpler use, simple ad hoc techniques are used instead of a tool, such as print
debugging instead of using a debugger, manual timing (of overall program or section of
code) instead of a profiler, or tracking bugs in a text file or spreadsheet instead of a bug
tracking system.
The distinction between tools and applications is murky. For example, developers use
simple databases (such as a file containing a list of important values) all the time as
24
tools.[dubious – discuss] However a full-blown database is usually thought of as an
application or software in its own right. For many years, computer-assisted software
engineering (CASE) tools were sought after. Successful tools have proven elusive.
[citation needed] In one sense, CASE tools emphasized design and architecture support,
such as for UML. But the most successful of these tools are IDEs.
normally consists of a source code editor, build automation tools and a debugger. Most
modern IDEs have an intelligent code completion. Example : Sublime text , vs code
etc.
I used sublime text 3 for this project. The Screenshot is given below:
25
4.3.2 Windows Command Prompt
To setup the development environment with python the following tools and
26
Beautifulsoup4==4.7.1
Lxml==4.3.4
Pillow==6.1.0
Request==2019.3.22
Requests==2.21.0
Urllib3==1.24.1
Etc.
27
4.4 User Experience And Project Preview
After running this web application the user will face the authentication page where only
the authentic users can log in the user name and password previously given by the
admin to user.
28
Figure 4.4.1(b): Incorrect log in alert
After successfully logged in user will be redirected to home page where they can enter
a link of complementary website from where they want to harvest data. Another drop
down menu to select the type of data and a submit button and a log out link.
29
Figure 4.4.2(a): Home page for input links and choices
30
4.4.3 Admin page
User only can be added by the admin. The admin can add delete and limited the users.
31
Figure 4.4.3(b): Django Administration page 1
32
Figure 4.4.3(d): User permission setup from admin
33
Figure 4.4.3(f): Important dates for user activity
4.4.4 Output
User will find their data in the local folder of their computer as the format they selected
in the input section .
34
Figure 4.4.4(b): Output as images
35
CHAPTER: 5.0
LIMITATION AND FUTURE ENHANCEMENT
36
5.1 Limitations
Like every other projects “Web application for harvesting data from complementary
websites” have some limitations as well. The main limitations are-
1. Multitasking: This website serve one service at a time rather then a group of task.
Specifically it cannot take multiple link at a time. It just takes one link and go through
that link to get data.
2. Accuracy in analysis: Every website has their own style of coding so that
sometime it can be gone wrong for that reason.
2. Make an automated website so that it can work with keyword rather then links.
37
Chapter: 6.0
Conclusion
38
6.1 Conclusion
Finally this reports demonstrates the achievements of the project, but also presents an
assessment of the performance and reliability. It extensively make use of web scrapper
and data mining technology. Moreover it helped me to develop my coding skill and better
understanding of technical methodologies. To conclude with I believe that the current
solution succeeded to meet the projects requirement and its deliverable. Even through it
has series of limitations, it allows for further extensions, which would enable a more in
depth understanding of data scraping and data mining . And I hope this projects will help
businessman, scientists, marketers, students and other users for developing their own
scope of knowledge.
6.2 Appendix
39
from bs4 import BeautifulSoup
import xlwt
from xlwt import Workbook
from django.contrib.auth.decorators import login_required
def login(request):
if request.user.is_authenticated:
return redirect('home')
if request.method == 'POST':
username = request.POST.get('username')
password = request.POST.get('password')
print(password)
user = auth.authenticate(username=username, password=password)
if user is not None:
# correct username and password login the user
auth.login(request, user)
return redirect('home')
else:
messages.error(request, 'Error wrong username/password')
return render(request, 'login.html')
@login_required(login_url='login')
def home(request):
return render(request, "home.html")
def logout(request):
auth.logout(request)
return redirect('login')
@login_required(login_url='login')
def scrap_data(request):
40
e_links = request.POST['e_links']
e_tag = request.POST['e_tag']
print(e_links)
print(e_tag)
if e_tag == 'heading':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find_all("h2", class_="title")
f= open("result/headlines.txt", "w+")
for i in tb:
try:
ss = i.find("a").contents
# print(ss)
k = ss[0]
# # print(str(k))
# # print()
k = str(k)
print(k)
f.write(k + "\n")
except:
pass
# print("No")
f.close()
except:
pass
41
elif e_tag == 'paragraph':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# tb = soup.find('tr')
tb = soup.find_all("p")
# wb = Workbook()
# sheet1 = wb.add_sheet('Sheet 1')
# # add_sheet is used to create sheet.
f= open("result/paragraph.txt", "w+")
for i in tb:
try:
ss = i.contents
# print(ss[0])
k = ss[0]
# print(str(k))
# print()
k = str(k)
print(k)
f.write(k + "\n")
print(1)
except:
pass
# print("No")
f.close()
except:
42
pass
elif e_tag == 'telephone':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# tb = soup.find('tr')
tb = soup.find_all("td", colspan="2")
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
sheet1.write(0, 0, 'TelePhone Number')
count = 0
for i in tb:
j = i.contents
ss = j[0]
# print(ss)
try:
if ss[:3] == '+88' or ss[:2]=='01' :
count += 1
print(ss)
sheet1.write(count, 0, ss)
except:
print("no")
wb.save('result/telephone.xls')
except:
pass
43
elif e_tag == 'image':
try:
url = e_links
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
imgs = soup.findAll('img')
print("aa")
# user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7)
Gecko/2009021910 Firefox/3.0.7'
# headers={'User-Agent':user_agent,}
for img in imgs:
print("VV")
# print(img)
jf = img.get('src
print(jf)
# j = img.get('src').read()
# fileName = basename(urlsplit(jf)[2])
# output = open(fileName,'wb')
# output.write(j)
# output.close()
# print(img.get('src'))
#'/home/asus/Desktop/defodilscrap/result/imagefile'
try:
my_path = 'C:/Users/Asus/Desktop/diitdata/result/imagefile'
#'/home/asus/Desktop/amarfile'
urllib.request.urlretrieve(jf, os.path.join(my_path, os.path.basename(jf)))
# with open(basename(jf), "wb") as f:
# f.write(requests.get(jf).content)
44
# uopen = urlopen(jf)
# stream = uopen.read()
# file = open('myfile.jpg','w')
# file.write(stream)
# file.close()
except: print('not found')
except:
pass
return HttpResponse("check your Folder")
6.2 References
[3] Jhon Mc, ”Web Scraping and Crawling with Python: Beautiful Soup, Requests &
Selenium”, www.udemy.com, March 2009.
[4] Brayan kylan , “Intro to data harvesting algorithm” www.kdnuggeds.com, April 2012
[6] Alison Fitter, “web scrapping software and tools” www.fminer.com , 2009
[9] Django 2 by Example: Build Powerful and Reliable Python Web Applications from
Scratch by Antonio Meley.
45