0% found this document useful (0 votes)
34 views55 pages

Web Application For Harvesting Data From Complementary Websites

Uploaded by

Shamim Bhuiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views55 pages

Web Application For Harvesting Data From Complementary Websites

Uploaded by

Shamim Bhuiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Web application for harvesting data

from complementary websites

A Project presented to the National University in partial fulfillment of the


requirement for the degree of Bachelor of Science (Hon’s) in Computer Science &
Engineering

Submitted By

Md. Shamim Bhuiyan


Reg: 14502001237
Session: 2014-15

Department of Computer Science & Engineering


Daffodil Institute of IT, Dhaka
Under National University, Bangladesh
September 2019
APPROVALS

This project titled “Web application for harvesting data from complementary
websites” submitted by Md. Shamim Bhuiyan to the Department of Computer Science
& Engineering, Daffodil Institute Of IT, Dhaka, Bangladesh has been accepted as
satisfactory for the partial fulfillment of the requirements for the Degree of Bachelor of
Science (B.Sc.) in Computer Science and Engineering and approved as to its style and
contents.

………………………..… ………………………..

Examiner Examiner

……………………… …………………….
Supervisor Coordinator
Tahmina Aktar Trisha Md. Imran Hossain
Lecturer Lecturer and Coordinator
Dept. Of CSE Dept. Of CSE
Daffodil Institute of IT Daffodil Institute of IT

I
DECLARATION

I am the student of Bachelor of Science in Computer Science and Engineering hereby

declare that the work presented in this project “Web application for harvesting data

from complementary websites” has been done by me under the supervision of

Tahmina Aktar Trisha, Lecturer, Department of Computer Science & Engineering,

Daffodil Institute Of IT. I also declare the neither this project nor any of this has been

submitted elsewhere for award of any other University,College or Organization for any

academic,qualification,certificate or degree. I guarantee that this project work done no

infringe any copyright.

.……………………
Md. Shamim Bhuiyan

Reg: 14502001237

Session: 2014-15

II
ABSTRACT

“Web application for harvesting data from complementary websites” which


generally used for harvesting data from internet those are publicly available and free from
any restrictions. The goal of this project is extracting information from websites
automatically. By using this application, it will be very easy to download any format of
data like text,CSV,images etc. Its Functionality is designed according to the basic
demands of user’s. This application will get data from user defined data source and
provide to the users as per as conditions. This application takes input links of POI (Point
Of Interest) of user and provides information that user need by analyzing and reading the
links.

III
ACKNOWLEDGEMENT

I would like to pay out gratitude to the almighty ALLAH who created us with not only
the ability to design and program this system but also the power of patience. First of all,
my appreciation heartfelt gratitude goes to our project supervisor’s Tahmina Aktar
Trisha, Lecturer, Department of Computer Science and Engineering, Daffodil institute
Of IT.I am obliged and thankful to her for her continuous encouragement, motivation and
professional guidance during the work of this project which has proven to be an integral
part of it. Without her valuable support and guidance, this project could not elevate up
this level of development from our point of view.

I also express my sincere thanks and gratitude to Mohammed Shakhawat Hossain,


Principal, Daffodil institute Of IT, Md. Imran Hossain, Lecturer and Coordinator of
Department of Computer Science and Engineering, Daffodil institute of IT.

And all the Faculty members, Department of Computer Science and Engineering,
Daffodil Institute of IT for their valuable time spend in requirements analysis and
evaluation of the project work.

IV
Table of Content
Page
No.
APPROVALS……………………………………………………….. I

DECLARATION……………………………………………………. II

ABSTRACT……………………………………………………….….. III

ACKNOWLEDGEMENT…………………………………………. IV

LIST OF FIGURES………………………………………….. VII

1. PROJECT INTRODUCTION..………………………… 1-6

1.1 Introduction …………………………………………………... 1

1.2 Background Study…………………………………………… 1

1.3 Objectives …………………………………………………….. 2


1.3.1 Specific objective ……………………………………… 2
1.4 Proposed Application……………………………….……… 3

1.5 Methodology…………………………………………….…… 3
1.5.1 Data sources……………………………………………… 3
1.6 Process Model ……………………………………….……. 4
1.6.1 Why incremental process model……………………..…. 4
1.7 Feasibility Study………………………………………..…… 5
1.7.1 Technical feasibility……………………….……..…… 5
1.7.2 Economical feasibility ………………………………… 5
1.7.3Operational Feasibility……………………………….… 5

2. REQUIREMENT ENGINEERING.................…..…. 6-11

2.1 Requirement Engineering…………………………..……. 7

2.2 Requirement Analysis……………………………...……… 8

2.3 Functional Requirements…………………………..…….. 9

2.4 Non-functional Requirements……………………..……. 9

V
2.5 Others Requirements……………………………….……… 9
2.5.1 Software requirements of client …………….………. 9
2.6 Benefits of the system……………………...........………… 10
2.6.1As a tool of marketing………...................……………..… 10
2.6.2 As a tool of data science……….........................……….... 10
2.6.3 Research…………………….................................………. 11
2.6.4 Some other potential benefits………….............………... 11

3. ANALYSIS & DESIGN………………………………….. 12-20

3.1 Use Case Diagram…………………………………...…….. 13

3.2 Entity Relationship Diagram………………..……..…… 14

3.3 Data Flow Diagram(DFD)…………………………. …… 16

3.4 Work Flow Diagram…………..……………………..…… 20

4. IMPLEMENTATION…………………………………… 21-34

4.1 Implementation………………………...........…………….. 23

4.2 Working Station……………………………..........………. 24

4.3 Software Tools ……………………………………………. 25


4.3.1 Python IDE……………………..........................…… 26
4.3.2 Windows Command Prompt………………………. 27
4.3.3 Supportive tools and software……………………… 27
4.4 User Experience And Project Preview…………..…… 29
4.4.1 Authentication page………………………………… 29
4.4.2 Home page ………………………………………….. 30
4.4.3 Admin page…………………………………….…… 31
4.4.4 Output………………………………………………. 35

5. LIMITATION & FUTURE ENHANCEMENT… 36-37


VI
5.1 Limitations…………………………………...........……….…. 37

5.2 Farther Enhancement……………………………………… 37

6. CONCLUSION…………………………….....................……… 38-45

6.1 Conclusion…………………………………………………… 39

6.2 Appendix…………………………………………………….. 39

6.3 Reference…………………………………………………….. 40

LIST OF FIGURES

Figure 1.6(a): Iterative process model……………………………….. 4

Figure 3.1(a): Use Case Diagram…………………………………….. 13

Figure 3.3(a): E-R diagram………………………….………………. 16

Figure 3.4(a) : Context level DFD…………………………………… 17

Figure 3.4(b): DFD Level 1…………………………………………... 18

Figure 3.4(c) : 2nd level DFD…………………………………………. 19

Figure 3.5(a): Work flow diagram………………………………….. 20

Figure 4.3.1(a): Sublime Text IDE………………………………….. 25

Figure 4.3.2(a): Windows Command prompt……………………… 26

Figure 4.3.3(a): Tools and frameworks for python………………… 27

Figure 4.4.1(a): Log in form to enter into the home page…………. 28

Figure 4.4.1(b): Incorrect log in alert……………………………… 29

VII
Figure 4.4.2(a): Home page for input links and choices………….. 30

Figure 4.4.2(b): Input url…………………………………………… 31

Figure 4.4.3(a): Admin login page and validation………………… 32

Figure 4.4.3(b): Django Administration page 1…………………… 32

Figure 4.4.3(c): User sign up info………………………………….. 33

Figure 4.4.3(d): User permission setup from admin……………… 33

Figure 4.4.3(e): Available user permission………………………… 34

Figure 4.4.3(f): Important dates for user activity…………………. 34

Figure 4.4.4(a): Output folder of their computer………………….. 35

Figure 4.4.4(b): Output as images………………………………….. 35

VIII
CHAPTER: 1.0
PROJECT INTRODUCTION
1.1 Introduction

The goal of any system development is to develop and implement the system cost
effectively. It most suited to the user’s analysis is the heart of the process. Analysis is the
study of the various operations performed by the system like as (add, update, delete,
search faculty details) and maintain relationship within through the system. During
analysis, data collected on the files, decision points and transactions handled by the
present system.

Administrator only they can add data into the database. The data can be retrieved easily.
The data are well protected for personal use and makes the data processing very fast. This
system objectives of easily & maintainable information.

1.2 Background of Study

There are some tools in the market those are providing such things like taking URL and
giving data to user. But most of them are not much effective in today's market. Most of
them are offline. The data they are providing is not well formatted. That’s why we are
developing this system where user will get rid of all those problems they are facing in
present days.

There are some systems which are performing only limited area of scraping. The
limitations of existing system are :

1. Not reliable with big data.


2. Extracting conditions cannot be performed by users.
3. Data to be extracted are not well formatted.
4. Less secure.
5. Not compatible with modern websites.
6. Slower in running
7. Not as much of features of harvesting.

1
1.3 Objectives

In order to reduce the limitation of data for different work. We know that there are lot’s

of data in the internet for every topic we are dealing daily. But the tools for get those data

in case of our need is very limited. So we felt that creation of such application is very

important. If it is online and anyone can get access then it will be very helpful for

different professionals and students. Making time more productive here this app will

work. Manually people can access internet and download some sort of data they are in

need. But when you are a expert in other important sector then you should make your

time valuable by doing those. You concentrate on your work and our system will collect

you data in your hand. So our objective is to provide an web application that will help to

fulfil the specific objectives below.

1.3.1 Specific objective

1.Get data from target website by this web application.

2. Dynamically grab data from URL defined by user.

3. Make it useful as marketers can use it as a marketing tool.

4. Will form the difficult data in needed formation like JSON, XML,CSV etc.

5. Admin can Add, Delete and Update users.

6. Showing retrieved data to user based on point of interest(POI).

7. Will be useful for data scientist .

8. Data downloads option for user.

2
1.4 Proposed Application

A web application that will able to give actionable data to the user. My proposed system
is to create an application based on data harvesting form internet theory. Where the
system is able to take input and analysis on it then will give data by user choice. But
some restriction will be apply to make sure that the data is giving is legal.

1.5 Methodology

The development process on “Internet data harvesting” will complete the following
structure described later on Software Analysis & Design.

This study on “Internet data harvesting” is tentative in nature. It aims to development of


web application. The variables identified to manipulate through a handy inspection and
from primary and secondary data.

1.5.1 Data Sources

For this project in data collection phase we collected two types of data i.e.

 Primary Data

 Secondary Data

Primary data are generated by scraping data from internet Where data are publicly
available without violating any legal rules. Means all data are safe and legal.

Secondary data are generated by website local database where different users
downloaded data will be safely stored for future use.

1.6 Process model

In iterative model the whole requirement is divided into various builds. Multiple
development cycles take place here, making the life cycle amulet. Cycles are divided up
into smaller, more easily managed modules.[1]. Each module passes through the
requirements, design, implementation and testing phases. A working version of software

3
is produced during the first module, so you have working software early on during
the software life cycle. Each subsequent release of the module adds function to the
previous release. The process continues till the complete system is achieved.

Iterative process model

Initialization
Requirements

Planning Design

Evaluation Implementation

Verification

Deployment

Figure 1.6(a): Iterative process model

1.6.1 Why Iterative Process Model

 Generates working software quickly and early during the software life cycle.

 This model is more flexible – less costly to change scope and requirements.

 It is easier to test and debug during a smaller iteration.

 In this model customer can respond to each built.

4
 Lowers initial delivery cost.

1.7 Feasibility Study

Feasibility study determines whether that solution is feasible or achievable for the
organization or not. There are three major areas of feasibility study.

 Technical Feasibility

 Economical Feasibility

 Operational Feasibility

1.7.1 Technical Feasibility

The technical feasibility assessment is focused on gaining an understanding of the present


technical resources of the organization and their applicability to the expected needs of the
proposed system. It is an evaluation of the hardware and software and how it meets the
need of the proposed system. The proposed system is compatible with a low qualification
of computer with an internet connection only, as because it’s a web based software.

1.7.2 Economical Feasibility

The purpose of the economic feasibility assessment is to determine the positive economic
benefits to the organization that the proposed system will provide. Our system is
economically feasible because by using the proposed system many works can be done
within small time and which is not possible by man power within the same time. So we
can say that, if they use proposed system they will be economically benefited. This is
how this system is economically feasible.

1.7.3 Operational Feasibility

User can easily operate the proposed system because the system is user friendly. It’s easy
to insert inventory products and easy to create stocks. If the stuff of the organization has
the basic to computer knowledge they could operate the software easily. Every features
and the activity that are combined within the system is designed and developed belongs
to previous format they had used with a more attractive user interface.

5
CHAPTER: 2.0
REQUIREMENT ENGINEEING

6
2.1 Requirement Engineering

Requirement engineering encompasses the tasks that lead to an understanding of what the
business impact of the application will be, what the user wants, and how end-users will
interact with the software.[2] Requirements engineering is defined in terms of its major
activities.

 Understanding problems

 Solution determination

 Specification of a solution that is testable, understandable, maintainable and that


satisfies project quality guidelines

Designing and building an elegant computer program that solves the wrong problem
serves no one’s need. That’s why it is important to understand what the customer wants
before we begin to design and build a system.

2.2 Requirement Analysis

Requirement analysis provides the software designer with a representation of


information, function and behavior that can be translated to data, architectural, interface
and component-level designs.

In the following task phases the requirement analysis was done.

2.2.1 Admin requirements and System requirements:

Admin requirement

2.2.1.1 Admin will allow remove users..

System requirement

2.2.1.2 A request get way method should be present.

User requirement

2.2.1.3 User can register by requesting to the system.

7
System Requirement

2.2.1.4 A form needed for user registration.

User requirement

2.2.1.5 Ability for inserting link of target website.

System Requirement

2.2.1.6 Need a input form to get link and analysis on it.

User requirement

2.2.1.7 Ability to download data.

System Requirement

2.2.1.8 Need database for storing data and give download able data.

User requirement

2.2.1.9 Ability to define data format..

System Requirement

2.2.1.10 Need options to select those format.

2.3 Functional Requirements

2.3.1 Admin can maintain whole system.

2.3.2 Admin can add, delete, scope of user interface.

2.3.3 Admin can add, delete and edit conditions.

2.3.4 Admin can add, delete and edit restrictions.

8
2.4 Non-functional Requirements

2.4.1 Admin can log in by using username and password.

2.4.2 Only admin can maintain the whole system.

2.4.3 Total secured system.

2.4.4 Any operating system supported.

2.5 Others Requirements

2.5.1 Software Requirement for client:

Browser:

-Windows Internet Explorer 8.0 or above

-Mozilla Firefox version 47.0.1

-Google Chrome 52.0.2743.82

2.6 Benefits Of The System

2.6.1 As a tool of marketing

 Competitor research.

 Campaign study.

 Getting idea from data.

 Finding audience from info.

 Market research.

 Brand monitoring.

9
2.6.2 As a tool of Data Science:

 Collection of big amount of data.

 Get data in targeted format.


 Analytical data will load directly from source.
 Getting of data in simple form.

2.6.3 Research :

 Download data for research.

 Ease of grabbing data.

 Can be compared among topics.

 One click download ability.

2.6.4 Some other potential benefits

 Administrator will be able to maintain this system more accurately.

 All the information will be stored on to the computer with its formatted screens and
built in databases.

 All the information can be carried out more easily or quickly than any other manual
process.

 Admin can easily take all information any time when he needs that stored by himself
previously.

 Admin can easily input the marks of the students very easily.

10
 Admin can manage students easily.

 The security of this software is high. Without Admin login details not can enter the
dashboard area.

 Students can download lecture material from online easily.

11
CHAPTER: 3.0
ANALYSIS & DESIGN

In system analysis a study of the system as detailed as possible will occur with the help of

12
some diagrams i.e. Use Case Diagram, Activity Diagram, Swim Lane Diagram Data
flow Diagram and Entity Relational Diagram etc.

3.1 Use Case Diagram


A use case diagram at its simplest is a representation of a user's interaction with the
system that shows the relationship between the user and the different Use Cases in which
the user is involved. A use case diagram can identify the different types of users of a
system and the different use cases and will often be accompanied by other types of
diagrams as well.
Web application for harvesting data

Harvesting
data

Unique no
Generate

Target URL

Log in

Admin
User
Register

View Result

Compare
Result

Download

Figure 3.1(a): Use Case Diagram

3.2 Entity Relationship Diagram (ERD)

13
The Entity Relationship Diagram (ERD) enables a software engineer to specify the data

objects that are input and output from a system, the attributes that define the properties of

these objects and their relationship. It provides an excellent graphical representation of

the data structures and relationship.[4] They provide a clear view of the logical structure of

data within the boundary of interest and allow the engineer to model the data without

considering the physical form. Some of the basic terms used in ERD described below:

Entity: An entity is an object with physical existence or may be an object with conceptual

existence. For example a car, a student, an employee, an applicant.

An entity represented by a rectangle.

Relationship: A relationship is a logical linkage between two or more entities which

describes how the entities are associated with each other. A relationship described by a

diamond.

Attribute: Attribute is a piece of information that describes a particular entity.

Primary Key: A primary key is an attribute or collection of attributes that allow us to

identify an entity uniquely.

14
Foreign key: A foreign key is an attribute of a relation, which refers to an existing

attribute of another relationship.

Relationship Cardinality:

Relationship cardinality refers to the number of entity instances involved in the

relationship. The cardinality ratios are:

 1:1 (One to One)

 1:n (One to Many)

 n:n (Many to Many)

E-id password
Name email
password
15
3.4 Data Flow Diagram (DFD)

A Data Flow Diagram (DFD) is a graphical representation of the "flow" of data through
an information system, modeling its process aspects. A DFD is often used as a
preliminary step to create an overview of the system, which can later be elaborated. [8]
DFD can also be used for the visualization of data processing (structured design).

A DFD shows what kind of information will be input to and output from the system,
where the data will come from and go to, and where the data will be stored .[5] It does not
show information about the timing of process or information about whether processes
will operate in sequence or in parallel (which is shown on a flowchart).

A context level DFD:

Search
Result
Approve
acc.
16
Request
to
system
Register Admin
Figure 3.4(a) : Context level DFD

Level 1 DFD

Registration

approved

Request for 17
registration

1.0
Registration

Figure 3.4(b): DFD Level 1

DFD Level 2 All process together:


Search by poi link View/manage
search details

18
Admin
User
Data

Processing

Figure 3.4(c) : 2nd level DFD

19
3.5 Work Flow Diagram

A work flow diagram is a way of representing the business process for better

understanding by using standard shapes that represent flow, decisions, process etc. [6] A

work flow diagram can be used in any business for clear the confusions of

understanding. In this project the work flow diagram representing the overall process of

the application after user log in.

Access

Try Link

Yes
Any
Obstacle?

No

Formalize the Link

Visit Internal Links

Find data from HTML

Scrape Data

Store Temporarily

Analyze data

Download

Clear temp. Data

End

Figure 3.5(a): Work flow diagram

20
CHAPTER: 4.0
Implementation

21
4. 1 Implementation

In computer science, an implementation is a realization of a technical specification or


algorithm as a program, software component, or other computer system through
computer programming and deployment. Many implementations may exist for a given
specification or standard. For example, web browsers contain implementations of
World Wide Web Consortium-recommended specifications, and software development
tools contain implementations of programming languages.

A special case occurs in object-oriented programming, when a concrete class


implements an interface; in this case the concrete class is an implementation of the
interface and it includes methods which are implementations of those methods
specified by the interface.

In the IT Industry, implementation refers to post-sales process of guiding a client from


purchase to use of the software or hardware that was purchased. This includes
requirements analysis, scope analysis, customization, systems integration, user policies,
user training and delivery. These steps are often overseen by a project manager using
project management methodologies. Software Implementations involve several
professionals that are relatively new to the knowledge based economy such as business
analysts, technical analysts, solutions architects, and project managers.

To implement a system successfully, a large number of inter-related tasks need to be


carried out in an appropriate sequence. Utilizing a well-proven implementation
methodology and enlisting professional advice can help but often it is the number of
tasks, poor planning and inadequate resourcing that causes problems with an
implementation project, rather than any of the tasks being particularly difficult.
Similarly with the cultural issues it is often the lack of adequate consultation and two-
way communication that inhibits achievement of the desired results.

22
4. 2 Working Station

A workstation is a special computer designed for technical or scientific applications.


Intended primarily to be used by one person at a time, they are commonly connected to
a local area network and run multi-user operating systems. The term workstation has
also been used loosely to refer to everything from a mainframe computer terminal to a
PC connected to a network, but the most common form refers to the group of hardware
offered by several current and defunct companies such as Sun Microsystems, Silicon
Graphics, Apollo Computer, DEC, HP and IBM which opened the door for the 3D
graphics animation revolution of the late 1990s.

Workstations offered higher performance than mainstream personal computers,


especially with respect to CPU and graphics, memory capacity, and multitasking
capability. Workstations were optimized for the visualization and manipulation of
different types of complex data such as 3D mechanical design, engineering simulation
(e.g. computational fluid dynamics), animation and rendering of images, and
mathematical plots. Typically, the form factor is that of a desktop computer, consist of
a high resolution display, a keyboard and a mouse at a minimum, but also offer
multiple displays, graphics tablets, 3D mice (devices for manipulating 3D objects and
navigating scenes), etc. Workstations were the first segment of the computer market to
present advanced accessories and collaboration tools.

The increasing capabilities of mainstream PCs in the late 1990s have blurred the lines
somewhat with technical/scientific workstations. The workstation market previously
employed proprietary hardware which made them distinct from PCs; for instance IBM
used RISC-based CPUs for its workstations and Intel x86 CPUs for its
business/consumer PCs during the 1990s and 2000s. However, by the early 2000s this
difference disappeared, as workstations now use highly commoditized hardware
dominated by large PC vendors, such as Dell and HP & Fujitsu, selling Microsoft
Windows or GNU/Linux systems running on x86-64 architecture such as Intel Core.

23
The performance of workstation in this project given below:

Processor: Intel® Core™ i3-7100U CPU @ 2.40GHz

Installed memory (RAM) : 4GB

System Type : 64-bit operating system , x-64 based processor.

OS: Windows 10 pro.

HDD: 500 GB

Display: 14.0 inch

Weight: 2.1kg, 2.2kg

4.3 Software Tools


A programming tool or software development tool is a computer program that software
developers use to create, debug, maintain, or otherwise support other programs and
applications. The term usually refers to relatively simple programs, that can be
combined together to accomplish a task, much as one might use multiple hand tools to
fix a physical object. The ability to use a variety of tools productively is one hallmark
of a skilled software engineer.

The most basic tools are a source code editor and a compiler or interpreter, which are
used ubiquitously and continuously. Other tools are used more or less depending on the
language, development methodology, and individual engineer, and are often used for a
discrete task, like a debugger or profiler. Tools may be discrete programs, executed
separately – often from the command line – or may be parts of a single large program,
called an integrated development environment (IDE). In many cases, particularly for
simpler use, simple ad hoc techniques are used instead of a tool, such as print
debugging instead of using a debugger, manual timing (of overall program or section of
code) instead of a profiler, or tracking bugs in a text file or spreadsheet instead of a bug
tracking system.

The distinction between tools and applications is murky. For example, developers use
simple databases (such as a file containing a list of important values) all the time as

24
tools.[dubious – discuss] However a full-blown database is usually thought of as an
application or software in its own right. For many years, computer-assisted software
engineering (CASE) tools were sought after. Successful tools have proven elusive.
[citation needed] In one sense, CASE tools emphasized design and architecture support,
such as for UML. But the most successful of these tools are IDEs.

4. 3.1 Python IDE

An integrated development environment (IDE) is a software application that provides

comprehensive facilities to computer programmers for software development. An IDE

normally consists of a source code editor, build automation tools and a debugger. Most

modern IDEs have an intelligent code completion. Example : Sublime text , vs code

etc.

I used sublime text 3 for this project. The Screenshot is given below:

Figure 4.3.1(a): Sublime Text IDE

25
4.3.2 Windows Command Prompt

To setup environment the Windows command prompt used.

Figure 4.3.2(a): Windows Command prompt.

4.3.3 Supportive tools and software

To setup the development environment with python the following tools and

frameworks used. They are :

 Python==3.7.4 (Powerful programming language for web application)

 Django==2.1.5 (A popular framework for web development with python).

26
 Beautifulsoup4==4.7.1

 Lxml==4.3.4

 Pillow==6.1.0

 Request==2019.3.22

 Requests==2.21.0

 Urllib3==1.24.1

Etc.

Figure 4.3.3(a): Tools and frameworks for python development setup.

27
4.4 User Experience And Project Preview

“Web application for harvesting data from complementary websites “ is a project


where user can easily use this web application. Different platform use on this project.
Here some user experience and our project preview.

4.4.1 Authentication page

After running this web application the user will face the authentication page where only
the authentic users can log in the user name and password previously given by the
admin to user.

Figure 4.4.1(a): Log in form to enter into the home page

28
Figure 4.4.1(b): Incorrect log in alert

4.4.2 Home Page

After successfully logged in user will be redirected to home page where they can enter
a link of complementary website from where they want to harvest data. Another drop
down menu to select the type of data and a submit button and a log out link.

29
Figure 4.4.2(a): Home page for input links and choices

Figure 4.4.2(b): Input url

30
4.4.3 Admin page

User only can be added by the admin. The admin can add delete and limited the users.

Figure 4.4.3(a): Admin login page and validation

31
Figure 4.4.3(b): Django Administration page 1

Figure 4.4.3(c): User sign up info

32
Figure 4.4.3(d): User permission setup from admin

Figure 4.4.3(e): Available user permission

33
Figure 4.4.3(f): Important dates for user activity

4.4.4 Output

User will find their data in the local folder of their computer as the format they selected
in the input section .

Figure 4.4.4(a): Output folder of their computer.

34
Figure 4.4.4(b): Output as images

35
CHAPTER: 5.0
LIMITATION AND FUTURE ENHANCEMENT

36
5.1 Limitations
Like every other projects “Web application for harvesting data from complementary
websites” have some limitations as well. The main limitations are-

1. Multitasking: This website serve one service at a time rather then a group of task.
Specifically it cannot take multiple link at a time. It just takes one link and go through
that link to get data.

2. Accuracy in analysis: Every website has their own style of coding so that
sometime it can be gone wrong for that reason.

3. Restrictions: Restricted websites cannot be analyzed because this website follow


legal rules.

5.2 Farther Enhancement


1. Willing to add more conditions to make this website more reliable with today’s demand.

2. Make an automated website so that it can work with keyword rather then links.

37
Chapter: 6.0
Conclusion

38
6.1 Conclusion
Finally this reports demonstrates the achievements of the project, but also presents an
assessment of the performance and reliability. It extensively make use of web scrapper
and data mining technology. Moreover it helped me to develop my coding skill and better
understanding of technical methodologies. To conclude with I believe that the current
solution succeeded to meet the projects requirement and its deliverable. Even through it
has series of limitations, it allows for further extensions, which would enable a more in
depth understanding of data scraping and data mining . And I hope this projects will help
businessman, scientists, marketers, students and other users for developing their own
scope of knowledge.

6.2 Appendix

from django.shortcuts import render


from django.shortcuts import render, HttpResponse
# Create your views here.
from django.shortcuts import render, redirect
from django.contrib import auth, messages
import os
import urllib
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
# from urlparse import urlsplit
from os.path import basename
from bs4 import BeautifulSoup
import requests
import requests

39
from bs4 import BeautifulSoup
import xlwt
from xlwt import Workbook
from django.contrib.auth.decorators import login_required
def login(request):
if request.user.is_authenticated:
return redirect('home')
if request.method == 'POST':
username = request.POST.get('username')
password = request.POST.get('password')
print(password)
user = auth.authenticate(username=username, password=password)
if user is not None:
# correct username and password login the user
auth.login(request, user)
return redirect('home')
else:
messages.error(request, 'Error wrong username/password')
return render(request, 'login.html')
@login_required(login_url='login')
def home(request):
return render(request, "home.html")
def logout(request):
auth.logout(request)
return redirect('login')
@login_required(login_url='login')
def scrap_data(request):

40
e_links = request.POST['e_links']
e_tag = request.POST['e_tag']
print(e_links)
print(e_tag)
if e_tag == 'heading':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find_all("h2", class_="title")
f= open("result/headlines.txt", "w+")
for i in tb:
try:
ss = i.find("a").contents
# print(ss)
k = ss[0]
# # print(str(k))
# # print()
k = str(k)
print(k)
f.write(k + "\n")
except:
pass
# print("No")
f.close()
except:
pass

41
elif e_tag == 'paragraph':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# tb = soup.find('tr')
tb = soup.find_all("p")
# wb = Workbook()
# sheet1 = wb.add_sheet('Sheet 1')
# # add_sheet is used to create sheet.
f= open("result/paragraph.txt", "w+")
for i in tb:
try:
ss = i.contents
# print(ss[0])
k = ss[0]
# print(str(k))
# print()
k = str(k)
print(k)
f.write(k + "\n")
print(1)
except:
pass
# print("No")
f.close()
except:

42
pass
elif e_tag == 'telephone':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# tb = soup.find('tr')
tb = soup.find_all("td", colspan="2")
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
sheet1.write(0, 0, 'TelePhone Number')
count = 0
for i in tb:
j = i.contents
ss = j[0]
# print(ss)
try:
if ss[:3] == '+88' or ss[:2]=='01' :
count += 1
print(ss)
sheet1.write(count, 0, ss)
except:
print("no")
wb.save('result/telephone.xls')
except:
pass

43
elif e_tag == 'image':
try:
url = e_links
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
imgs = soup.findAll('img')
print("aa")
# user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7)
Gecko/2009021910 Firefox/3.0.7'
# headers={'User-Agent':user_agent,}
for img in imgs:
print("VV")
# print(img)
jf = img.get('src
print(jf)
# j = img.get('src').read()
# fileName = basename(urlsplit(jf)[2])
# output = open(fileName,'wb')
# output.write(j)
# output.close()
# print(img.get('src'))
#'/home/asus/Desktop/defodilscrap/result/imagefile'
try:
my_path = 'C:/Users/Asus/Desktop/diitdata/result/imagefile'
#'/home/asus/Desktop/amarfile'
urllib.request.urlretrieve(jf, os.path.join(my_path, os.path.basename(jf)))
# with open(basename(jf), "wb") as f:
# f.write(requests.get(jf).content)

44
# uopen = urlopen(jf)
# stream = uopen.read()
# file = open('myfile.jpg','w')
# file.write(stream)
# file.close()
except: print('not found')
except:
pass
return HttpResponse("check your Folder")

6.2 References

[1] Web Scraping with Python by Ryan Mitchell.

[2] Web Scraping with Python by Richard Lawson.

[3] Jhon Mc, ”Web Scraping and Crawling with Python: Beautiful Soup, Requests &
Selenium”, www.udemy.com, March 2009.

[4] Brayan kylan , “Intro to data harvesting algorithm” www.kdnuggeds.com, April 2012

[5] Garrett Alley , “Data Extraction” www.dzone.com, January 2017

[6] Alison Fitter, “web scrapping software and tools” www.fminer.com , 2009

[7] Web Design Complete Reference by Thomas A. Powell

[8] Hanry Michel “E-R diagram tools” www.visual-paradigm.com , May 2018

[9] Django 2 by Example: Build Powerful and Reliable Python Web Applications from
Scratch by Antonio Meley.

[10] Mastering Django: Core by Nigel George.

45

You might also like