0% found this document useful (0 votes)

34 views55 pages

Web Application For Harvesting Data From Complementary Websites

Uploaded by

Shamim Bhuiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views55 pages

Web Application For Harvesting Data From Complementary Websites

Uploaded by

Shamim Bhuiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 55

Web application for harvesting data

from complementary websites

A Project presented to the National University in partial fulfillment of the

requirement for the degree of Bachelor of Science (Hon’s) in Computer Science &
Engineering

Submitted By

Md. Shamim Bhuiyan

Reg: 14502001237
Session: 2014-15

Department of Computer Science & Engineering

Daffodil Institute of IT, Dhaka
Under National University, Bangladesh
September 2019
APPROVALS

This project titled “Web application for harvesting data from complementary
websites” submitted by Md. Shamim Bhuiyan to the Department of Computer Science
& Engineering, Daffodil Institute Of IT, Dhaka, Bangladesh has been accepted as
satisfactory for the partial fulfillment of the requirements for the Degree of Bachelor of
Science (B.Sc.) in Computer Science and Engineering and approved as to its style and
contents.

………………………..… ………………………..

Examiner Examiner

……………………… …………………….
Supervisor Coordinator
Tahmina Aktar Trisha Md. Imran Hossain
Lecturer Lecturer and Coordinator
Dept. Of CSE Dept. Of CSE
Daffodil Institute of IT Daffodil Institute of IT

I
DECLARATION

I am the student of Bachelor of Science in Computer Science and Engineering hereby

declare that the work presented in this project “Web application for harvesting data

from complementary websites” has been done by me under the supervision of

Tahmina Aktar Trisha, Lecturer, Department of Computer Science & Engineering,

Daffodil Institute Of IT. I also declare the neither this project nor any of this has been

submitted elsewhere for award of any other University,College or Organization for any

academic,qualification,certificate or degree. I guarantee that this project work done no

infringe any copyright.

.……………………
Md. Shamim Bhuiyan

Reg: 14502001237

Session: 2014-15

II
ABSTRACT

“Web application for harvesting data from complementary websites” which

generally used for harvesting data from internet those are publicly available and free from
any restrictions. The goal of this project is extracting information from websites
automatically. By using this application, it will be very easy to download any format of
data like text,CSV,images etc. Its Functionality is designed according to the basic
demands of user’s. This application will get data from user defined data source and
provide to the users as per as conditions. This application takes input links of POI (Point
Of Interest) of user and provides information that user need by analyzing and reading the
links.

III
ACKNOWLEDGEMENT

I would like to pay out gratitude to the almighty ALLAH who created us with not only
the ability to design and program this system but also the power of patience. First of all,
my appreciation heartfelt gratitude goes to our project supervisor’s Tahmina Aktar
Trisha, Lecturer, Department of Computer Science and Engineering, Daffodil institute
Of IT.I am obliged and thankful to her for her continuous encouragement, motivation and
professional guidance during the work of this project which has proven to be an integral
part of it. Without her valuable support and guidance, this project could not elevate up
this level of development from our point of view.

I also express my sincere thanks and gratitude to Mohammed Shakhawat Hossain,

Principal, Daffodil institute Of IT, Md. Imran Hossain, Lecturer and Coordinator of
Department of Computer Science and Engineering, Daffodil institute of IT.

And all the Faculty members, Department of Computer Science and Engineering,
Daffodil Institute of IT for their valuable time spend in requirements analysis and
evaluation of the project work.

IV
Table of Content
Page
No.
APPROVALS……………………………………………………….. I

DECLARATION……………………………………………………. II

ABSTRACT……………………………………………………….….. III

ACKNOWLEDGEMENT…………………………………………. IV

LIST OF FIGURES………………………………………….. VII

1. PROJECT INTRODUCTION..………………………… 1-6

1.1 Introduction …………………………………………………... 1

1.2 Background Study…………………………………………… 1

1.3 Objectives …………………………………………………….. 2

1.3.1 Specific objective ……………………………………… 2
1.4 Proposed Application……………………………….……… 3

1.5 Methodology…………………………………………….…… 3
1.5.1 Data sources……………………………………………… 3
1.6 Process Model ……………………………………….……. 4
1.6.1 Why incremental process model……………………..…. 4
1.7 Feasibility Study………………………………………..…… 5
1.7.1 Technical feasibility……………………….……..…… 5
1.7.2 Economical feasibility ………………………………… 5
1.7.3Operational Feasibility……………………………….… 5

2. REQUIREMENT ENGINEERING.................…..…. 6-11

2.1 Requirement Engineering…………………………..……. 7

2.2 Requirement Analysis……………………………...……… 8

2.3 Functional Requirements…………………………..…….. 9

2.4 Non-functional Requirements……………………..……. 9

V
2.5 Others Requirements……………………………….……… 9
2.5.1 Software requirements of client …………….………. 9
2.6 Benefits of the system……………………...........………… 10
2.6.1As a tool of marketing………...................……………..… 10
2.6.2 As a tool of data science……….........................……….... 10
2.6.3 Research…………………….................................………. 11
2.6.4 Some other potential benefits………….............………... 11

3. ANALYSIS & DESIGN………………………………….. 12-20

3.1 Use Case Diagram…………………………………...…….. 13

3.2 Entity Relationship Diagram………………..……..…… 14

3.3 Data Flow Diagram(DFD)…………………………. …… 16

3.4 Work Flow Diagram…………..……………………..…… 20

4. IMPLEMENTATION…………………………………… 21-34

4.1 Implementation………………………...........…………….. 23

4.2 Working Station……………………………..........………. 24

4.3 Software Tools ……………………………………………. 25

4.3.1 Python IDE……………………..........................…… 26
4.3.2 Windows Command Prompt………………………. 27
4.3.3 Supportive tools and software……………………… 27
4.4 User Experience And Project Preview…………..…… 29
4.4.1 Authentication page………………………………… 29
4.4.2 Home page ………………………………………….. 30
4.4.3 Admin page…………………………………….…… 31
4.4.4 Output………………………………………………. 35

5. LIMITATION & FUTURE ENHANCEMENT… 36-37

VI
5.1 Limitations…………………………………...........……….…. 37

5.2 Farther Enhancement……………………………………… 37

6. CONCLUSION…………………………….....................……… 38-45

6.1 Conclusion…………………………………………………… 39

6.2 Appendix…………………………………………………….. 39

6.3 Reference…………………………………………………….. 40

LIST OF FIGURES

Figure 1.6(a): Iterative process model……………………………….. 4

Figure 3.1(a): Use Case Diagram…………………………………….. 13

Figure 3.3(a): E-R diagram………………………….………………. 16

Figure 3.4(a) : Context level DFD…………………………………… 17

Figure 3.4(b): DFD Level 1…………………………………………... 18

Figure 3.4(c) : 2nd level DFD…………………………………………. 19

Figure 3.5(a): Work flow diagram………………………………….. 20

Figure 4.3.1(a): Sublime Text IDE………………………………….. 25

Figure 4.3.2(a): Windows Command prompt……………………… 26

Figure 4.3.3(a): Tools and frameworks for python………………… 27

Figure 4.4.1(a): Log in form to enter into the home page…………. 28

Figure 4.4.1(b): Incorrect log in alert……………………………… 29

VII
Figure 4.4.2(a): Home page for input links and choices………….. 30

Figure 4.4.2(b): Input url…………………………………………… 31

Figure 4.4.3(a): Admin login page and validation………………… 32

Figure 4.4.3(b): Django Administration page 1…………………… 32

Figure 4.4.3(c): User sign up info………………………………….. 33

Figure 4.4.3(d): User permission setup from admin……………… 33

Figure 4.4.3(e): Available user permission………………………… 34

Figure 4.4.3(f): Important dates for user activity…………………. 34

Figure 4.4.4(a): Output folder of their computer………………….. 35

Figure 4.4.4(b): Output as images………………………………….. 35

VIII
CHAPTER: 1.0
PROJECT INTRODUCTION
1.1 Introduction

The goal of any system development is to develop and implement the system cost
effectively. It most suited to the user’s analysis is the heart of the process. Analysis is the
study of the various operations performed by the system like as (add, update, delete,
search faculty details) and maintain relationship within through the system. During
analysis, data collected on the files, decision points and transactions handled by the
present system.

Administrator only they can add data into the database. The data can be retrieved easily.
The data are well protected for personal use and makes the data processing very fast. This
system objectives of easily & maintainable information.

1.2 Background of Study

There are some tools in the market those are providing such things like taking URL and
giving data to user. But most of them are not much effective in today's market. Most of
them are offline. The data they are providing is not well formatted. That’s why we are
developing this system where user will get rid of all those problems they are facing in
present days.

There are some systems which are performing only limited area of scraping. The
limitations of existing system are :

1. Not reliable with big data.

2. Extracting conditions cannot be performed by users.
3. Data to be extracted are not well formatted.
4. Less secure.
5. Not compatible with modern websites.
6. Slower in running
7. Not as much of features of harvesting.

1
1.3 Objectives

In order to reduce the limitation of data for different work. We know that there are lot’s

of data in the internet for every topic we are dealing daily. But the tools for get those data

in case of our need is very limited. So we felt that creation of such application is very

important. If it is online and anyone can get access then it will be very helpful for

different professionals and students. Making time more productive here this app will

work. Manually people can access internet and download some sort of data they are in

need. But when you are a expert in other important sector then you should make your

time valuable by doing those. You concentrate on your work and our system will collect

you data in your hand. So our objective is to provide an web application that will help to

fulfil the specific objectives below.

1.3.1 Specific objective

1.Get data from target website by this web application.

2. Dynamically grab data from URL defined by user.

3. Make it useful as marketers can use it as a marketing tool.

4. Will form the difficult data in needed formation like JSON, XML,CSV etc.

5. Admin can Add, Delete and Update users.

6. Showing retrieved data to user based on point of interest(POI).

7. Will be useful for data scientist .

8. Data downloads option for user.

2
1.4 Proposed Application

A web application that will able to give actionable data to the user. My proposed system
is to create an application based on data harvesting form internet theory. Where the
system is able to take input and analysis on it then will give data by user choice. But
some restriction will be apply to make sure that the data is giving is legal.

1.5 Methodology

The development process on “Internet data harvesting” will complete the following
structure described later on Software Analysis & Design.

This study on “Internet data harvesting” is tentative in nature. It aims to development of

web application. The variables identified to manipulate through a handy inspection and
from primary and secondary data.

1.5.1 Data Sources

For this project in data collection phase we collected two types of data i.e.

 Primary Data

 Secondary Data

Primary data are generated by scraping data from internet Where data are publicly
available without violating any legal rules. Means all data are safe and legal.

Secondary data are generated by website local database where different users
downloaded data will be safely stored for future use.

1.6 Process model

In iterative model the whole requirement is divided into various builds. Multiple
development cycles take place here, making the life cycle amulet. Cycles are divided up
into smaller, more easily managed modules.[1]. Each module passes through the
requirements, design, implementation and testing phases. A working version of software

3
is produced during the first module, so you have working software early on during
the software life cycle. Each subsequent release of the module adds function to the
previous release. The process continues till the complete system is achieved.

Iterative process model

Initialization
Requirements

Planning Design

Evaluation Implementation

Verification

Deployment

Figure 1.6(a): Iterative process model

1.6.1 Why Iterative Process Model

 Generates working software quickly and early during the software life cycle.

 This model is more flexible – less costly to change scope and requirements.

 It is easier to test and debug during a smaller iteration.

 In this model customer can respond to each built.

4
 Lowers initial delivery cost.

1.7 Feasibility Study

Feasibility study determines whether that solution is feasible or achievable for the
organization or not. There are three major areas of feasibility study.

 Technical Feasibility

 Economical Feasibility

 Operational Feasibility

1.7.1 Technical Feasibility

The technical feasibility assessment is focused on gaining an understanding of the present

technical resources of the organization and their applicability to the expected needs of the
proposed system. It is an evaluation of the hardware and software and how it meets the
need of the proposed system. The proposed system is compatible with a low qualification
of computer with an internet connection only, as because it’s a web based software.

1.7.2 Economical Feasibility

The purpose of the economic feasibility assessment is to determine the positive economic
benefits to the organization that the proposed system will provide. Our system is
economically feasible because by using the proposed system many works can be done
within small time and which is not possible by man power within the same time. So we
can say that, if they use proposed system they will be economically benefited. This is
how this system is economically feasible.

1.7.3 Operational Feasibility

User can easily operate the proposed system because the system is user friendly. It’s easy
to insert inventory products and easy to create stocks. If the stuff of the organization has
the basic to computer knowledge they could operate the software easily. Every features
and the activity that are combined within the system is designed and developed belongs
to previous format they had used with a more attractive user interface.

5
CHAPTER: 2.0
REQUIREMENT ENGINEEING

6
2.1 Requirement Engineering

Requirement engineering encompasses the tasks that lead to an understanding of what the
business impact of the application will be, what the user wants, and how end-users will
interact with the software.[2] Requirements engineering is defined in terms of its major
activities.

 Understanding problems

 Solution determination

 Specification of a solution that is testable, understandable, maintainable and that

satisfies project quality guidelines

Designing and building an elegant computer program that solves the wrong problem
serves no one’s need. That’s why it is important to understand what the customer wants
before we begin to design and build a system.

2.2 Requirement Analysis

Requirement analysis provides the software designer with a representation of

information, function and behavior that can be translated to data, architectural, interface
and component-level designs.

In the following task phases the requirement analysis was done.

2.2.1 Admin requirements and System requirements:

Admin requirement

2.2.1.1 Admin will allow remove users..

System requirement

2.2.1.2 A request get way method should be present.

User requirement

2.2.1.3 User can register by requesting to the system.

7
System Requirement

2.2.1.4 A form needed for user registration.

User requirement

2.2.1.5 Ability for inserting link of target website.

System Requirement

2.2.1.6 Need a input form to get link and analysis on it.

User requirement

2.2.1.7 Ability to download data.

System Requirement

2.2.1.8 Need database for storing data and give download able data.

User requirement

2.2.1.9 Ability to define data format..

System Requirement

2.2.1.10 Need options to select those format.

2.3 Functional Requirements

2.3.1 Admin can maintain whole system.

2.3.2 Admin can add, delete, scope of user interface.

2.3.3 Admin can add, delete and edit conditions.

2.3.4 Admin can add, delete and edit restrictions.

8
2.4 Non-functional Requirements

2.4.1 Admin can log in by using username and password.

2.4.2 Only admin can maintain the whole system.

2.4.3 Total secured system.

2.4.4 Any operating system supported.

2.5 Others Requirements

2.5.1 Software Requirement for client:

Browser:

-Windows Internet Explorer 8.0 or above

-Mozilla Firefox version 47.0.1

-Google Chrome 52.0.2743.82

2.6 Benefits Of The System

2.6.1 As a tool of marketing

 Competitor research.

 Campaign study.

 Getting idea from data.

 Finding audience from info.

 Market research.

 Brand monitoring.

9
2.6.2 As a tool of Data Science:

 Collection of big amount of data.

 Get data in targeted format.

 Analytical data will load directly from source.
 Getting of data in simple form.

2.6.3 Research :

 Download data for research.

 Ease of grabbing data.

 Can be compared among topics.

 One click download ability.

2.6.4 Some other potential benefits

 Administrator will be able to maintain this system more accurately.

 All the information will be stored on to the computer with its formatted screens and
built in databases.

 All the information can be carried out more easily or quickly than any other manual
process.

 Admin can easily take all information any time when he needs that stored by himself
previously.

 Admin can easily input the marks of the students very easily.

10
 Admin can manage students easily.

 The security of this software is high. Without Admin login details not can enter the
dashboard area.

 Students can download lecture material from online easily.

11
CHAPTER: 3.0
ANALYSIS & DESIGN

In system analysis a study of the system as detailed as possible will occur with the help of

12
some diagrams i.e. Use Case Diagram, Activity Diagram, Swim Lane Diagram Data
flow Diagram and Entity Relational Diagram etc.

3.1 Use Case Diagram

A use case diagram at its simplest is a representation of a user's interaction with the
system that shows the relationship between the user and the different Use Cases in which
the user is involved. A use case diagram can identify the different types of users of a
system and the different use cases and will often be accompanied by other types of
diagrams as well.
Web application for harvesting data

Harvesting
data

Unique no
Generate

Target URL

Admin
User
Register

View Result

Compare
Result

Download

Figure 3.1(a): Use Case Diagram

3.2 Entity Relationship Diagram (ERD)

13
The Entity Relationship Diagram (ERD) enables a software engineer to specify the data

objects that are input and output from a system, the attributes that define the properties of

these objects and their relationship. It provides an excellent graphical representation of

the data structures and relationship.[4] They provide a clear view of the logical structure of

data within the boundary of interest and allow the engineer to model the data without

considering the physical form. Some of the basic terms used in ERD described below:

Entity: An entity is an object with physical existence or may be an object with conceptual

existence. For example a car, a student, an employee, an applicant.

An entity represented by a rectangle.

Relationship: A relationship is a logical linkage between two or more entities which

describes how the entities are associated with each other. A relationship described by a

diamond.

Attribute: Attribute is a piece of information that describes a particular entity.

Primary Key: A primary key is an attribute or collection of attributes that allow us to

identify an entity uniquely.

14
Foreign key: A foreign key is an attribute of a relation, which refers to an existing

attribute of another relationship.

Relationship Cardinality:

Relationship cardinality refers to the number of entity instances involved in the

relationship. The cardinality ratios are:

 1:1 (One to One)

 1:n (One to Many)

 n:n (Many to Many)

E-id password
Name email
password
15
3.4 Data Flow Diagram (DFD)

A Data Flow Diagram (DFD) is a graphical representation of the "flow" of data through
an information system, modeling its process aspects. A DFD is often used as a
preliminary step to create an overview of the system, which can later be elaborated. [8]
DFD can also be used for the visualization of data processing (structured design).

A DFD shows what kind of information will be input to and output from the system,
where the data will come from and go to, and where the data will be stored .[5] It does not
show information about the timing of process or information about whether processes
will operate in sequence or in parallel (which is shown on a flowchart).

A context level DFD:

Search
Result
Approve
acc.
16
Request
to
system
Register Admin
Figure 3.4(a) : Context level DFD

Level 1 DFD

Registration

approved

Request for 17
registration

1.0
Registration

Figure 3.4(b): DFD Level 1

DFD Level 2 All process together:

Search by poi link View/manage
search details

18
Admin
User
Data

Processing

Figure 3.4(c) : 2nd level DFD

19
3.5 Work Flow Diagram

A work flow diagram is a way of representing the business process for better

understanding by using standard shapes that represent flow, decisions, process etc. [6] A

work flow diagram can be used in any business for clear the confusions of

understanding. In this project the work flow diagram representing the overall process of

the application after user log in.

Access

Try Link

Yes
Any
Obstacle?

Formalize the Link

Visit Internal Links

Find data from HTML

Scrape Data

Store Temporarily

Analyze data

Download

Clear temp. Data

End

Figure 3.5(a): Work flow diagram

20
CHAPTER: 4.0
Implementation

21
4. 1 Implementation

In computer science, an implementation is a realization of a technical specification or

algorithm as a program, software component, or other computer system through
computer programming and deployment. Many implementations may exist for a given
specification or standard. For example, web browsers contain implementations of
World Wide Web Consortium-recommended specifications, and software development
tools contain implementations of programming languages.

A special case occurs in object-oriented programming, when a concrete class

implements an interface; in this case the concrete class is an implementation of the
interface and it includes methods which are implementations of those methods
specified by the interface.

In the IT Industry, implementation refers to post-sales process of guiding a client from

purchase to use of the software or hardware that was purchased. This includes
requirements analysis, scope analysis, customization, systems integration, user policies,
user training and delivery. These steps are often overseen by a project manager using
project management methodologies. Software Implementations involve several
professionals that are relatively new to the knowledge based economy such as business
analysts, technical analysts, solutions architects, and project managers.

To implement a system successfully, a large number of inter-related tasks need to be

carried out in an appropriate sequence. Utilizing a well-proven implementation
methodology and enlisting professional advice can help but often it is the number of
tasks, poor planning and inadequate resourcing that causes problems with an
implementation project, rather than any of the tasks being particularly difficult.
Similarly with the cultural issues it is often the lack of adequate consultation and two-
way communication that inhibits achievement of the desired results.

22
4. 2 Working Station

A workstation is a special computer designed for technical or scientific applications.

Intended primarily to be used by one person at a time, they are commonly connected to
a local area network and run multi-user operating systems. The term workstation has
also been used loosely to refer to everything from a mainframe computer terminal to a
PC connected to a network, but the most common form refers to the group of hardware
offered by several current and defunct companies such as Sun Microsystems, Silicon
Graphics, Apollo Computer, DEC, HP and IBM which opened the door for the 3D
graphics animation revolution of the late 1990s.

Workstations offered higher performance than mainstream personal computers,

especially with respect to CPU and graphics, memory capacity, and multitasking
capability. Workstations were optimized for the visualization and manipulation of
different types of complex data such as 3D mechanical design, engineering simulation
(e.g. computational fluid dynamics), animation and rendering of images, and
mathematical plots. Typically, the form factor is that of a desktop computer, consist of
a high resolution display, a keyboard and a mouse at a minimum, but also offer
multiple displays, graphics tablets, 3D mice (devices for manipulating 3D objects and
navigating scenes), etc. Workstations were the first segment of the computer market to
present advanced accessories and collaboration tools.

The increasing capabilities of mainstream PCs in the late 1990s have blurred the lines
somewhat with technical/scientific workstations. The workstation market previously
employed proprietary hardware which made them distinct from PCs; for instance IBM
used RISC-based CPUs for its workstations and Intel x86 CPUs for its
business/consumer PCs during the 1990s and 2000s. However, by the early 2000s this
difference disappeared, as workstations now use highly commoditized hardware
dominated by large PC vendors, such as Dell and HP & Fujitsu, selling Microsoft
Windows or GNU/Linux systems running on x86-64 architecture such as Intel Core.

23
The performance of workstation in this project given below:

Processor: Intel® Core™ i3-7100U CPU @ 2.40GHz

Installed memory (RAM) : 4GB

System Type : 64-bit operating system , x-64 based processor.

OS: Windows 10 pro.

HDD: 500 GB

Display: 14.0 inch

Weight: 2.1kg, 2.2kg

4.3 Software Tools

A programming tool or software development tool is a computer program that software
developers use to create, debug, maintain, or otherwise support other programs and
applications. The term usually refers to relatively simple programs, that can be
combined together to accomplish a task, much as one might use multiple hand tools to
fix a physical object. The ability to use a variety of tools productively is one hallmark
of a skilled software engineer.

The most basic tools are a source code editor and a compiler or interpreter, which are
used ubiquitously and continuously. Other tools are used more or less depending on the
language, development methodology, and individual engineer, and are often used for a
discrete task, like a debugger or profiler. Tools may be discrete programs, executed
separately – often from the command line – or may be parts of a single large program,
called an integrated development environment (IDE). In many cases, particularly for
simpler use, simple ad hoc techniques are used instead of a tool, such as print
debugging instead of using a debugger, manual timing (of overall program or section of
code) instead of a profiler, or tracking bugs in a text file or spreadsheet instead of a bug
tracking system.

The distinction between tools and applications is murky. For example, developers use
simple databases (such as a file containing a list of important values) all the time as

24
tools.[dubious – discuss] However a full-blown database is usually thought of as an
application or software in its own right. For many years, computer-assisted software
engineering (CASE) tools were sought after. Successful tools have proven elusive.
[citation needed] In one sense, CASE tools emphasized design and architecture support,
such as for UML. But the most successful of these tools are IDEs.

4. 3.1 Python IDE

An integrated development environment (IDE) is a software application that provides

comprehensive facilities to computer programmers for software development. An IDE

normally consists of a source code editor, build automation tools and a debugger. Most

modern IDEs have an intelligent code completion. Example : Sublime text , vs code

etc.

I used sublime text 3 for this project. The Screenshot is given below:

Figure 4.3.1(a): Sublime Text IDE

25
4.3.2 Windows Command Prompt

To setup environment the Windows command prompt used.

Figure 4.3.2(a): Windows Command prompt.

4.3.3 Supportive tools and software

To setup the development environment with python the following tools and

frameworks used. They are :

 Python==3.7.4 (Powerful programming language for web application)

 Django==2.1.5 (A popular framework for web development with python).

26
 Beautifulsoup4==4.7.1

 Lxml==4.3.4

 Pillow==6.1.0

 Request==2019.3.22

 Requests==2.21.0

 Urllib3==1.24.1

Etc.

Figure 4.3.3(a): Tools and frameworks for python development setup.

27
4.4 User Experience And Project Preview

“Web application for harvesting data from complementary websites “ is a project

where user can easily use this web application. Different platform use on this project.
Here some user experience and our project preview.

4.4.1 Authentication page

After running this web application the user will face the authentication page where only
the authentic users can log in the user name and password previously given by the
admin to user.

Figure 4.4.1(a): Log in form to enter into the home page

28
Figure 4.4.1(b): Incorrect log in alert

4.4.2 Home Page

After successfully logged in user will be redirected to home page where they can enter
a link of complementary website from where they want to harvest data. Another drop
down menu to select the type of data and a submit button and a log out link.

29
Figure 4.4.2(a): Home page for input links and choices

Figure 4.4.2(b): Input url

30
4.4.3 Admin page

User only can be added by the admin. The admin can add delete and limited the users.

Figure 4.4.3(a): Admin login page and validation

31
Figure 4.4.3(b): Django Administration page 1

Figure 4.4.3(c): User sign up info

32
Figure 4.4.3(d): User permission setup from admin

Figure 4.4.3(e): Available user permission

33
Figure 4.4.3(f): Important dates for user activity

4.4.4 Output

User will find their data in the local folder of their computer as the format they selected
in the input section .

Figure 4.4.4(a): Output folder of their computer.

34
Figure 4.4.4(b): Output as images

35
CHAPTER: 5.0
LIMITATION AND FUTURE ENHANCEMENT

36
5.1 Limitations
Like every other projects “Web application for harvesting data from complementary
websites” have some limitations as well. The main limitations are-

1. Multitasking: This website serve one service at a time rather then a group of task.
Specifically it cannot take multiple link at a time. It just takes one link and go through
that link to get data.

2. Accuracy in analysis: Every website has their own style of coding so that
sometime it can be gone wrong for that reason.

3. Restrictions: Restricted websites cannot be analyzed because this website follow

legal rules.

5.2 Farther Enhancement

1. Willing to add more conditions to make this website more reliable with today’s demand.

2. Make an automated website so that it can work with keyword rather then links.

37
Chapter: 6.0
Conclusion

38
6.1 Conclusion
Finally this reports demonstrates the achievements of the project, but also presents an
assessment of the performance and reliability. It extensively make use of web scrapper
and data mining technology. Moreover it helped me to develop my coding skill and better
understanding of technical methodologies. To conclude with I believe that the current
solution succeeded to meet the projects requirement and its deliverable. Even through it
has series of limitations, it allows for further extensions, which would enable a more in
depth understanding of data scraping and data mining . And I hope this projects will help
businessman, scientists, marketers, students and other users for developing their own
scope of knowledge.

6.2 Appendix

from django.shortcuts import render

from django.shortcuts import render, HttpResponse
# Create your views here.
from django.shortcuts import render, redirect
from django.contrib import auth, messages
import os
import urllib
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
# from urlparse import urlsplit
from os.path import basename
from bs4 import BeautifulSoup
import requests
import requests

39
from bs4 import BeautifulSoup
import xlwt
from xlwt import Workbook
from django.contrib.auth.decorators import login_required
def login(request):
if request.user.is_authenticated:
return redirect('home')
if request.method == 'POST':
username = request.POST.get('username')
password = request.POST.get('password')
print(password)
user = auth.authenticate(username=username, password=password)
if user is not None:
# correct username and password login the user
auth.login(request, user)
return redirect('home')
else:
messages.error(request, 'Error wrong username/password')
return render(request, 'login.html')
@login_required(login_url='login')
def home(request):
return render(request, "home.html")
def logout(request):
auth.logout(request)
return redirect('login')
@login_required(login_url='login')
def scrap_data(request):

40
e_links = request.POST['e_links']
e_tag = request.POST['e_tag']
print(e_links)
print(e_tag)
if e_tag == 'heading':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find_all("h2", class_="title")
f= open("result/headlines.txt", "w+")
for i in tb:
try:
ss = i.find("a").contents
# print(ss)
k = ss[0]
# # print(str(k))
# # print()
k = str(k)
print(k)
f.write(k + "\n")
except:
pass
# print("No")
f.close()
except:
pass

41
elif e_tag == 'paragraph':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# tb = soup.find('tr')
tb = soup.find_all("p")
# wb = Workbook()
# sheet1 = wb.add_sheet('Sheet 1')
# # add_sheet is used to create sheet.
f= open("result/paragraph.txt", "w+")
for i in tb:
try:
ss = i.contents
# print(ss[0])
k = ss[0]
# print(str(k))
# print()
k = str(k)
print(k)
f.write(k + "\n")
print(1)
except:
pass
# print("No")
f.close()
except:

42
pass
elif e_tag == 'telephone':
try:
url = e_links
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# tb = soup.find('tr')
tb = soup.find_all("td", colspan="2")
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
sheet1.write(0, 0, 'TelePhone Number')
count = 0
for i in tb:
j = i.contents
ss = j[0]
# print(ss)
try:
if ss[:3] == '+88' or ss[:2]=='01' :
count += 1
print(ss)
sheet1.write(count, 0, ss)
except:
print("no")
wb.save('result/telephone.xls')
except:
pass

43
elif e_tag == 'image':
try:
url = e_links
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
imgs = soup.findAll('img')
print("aa")
# user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7)
Gecko/2009021910 Firefox/3.0.7'
# headers={'User-Agent':user_agent,}
for img in imgs:
print("VV")
# print(img)
jf = img.get('src
print(jf)
# j = img.get('src').read()
# fileName = basename(urlsplit(jf)[2])
# output = open(fileName,'wb')
# output.write(j)
# output.close()
# print(img.get('src'))
#'/home/asus/Desktop/defodilscrap/result/imagefile'
try:
my_path = 'C:/Users/Asus/Desktop/diitdata/result/imagefile'
#'/home/asus/Desktop/amarfile'
urllib.request.urlretrieve(jf, os.path.join(my_path, os.path.basename(jf)))
# with open(basename(jf), "wb") as f:
# f.write(requests.get(jf).content)

44
# uopen = urlopen(jf)
# stream = uopen.read()
# file = open('myfile.jpg','w')
# file.write(stream)
# file.close()
except: print('not found')
except:
pass
return HttpResponse("check your Folder")

6.2 References

[1] Web Scraping with Python by Ryan Mitchell.

[2] Web Scraping with Python by Richard Lawson.

[3] Jhon Mc, ”Web Scraping and Crawling with Python: Beautiful Soup, Requests &
Selenium”, www.udemy.com, March 2009.

[4] Brayan kylan , “Intro to data harvesting algorithm” www.kdnuggeds.com, April 2012

[5] Garrett Alley , “Data Extraction” www.dzone.com, January 2017

[6] Alison Fitter, “web scrapping software and tools” www.fminer.com , 2009

[7] Web Design Complete Reference by Thomas A. Powell

[8] Hanry Michel “E-R diagram tools” www.visual-paradigm.com , May 2018

[9] Django 2 by Example: Build Powerful and Reliable Python Web Applications from
Scratch by Antonio Meley.

[10] Mastering Django: Core by Nigel George.

2016 Service-Oriented Architecture Analysis and Desige Technology - Thomas Erl
100% (1)
2016 Service-Oriented Architecture Analysis and Desige Technology - Thomas Erl
714 pages
Online Auction Report
100% (1)
Online Auction Report
67 pages
Bca Minor Project Report
100% (1)
Bca Minor Project Report
21 pages
E Banking
100% (3)
E Banking
33 pages
Thesis Book CSE Departmental Website Ful
No ratings yet
Thesis Book CSE Departmental Website Ful
365 pages
Azure PDF
No ratings yet
Azure PDF
2,825 pages
Alumni Report
No ratings yet
Alumni Report
135 pages
Mini Project Report
100% (2)
Mini Project Report
19 pages
Roll No 17
No ratings yet
Roll No 17
89 pages
Driving License Final (Defense)
No ratings yet
Driving License Final (Defense)
60 pages
2018 Thesis Evan Gallagher - Scraping Websites For Law
No ratings yet
2018 Thesis Evan Gallagher - Scraping Websites For Law
91 pages
Babu Banarasi Das University Lucknow: Project Report
No ratings yet
Babu Banarasi Das University Lucknow: Project Report
63 pages
Full Thesis
No ratings yet
Full Thesis
70 pages
2018 Thesis Evan Gallagher - Scraping Websites For Law PDF
No ratings yet
2018 Thesis Evan Gallagher - Scraping Websites For Law PDF
91 pages
x300 Schematics
No ratings yet
x300 Schematics
99 pages
Hotel Managment
No ratings yet
Hotel Managment
152 pages
Online Jewelry Shop - Final
No ratings yet
Online Jewelry Shop - Final
116 pages
Synopsis Spy Eye
No ratings yet
Synopsis Spy Eye
17 pages
DriveLock Admin Guide
No ratings yet
DriveLock Admin Guide
566 pages
Online Examination
100% (1)
Online Examination
63 pages
Library Management System1
100% (2)
Library Management System1
31 pages
Documentation and ASCII Archives: Simatic S7
No ratings yet
Documentation and ASCII Archives: Simatic S7
12 pages
BSIT MIT Thesis Template
100% (1)
BSIT MIT Thesis Template
12 pages
Tarrif Challan App Report
No ratings yet
Tarrif Challan App Report
23 pages
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
20 pages
Miss - Ashwini Ganpati Patil Miss - Snehal Uttam Khule: "Online Bank Application System"
No ratings yet
Miss - Ashwini Ganpati Patil Miss - Snehal Uttam Khule: "Online Bank Application System"
44 pages
Shoulder Surfing
No ratings yet
Shoulder Surfing
49 pages
Software Development
No ratings yet
Software Development
61 pages
BCM50 - Troubleshooting Guide
No ratings yet
BCM50 - Troubleshooting Guide
92 pages
Office Management
100% (1)
Office Management
116 pages
Programming and Data Structure-Ii Lab Manual
No ratings yet
Programming and Data Structure-Ii Lab Manual
164 pages
Student Result
No ratings yet
Student Result
47 pages
512 Kbit (64K x8) UV EPROM and OTP EPROM: Features
No ratings yet
512 Kbit (64K x8) UV EPROM and OTP EPROM: Features
19 pages
A Project Report On: Diploma in Computer Engineering
No ratings yet
A Project Report On: Diploma in Computer Engineering
38 pages
Predictive Analytics in Data Science For Business Intelligence Solutions
No ratings yet
Predictive Analytics in Data Science For Business Intelligence Solutions
4 pages
Ford Acronyms List
No ratings yet
Ford Acronyms List
32 pages
PHP Project Titles With Abstract
No ratings yet
PHP Project Titles With Abstract
42 pages
Python modules-XI
No ratings yet
Python modules-XI
9 pages
PHP Project Titles With Abstract
90% (10)
PHP Project Titles With Abstract
42 pages
Mca, Bca Project List 2023-2024
No ratings yet
Mca, Bca Project List 2023-2024
90 pages
STS Installation Instructions 2.7.1.RELEASE
No ratings yet
STS Installation Instructions 2.7.1.RELEASE
15 pages
Project
No ratings yet
Project
23 pages
A18 - Mini Project Report
No ratings yet
A18 - Mini Project Report
45 pages
Assessment System: Exam Viewer - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
No ratings yet
Assessment System: Exam Viewer - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
39 pages
Online Library Management System
No ratings yet
Online Library Management System
63 pages
ChandraSekhar Byna
No ratings yet
ChandraSekhar Byna
7 pages
Final Repot1
No ratings yet
Final Repot1
36 pages
Project Report Quiz Application
No ratings yet
Project Report Quiz Application
97 pages
Intro
No ratings yet
Intro
24 pages
Touch With Industry
No ratings yet
Touch With Industry
3 pages
Umang Vyas Report
No ratings yet
Umang Vyas Report
51 pages
Business Models in Two-Sided Markets - An Assessment of Strategies
No ratings yet
Business Models in Two-Sided Markets - An Assessment of Strategies
13 pages
MX430 101
No ratings yet
MX430 101
8 pages
Entry Test PHARM D - 2023 1
No ratings yet
Entry Test PHARM D - 2023 1
1 page
DGCA Module 08 MARCH 2017 HANDWRITTEN SET 1 & 2 PDF
No ratings yet
DGCA Module 08 MARCH 2017 HANDWRITTEN SET 1 & 2 PDF
5 pages
637 Servo Drive Product Manual 2
No ratings yet
637 Servo Drive Product Manual 2
87 pages
Hostel Management
No ratings yet
Hostel Management
138 pages
123 Merged
No ratings yet
123 Merged
62 pages
Assignment Computer Application in Business: Mr. Shahid Waseem
No ratings yet
Assignment Computer Application in Business: Mr. Shahid Waseem
10 pages
Final Report LMS
No ratings yet
Final Report LMS
32 pages
An Electronic Journal Management System
No ratings yet
An Electronic Journal Management System
6 pages
Projok Merged
No ratings yet
Projok Merged
37 pages
Dark Web
No ratings yet
Dark Web
6 pages
Projok
No ratings yet
Projok
33 pages
Samarth
No ratings yet
Samarth
24 pages
Project Report Quiz Application
No ratings yet
Project Report Quiz Application
97 pages
Car Rent
No ratings yet
Car Rent
7 pages
Savitendra Miniproject
No ratings yet
Savitendra Miniproject
12 pages
Bell Printers Installs India's First Varimatrix 105 CS Die-Cutter
No ratings yet
Bell Printers Installs India's First Varimatrix 105 CS Die-Cutter
3 pages
RVT Final
No ratings yet
RVT Final
58 pages
5 Computer Hardware
No ratings yet
5 Computer Hardware
127 pages
QB - Python Basics - Ver 7.0
No ratings yet
QB - Python Basics - Ver 7.0
49 pages
1 Starting Pages of Thesis For BS (CS) N MCS
No ratings yet
1 Starting Pages of Thesis For BS (CS) N MCS
7 pages
Project Report Quiz Application
No ratings yet
Project Report Quiz Application
97 pages
Quiz Application Report
No ratings yet
Quiz Application Report
97 pages
Manuscript Library Web Based
No ratings yet
Manuscript Library Web Based
126 pages
AtlassianPrakash Devnani Resume
No ratings yet
AtlassianPrakash Devnani Resume
1 page
PDF The Handbook of Formal Methods in Human Computer Interaction Benjamin Weyers Download
100% (8)
PDF The Handbook of Formal Methods in Human Computer Interaction Benjamin Weyers Download
65 pages
WC - Module 3
No ratings yet
WC - Module 3
72 pages
Thesis Format Cs&It
No ratings yet
Thesis Format Cs&It
26 pages
Trip Planner
No ratings yet
Trip Planner
59 pages
Sample
No ratings yet
Sample
21 pages
Zoya Individual Report
No ratings yet
Zoya Individual Report
40 pages
Mini Project Report
No ratings yet
Mini Project Report
35 pages
Avinash Chaudhari
No ratings yet
Avinash Chaudhari
39 pages
Pushpendra Fianl Year Industry Project
No ratings yet
Pushpendra Fianl Year Industry Project
59 pages
Autodesk Fusion 360: A Tutorial Approach
From Everand
Autodesk Fusion 360: A Tutorial Approach
Prof. Sham Tickoo
No ratings yet
Smart Workspaces: The Power of AI in Office Automation
From Everand
Smart Workspaces: The Power of AI in Office Automation
John Nunez
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
A To Z of Internet: Everything You Wanted to Know
From Everand
A To Z of Internet: Everything You Wanted to Know
Bittu Kumar
No ratings yet