News Document
News Document
1.INTRODUCTION
1.1 Overview of the project
news generated every day, obtaining effective news is an important objective. How to get
news conveniently and efficiently has become an important orientation. Nowadays, a full-
featured news-gathering platform has become more and more popular and has good
development prospects [1].This paper designs and develops a convenient automatic news-
gathering system. The system uses crawler analysis to collect domestic news, saves it after
deduplication, and finally provides news services for retrieving and viewing. It can help users
find similar news and extract hot news that users are interested in, and improve the efficiency
of readin news . News is an important way to convey information. Among the tens of
How to get news conveniently and efficiently has become an important orientation.
Nowadays, a full-featured news-gathering platform has become more and more popular and
has good development prospects [1].This paper designs and develops a convenient automatic
news-gathering system. The system uses crawler analysis to collect domestic news, saves it
after deduplication, and finally provides news services for retrieving and viewing. It can help
users find similar news and extract hot news that users are interested in, and improve the
The rapid development of the Internet, network media has become a new window for
people to understand the outside world due to its fast speed and wide spread. News is a
channel for people to know about Surrounding Information, but thousands of news are
produced every day on the Internet. These news are needed or not in inside. How to
efficiently and accurately obtain the news content we need from the website is a great need in
people's life.This system aims to collect news on specific websites and return it to users with
concise and clear pages. Users can search specific keywords to select news that they are
interested in so as to realize personalization for users. This system crawls and processes the
domestic financial news content, which is convenient for people to process the information.
In order to avoid duplication of information, the system has also implemented a self-defined
deduplication rule. In the specific implementation, the system is written using Python in
conjunction with the Scrapy framework and Django framework, which can simplify the
system code to a certain extent. The practical value of this system lies in the timely, efficient
and convenient access to domestic financial news that people care about, need and are
interested .
2. LITERATURE REVIEW
Low Efficiency.
We use Large amount of Code.
Deduplication is not allowed
Instant access.
Improved productivity.
Optimum utilization of resources.
Efficient management of records.
Simplification of the operations.
2.2.1. Advantages
High Efficiency.
Simplifies the code writing and improves Speed and efficiency of reptiles
Deduplication is not allowed.
3.SYSTEM DESIGN
Analysis is a logical process. The objective of this phase is to determine exactly what must be
done to solve the problem. Tools such as Class Diagrams, Sequence diagrams, data flow
diagrams and data dictionary are used in developing a logical model of system.
A software life cycle model (also termed process model) is a pictorial and diagrammatic
representation of the software life cycle. A life cycle model represents all the methods
required to make a software product transit through its life cycle stages. It also captures the
structure in which these methods are to be undertaken.
A life cycle model maps the various activities performed on a software product from
its inception to retirement.
There are different software development life cycle models specify and design, which are
followed during the software development phase. These models are also called "Software
Development Process Models." Each process model follows a series of phase unique to its
type to ensure success in the step of software development.
Waterfall Model
RAD Model
Spiral Model
Incremental Model
Iterative Model
Among all these models’ spiral model is the one of the best models.
Spiral Model
This SDLC model helps the group to adopt elements of one of more process models like a
waterfall, incremental. The spiral technique is a combination of rapid prototyping and
concurrency in design and development activities. Each cycle in the spiral begins with the
identification of objective for that cycle.
All projects are feasible when provided with unlimited resources and infinite time
unfortunately, the development of computer-based system or product is more likely plagued
by a scarcity of resources and difficult delivery dates. It is both necessary and prudent to
evaluate the feasibility of a project at the earliest possible time. Months or years of effort,
thousands or millions of dollars, and untold professional embarrassment can be averted if an
ill-conceived system is recognized early in the definition phase.
Feasibility and risk analysis are related in many ways. If project risk is great the
feasibility of producing quality software is reduced. During product engineering, however,
we concentrate our attention on four primary areas of interest.
3.2.1 Technical Feasibility
This application in going to be used in an Internet environment called www (World Wide
Web). So, it is necessary to use a technology that is capable of providing the networking
facility to the application. This application as also able to work on distributed environment.
GUI is developed using HTML to capture the information from the customer. HTML
is used to display the content on the browser. It uses TCP/IP protocol. It is an interpreted
language.
The economical issues usually arise during the economical feasibility stage are whether the
system will be used if it is developed and implemented, whether the financial benefits are
equal are exceeds the costs. The cost for developing the project will include cost conducts full
system investigation, cost of hardware and software forthe class of being considered, the
benefits in the form of reduced costs or fewer costly errors.
3.2.3 Operational Feasibility
In our application front end is developed using GUI. So, it is very easy to the customer to
enter the necessary information. But customer must have some knowledge on using web
applications before going to use our application.
Functional requirements describe what the system should do. The functional
requirements can be further categorized as follows:
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and the steps are necessary to
put transaction data in to a usable form for processing that can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system.
5 members can complete the project in 2 – 4 months if they work fulltime on it.
3.4 Modules
3.4.1. Service Provider
In this module, the sp has to login by using valid user name and password. After login
successful he can do some operations such as View Flight Delay Data Set Details,Search &
Predict Flight Delay Data Sets,Calculate and View All Flight Delay Prediction,View All
Flights with No Delay,View All Remote Users,View Actual Flight Delay Results by Line
Chart,View Actual Flight Delay Results,View Flight Delay Prediction Results.
3.4.2.User
In this module, there are n numbers of users are present. User should register before
doing some operations. After registration successful he has to login by using authorized user
name and password. Login successful he will do some operations like post flight delay data
sets,search & predict flight delay data sets,view your profile.
The input controls provide ways to ensure that only authorized users access the system
guarantee the valid transactions, validate the data for accuracy and determine whether any
necessary data has been omitted. The primary input medium chosen is display. Screens have
been developed for input of data using HTML.
Design of output
Design of output involves the following decisions
Information to present
Output medium
Output layout
Output of this system is given in easily understandable, user-friendly manner, Layout of the
output is decided through the discussions with the different users.
Design of control
The system should offer the means of detecting and handling errors.
Input controls provides ways to:
Valid transactions are only acceptable
Validates the accuracy of data
Ensures that all mandatory data have been captured
All entities to the system will be validated. And updating of tables is allowed for only valid
entries. Means have been provided to correct, if any by change incorrect entries have been
entered into the system they can be edited.
As the strategic value of software increases for many companies, the industry looks for
techniques to automate the production of software and to improve quality and reduce cost and
time-to-market. These techniques include component technology, visual programming,
patterns and frameworks.
Businesses also seek techniques to manage the complexity of systems as they increase
in scope and scale. In particular, they recognize the need to solve recurring architectural
problems, such as physical distribution, concurrency, replication, security, load balancing and
fault tolerance.
The Unified Modeling Language (UML) was designed to respond to these needs.
Simply, Systems design refers to the process of defining the architecture, components,
modules, interfaces, and data for a system to satisfy specified requirements which can be
done easily through UML diagrams.
In the project four basic UML diagrams have been explained among the following list:
Class Diagram
Use Case Diagram
Sequence Diagram
Activity Diagram
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, their attributes, and the relationships between the classes.
This is one of the most important of the diagrams in development. The diagram
breaks the class into three layers.
The relationships are drawn between the classes. Developers use the Class Diagram to
develop the classes. Analyses use it to show the details of the system.
Fig.3.3:UML Diagram
Fig.3.5:Sequence Diagram
Activity diagrams are a loosely defined diagram technique for showing workflows of
stepwise activities and actions, with support for choice, iteration and concurrency. In the
Unified Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity diagram shows
the overall flow of control.
Fig.3.6:Activity Diagram
4.IMPLEMENTATION
4.1. Python
Below are some facts about Python. Python is currently the most widely used multi-
purpose, high-level programming language. Python allows programming in Object-Oriented
and Procedural paradigms. Python programs generally are smaller than other programming
languages like Python. Programmers have to type relatively less and indentation requirement
of the language, makes them readable all the time. Python language is being used by almost a
ll tech-giant companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.
The biggest strength of Python is huge collection of standard library which can be used for th
e following –
• Machine Learning
• Test frameworks
• Multimedia
Advantages of Python :-
1. Extensive Libraries
Python downloads with an extensive library and it contain code for various purposes l
ike regular expressions, documentation-generation, unit-testing, web browsers, threading, dat
abases, CGI, email, image manipulation, and more. So, we don’t have to write the complete c
ode for that manually.
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write so
me of your code in languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Pyth
on code in your source code of a different language, like C++. This lets us add scripting capa
bilities to our code in the other language.
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productiv
e than languages like Python and C++ do. Also, the fact that you need to write less and get m
ore things done.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future br
ight for the Internet Of Things. This is a way to connect the language with the real world.
When working with Python, you may have to create a class to print ‘Hello World’. Bu
t in Python, just a print statement will do. It is also quite easy to learn, understand, and code.
This is why when people pick up Python, they have a hard time adjusting to other more verbo
se languages like Python.
7. Readable
Because it is not such a verbose language, reading Python is much like reading Englis
h. This is the reason why it is so easy to learn, understand, and code. It also does not need cur
ly braces to define blocks, and indentation is mandatory. This further aids the readability of th
e code.
8. Object-Oriented
This language supports both the procedural and object-oriented programming paradig
ms. While functions help us with code reusability, classes and objects let us model the real w
orld. A class allows the encapsulation of data and functions into one.
Like we said earlier, Python is freely available. But not only can you download Pytho
n for free, but you can also download its source code, make changes to it, and even distribute
it. It downloads with an extensive collection of libraries to help you with your tasks.
10. Portable
When you code your project in a language like C++, you may need to make some cha
nges to it if you want to run it on another platform. But it isn’t the same with Python. Here, y
ou need to code only once, and you can run it anywhere. This is called Write Once Run Any
where (WORA). However, you need to be careful enough not to include any system-depende
nt features.
11. Interpreted
Lastly, we will say that it is an interpreted language. Since statements are executed on
e by one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.
4.2. HISTORY OF PYTHON
What do the alphabet and the programming language Python have in common? Right,
both start with ABC. If we are talking about ABC in the Python context, it's clear that the
programming language ABC is meant. ABC is a general-purpose programming language and
programming environment, which had been developed in the Netherlands, Amsterdam, at the
CWI (Centrum Wiskunde &Informatica). The greatest achievement of ABC was to influence
the design of Python.Python was conceptualized in the late 1980s. Guido van Rossum
worked that time in a project at the CWI, called Amoeba, a distributed operating system. In
an interview with Bill Venners1, Guido van Rossum said: "In the early 1980s, I worked as an
implementer on a team building a language called ABC at Centrum voor Wiskunde en
Informatica (CWI). I don't know how well people know ABC's influence on Python. I try to
mention ABC's influence because I'm indebted to everything I learned during that project and
to the people who worked on it."Later on in the same Interview, Guido van Rossum
continued: "I remembered all my experience and some of my frustration with ABC. I decided
to try to design a simple scripting language that possessed some of ABC's better properties,
but without its problems. So I started typing. I created a simple virtual machine, a simple
parser, and a simple runtime. I made my own version of the various ABC parts that I liked. I
created a basic syntax, used indentation for statement grouping instead of curly braces or
begin-end blocks, and developed a small number of powerful data types: a hash table (or
dictionary, as we call it), a list, strings, and numbers."
Before we take a look at the details of various machine learning methods, let's start by
looking at what machine learning is, and what it isn't. Machine learning is often categorized
as a subfield of artificial intelligence, but I find that categorization can often be misleading at
first brush. The study of machine learning certainly arose from research in this context, but in
the data science application of machine learning methods, it's more helpful to think of
machine learning as a means of building models of data. Fundamentally, machine learning
involves building mathematical models to help understand data. "Learning" enters the fray
when we give these models tunable parameters that can be adapted to observed data; in this
way the program can be considered to be "learning" from the data. Once these models have
been fit to previously seen data, they can be used to predict and understand aspects of newly
observed data. I'll leave to the reader the more philosophical digression regarding the extent
to which this type of mathematical, model-based "learning" is similar to the "learning"
exhibited by the human brain.Understanding the problem setting in machine learning is
essential to using these tools effectively, and so we will start with some broad categorizations
of the types of approaches we'll discuss here.
At the most fundamental level, machine learning can be categorized into two main
types: supervised learning and unsupervised learning. Supervised learning involves somehow
modeling the relationship between measured features of data and some label associated with
the data; once this model is determined, it can be used to apply labels to new, unknown data.
This is further subdivided into classification tasks and regression tasks: in classification, the
labels are discrete categories, while in regression, the labels are continuous quantities. We
will see examples of both types of supervised learning in the following section. Unsupervised
learning involves modeling the features of a dataset without reference to any label, and is
often described as "letting the dataset speak for itself." These models include tasks such as
clustering and dimensionality reduction. Clustering algorithms identify distinct groups of
data, while dimensionality reduction algorithms search for more succinct representations of
the data. We will see examples of both types of unsupervised learning in the following
section.
Human beings, at this moment, are the most intelligent and advanced species on earth
because they can think, evaluate and solve complex problems. On the other side, AI is still in
its initial stage and haven’t surpassed human intelligence in many aspects. Then the question
is that what is the need to make machine learn? The most suitable reason for doing this is, “to
make decisions, based on data, with efficiency and scale”. Lately, organizations are investing
heavily in newer technologies like Artificial Intelligence, Machine Learning and Deep
Learning to get the key information from data to perform several real-world tasks and solve
problems. We can call it data-driven decisions taken by machines, particularly to automate
the process. These data-driven decisions can be used, instead of using programing logic, in
the problems that cannot be programmed inherently. The fact is that we can’t do without
human intelligence, but other aspect is that we all need to solve real-world problems with
efficiency at a huge scale. That is why the need for machine learning arises.
Quality of data − Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data preprocessing and
feature extraction.
No clear objective for formulating business problems − Having no clear objective and well-
defined goal for business problems is another key challenge for ML because this technology
is not that mature yet.
Curse of dimensionality − Another challenge ML model faces is too many features of data
points. This can be a real hindrance.
4.5 MYSQL
MYSQL is a relational database management system, which organizes data in the form of
tables. MYSQL is one of many database servers based on RDBMS model, which manages a
series of data that attends three specific things-data structures, data integrity and data
manipulation. With MYSQL cooperative server technology we can realize the benefits of
open, relational systems for all the applications. MYSQL makes efficient use of all systems
resources, on all hardware architecture to deliver unmatched performance, price performance
and scalability. Any DBMS to be called as RDBMS has to satisfy Dr.E.F. Codd’s rules.
MYSQL is portable
The MYSQL RDBMS is available on wide range of platforms ranging from PCs to super
computers and as a multi user loadable module for Novel NetWare, if you develop
application on system, you can run the same application on other systems without any
modifications.
MYSQL is compatible
MYSQL commands can be used for communicating with IBM DB2 mainframe RDBMS
that is different from MYSQL, that is MYSQL compatible with DB2. MYSQL RDBMS is a
high-performance fault tolerant DBMS, which is specially designed for online transaction
processing and for handling large database applications.
Multithreaded server architecture
MYSQL adaptable multithreaded server architecture delivers scalable high performance for
very large number of users on all hardware architecture including symmetric multiprocessors
(sumps) and loosely coupled multiprocessors. Performance is achieved by eliminating CPU,
I/O, memory and operating system bottlenecks and by optimizing the Sql Server 2005,
DBMS server code to eliminate all internal bottlenecks.
4.5.1 Features of MYSQL
Most popular RDBMS in the market because of its ease of use
Client/server architecture.
Ensuring data integrity and data security.
Parallel processing support for speed up data entry and online transaction processing
used for applications.
DB procedures, functions and packages.
Dr.E.F. CODD’s RULES
These rules are used for valuating a product to be called as relational database management
systems. Out of 12 rules, a RDBMS product should satisfy at least 8 rules, +rule called rule 0
that must be satisfied.
RULE 0. FOUNDATION RULE
For any system that is to be advertised as, or claimed to be relational DBMS. That system
should manage database with in itself, without using an external language.
RULE 1. INFORMATION RULE
All information in relational database is represented at logical level in only one way as values
in tables.
RULE 2. GUARANTEED ACCESS
Each and every data in a relational database is guaranteed to be logically accessibility by
using to a combination of table name, primary key value and column name.
RULE 3. SYSTEMATIC TREATMENT OF NULL VALUES
Null values are supported for representing missing information and inapplicable information.
They must be handled in systematic way, independent of data types.
RULE 4. DYNAMIC ONLINE CATALOG BASED RELATION MODEL
The database description is represented at the logical level in the same way as ordinary data
so that authorized users can apply the same relational language to its interrogation as they do
to the regular data.
RULE 5. COMPRHENSIVE DATA SUB LANGUAGE
A relational system may support several languages and various models of terminal use.
However, there must be one language whose statement can express all of the following Data
Definitions, View Definitions, Data Manipulations, Integrity, Constraints, Authorization and
transaction boundaries.
RULE 6. VIEW UPDATING
Any view that is theoretical can be updatable if changes can be made to the tables that effect
the desired changes in the view.
RULE 7. HIGH LEVEL UPDATE, INSERT and DELETE
The capability of handling a base relational or derived relational as a single operand applies
not only retrieval of data also to its insertion, updating, and deletion.
RULE 8. PHYSICAL DATA INDEPENDENCE
Application program and terminal activities remain logically unimpaired whenever any
changes are made in either storage representation or access method.
5.TESTING
5.1 Software Testing Techniques
Software Testing is a critical element of software quality assurance and represents the
ultimate review of specification, design and coding, Testing presents an interesting anomaly
for the software engineer.
2.A good test case is one that has a probability of finding an as yet undiscovered error.
5.1.2Test cases
Status
Test Input Expected Behavior Observed P = Passed
S.No. behavior F = Failed
1 Login as user or Administrator or user -do- P
admin with Home page for
correct login manager should be
details displayed
2 Login as user or Error message should -do- P
admin with wrong be displayed
login details
This testing is also called as glass box testing. In this testing, by knowing the specified
function that a product has been designed to perform test can be conducted that demonstrates
each function is fully operation at the same time searching for errors in each function. It is a
test case design method that uses the control structure of the procedural design to derive test
cases. Basis path testing is a white box testing.
Condition testing
Data flow testing
Loop testing
Loop testing
Black Box Testing
In this testing by knowing the internal operation of a product, tests can be conducted to
ensure that “all gears mesh”, that is the internal operation performs according to specification
and all internal components have been adequately exercised. It fundamentally focuses on the
functional requirements of the software.
The steps involved in black box test case design are:
Graph based testing methods
Equivalence partitioning
Boundary value analysis
Comparison testing
Graph matrices
Integration testing is a systematic technique for constructing the program structure, while
conducting test to uncover errors associated with the interface. The objective is to take unit
tested methods and build a program structure that has been dictated by design.
6.FUTURE ENHANCEMENT
Furthermore, if a website is frequently accessed, this website may detect crawlers and
block the crawlers. For this problem, you can set a certain anti-crawling strategy to avoid
system failure. On the page display, the system can be further optimized to make the
interface more concise and intuitive; in the system functions, the functions can be further
expanded. These are the goals and directions of this system. This process needs to be
optimized step by step to achieve.
7.CONCLUSION
This system makes every effort to facilitate the processing of news information for users,
and presents the news information obtained from various websites to the users. The
simple and efficient interface enables users to read the news clearly, and only crawls and
displays the key information of the news and ignores other unnecessary information, so
that users can find the content they are interested in or need more quickly. In short, this
system, as a comprehensive information, analysis and retrieval tool, will facilitate
people's lives to a certain extent.Certainly, this system can't be perfect, there are still
many functions that can be expected, and there are some deficiencies that can be
improved. For example, the system currently only implements crawling of a few sites,
and the number of crawled sites can continue to be expanded to make news content richer
and more complete
Home page
Manager login:
Manager Home
View users:
Add news:
View news:
User Registration:
User login:
User home:
View news:
APPENDIX-B: REFERENCES
1. Roger S Pressman, “Software Engineering - A Practitioner’s approach”
McGraw – Hill International Editions, Fifth Edition, 2001.
2. Henry F Korth, S. Sudarshan, “Database System Concepts” McGraw – Hill
International Editions, Fourth Edition, 2002.
3. George Koch, Kevin Loney, “Oracle – The Complete Reference”, Tata
McGraw Hill, Third Edition, 2001.
4. Herbert Schildt & Patrick Naughton, “Python2 Complete Reference”, Tmh
3/e, 1999.
5. James Jawroski, “Mastering Python Script”, Tmh 3/e, 2000.
6. JSP Architecture “Karl Avedal”, Tata McGraw – Hill International Editions,
Fourth Edition, 2002.