0% found this document useful (0 votes)
56 views6 pages

Research On Text Generation Model of Natural Language Processing Based On Computer Artificial Intelligence

Uploaded by

rkhumara21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views6 pages

Research On Text Generation Model of Natural Language Processing Based On Computer Artificial Intelligence

Uploaded by

rkhumara21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA)

Research on Text Generation Model of Natural


Language Processing Based on Computer Artificial
Intelligence
Zhijian Zhao
Nanjing Engineering Branch, Jiangsu Union Technical Institute
Nanjing, China
2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA) | 979-8-3503-1467-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICIPCA59209.2023.10257788

[email protected]

Abstract—This paper proposes a text generation system matching is based on semantic similarity, and concept
model for natural language processing based on artificial matching is the key to text understanding and online
intelligence technology. In this model, pre-trained dynamic information retrieval. The basis of semantic similarity
word vector ALBERT was used instead of the traditional calculation is to have a huge thesaurus and corpus support.
BERT reference model for feature extraction to obtain word This paper proposes an algorithm for semantic similarity
vectors. Through ontology, the factors that affect the semantic between natural language concepts based on well-designed
similarity of concepts, such as node density, node depth and domain ontology.
node hierarchy order, are improved. The semantic distance,
the relationship between concepts, the attribute of concepts
and the level of concepts are considered comprehensively. The II. THEORETICAL BASIS
experiment on the public data set proves that the BLEU value
and manual evaluation index of the algorithm on the test set A. Semantic Basis
are significantly improved compared with the baseline model. The function of a noun phrase in Chinese is much
The effectiveness of the algorithm is proved by the system. greater than that of a single noun, which can only express a
simple meaning, while a noun phrase can express a complex
Keywords—Computer, artificial intelligence, natural
and more complete semantic meaning. Of course, the
language, text generation model
semantic knowledge representation of noun phrases is based
on the semantic knowledge representation of nouns. A noun
I. INTRODUCTION phrase is a centered structure with a noun as the core word.
Knowledge base is the product of the combination of The natural language understanding of noun phrases is
artificial intelligence and database technology. With the essentially a process to determine whether the collocation
development of computer technology, its application has relationship between various nouns and modifiers in a
entered a wide range of non-numerical processing fields. phrase is reasonable and to determine the core words.
With the enlargement of knowledge base, the level of Therefore, the classification of noun phrases is the premise
knowledge is enriched, including common sense knowledge, of realizing the correct understanding of noun phrases [2].
rational knowledge, empirical knowledge and meta- The natural language understanding system described in this
knowledge, etc. Knowledge management has become a paper is based on the field of product design. In this system,
prominent problem, for example, how to effectively utilize, noun phrases are divided into the following three categories:
organize, store, manage, maintain and update large-scale bias structure, association structure and event noun.
knowledge; How to make effective use of stored knowledge
for reasoning and problem solving. Therefore, knowledge B. Concept dependency Theory
maintenance is an important part of knowledge base Any natural language, whether English or Chinese,
management, which is directly related to the quality of contains tens of thousands of concepts. Understanding and
knowledge system. Natural language processing (NLP) processing so many concepts one by one is undoubtedly a
technology is booming, driven by deep learning, and has very large and complex project. If the concept is treated
surpassed statistics-based machine learning methods in from an abstract point of view, the problem will be
machine translation, text classification, etc. On the one simplified. Conceptual subordination is when the extension
handlist and other recurrent neural network models of one concept includes all the extension of another concept.
cooperated with word embedding models such as Word2vec After the dependency relationship is introduced, knowledge
and Glove, which could excavate the information of text is divided into a series of nodes, and there are tree
timing level and avoid the feature loss of traditional connections between knowledge nodes [3]. The dependency
machine learning models [1]. On the other hand, the pre- relationship is used to abstract the knowledge block in a
training language model represented by BERT and XLNet, specific domain, so that it forms a tree-like hierarchical
through the use of massive data sets and unsupervised structure formed by knowledge nodes with different levels
training methods, has deep network structure and larger of abstraction. This kind of tree model with dependency
scale, further solves the polysemous problems such as the relationship is called conceptual dependency tree (CDT).
word, and has comparable performance in multiple NLP Figure 1 shows the concept dependency tree after refining
tasks, becoming a new milestone in the field of NLP. From the knowledge structure of the triangle.
the perspective of natural language processing, concept

979-8-3503-1467-0/23/$31.00 ©2023 IEEE 59 August 11-13, 2023 Changchun, China


Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:47:01 UTC from IEEE Xplore. Restrictions apply.
definition of XML format, using natural language
processing technology and indexing technology to obtain
information from the meta-information description file, and
thus build the index library. (3) Construct related thesaurus
and thesaurus based on massive resource meta information.
Figure 2 describes the architecture of an intelligent resource
retrieval system.

Fig. 1. Triangular concept dependency tree after knowledge refinement

III. NATURAL LANGUAGE PROCESSING TEXT GENERATION


SYSTEM MODEL

A. Architecture
The whole system of intelligent resource retrieval
system consists of two parts: resource preprocessing
subsystem and resource retrieval subsystem. On the one
hand, the resource preprocessing subsystem is responsible Fig. 2. Architecture of intelligent resource retrieval system
for obtaining and describing the meta information of all
kinds of resources, and building the index database of the B. System Functions
meta information of resources by combining Chinese
The knowledge base system based on domain natural
information processing technology and index technology [4].
language understanding and its anomaly detection are
On the other hand, it is responsible for constructing
described above. In order to assist knowledge engineers to
thesaurus and related thesaurus for keyword extension to
acquire, organize and manage domain knowledge, a domain
realize intelligent retrieval. The resource preprocessing
knowledge base management system is designed and
subsystem is the foundation of the whole system (for
implemented. The system integrates the following key tools:
obtaining the required resources in the massive and
diversified information system), and provides the data 1) Conceptual Slave Tree Editing and Management Tool
source at multiple levels for the subsequent retrieval Object-oriented technology is used to reasonably
function. Specifically, the functions of the resource embody knowledge in the form of concept dependency tree
preprocessing subsystem include: (1) to obtain the meta design, and provides functions of adding, deleting and
information of each resource, and to use XML file to modifying knowledge to meet the requirements of
describe the meta information. In other words, the system knowledge management as far as possible (Figure 3).
provides the format definition of XML files used to describe
resource meta-information, and each resource has a
corresponding XML description. (2) Combining with the

Fig. 3. Conceptual Slave Tree tool main interface

60
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:47:01 UTC from IEEE Xplore. Restrictions apply.
2) Domain process Tree editing and management tools words from X ∈ xS , namely selector ε ( X ) , and the k
The domain process tree is used to describe the global
context of events in the domain. The domain contains a words selected are enough for Fs to correctly predict the
central event, which represents a concept or generalization, label Y corresponding to X . For a given positive integer
complete by a number of sub-events.
k , let ρ k be the set of all possible subsets of size k .
3) Domain knowledge management platform
Assist knowledge engineer to check the redundancy and
consistency of acquired knowledge. Short text writing task ρ k =⊂
{S 2d || S |=
k} (1)
is relatively simple for the generation model. Short text
length can avoid the generation of a large number of
detectable machine features, which brings difficulties to the The index of k word is represented by S , and its
text detection work. Human beings tend to notice the corresponding subvector in X is represented by X s .
contradictions or semantic errors of articles when judging,
which effectively makes up for the deficiencies of deep
semantic understanding in machine detection, and can S ~ P( S | x) = ε ( X ) (2)
greatly improve the detection accuracy of short texts [5].
Machines, on the other hand, are better at calculating the
distribution of text, an advantage that works well for long Where S ∈ ρ k , X S ∈ Rk , k is adjusted as a
texts, such as online news, that humans often cannot discern hyperparameter to find ε that maximizes mutual
from underlying statistical patterns. Therefore, the
establishment of man-machine cooperation mechanism in information I ( X S ; Y ) .
the detection work can expand the scope of social text
detection and improve the overall detection accuracy, so that
the two can play their strengths in the detection process. max I ( X S ; Y ) subject toS ~ ε ( X ) (3)
ε

IV. RESEARCH ON ANTAGONISTIC TEXT GENERATION BASED In counterattack, the logical output of the last layer of
ON WORD SUBSTITUTION
ε (X ) is taken as the importance score α i , i ∈ [1, n]
A. Word Sorting corresponding to each word in input X = {x1 , x2 , , xn }
This paper starts from the two perspectives of the before the selection step of k word.
influence of the words themselves in input X ∈ χ on the
classification decision and the influence of the synonym C. Word replacement scoring strategy
replacement on the classification decision, and measures the
importance of the words comprehensively before sorting This paper selects synonyms for each xi in
them. How to determine the effective way of word ordering X = {x1 , x2 , , xn } based on the word vector set
is the focus of this paper [6]. The method of this paper is to
first establish an interpretable alternative model to obtain the designed for synonyms in literature. Here, T words with
importance score of the word itself, namely, the importance the greatest cosine similarity to xi are selected from the
score strategy of the word. Then, the important words found
in the above steps are replaced with synonyms. After word vector set as the synonym set C . Replace c j ∈ C
hierarchical sampling according to semantic similarity, the and xi in turn to form a new sentence X ij . Calculate the
difference of decision probability of the target model is
calculated, namely, the scoring strategy of word semantic similarity between X ij and X and put them into
replacement. Finally, the words are sorted in descending
order by combining the scores obtained in the above two H:
steps.
Sim( X ij , X ) = Cos ( Encoder ( X ij ), Encoder ( X )) (4)
B. Scoring strategies of word importance
Compared with white box, black box scenario is more in
line with reality. However, in black box scenario, training
data of the target model under attack cannot be accessed. In= X ij x1 x2  c j  xn , j ∈ [1, K ] (5)
this paper, the framework mentioned in the literature is used
to train alternative models on similar data sets to achieve the
purpose of imitating target model decision-making.
Encoder () is a general-purpose sentence encoder,
used to vectorize sentences; Sim() represents semantic
In this paper, Fs is used to refer to the alternative model,
similarity function; Cos () is cosine similarity calculation
where S is the abbreviation of Substitute. Search for [7]. The values in H whose semantic similarity is higher
alternative data set Ds = {xS , yS } to train alternative than the threshold value β are screened out and stratified
model Fs . The purpose of training alternative model Fs is sampling is carried out. The semantic similarity of each
layer is close, that is, the meaning expressed by the
to learn a network that can be used to select k important corresponding sentences in each layer is close. A

61
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:47:01 UTC from IEEE Xplore. Restrictions apply.
hijl ∈ H (1 ∈ [1, L]) is randomly selected from each layer, ∆Pi * of the decision probability of the target model after
i
l
and the class probability difference ∆Pi between X ij and
l
synonym replacement to get Score final , which is sorted in
X is calculated by the target model: descending order according to Score final to
i
xi ∈ X , as
the order of word replacement against attack.
∆Pi l P(Ytrue | X ) − P(Ytrue | X ijl ), l ∈ [1, L]
= (6)
V. NATURAL LANGUAGE PROCESSING INFORMATION NEEDS
Weak information demand refers to that users have
* unclear demand definition and cannot provide clear
∆Pi = max(∆Pi ), i ∈ [1, n]
l
(7) keywords to the web page. Even if they input certain
keywords, it is difficult to retrieve the expected results from
For each xi ∈ X , the idea of stratified sampling is used various resources provided by language processing [8]. Such
demand must be satisfied by exploring or reading a large
to screen the synonym that has great influence on the number of digital resources. The model proposed in this
classification decision of the target model, and give it a high paper is mainly intended to solve the needs of users.
*
score of ∆Pi . Compared with Ren et al. 's scoring of each
A. Model Definition
synonym of xi ∈ X , this procedure calculates only a few
Language processing users may have such characteristic
synonyms, reducing the target model access frequency.
values as registration attribute value, demand preference
value and weak keyword when making weak information
demand. Based on the registration information demand
Scoreifinal = ∆Pi * * α i (8)
model and strong information demand model, this paper
defines a weak information demand model based on the
Combine the importance score αi corresponding to each characteristic values possessed by users. The specific
structure and content of the model are shown in Figure 4.
word in X = {x1 , x2 , , xn } with the maximum difference

Fig. 4. User weak information demand model

As weak information demand has the characteristics of needs are not fully met, they can define them as weak
consciousness demand, whether it is strong or weak keyword-related requirements. Similarly, users in this model
information demand needs to be judged by users themselves. can be divided into registered users and non-registered users.
When users use keywords for information resource service If the user is a registered user, then steps 1 and 2 in Figure 4
requests, there are two situations: one is that they can are performed; otherwise, only step 2 is performed. The
retrieve information resources, but whether these following are two different similarity calculation methods
information resources meet the needs of users needs to be based on these two situations.
determined by users themselves [9]. If users judge that their

62
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:47:01 UTC from IEEE Xplore. Restrictions apply.
B. Similarity Calculation of Registered Users information demand model, and finally push the content of
If the user is a registered user, a two-tier similarity one or more topics to them. The inscriptions in this topic can
calculation should be performed: firstly, the similarity sim maximize the search keywords that users need, so that they
(Si, Tj) of the demand attribute value vector Sj and the can get the maximum satisfaction of information resource
demand topic library vector Tj composed of the registration needs. Table II lists the top 20 topics in a particular topic
attribute value and demand preference value submitted by and gives the likelihood that these topics are related to them.
the user during registration should be calculated. Then
calculate the similarity sim (Ki, Tj) between the weak TABLE II. SUBJECT WORDS AND THEIR PROBABILITIES IN THE "DT"
keyword vector Ki and the requirement topic library vector REQUIREMENT SUBJECT LIBRARY
Tj. Subject word Keyword Subject word Keyword
distribution distribution
C. Similarity Calculation of non-registered Users probability probability
Co-word analysis Word co-
If the user is a non-registered user, the similarity sim(Ki, 0.0677 occurrence 0.0094
Tj) between the weak keyword vector Ki and the model
requirement topic library vector Tj should only be calculated Cluster analysis
0.0208
Co-word
0.0083
due to the lack of requirement attribute value composed of analysis
registration information [10]. When the similarity Factor analysis Research
0.0167 0.0083
theme
calculation result is greater than a certain threshold value, all
Knowledge bibliometrics
top subject terms satisfying the conditions in the vector of graph
0.0156 0.0083
the requirement topic library will be pushed to the user to Research hotspot Word co-
0.0146 0.0083
provide personalized push service. occurrence
visualization Knowledge
0.0135 0.0083
management
VI. SYSTEM SIMULATION Multidimensional Co-word
0.0135 0.0073
In this paper, the Chinese natural language processing scaling analysis network
system is taken as the experimental object, and the Subject word Strategic
0.0115 0.0073
coordinate
registered user of a natural language processing system, Co-word co-occurrence
whose major is library and information, uses the database clustering
0.0094 0.0070
content structure analysis method as the keyword to submit Research Cluster
0.0094 0.0070
information requirements to Wenjin search. A total of 531 structure analysis
documents in foreign languages and 91 in Chinese and
special Tibetan are obtained. The author makes statistics on We can find that most of the words with the top20
the Chinese and special collection data sets in the search distribution probability provided are directly or indirectly
results. The top 20 subject words with high frequency and related to the co-word analysis method. Therefore, it is very
their corresponding frequency were obtained (Table I). convenient to obtain the information required by users
according to the keywords given in this mode. Not only can
users get the resources they need, but also can use the strong
TABLE I. KEYWORDS AND FREQUENCY OF DT SEARCH RESULTS correlation between these resources to get more information
Subject word Frequency Subject word Frequency in the field related to "DT", so as to help them build their
Telecommunication 5 Iodine 3 own professional knowledge structure.
Magnetic
resonance diffusion Block
tensor imaging
4
copolymerization
3 VII. CONCLUSION
(DT-MRI) In this paper, some influencing factors of semantic
Vinyl acetate 4 DT company 2 similarity are improved and a new method for calculating
Methyl semantic similarity is proposed based on various influencing
4 Belt conveyor 2
methacrylate
Industrial
factors. Compared with the original classical algorithm, the
Network
4 enterprise 2 semantic similarity calculated by the improved algorithm is
optimization more in line with the experts' experience, and the accuracy
management
DT test 4
Fast neutrons for
2
is higher.
cancer
Magnetic
resonance imaging
4 Face recognition 2 REFERENCES
DT-CWT 3 Data mining 2 [1] Chen Zhe, Wen Dunwei. Research and implementation of question
DT polymerization 3 Image processing 2 answering System improved by natural language processing.
MCNP simulation 3 Neutron spectrum 2 Computer Engineering, vol. 32, pp.205-206,February 2020.
[2] Chen Zhaoxiong, Gao Qingshi. Natural Language Processing.
There is little relationship with common words, so we Computer Research and Development, vol. 26, pp.16-24,November
can judge that this is a poor keyword. Even though this user 2021.
is running a "Database Tomography" search. The search is [3] Chen Guohua, Zhao Ke, Li Yatao, et al. Coupled processing of event
also blank. When "DT" is judged to be a poor keyword, a class nouns in natural language processing systems. Computer
poor keyword under the above conditions can be used for Technology and Development, vol. 18, pp.58-64,June 2020.
analysis [11]. When users with registration information [4] Wang Min, Wu Gang, Xiao Jun, et al. Intelligent resource retrieval
submit social keywords, they can analyze them according to based on XML and natural language processing. Computer
Engineering and Science, vol. 28, pp.39-47,November 2021.
their registration characteristic values and demand
[5] Li Lei, Zhou Yanquan, Zhong Yixin. Research and application of
preferences through three levels of registration information natural Language Processing based on Pragmatics. Journal of
demand model, strong information demand model and weak

63
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:47:01 UTC from IEEE Xplore. Restrictions apply.
Intelligent Systems, vol. 1, pp.68-74,February 2023. processing techniques in computer software systems for the blind.
[6] Zhang Peiying, Li Cunhe. An improved context-dependent ambiguity Journal of Chinese Information Technology, vol. 018, pp.72-78,April
field segmentation algorithm. Computer Systems and Applications, 2022.
vol. 15, pp.46-48,May 2019. [10] Yin Hongzhi, Zhang Fan, Ding Ding, et al. AnswerSeeker: An
[7] GUO Lei. Design of Computer intelligent Scoring System for English intelligent question and answer system based on Internet mining.
Translation based on natural Language Processing. Modern Electronic Journal of Computer Systems and Applications, vol. 8, pp.8-
Technique, vol. 4, pp.42-47,April 2021. 19,January 2021.
[8] Zhang Yu, Liu Ting, Chen Yiheng, et al. Watermarking of natural [11] Li Xinli, Li Xinqi, Ma Kai, et al. Intelligent statistical analysis system
language text. Journal of Chinese Information Technology, vol. 19, based on natural language processing and Office COM components.
pp.98-104, January 2022. Computer Applications and Software, vol. 34, pp.49-54,December
2021.
[9] Zhuang Li, Bao Ta, Zhu Xiaoyan. Speech and natural language

64
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:47:01 UTC from IEEE Xplore. Restrictions apply.

You might also like