Let Load Create Part 3 HAPE
Let Load Create Part 3 HAPE
Let Load Create Part 3 HAPE
Loop2 consists of two patterns: patterns 3 and 4. Let’s see what it tries to find. Loop3 is
looking for {any word, with NNP and upperinital or anyword, with NNPS and upperinital}
this will match {Millennium, Sardar, Emirates, Wembley, Ben} from our example.
Now lets move on to pattern 4, which is repetition of pattern in 3. However notice the
presence of “?” at the end meaning optional. Basically, we want to capture the possibility of
having two words before the word “Stadium”. Using loop 3 and 4 in loop 2 will match
further { Sardar Patel, Ben Hill} from our examples however { Millennium , Emirates,
Wembley} will use ? and escape any matching.
After Loop 2 is run, loop 5 runs, which basically further exploits possibility of another
word. Again there is ? to indicate the optional nature of this loop. Hence this time it will
match as follows for our running example:
{Ben Hill Griffin}, while for others it will be optional. Hence when the outer loop runs it
detects following:
Now the rest of the program is straightforward, we are trying to make sure that there is a
{Stadium, Circuit, Golf Club} word suffix at the end of this. Pay attention to how the “Golf
Club” is implemented, as they are two words unlike other cases.
{Token.string =~ "[Ss]tadium"} |
{Token.string =~ "[Cc]ircuit"}
|
{Token.string =~ "[Aa]rena"}
|
(
{Token.string =~ "[Gg]olf"} {Token.string =~ "[Cc]lub"} )
Refer to this example when you are trying to write a generic grammar which has number of
patterns within and you need to make some of the patterns optional.
Exercise 4:
1. Write an example that detects location by using context “at”
For demonstrating the use of priority in JAPE grammar, we will create few competing rules
that can identify same entities. For example, we want to identify name of a team based on a
few possible contexts. In the first context we use the fact that, in some sports related
articles, player names are mentioned closer to name of their teams in the same sentence.
For instance, in the following text (Example 5.txt) team name is followed by mention of a
player names:
“Manchester United players Park Ji Sung and Jonny Evans are expected to commit their
long-term futures in the coming weeks as the English, European and world champions
continue to plan for life after Sir Alex Ferguson.”
If we have name of the team identified (Manchester United) possibly using gazetteer or
using POS tags, then we can build a rule surrounding. Following is one way of building
such rule:
Rule: teamcontext2
Priority:40
(
(
{Token.kind==word,Token.category==NNP,Token.orth==upperInitial}
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial} )
|
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
):team (
{Token.string=="players"}
{Player}
{Token.string=="and" }
{Player}
)
-->
:team.Team = {rule= "teamcontext-teamcontext2" }
JAPEGrammar7teamcontext.jape(rule teamcontext2)
Following lines from the rule, establishes the fact that the pattern starts with one or two
Noun (NNP) words (i.e. Manchester United, Liverpool)
(
(
{Token.kind==word,Token.category==NNP,Token.orth==upperInitial}
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial} )
|
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
):team
And immediately followed by player names with strings “player” and “and” in the mix.
(
{Token.string=="players"}
{Player}
{Token.string=="and”}
{Player}
Another way of looking at this problem is to not explicitly tell what these strings could be
(“players”, “and”) and handle it with just {Token} as done in the following rule:
Rule: teamcontext3
Priority:30
(
(
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
)
|
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
):team (
{Token}
{Player}
{Token}
{Player}
)
-->
:team.Team = {rule= "teamcontext-teamcontext3" }
JAPEGrammar8teamcontext.jape(rule teamcontext3)
There is one more variety of the context we can use to detect team name here. If we have
name of the city identified then we can say that “City + United”, “City + F.C” or “City +
FC” will be name of the team. See rule below.
Rule: teamcontext1
Priority:50
(
{City} (
) ):team
-->
:team.Team = {rule= "teamcontext-teamcontext1" }
{Token.string=="United" }
|
{Token.string=="F.C" }
|
{Token.string=="FC" }
All these competing rules are stored in teamcontext.jape. In the teamcontext.jape file, by
default the control option is set to brill. Let’s look at brill style and what it can do in theory.
The Brill style means that when more than one rule matches the same region of the
document, they are all fired. The result of this is that a segment of text could be allocated
more than one entity type, and that no priority ordering is necessary. Brill
will execute all matching rules starting from a given position and will advance and continue
matching from the position in the document where the longest match finishes.
Let’s see what Brill style does with our current example. Load the Example 5.txt file from
the data store. Run the jape grammar with the transducer in the GATE GUI. You will see
that the transducer detects name of the team exactly once through all the three contexts
(Figure 3). This is because, all these three rules match the same region of the document,
hence transducer fires all three rules. As you might have guessed, we can do without
priority ordering here as it had had no effect on the result.
You can try the same grammar with All by changing the control to “All”.
The “all” style is similar to Brill, in that it will also execute all matching rules, but the
matching will continue from the next offset to the current one.
For example, where [] are annotations of type Ann
[aaa[bbb]] [ccc [ddd]]
Then a rule matching {Ann} and creating {Ann-2} for the same spans will generate:
BRILL: [aaabbb] [cccddd]
The first three results are same as the ones from the Brill style (Figure 3). The All style also
gathers two more annotations (Start=11, End=17, Rule=teamcontext2, String= “United”)
and (Start=11, End=17, Rule=teamcontext3, String= “United”). Both the rules
(teamcontext2 and teamcontext3) have following common pattern that result in those two
annotations:
( (
from the above, hence only selected [Manchester United] the longest match. As Brill will
execute all matching rules starting from a given position and will advance and continue
matching from the position in the document where the longest match finishes.
[Manchester United] matches loop 1 from the above while [United] for the loop 2 from the
above, hence first selects [Manchester United] through loop 1 but goes back to loop 2
straight after that and also detects [United]. This is because the matching continues from
the next offset to the current one.
With the appelt style, only one rule can be fired for the same region of text, according to a
set of priority rules. Priority operates in the following way.
1. From all the rules that match a region of the document starting at some point X, the one
which matches the longest region is fired.
{Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
2. If more than one rule matches the same region, the one with the highest priority is fired
3. If there is more than one rule with the same priority, the one defined earlier in the
grammar is fired.
The longest match is identified here. Infect they are two longest matches that match the
same region of the text. Rules teamcontext2 and teamcontext3 will select “Manchester
United”. However, the priority of the rule teamcontext2 is higher (40) than the
teamcontext3 rule (30) hence only teamcontext2 gets fired. Try swapping the priority and
you will see that teamcontext3 will get selected this time. Marking both the rules with same
priority will fire the rule which is first in the grammar file.