
HomePage | Optical Illusions | War Stories | QBasic | Dads Navy Days | Bristol | Bristol, USA | Bristol, Canada | Terre Haute | Miscellany | Web Stuff | About Ray | Site Map | Site Search | Messages | Credits | Links | Web Rings
QBasic | Errors | 40lb Weight | Bits | Chance | Colours | Dates | Delays | File Dialog | Files | Input | Matching | Menus | Mouse | Numbers | SeqNo | SIRDS | Sorts | Text | Timer | DLoads
Text matching techniques
When using databases one of the most common proceedures is comparing records, either for deduplication or suppression. Deduplication is the process of removing identical records from a database. Suppression uses two or more databases, records found in one or more databases are removed from the main database, this is also known as purging. Another, similar, process is known as merging. In this process records in the suppression database(s) are compared against the main data records. Those records that are not already contained within the main database are added to it.
Searching for identical text is relatively easy, but problems arise when matching on names and addresses. For example identical text matching would not match the following records :-
Name Address Line 1 Address Line 2 Address Line 3
Mr. A B Sample 123 Sample Road Sample Town Sample County
Mr A Sample 123, Sample Road SampleTown SampleCounty
MR A SAMPLE 123 SAMPLE ROAD SAMPLE TOWN SAMPLE COUNTY
Mr Alan Sample 123 Sample Rd Sample Town Sample County
Alan Sample 123 Sample Road Sample Town Sample County
A Sample 123, Sample Rd SAmple Town Sample CouNty
MATCHES
0 2 3 3
Because of these discrepancies professional database matching software is very sophisticated and can be very expensive. Most of this software award scores for how close records get to being a perfect match to eachother. Working for a direct mailing house, where we do a lot of record comparisons, we can choose at which score level we decide to either delete or keep almost matching records. At low scores we delete less records but may not catch mail to the same address, hence people sometimes get two or more letters addressed to very slightly addresses but arriving at the same house. Incidentally, this is usually the fault of the suppliers of the database and not us. When we do something like a Mail Preference Service (MPS) suppression where people have specifically asked not to receive direct mail we keep the matching score fairly high, this means that people in the MPS suppression database will not receive the mailing at the risk of leaving people who may like to get it out.
Returning to our three example records, one way to improve the matching process is to strip out any characters that is not A-Z or 0-9. Matching will also improve if the letters are all cased the same. Applying this technique to the above records they now become :-
Name Address Line 1 Address Line 2 Address Line 3
MRABSAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY
MRASAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY
MRASAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY
MRALANSAMPLE 123SAMPLERD SAMPLETOWN SAMPLECOUNTY
ALANSAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY
ASAMPLE 123SAMPLERD SAMPLETOWN SAMPLECOUNTY
MATCHES
2 4 6 6
This is a little better, but doing a straightforward text comparison will not exactly match all of the fields in all of these records. Now you can see the value of giving scores to the matches. Some matching softwware will also remove the vowels, some will also split the NAME field into fields called TITLE, FORNAMES, INITIALS and SURNAMES. Some can make initials from the forenames as well. Some of this software contains a database of fornames and can make up the title from them.
Title Initial Forename Surname Address Line 1 Address Line 2 Address Line 3
MR A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY
MR A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY
MR A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY
MR A ALAN SAMPLE 123SMPLRD SMPLTWN SMPLCNTY
A ALAN SAMPLE 123SMPLRD SMPLTWN SMPLCNTY
A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY
MATCHES
4 6 2 6 6 6 6
To get this far may seem to be a relatively simple process but the work is only just beginning. To cut down on the work that has to be done relatively few matches will be done. We usually use surname, address line 1 and postcode or town. Some clients may want only members of one sex mailed, or only one letter to each household, so further comparisons have to be done. When deduping every record in the database has to be compared against every other record, when doing suppressions every record in every database used has to be compared. To compare any one field in the four records of our pretend database above SIX seperate comparisons have to be done, imagine how many comparisons have to be done on a database of several thousand records. For this reason most databases, Access, FoxPro etc allow the use of key indexes. Indexes are a type of sort on the records so a large number of comparisons can be bypassed. Indexing or sorting a database makes comparisons very much quicker; there is no need for addresses in Edinburgh to be compared to those in, say, Sheffield. A discussion on sorting techniques can be found on the SORT pages.
The following program, PreMatch.bas, shows how text can be prepared for the text comparison process. A word of caution, this code should NOT be used for ANY commercial application. The reason for the caution is that the seperation of names into their componant parts is fraught with difficulties, getting it wrong will certainly annoy the people in your database and may even move them to sue you. You must also remember that by keeping lists of peoples names and addresses, let alone any other details about them, makes you subject to the Data Protection Act.
DECLARE SUB PrintFlds ()
DECLARE SUB TitleCase (Text$)
DECLARE SUB Strip (Text$)
DECLARE SUB MoreStrip (Text$)
'PreMatch.bas Ray Thomas August 2000
'*** A program to get text into a form suitable for matching
'*** To simplfy the program I will assume no spaces between hyphenated surnames
'*** It will also contain a "contact" field and a salutation
TYPE InRec
Title AS STRING * 5
Initial AS STRING * 5
Surname AS STRING * 20
Fullname AS STRING * 20
Add1 AS STRING * 20
Add2 AS STRING * 20
Add3 AS STRING * 20
Salut AS STRING * 30
Contact AS STRING * 30
MName AS STRING * 20
MAdd1 AS STRING * 20
MAdd2 AS STRING * 20
MAdd3 AS STRING * 20
END TYPE
DIM SHARED Record(9) AS InRec 'An array to hold the records
DIM TitleComp(6) AS STRING 'Array to compare titles
DIM Counter AS INTEGER 'A counter for loop counts
DIM Count AS INTEGER 'A counter for loop counts
DIM Number AS INTEGER 'General numeric variable
'*** When defining type arrays they are initialised to NULLS
'*** to make text comparisons easier we are going to fill them with spaces
FOR Counter = 1 TO UBOUND(Record) - 1
Record(Counter).Title = SPACE$(LEN(Record(Counter).Title))
Record(Counter).Initial = SPACE$(LEN(Record(Counter).Initial))
Record(Counter).Surname = SPACE$(LEN(Record(Counter).Surname))
Record(Counter).Fullname = SPACE$(LEN(Record(Counter).Fullname))
Record(Counter).Add1 = SPACE$(LEN(Record(Counter).Add1))
Record(Counter).Add2 = SPACE$(LEN(Record(Counter).Add2))
Record(Counter).Add3 = SPACE$(LEN(Record(Counter).Add3))
Record(Counter).Salut = SPACE$(LEN(Record(Counter).Salut))
Record(Counter).Contact = SPACE$(LEN(Record(Counter).Contact))
Record(Counter).MName = SPACE$(LEN(Record(Counter).MName))
Record(Counter).MAdd1 = SPACE$(LEN(Record(Counter).MAdd1))
Record(Counter).MAdd2 = SPACE$(LEN(Record(Counter).MAdd2))
Record(Counter).MAdd3 = SPACE$(LEN(Record(Counter).MAdd3))
NEXT Counter
'*** Define some records ***
DATA Mr A B Sample,123 Sample Road,Sample Town,Sample County
DATA Mr A Sample,"123, Sample Road",SampleTown,SampleCounty
DATA MR A SAMPLE,123 SAMPLE ROAD,SAMPLE TOWN,SAMPLE COUNTY
DATA Mr Alan Sample,123 Sample Rd,Sample Town,Sample County
DATA Alan Sample,"123, Sample Rd",SAmple Town,Sample CouNty
DATA A Sample,123 Sample Road,Sample Town,Sample County
DATA Mr Sample,123 Sample Road,Sample Town,Sample County
DATA Mr Alpha-Sample,123 Sample Road, Sample Town, Sample County
DATA Mr A B Sample,123 Sample Road,Sample Town,Sample County
'*** Fill the record array and titlecase the fields ***
FOR Counter = 0 TO UBOUND(Record) - 1
READ Record(Counter).Fullname
TitleCase Record(Counter).Fullname
READ Record(Counter).Add1
TitleCase Record(Counter).Add1
READ Record(Counter).Add2
TitleCase Record(Counter).Add2
READ Record(Counter).Add3
TitleCase Record(Counter).Add3
NEXT Counter
'*** Define a simple titles array ***
'*** Imagine how many other variations may be found ***
DATA Mr,Miss,Mrs,Ms,Doctor,Dr
FOR Counter = 0 TO UBOUND(TitleComp$) - 1
READ TitleComp$(Counter)
NEXT Counter
'*** Strip out any "odd" characters
FOR Counter = 0 TO UBOUND(Record) - 1
Strip Record(Counter).Fullname
Strip Record(Counter).Add1
Strip Record(Counter).Add2
Strip Record(Counter).Add3
NEXT Counter
'*** Split the fullname field into title, first inital and surname ***
'*** title first ***
FOR Counter = 0 TO UBOUND(Record) - 1
Number = INSTR(Record(Counter).Fullname, " ")
Text$ = LEFT$(Record(Counter).Fullname, Number - 1)
FOR Count = 0 TO UBOUND(TitleComp$) - 1
IF UCASE$(Text$) = UCASE$(TitleComp$(Count)) THEN
Record(Counter).Title = Text$
EndTitle = LEN(TitleComp$(Count)) + 2
EXIT FOR
END IF
EndTitle = 1
NEXT Count
'*** To work out where the surname is we work backwards through the field
'*** The first space after the end spaces must be the start of the surname
LastChar$ = " "
FOR Count = LEN(Record(Counter).Fullname) TO 1 STEP -1
LastChar$ = MID$(Record(Counter).Fullname, Count, 1)
IF LastChar$ <> " " THEN
Number = Count
EXIT FOR
END IF
NEXT Count
FOR Count = 1 TO Number
CharPosn = INSTR(Count, Record(Counter).Fullname, " ")
IF CharPosn >= Number THEN
Num = LEN(Record(Counter).Fullname) - Count
Record(Counter).Surname = MID$(Record(Counter).Fullname, Count, Num)
StartSur = Count
EXIT FOR
END If
NEXT Count
'*** Get the initial
IF EndTitle <> StartSur THEN
Record(Counter).Initial = MID$(Record(Counter).Fullname, EndTitle, 1)
END IF
NEXT Counter
'*** Create the "Contact" field
Text$ = ""
FOR Counter = 0 TO UBOUND(Record) - 1
Text$ = RTRIM$(Record(Counter).Title) + " "
IF Text$ = " " THEN
Text$ = RTRIM$(Record(Counter).Initial) + " "
ELSE
Text$ = Text$ + RTRIM$(Record(Counter).Initial) + " "
END IF
Text$ = Text$ + RTRIM$(Record(Counter).Surname)
Record(Counter).Contact = Text$
NEXT Counter
'*** Create the Salutation field
FOR Counter = 0 TO UBOUND(Record) - 1
Text$ = "Dear " + Record(Counter).Title
Text$ = RTRIM$(Text$) + " " + Record(Counter).Surname
IF LEFT$(Record(Counter).Title, 1) = " " THEN
Text$ = "Dear Sir/Madam"
END IF
Record(Counter).Salut = Text$
NEXT Counter
'*** Now the text is in sort of natural casing, create the matching fields
FOR Counter = 0 TO UBOUND(Record) - 1
Record(Counter).MName = UCASE$(Record(Counter).Surname)
Record(Counter).MAdd1 = UCASE$(Record(Counter).Add1)
Record(Counter).MAdd2 = UCASE$(Record(Counter).Add2)
Record(Counter).MAdd3 = UCASE$(Record(Counter).Add3)
NEXT Counter
'*** Process the matching fields
'*** This involves stripping spaces
FOR Counter = 0 TO UBOUND(Record) - 1
MoreStrip Record(Counter).MName
MoreStrip Record(Counter).MAdd1
MoreStrip Record(Counter).MAdd2
MoreStrip Record(Counter).MAdd3
NEXT Counter
PrintFlds
END
SUB MoreStrip (Text$)
NewText$ = ""
Count = 0
FOR Counter = 1 TO LEN(Text$)
Char$ = MID$(Text$, Counter, 1)
IF INSTR("AEIOU ", UCASE$(Char$)) > 0 THEN
Char$ = ""
Count = Count + 1
END IF
'*** For some reason commas need to be taken out specificaly
IF Char$ = "," THEN
Char$ = ""
Count = Count + 1
END IF
NewText$ = NewText$ + Char$
NEXT Counter
Text$ = NewText$ + SPACE$(Count)
END SUB
SUB PrintFlds
CLS
FOR Count = 0 TO UBOUND(Record) - 1
PRINT Record(Count).Title;
PRINT Record(Count).Initial;
PRINT Record(Count).Surname
PRINT Record(Count).Fullname;
PRINT Record(Count).Salut;
PRINT Record(Count).Contact
PRINT Record(Count).Add1;
PRINT Record(Count).Add2;
PRINT Record(Count).Add3
PRINT Record(Count).MName;
PRINT Record(Count).MAdd1;
PRINT Record(Count).MAdd2;
PRINT Record(Count).MAdd3
NEXT Count
END SUB
SUB Strip (Text$)
NewText$ = ""
Count = 0
FOR Counter = 1 TO LEN(Text$)
Char$ = MID$(Text$, Counter, 1)
IF INSTR("ABCDEFGHIJKLMNOPQRSTUVWXYX 0123456789-", UCASE$(Char$)) < 1 THEN
Char$ = ""
Count = Count + 1
END IF
NewText$ = NewText$ + Char$
NEXT Counter
Text$ = NewText$ + SPACE$(Count)
END SUB
SUB TitleCase (Text$)
'*** This subroutine title cases text
'*** Split the text into words
'*** The initial letter of each word is uppercased
'*** The rest is lower cased
'*** While we're looking for spaces strip out extra ones
FinalText$ = ""
OldNum = 1
Number = 0
DO
Number = INSTR(OldNum, Text$, " ")
IF Number < 1 THEN EXIT DO
NewText$ = MID$(Text$, OldNum, Number - OldNum)
IF NewText$ > " " THEN
InitText$ = UCASE$(LEFT$(NewText$, 1))
RestText$ = LCASE$(MID$(NewText$, 2, Number - OldNum))
FinalText$ = FinalText$ + InitText$) + RestText$) + SPACE$(1)
END IF
OldNum = Number + 1
LOOP UNTIL Number < 1
'*** Special case for hyphenated names
Number = INSTR(FinalText$, "-")
IF Number > 0 THEN
MID$(FinalText$, Number + 1, 1) = UCASE$(MID$(FinalText$, Number + 1, 1))
END IF
Text$ = FinalText$
END SUB
If you look at the program you'll see it is just a text manipulation program that produces fields that are more suitable for text matching than simply casing the text. Once the matching fields have been produced they can be matched against eachother. The program also produces Salutation and Contact fields. When testing these sort of programs it is important that all possible forms of the data the program is required to deal with is tested, hence the 9 different forms of the A B Sample data.
Soundex
There is another way that text can be matched. This uses an algorithim that, strange as it seems, matches text by the way it is pronounced. The method is called SOUNDEX.
There are several variations of the Soundex algorithim and they give slightly different results. Here is a list of the common rules of creating a soundex algorithim :-
Double consonants are reduced to one letter
Adjacent letters that belong to the same group in the table below are reduced to one letter
Letters separated by H or W are treated as though the H or W are not there
The letters are substituted for the following numbers1 for B, F, P and Y
2 for C, G, J, K, Q, S, X and Z
3 for D and T
4 for L
5 for M and N
6 for RVowels are removed, except if it is the first letter
The letter Y is removed, except if is the first letter
Shorten the coded word to 4 characters
If the code produced is less than 4 characters long then make it 4 by right padding with 0's
Odd though it seems this system really does work, the US Government uses it to index its census information and several genealogy pages index their databases with it. Several of these Soundex calculators can be found on the net.
Take the words BROADCAST, HELLRAISING, BARE and BEAR
Removing H and W gives BROADCAST, HELLRAISING, BARE and BEAR
Removing double consonants gives BROADCAST, HELRAISING, BARE and BEAR
Removing the vowels gives BRDCST, HLRSNG, BR and BR
Removing Y gives BRDCST, HLRSNG, BR and BR
Removing letters of the same group gives BRDCT, HLRSNG, BR and BR
Shortening the word to four characters gives BRDC, HLRS, BR and BR
Substitution gives B632, H462, B6 and B6
Padding gives B632, H462, B600 and B600
Using this method the surnames PAINE, PAIN, PANE, PAYN, PAYNE etc all give P500
When calculating Soundex's for surnames some programs will ignore prefixes such as Von, de, D', le, du, van etc. Most, however keep Mc, Mac etc.
'Soundex.bas Ray Thomas August 2000
'A simple Soundex calculator
'*** Get the word to calculate the soundex code for
CLS
LOCATE 5, 10
PRINT "Please type the word to create the soundex code for"
LOCATE 7, 20
LINE INPUT ; UserWord$
'*** To make processing simpler uppercase the input word
NewText$ = UCASE$(UserWord$)
'*** Keep the first letter for later
InitLet$ = LEFT$(NewText$, 1)
'*** Substitute the letters for numbers
TestWord$ = ""
FOR Counter = 1 TO LEN(NewText$)
Char$ = MID$(NewText$, Counter, 1)
IF INSTR("BFPV", Char$) THEN Char$ = "1"
IF INSTR("CGJKQSXZ", Char$) THEN Char$ = "2"
IF INSTR("DT", Char$) THEN Char$ = "3"
IF INSTR("L", Char$) THEN Char$ = "4"
IF INSTR("MN", Char$) THEN Char$ = "5"
IF INSTR("R", Char$) THEN Char$ = "6"
TestWord$ = TestWord$ + Char$
NEXT Counter
' *** Strip out the letters H and W
NewText$ = ""
FOR Counter = 1 TO LEN(TestWord$)
Char$ = MID$(TestWord$, Counter, 1)
Number = INSTR("HW", Char$)
IF Number < 1 THEN NewText$ = NewText$ + Char$
NEXT Counter
'*** Check that no two adjacent codes are the same
TestWord$ = ""
FOR Counter = 1 TO LEN(NewText$)
Char$ = MID$(NewText$, Counter, 1)
IF Char$ <> RIGHT$(TestWord$, 1) THEN
TestWord$ = TestWord$ + Char$
END IF
NEXT Counter
'*** Strip out any non-alphabetic characters, vowels and Y
NewText$ = ""
FOR Counter = 1 TO LEN(TestWord$)
Char$ = MID$(TestWord$, Counter, 1)
Number = INSTR("123456", Char$)
IF Number > 0 THEN NewText$ = NewText$ + Char$
NEXT Counter
'*** Create the final code
Number = INSTR("BCDFGJKLMNPQRSTVXZ", InitLet$)
IF Number > 0 THEN
Soundex$ = InitLet$ + MID$(NewText$, 2, 3)
ELSE
Soundex$ = InitLet$ + MID$(NewText$, 1, 3)
END IF
'*** If less than four characters pad right with "0"s
IF LEN(Soundex$) < 4 THEN
Soundex$ = Soundex$ + STRING$(4 - LEN(Soundex$), "0")
END IF
LOCATE 10, 10
PRINT "The Soundex code for this word is :-"
LOCATE 12, 30
PRINT Soundex$
END
QBasic | Errors | 40lb Weight | Bits | Chance | Colours | Dates | Delays | File Dialog | Files | Input | Matching | Menus | Mouse | Numbers | SeqNo | SIRDS | Sorts | Text | Timer | DLoads
HomePage | Optical Illusions | War Stories | QBasic | Dads Navy Days | Bristol | Bristol, USA | Bristol, Canada | Terre Haute | Miscellany | Web Stuff | About Ray | Site Map | Site Search | Messages | Credits | Links | Web Rings