UTL_MATCH – string similarity/matching 11G

UTL_MATCH is very interesting package which implements quite advanced algorithms to match two strings. You can use them to quickly match name like First Names or Last Names.

Two basic algorithms are implemented here:

Edit Distance” – is a measure of Similarity between two strings, s1 and s2. The distance is the number of insertions, deletions or substitutions required to transform s1 to s2

Jaro-Winkler algorithm” is another way of calculating Edit distance between two strings. It is a String Comparator measure that gives values of partial agreement between two strings. The string comparator accounts for length of strings and partially accounts for typical human errors made in alphanumeric strings

Setup test

drop table test_tbl;

create table test_tbl
(id1 varchar2(30),
 id2 varchar2(30));

insert into test_tbl values('Tomasz Lesinski', 'Thomas Lesinski');
insert into test_tbl values('Angeles Lakers',  'Angeles Clippers');
insert into test_tbl values('Cunningham',      'Dunningham');
insert into test_tbl values('Perfect Match',   'Perfect Match');
insert into test_tbl values('Coca Cola',       'Pepsi');
insert into test_tbl values('Gerard',          'Gerhard');
insert into test_tbl values('Abraham',         'Abriham');

commit;

Function EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY

EDIT_DISTANCE function calculates the number of insertions, deletions or substitutions required to transform string-1 into string-2.

EDIT_DISTANCE_SIMILARITY

This function calculates the number of insertions, deletions or substations required to transform string-1 into string-2, and returns the Normalized value of the Edit Distance between two Strings. The value is typically between 0 (no match) and 100 (perfect match)

select 
  id1, id2, 
  utl_match.edit_distance(id1, id2) ed, 
  utl_match.edit_distance_similarity(id1, id2) eds
from test_tbl;

ID1                  ID2                          ED        EDS
-------------------- -------------------- ---------- ----------
Tomasz Lesinski      Thomas Lesinski               2         87 
Angeles Lakers       Angeles Clippers              5         69 
Cunningham           Dunningham                    1         90 
Perfect Match        Perfect Match                 0        100 
Coca Cola            Pepsi                         9          0 
Gerard               Gerhard                       1         86 
Abraham              Abriham                       1         86 

 7 rows selected

Function JARO_WINKLER and JARO_WINKLER_SIMILARITY

JARO_WINKLER function calculates the measure of agreement between two strings

JARO_WINKLER_SIMILARITY – function calculates the measure of agreement between two strings, and returns a score between 0 (no match) and 100 (perfect match)

 select 
  id1, id2, 
  trunc(utl_match.jaro_winkler(id1, id2),4) jw, 
  utl_match.jaro_winkler_similarity(id1, id2) jws
from test_tbl;

ID1                  ID2                          JW        JWS
-------------------- -------------------- ---------- ----------
Tomasz Lesinski      Thomas Lesinski          0.9599         96 
Angeles Lakers       Angeles Clippers         0.8946         89 
Cunningham           Dunningham               0.9333         93 
Perfect Match        Perfect Match                 1        100 
Coca Cola            Pepsi                         0          0 
Gerard               Gerhard                  0.9666         96 
Abraham              Abriham                  0.8944         89 

 7 rows selected

Compare both algorithms

select 
  id1, id2, 
  utl_match.edit_distance_similarity(id1, id2) eds,
  utl_match.jaro_winkler_similarity(id1, id2)  jws
from test_tbl;

ID1                  ID2                         EDS        JWS
-------------------- -------------------- ---------- ----------
Tomasz Lesinski      Thomas Lesinski              87         96 
Angeles Lakers       Angeles Clippers             69         89 
Cunningham           Dunningham                   90         93 
Perfect Match        Perfect Match               100        100 
Coca Cola            Pepsi                         0          0 
Gerard               Gerhard                      86         96 
Abraham              Abriham                      86         89 

 7 rows selected

I have no doubts that JARO_WINKLER is better but which is faster ?

Try to answer for this question yourself

Have a fun 🙂

Tomasz

Leave a Reply

Your email address will not be published. Required fields are marked *