UTL_MATCH is very interesting package which implements quite advanced algorithms to match two strings. You can use them to quickly match name like First Names or Last Names.
Two basic algorithms are implemented here:
“Edit Distance” – is a measure of Similarity between two strings, s1 and s2. The distance is the number of insertions, deletions or substitutions required to transform s1 to s2
“Jaro-Winkler algorithm” is another way of calculating Edit distance between two strings. It is a String Comparator measure that gives values of partial agreement between two strings. The string comparator accounts for length of strings and partially accounts for typical human errors made in alphanumeric strings
Setup test
drop table test_tbl; create table test_tbl (id1 varchar2(30), id2 varchar2(30)); insert into test_tbl values('Tomasz Lesinski', 'Thomas Lesinski'); insert into test_tbl values('Angeles Lakers', 'Angeles Clippers'); insert into test_tbl values('Cunningham', 'Dunningham'); insert into test_tbl values('Perfect Match', 'Perfect Match'); insert into test_tbl values('Coca Cola', 'Pepsi'); insert into test_tbl values('Gerard', 'Gerhard'); insert into test_tbl values('Abraham', 'Abriham'); commit;
Function EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY
EDIT_DISTANCE function calculates the number of insertions, deletions or substitutions required to transform string-1 into string-2.
EDIT_DISTANCE_SIMILARITY
This function calculates the number of insertions, deletions or substations required to transform string-1 into string-2, and returns the Normalized value of the Edit Distance between two Strings. The value is typically between 0 (no match) and 100 (perfect match)
select id1, id2, utl_match.edit_distance(id1, id2) ed, utl_match.edit_distance_similarity(id1, id2) eds from test_tbl; ID1 ID2 ED EDS -------------------- -------------------- ---------- ---------- Tomasz Lesinski Thomas Lesinski 2 87 Angeles Lakers Angeles Clippers 5 69 Cunningham Dunningham 1 90 Perfect Match Perfect Match 0 100 Coca Cola Pepsi 9 0 Gerard Gerhard 1 86 Abraham Abriham 1 86 7 rows selected
Function JARO_WINKLER and JARO_WINKLER_SIMILARITY
JARO_WINKLER function calculates the measure of agreement between two strings
JARO_WINKLER_SIMILARITY – function calculates the measure of agreement between two strings, and returns a score between 0 (no match) and 100 (perfect match)
select id1, id2, trunc(utl_match.jaro_winkler(id1, id2),4) jw, utl_match.jaro_winkler_similarity(id1, id2) jws from test_tbl; ID1 ID2 JW JWS -------------------- -------------------- ---------- ---------- Tomasz Lesinski Thomas Lesinski 0.9599 96 Angeles Lakers Angeles Clippers 0.8946 89 Cunningham Dunningham 0.9333 93 Perfect Match Perfect Match 1 100 Coca Cola Pepsi 0 0 Gerard Gerhard 0.9666 96 Abraham Abriham 0.8944 89 7 rows selected
Compare both algorithms
select id1, id2, utl_match.edit_distance_similarity(id1, id2) eds, utl_match.jaro_winkler_similarity(id1, id2) jws from test_tbl; ID1 ID2 EDS JWS -------------------- -------------------- ---------- ---------- Tomasz Lesinski Thomas Lesinski 87 96 Angeles Lakers Angeles Clippers 69 89 Cunningham Dunningham 90 93 Perfect Match Perfect Match 100 100 Coca Cola Pepsi 0 0 Gerard Gerhard 86 96 Abraham Abriham 86 89 7 rows selected
I have no doubts that JARO_WINKLER is better but which is faster ?
Try to answer for this question yourself
Have a fun 🙂
Tomasz
whats happen wehn:
ID1 = Tomasz Lesinski
and
ID2 = Tomasz Victor Ursus Lesinski
ID1 is 100% into ID2 but JARO calculates 85 ?