PageKicker v2.1.1-Keats improves acronym suggester

Version 2.1.1 of PageKicker replaces an acronym-identifying regex with a narrower one that produces better results.  It is still far from perfect.

#sed 's/[[:space:]]\+/\n/g' $txtinfile  | sort -u | \
 egrep  '[[:upper:]].*[[:upper:]]' | sed 's/[\(\),]//g' | uniq
sed 's/[[:space:]]\+/\n/g' $txtinfile  | sort -u | \
  egrep [A-Z][a-zA-Z0-9+\.\&]*[A-Z0-9] | sed 's/[\(\),]//g' | uniq

I reviewed a number of text analytics approaches prior to selecting this simpler and stupider regex approach.  Most of the available tools require that the full phrase be immediately followed by the acronym, often in parentheses.  There’s one that doesn’t require that but it is in Java which means I’d have to traverse a learning curve to plug it in.   Also, I’m not really looking just for acronyms, I’m also looking for technical initialisms such as B8 or B-8.

There is a very simple test file included in the commit that includes these terms:

cat 
dog 
fool 
M21a 
M21A 
SSN 
SSN21 
V-8 
GOLLY
V8

Output from old version:

GOLLY
M21A
SSN
SSN21

Output from new version:

GOLLY
M21A
M21a
SSN
SSN21
V8

Leave a Reply

Your email address will not be published. Required fields are marked *