Version 2.1.1 of PageKicker replaces an acronym-identifying regex with a narrower one that produces better results. It is still far from perfect.
#sed 's/[[:space:]]\+/\n/g' $txtinfile | sort -u | \ egrep '[[:upper:]].*[[:upper:]]' | sed 's/[\(\),]//g' | uniq sed 's/[[:space:]]\+/\n/g' $txtinfile | sort -u | \ egrep [A-Z][a-zA-Z0-9+\.\&]*[A-Z0-9] | sed 's/[\(\),]//g' | uniq
I reviewed a number of text analytics approaches prior to selecting this simpler and stupider regex approach. Most of the available tools require that the full phrase be immediately followed by the acronym, often in parentheses. There’s one that doesn’t require that but it is in Java which means I’d have to traverse a learning curve to plug it in. Also, I’m not really looking just for acronyms, I’m also looking for technical initialisms such as B8 or B-8.
There is a very simple test file included in the commit that includes these terms:
cat dog fool M21a M21A SSN SSN21 V-8 GOLLY V8
Output from old version:
GOLLY M21A SSN SSN21
Output from new version:
GOLLY M21A M21a SSN SSN21 V8