Artificial intelligent assistant

sed one-liner to replace word-medial capitals I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and these can easily be distinguished by the fact that a capital W does not ever occur in the middle of a word in true English. So, I need a sed one-liner that replaces all word-medial capital W's with the letters fi.

A capital W doesn't occur at the end of a word either, but it may occur in an all-caps abbreviation. So I'd replace `W` when it's immediately after a lowercase letter, or when it follows an uppercase letter and precedes a lowercase letter (aWre).


sed -e 's/\([[:lower:]]\)W/\1fi/g' -e 's/\([[:alpha:]]\)W\([[:lower:]]\)/\1fi\2/g'


This doesn't cover `fifi` (which my biggest word list only finds it in “fifing”). More importantly, this doesn't cover `W` at the beginning of a word; you can capture some cases by looking at the second letter, but that's still going to miss many words that begin with `fi`. In English, many letters never appear after a W:


… -e 's/\([^[:alnum:]]\)W\([b-dfgj-npqstv-xz]\)/\1fi\2/g' \
-e 's/^W\([b-dfgj-npqstv-xz]\)/fi\2/'


For more precise results and to cope with other languages, you can switch to a more complex dictionary-based approach (which fancy OCR systems often use, evidently yours isn't fancy enough).

xcX3v84RxoQ-4GxG32940ukFUIEgYdPy 829ec7ef1a40c1660fea43e0a201618b