Artificial intelligent assistant

Two words with different last letter (Arabic) I made dictionary file for Arabic to be used in LibreOffice and LyX. It contains over 2.7 million Arabic words. Sometimes, we can write the word with `ه`, and sometimes with `ة` if it's at the end of the word. I want to use a script with `sed` or `tr` to say that if there are two words that are the same, except for the last letter, and the last letters of the two words are `ة` and `ه`, delete the word which contains `ه`. Examples input: الجنة الجنه الشجرة الشجره Output: الجنة الشجرة

Try this:


awk -v TA=ة -v HA=ه '
{ orig = $0 }
sub(HA"$", TA) { $0 in ta || ha[$0] = orig; next }
$0 ~ TA"$" { ta[$0] = 1; delete ha[$0] }
{ print }
END{ for(i in ha) print ha[i] }
' input_file | LC_ALL=C sort -u > output_file


I've tried to do something smarter, by creating a custom `LC_COLLATE`, but didn't manage it ;-)

xcX3v84RxoQ-4GxG32940ukFUIEgYdPy e5330e56a39bc3ae6ffad3e29ce59430