Mupdf: finding hyphenated words in PDF file When I search for a word in a PDF file using `mupdf`. It, only, finds the whole word. For example, searching for the word “meaningless” will find the whole word: ...--prophetes.ai

Mupdf: finding hyphenated words in PDF file When I search for a word in a PDF file using `mupdf`. It, only, finds the whole word. For example, searching for the word “meaningless” will find the whole word: This is a short, staggeringly meaningless sentence. There is no way I can know in advance whether a word is broken over two lines – and therefore: hyphenated – or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for “meaningless” won’t find the word in this example: This is a short, staggeringly meaning- less sentence. The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?

Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).

To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German `ck`).

So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.