Categorization rule doesn't match to combining character sequence of unicode

Kuyama · December 8, 2023, 12:11am

Hello, I’m ActivityWatch user in Japan, and I’ve used AcivityWatch for recent few months by my “Japanese” environment.
It works almost well, but I realized that there is a case that AW categorization mechanism doesn’t work well.

The strings include “combining character sequence” of unicode doesn’t match the words defined in the categorization settings.

Combining character sequence is a combination of base character and the following “combining character” which form a grapheme.(See Combining character - Wikipedia)

For example, Latin small letter e with acute accent é is formed by* base letter e (U+0065) and combining acute accent (U+0301).
é is also expressed as* precomposed characters U+00E9.

Some Japanese characters can be expressed as combining character sequences and they are usual on Japanese Mac OS. For example, Microsoft Excel, Word, Powerpoint file names often contain combined character sequence.

On the other hand, strings you define in the categorization settings are always expressed as “precomposed characters” .

So, the string matching is failed even if you define the same string as a file name.
Does any one face to the same problem?
(I think combined characters are used in many languages.)
Any work around?

For fundamental solution , probably need to do Unicode-normalization before matching.
How about normalizing event data strings just before regex.search()?