What is Topic Matching?

When NewsBrain predicts your level of interest in an article, one of the things it does is scan the article text for any references to each topic in your Interest Profile. We call this "topic matching". Here are some technical details on the process.

As part of its prediction, NewsBrain looks at each topic in the profile to determine how relevant it is to the article. If it finds no matches to the topic, it has no relevance. If it finds many matches, the topic may be very relevant. If a match is found near the beginning of the article, it may be even more relevant, and so on.

In most cases, topic matching involves looking for the topic name. Specifically, looks for a case-insensitive match on the name, prefixed and suffixed by a word boundary. So, if the topic is "Cat", this matches the words "Cat" and "cat" but not "catch" or "scathe".

The HTML markup in an article is included in the search, so if you have a topic called "JPEG", it could match against any links in the article that include ".jpeg" file types.

Some topics include a "regex" string, and this performs a more complex search. Topic regex strings can be seen or set on the Topic Info Screen.

"Regex" is short for "regular expression" and it uses special characters and commands embedded in the string to allow for very flexible and powerful searches. One must be careful when constructing such a string, because it is easy to make a mistake. Therefore...

If you are not very familiar with regular expressions and how they work, leave Topic regex strings alone!

NewsBrain does a simple validation check on regex strings you enter, finding things like mismatched parenthesis or invalid commands, but it is possible to enter valid strings that tie up the computer or even create infinite loops, so be careful. There are many descriptions of regex strings and how to use them on the Internet. One such tutorial is here. A detailed syntax summary of the ICU version of regex that iOS regex is based on is here. For the rest of this discussion, familiarity is assumed.

NewsBrain always uses regex strings when topic matching. All matching is case-insensitive. If a regex is specified for a topic, NewsBrain uses it directly. If not, NewsBrain constructs a regex string by quoting the topic name and requiring it to start and end on word boundaries. So if the topic is "Cat" its regex string becomes "\b\QCat\E\b". The quoting allows a topic name to use regex special characters without undesired side effects.

There are many possible pitfalls when constructing your own regex strings. For example:

If you have a regex string "cat" it matches not only just the words "Cat" and "cat" as it would if you used it for a topic name, but "catch" and "scathe" would also match. To fix it, you could require word boundaries like this "\bcat\b".

If you have a regex string like "united states", that requires a single space between the words, and would miss occurrences such as when there was a newline between words. To fix it, you could use "united\s+states", which requires one or more whitespace characters (such as space, newline, or tab) between the words.

NewsBrain looks for a match on the full regex expression. So, you can't match on parenthesized subexpressions - parenthesis are just for grouping. You might use look-ahead and look-behind assertions as a substitute. Normally subexpressions aren't needed anyway, since the exact match string isn't used by NewsBrain, only the fact of the match (it counts how many matches are in the article) and the position of the match (matches near the beginning of an article or in the headline are more relevant than those near the end).

Another use for such assertions is to avoid certain matches. For example you might want to match on JPEG when used in a sentence, but not when it is a filename extension, such as in an article link URL. To prevent a match when JPEG is prefixed with a period, you could use "(?<!\.)JPEG".

As mentioned above, all regex matches are case insensitive. To make part of an expression case sensitive you could use the -i flag, for example to match FBI only when it is in uppercase, you could use "(?-iFBI)", since everything after "(?-i" and before the ")" will be case-sensitive. The "^" and "$" codes match the start and end of line, rather than the start and end of the whole article. However, web page article line boundaries are not usually explicit, so text can wrap according to screen size, and so those aren't very useful in any case. When formulating default regex strings, NewsBrain quotes the topic name with "\Q" and "\E" to keep any special characters in the topic name from being misinterpreted.