Updated How Search in 1.6 works (markdown)

Chris Li 2016-06-26 07:44:27 -04:00
parent 9c1a0e351e
commit 4d99abf3ff

@ -28,7 +28,7 @@ In 1.6 I calculate [Levenshtein distance](https://en.wikipedia.org/wiki/Levensht
You may have noticed a big flaw in our system till this step. Our system gives xapian results a great disadvantage. The most relevant xapian result may have much longer titles than the search term, but Levenshtein distance of such result is long and the rank is low. We will address it in the next step!
## Step 4: Utilize xapian probability
If xapian gives a search result 100% probability value, what does it mean? Well, it means xapian is quite sure this is exactly the article you want. It is very confident. So, we need to give such an article a boost, despite they may have (very) long Levenshtein distance to our search term. On the other hand, if xapian gives an article very low probability, then this article may have nothing to do with out search term. Wen may need to give such articles a penalty, even though they may share great similarity with the search term.
If xapian gives a search result 100% probability, what does it mean? Well, it means xapian is quite sure this is exactly the article you want. It is very confident. So, we need to give such an article a boost, despite they may have (very) long Levenshtein distance to our search term. On the other hand, if xapian gives an article very low probability, then this article may have nothing to do with out search term. Wen may need to give such articles a penalty, even though they may share great similarity with the search term.
A lot of times, xapian results congregate on the high end of the probability range. In other words, you get a lot of articles with 100%, 99%, etc., but not a lot with 75%, 66%, 45% and things like that. To better differentiate them, we need a non-liner map from probability to a boost/penalty factor. This also allow us not to give too much penalty on xapian result with low probability.