資訊檢索:布爾檢索-求交集並集(1)

  • 2019 年 11 月 21 日
  • 筆記

前言

布爾檢索指對文檔集進行布爾運算。比如,有以下三個文檔(已歸約化處理):

doc1 = ["1", "hello", "word", "i", "love", "dazhu"]  doc2 = ["2", "hi", "i", "can", "speak", "love"]  doc3 = ["3", "can", "i", "say", "hello", "make", "dazhu", "hi"]

要求在這個文檔集中求同時包含「i」和「can」的文檔。假定輸入如下:

"i" AND "can"

返回結果應該是[2,3]。即,通過運算,得知doc2,doc3滿足條件。 要實現布爾檢索,關鍵在於建立倒排索引和求N個集合的交集,並集。在這裡,首先實現兩個集合的交並集簡易演算法。

求交集並集

要布爾檢索,首先要求兩個集合的交集或並集。它們的時間複雜度都為 o(x+y) 參考程式碼如下:

def arr_and(arr1, arr2):      p1 = 0      p2 = 0      result = []        while p1 != len(arr1) and p2 != len(arr2):          if arr1[p1] == arr2[p2]:              result.append(arr1[p1])              p1 += 1              p2 += 1          else:              if arr1[p1] < arr2[p2]:                  p1 += 1              else:                  p2 += 1      return result    def arr_or(arr1, arr2):      p1 = 0      p2 = 0      result = []        while p1 != len(arr1) and p2 != len(arr2):          if arr1[p1] == arr2[p2]:              result.append(arr1[p1])              p1 += 1              p2 += 1          else:              if arr1[p1] < arr2[p2]:                  result.append(arr1[p1])                  p1 += 1              else:                  result.append(arr2[p2])                  p2 += 1      if p1 < len(arr1):          result += arr1[p1:]      if p2 < len(arr2):          result += arr2[p2:]        return result    ## test  arr1 = [1,3,5,7,8,12]  arr2 = [1,4,5,6,7,8]    print(arr_and(arr1, arr2))  print(arr_or(arr1, arr2))