如何計算兩個字元串之間的文本相似度?

2019 年 11 月 13 日
筆記

前言
Jaccard 相似度
Sorensen Dice 相似度係數
Levenshtein
漢明距離
餘弦相似性
總結
參考文章

前言

最近好久沒有寫文章了，上一篇文章還是九月十一的時候寫的，距今已經兩個月了，期間一直在忙一些工作上的事情，今天終於有點空閑，所以寫一篇文章散散心。

平時的編碼中，我們經常需要判斷兩個文本的相似性，不管是用來做文本糾錯或者去重等等，那麼我們應該以什麼維度來判斷相似性呢？這些演算法又怎麼實現呢？這篇文章對常見的計算方式做一個記錄。

Jaccard 相似度

首先是 Jaccard 相似度係數，下面是它在維基百科上的一個定義及計算公式。

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name coefficient de communauté by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

其實總結就是一句話：集合的交集與集合的並集的比例.

java 程式碼實現如下：

    public static float jaccard(String a, String b) {          if (a == null && b == null) {              return 1f;          }          // 都為空相似度為 1          if (a == null || b == null) {              return 0f;          }          Set<Integer> aChar = a.chars().boxed().collect(Collectors.toSet());          Set<Integer> bChar = b.chars().boxed().collect(Collectors.toSet());          // 交集數量          int intersection = SetUtils.intersection(aChar, bChar).size();          if (intersection == 0) return 0;          // 並集數量          int union = SetUtils.union(aChar, bChar).size();          return ((float) intersection) / (float)union;      }

Sorensen Dice 相似度係數

與 Jaccard 類似，Dice 係數也是一種計算簡單集合之間相似度的一種計算方式。與 Jaccard 不同的是，計算方式略有不同。下面是它的定義。

The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen[1] and Lee Raymond Dice,[2] who published in 1948 and 1945 respectively.

需要注意的是，他是：集合交集的 2 倍除以兩個集合相加。並不是並集.

java 程式碼實現如下：

    public static float SorensenDice(String a, String b) {          if (a == null && b == null) {              return 1f;          }          if (a == null || b == null) {              return 0F;          }          Set<Integer> aChars = a.chars().boxed().collect(Collectors.toSet());          Set<Integer> bChars = b.chars().boxed().collect(Collectors.toSet());          // 求交集數量          int intersect = SetUtils.intersection(aChars, bChars).size();          if (intersect == 0) {              return 0F;          }          // 全集，兩個集合直接加起來          int aSize = aChars.size();          int bSize = bChars.size();          return (2 * (float) intersect) / ((float) (aSize + bSize));      }

Levenshtein

萊文斯坦距離，又稱 Levenshtein 距離，是編輯距離的一種。指兩個字串之間，由一個轉成另一個所需的最少編輯操作次數。

簡單的說，就是用編輯距離表示字元串相似度, 編輯距離越小，字元串越相似。

java 實現程式碼如下：

    public static float Levenshtein(String a, String b) {          if (a == null && b == null) {              return 1f;          }          if (a == null || b == null) {              return 0F;          }          int editDistance = editDis(a, b);          return 1 - ((float) editDistance / Math.max(a.length(), b.length()));      }        private static int editDis(String a, String b) {            int aLen = a.length();          int bLen = b.length();            if (aLen == 0) return aLen;          if (bLen == 0) return bLen;            int[][] v = new int[aLen + 1][bLen + 1];          for (int i = 0; i <= aLen; ++i) {              for (int j = 0; j <= bLen; ++j) {                  if (i == 0) {                      v[i][j] = j;                  } else if (j == 0) {                      v[i][j] = i;                  } else if (a.charAt(i - 1) == b.charAt(j - 1)) {                      v[i][j] = v[i - 1][j - 1];                  } else {                      v[i][j] = 1 + Math.min(v[i - 1][j - 1], Math.min(v[i][j - 1], v[i - 1][j]));                  }              }          }          return v[aLen][bLen];      }

程式碼中的編輯距離求解使用了經典的動態規劃求解法。

我們使用了** 1 – ( 編輯距離 / 兩個字元串的最大長度） ** 來表示相似度，這樣可以得到符合我們語義的相似度。

漢明距離

漢明距離是編輯距離中的一個特殊情況，僅用來計算兩個等長字元串中不一致的字元個數。

因此漢明距離不用考慮添加及刪除，只需要對比不同即可，所以實現比較簡單。

我們可以用similarity=漢明距離/長度來表示兩個字元串的相似度。

java 程式碼如下：

    public static float hamming(String a, String b) {          if (a == null || b == null) {              return 0f;          }          if (a.length() != b.length()) {              return 0f;          }            int disCount = 0;          for (int i = 0; i < a.length(); i++) {              if (a.charAt(i) != b.charAt(i)) {                  disCount++;              }          }          return (float) disCount / (float) a.length();      }

下面是測試用例：

        Assert.assertEquals(0.0f, StringSimilarity.hamming("java 開發", "大過年的幹啥"), 0f);          Assert.assertEquals(0.6666667f, StringSimilarity.hamming("大過年的吃肉", "大過年的幹啥"), 0f);

餘弦相似性

首先是餘弦相似性的定義：

餘弦相似性通過測量兩個向量的夾角的餘弦值來度量它們之間的相似性。0 度角的餘弦值是 1，而其他任何角度的餘弦值都不大於 1；並且其最小值是-1。從而兩個向量之間的角度的餘弦值確定兩個向量是否大致指向相同的方向。兩個向量有相同的指向時，餘弦相似度的值為 1；兩個向量夾角為 90°時，餘弦相似度的值為 0；兩個向量指向完全相反的方向時，餘弦相似度的值為-1。這結果是與向量的長度無關的，僅僅與向量的指向方向相關。餘弦相似度通常用於正空間，因此給出的值為 0 到 1 之間。

計算公式如下：

餘弦我們都比較熟悉，那麼是怎麼用它來計算兩個字元串之間的相似度呢？

首先我們將字元串向量化，之後就可以在一個平面空間中，求出他們向量之間夾角的餘弦值即可。

字元串向量化怎麼做呢？我舉一個簡單的例子：

A: 呼延十二  B: 呼延二十三    他們的並集 [呼，延，二，十，三]    向量就是並集中的每個字元在各自中出現的頻率。  A 的向量：[1,1,1,1,0]  B 的向量：[1,1,1,1,1]    然後調用上面的公式計算即可。

java 程式碼實現如下：

    public static float cos(String a, String b) {          if (a == null || b == null) {              return 0F;          }          Set<Integer> aChar = a.chars().boxed().collect(Collectors.toSet());          Set<Integer> bChar = b.chars().boxed().collect(Collectors.toSet());            // 統計字頻          Map<Integer, Integer> aMap = new HashMap<>();          Map<Integer, Integer> bMap = new HashMap<>();          for (Integer a1 : aChar) {              aMap.put(a1, aMap.getOrDefault(a1, 0) + 1);          }          for (Integer b1 : bChar) {              bMap.put(b1, bMap.getOrDefault(b1, 0) + 1);          }            // 向量化          Set<Integer> union = SetUtils.union(aChar, bChar);          int[] aVec = new int[union.size()];          int[] bVec = new int[union.size()];          List<Integer> collect = new ArrayList<>(union);          for (int i = 0; i < collect.size(); i++) {              aVec[i] = aMap.getOrDefault(collect.get(i), 0);              bVec[i] = bMap.getOrDefault(collect.get(i), 0);          }            // 分別計算三個參數          int p1 = 0;          for (int i = 0; i < aVec.length; i++) {              p1 += (aVec[i] * bVec[i]);          }            float p2 = 0f;          for (int i : aVec) {              p2 += (i * i);          }          p2 = (float) Math.sqrt(p2);            float p3 = 0f;          for (int i : bVec) {              p3 += (i * i);          }          p3 = (float) Math.sqrt(p3);            return ((float) p1) / (p2 * p3);      }

對上面的程式碼運行了測試用例，可以看到基本符合我們的期望。

        Assert.assertEquals(0.70710677f, StringSimilarity.cos("apple", "app"), 0f);          Assert.assertEquals(0.8944272f, StringSimilarity.cos("呼延十二", "呼延二十三"), 0f);          Assert.assertEquals(0.0f, StringSimilarity.cos("數據工程", "日本旅遊"), 0f);

總結

本文簡單的介紹了幾種不同的計算純文本之間相似度的方式，他們在一定程度上都是奏效的，但是，各自也有各自的一些含義在裡面，比如有的使用編輯距離來描述，有的用向量夾角來描述。所以在使用到本文中的方式時，還是要多多了解他的原理，結合自己的業務實際，選擇其中的一種或者幾種進行使用。

參考文章

維基百科

完。