Python:正則表達式
1. 正則表達式概述
正則表達式(簡稱為 regex)是一些由字元和特殊符號組成的字元串, 描述了模式的重複或者表述多個字元。
正則表達式能按照某種模式匹配一系列有相似特徵的字元串。
換句話說, 它們能夠匹配多個字元串。
不同語言的正則表達式有差異,本文敘述是Python的正則表達式。
解釋程式碼大多摘自《Python編程快速上手 讓繁瑣工作自動化》
2. 正則表達式書寫
正則表達式就是一個字元串,與普通字元串不同的是,正則表達式包含了0個或多個表達式符號以及特殊字元,詳見《Python核心編程》1.2節。
# 正則表達式書寫 'hing' '\wing' '123456' '\d\d\d\d\d\d' 'regex.py' '.*\.py'
3. 創建正則表達式對象
孤立的一個正則表達式並不能起到匹配字元串的作用,要讓其能夠匹配目標字元,需要創建一個正則表達式對象。通常向compile()函數傳入一個原始字元形式的正則表達式,即 r’…..’
>>> # re模組的compile()函數將返回(創建)一個Regex模式對象 >>> import re >>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
4. 常用的正則表達式模式
4.1 括弧分組
>>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') >>> mo = Regex.search('My number is 415-555-4242.') >>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') # 創建Regex對象 >>> mo = Regex.search('My number is 415-555-4242.') # 返回Match對象 >>> mo.group() # 調用Regex對象的group()方法將返回整個匹配文本 '415-555-4242' >>> mo.group(1) '415' >>> mo.group(2) '555-4242' >>> mo.group(0) '415-555-4242' >>> mo.groups() ('415', '555-4242') >>> a,b = mo.groups() # groups()方法返回多個值得元組 >>> a '415' >>> b '555-4242' >>>
4.2 用管道匹配多個分組
>>> heroRegex = re.compile (r'Batman|Tina Fey') >>> mo1 = heroRegex.search('Batman and Tina Fey.') >>> mo1.group() 'Batman' >>> mo2 = heroRegex.search('Tina Fey and Batman.') >>> mo2.group() 'Tina Fey
4.3 用問號實現可選匹配
>>> batRegex = re.compile(r'Bat(wo)?man') # 如果'wo'沒有用括弧括起來,則可選的字元將是Batwo >>> mo1 = batRegex.search('The Adventures of Batman') >>> mo1.group() 'Batman' >>> mo2 = batRegex.search('The Adventures of Batwoman') >>> mo2.group() 'Batwoman'
4.4 用星號匹配零次或多次
>>> batRegex = re.compile(r'Bat(wo)*man') # 如果要匹配'*'號則用\* >>> mo1 = batRegex.search('The Adventures of Batman') >>> mo1.group() 'Batman' >>> mo2 = batRegex.search('The Adventures of Batwoman') >>> mo2.group() 'Batwoman' >>> mo3 = batRegex.search('The Adventures of Batwowowowoman') >>> mo3.group() 'Batwowowowoman
4.5 用加號匹配一次或多次
>>> batRegex = re.compile(r'Bat(wo)+man') # 如果要匹配+號用\+ >>> mo1 = batRegex.search('The Adventures of Batwoman') >>> mo1.group() 'Batwoman' >>> mo2 = batRegex.search('The Adventures of Batwowowowoman') >>> mo2.group() 'Batwowowowoman' >>> mo3 = batRegex.search('The Adventures of Batman') >>> mo3 == None True
4.6 用花括弧匹配特定次數
下面程式碼的 「?」 表示非貪心匹配。問號在正則表達式中可能有兩種含義: 聲明非貪心匹配或表示可選的分組。這兩種含義是完全無關的。
>>> greedyHaRegex = re.compile(r'(Ha){3,5}') # 若果要匹配{,則用\{ >>> mo1 = greedyHaRegex.search('HaHaHaHaHa') >>> mo1.group() 'HaHaHaHaHa' >>> nongreedyHaRegex = re.compile(r'(Ha){3,5}?') >>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa') >>> mo2.group() 'HaHaHa'
5. 貪心和非貪心匹配
利用非貪心匹配的目的往往在於不想讓通配符(.)連通配符之外的匹配字元也被匹配,程式碼如下。當然3.6也是非貪心匹配的一個例子
>>> nongreedyRegex = re.compile(r'<.*?>') >>> mo = nongreedyRegex.search('<To serve man> for dinner.>') >>> mo.group() '<To serve man>' >>> greedyRegex = re.compile(r'<.*>') >>> mo = greedyRegex.search('<To serve man> for dinner.>') >>> mo.group() '<To serve man> for dinner.>'
6. Regex 對象常用方法
如上所述,compile()函數創建了一個Regex對象,Regex對象常用方法如下
6.1 search(), group(), groups()
>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') >>> mo = Regex.search('My number is 415-555-4242.') >>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') # 創建Regex對象 >>> mo = Regex.search('My number is 415-555-4242.') # 返回Match對象 >>> mo.group() # 調用Regex對象的group()方法將返回整個匹配文本 '415-555-4242' >>> mo.group(1) '415' >>> mo.group(2) '555-4242' >>> mo.group(0) '415-555-4242' >>> mo.groups() ('415', '555-4242') >>> a,b = mo.groups() # groups()方法返回多個值得元組 >>> a '415' >>> b '555-4242' >>>
6.2 findall()
如果調用在一個沒有分組的正則表達式上,findall()將返回一個匹配字元串的列表。
如果調用在一個有分組的正則表達式上,findall()將返回一個字元串的元組的列表(每個分組對應一個字元串)
>>> Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups >>> Regex.findall('Cell: 415-555-9999 Work: 212-555-0000') ['415-555-9999', '212-555-0000'] >>> Regex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups >>> Regex.findall('Cell: 415-555-9999 Work: 212-555-0000') [('415', '555', '1122'), ('212', '555', '0000')]
6.3 sub()
>>> namesRegex = re.compile(r'Agent \w+') >>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.') 'CENSORED gave the secret documents to CENSORED.' >>> namesRegex = re.compile(r'Agent \w+') >>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.' , 1) # 匹配1次 'CENSORED gave the secret documents to Agent Bob.'
7. re.IGNOREC ASE、 re.DOTALL 和 re.VERBOSE
要讓正則表達式不區分大小寫,可以向 re.compile()傳入 re.IGNORECASE 或 re.I,作為第二個參數。
通過傳入 re.DOTALL 作為 re.compile()的第二個參數, 可以讓句點字元匹配所有字元, 包括換行字元。
要在多行正則表達式中添加註釋,則向 re.compile()傳入變數 re.VERBOSE, 作為第二個參數。
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
8. (?:…)
>>> re.findall(r'//(?:\w+\.)*(\w+\.com)', '//google.com //www.google.com //code.google.com') ['google.com', 'google.com', 'google.com'] >>>
9.程式碼實踐
# (文件讀寫)瘋狂填詞2.py ''' 創建一個瘋狂填詞( Mad Libs)程式,它將讀入文本文件, 並讓用戶在該文本文件中出現 ADJECTIVE、 NOUN、 ADVERB 或 VERB 等單詞的地方, 加上他們自己的文本。例如,一個文本文件可能看起來像這樣: The ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN was unaffected by these events. 程式將找到這些出現的單詞, 並提示用戶取代它們。 Enter an adjective: silly Enter a noun: chandelier Enter a verb: screamed Enter a noun: pickup truck 以下的文本文件將被創建: The silly panda walked to the chandelier and then screamed. A nearby pickup truck was unaffected by these events. 結果應該列印到螢幕上, 並保存為一個新的文本文件。 ''' import re def mad_libs(filename_path, save_path): with open(filename_path,'r') as strings: # 相對路徑下的文檔 words = strings.read() Regex = re.compile(r'\w[A-Z]+') # \w :匹配1個任何字母、數字或下劃線 finds = Regex.findall(words) for i in finds: replace = input('輸入你想替換 {} 的單詞:\n'.format(i)) Regex2 = re.compile(i) words = Regex2.sub(replace,words,1) # 這個變數必須要是words與上面一致否則只列印最後替換的一個,可以畫棧堆圖跟蹤這個變數的值 print(words) # strings.close() 不用這一行,with 上下文管理器會自動關閉 with open(save_path,'a') as txt: txt.write(words + '\n') #分行寫 txt.close() # save_txt = open('保存瘋狂填詞文檔.txt','a') # save_txt.write(words) # save_txt.close() if __name__ == '__main__': filename_path = input('輸入要替換的txt文本路徑:') # '瘋狂填詞原始文檔.txt' save_path = input('輸入要保存的文件路徑(包含文件名稱):') # '保存瘋狂填詞文檔.txt' mad_libs(filename_path, save_path)