Python：正則表達式

2021 年 4 月 22 日
筆記
Python基礎, 編程基礎知識

1. 正則表達式概述　

　　正則表達式（簡稱為 regex）是一些由字元和特殊符號組成的字元串，描述了模式的重複或者表述多個字元。

　　正則表達式能按照某種模式匹配一系列有相似特徵的字元串。

　　換句話說，它們能夠匹配多個字元串。

　　不同語言的正則表達式有差異，本文敘述是Python的正則表達式。

　　解釋程式碼大多摘自《Python編程快速上手讓繁瑣工作自動化》

2. 正則表達式書寫

　　正則表達式就是一個字元串，與普通字元串不同的是，正則表達式包含了0個或多個表達式符號以及特殊字元，詳見《Python核心編程》1.2節。

# 正則表達式書寫

'hing'
'\wing'
'123456'
'\d\d\d\d\d\d'
'regex.py'
'.*\.py'

3. 創建正則表達式對象

　　孤立的一個正則表達式並不能起到匹配字元串的作用，要讓其能夠匹配目標字元，需要創建一個正則表達式對象。通常向compile()函數傳入一個原始字元形式的正則表達式，即 r’…..’

>>> # re模組的compile()函數將返回（創建）一個Regex模式對象
>>> import re
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

4. 常用的正則表達式模式

4.1 括弧分組

>>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = Regex.search('My number is 415-555-4242.')
>>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') # 創建Regex對象
>>> mo = Regex.search('My number is 415-555-4242.')   # 返回Match對象
>>> mo.group()         # 調用Regex對象的group()方法將返回整個匹配文本
'415-555-4242'
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'
>>> mo.group(0)
'415-555-4242'
>>> mo.groups()
('415', '555-4242')
>>> a,b = mo.groups()   # groups()方法返回多個值得元組
>>> a
'415'
>>> b
'555-4242'
>>>

4.2 用管道匹配多個分組

>>> heroRegex = re.compile (r'Batman|Tina Fey')
>>> mo1 = heroRegex.search('Batman and Tina Fey.')
>>> mo1.group()
'Batman'
>>> mo2 = heroRegex.search('Tina Fey and Batman.')
>>> mo2.group()
'Tina Fey

4.3 用問號實現可選匹配

>>> batRegex = re.compile(r'Bat(wo)?man')   # 如果'wo'沒有用括弧括起來，則可選的字元將是Batwo
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

4.4 用星號匹配零次或多次

>>> batRegex = re.compile(r'Bat(wo)*man') # 如果要匹配'*'號則用\*
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'
>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman

4.5 用加號匹配一次或多次

>>> batRegex = re.compile(r'Bat(wo)+man')  # 如果要匹配+號用\+
>>> mo1 = batRegex.search('The Adventures of Batwoman')
>>> mo1.group()
'Batwoman'
>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
'Batwowowowoman'
>>> mo3 = batRegex.search('The Adventures of Batman')
>>> mo3 == None
True

4.6 用花括弧匹配特定次數

　　下面程式碼的「?」表示非貪心匹配。問號在正則表達式中可能有兩種含義：聲明非貪心匹配或表示可選的分組。這兩種含義是完全無關的。

>>> greedyHaRegex = re.compile(r'(Ha){3,5}') # 若果要匹配{,則用\{
>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'
>>> nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
>>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
>>> mo2.group()
'HaHaHa'

5. 貪心和非貪心匹配

　　利用非貪心匹配的目的往往在於不想讓通配符（.）連通配符之外的匹配字元也被匹配，程式碼如下。當然3.6也是非貪心匹配的一個例子

>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man>'
>>> greedyRegex = re.compile(r'<.*>')
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man> for dinner.>'

6. Regex 對象常用方法

　　如上所述，compile()函數創建了一個Regex對象，Regex對象常用方法如下

6.1 search(), group(), groups()

>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = Regex.search('My number is 415-555-4242.')
>>> Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') # 創建Regex對象
>>> mo = Regex.search('My number is 415-555-4242.')   # 返回Match對象
>>> mo.group()         # 調用Regex對象的group()方法將返回整個匹配文本
'415-555-4242'
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'
>>> mo.group(0)
'415-555-4242'
>>> mo.groups()
('415', '555-4242')
>>> a,b = mo.groups()   # groups()方法返回多個值得元組
>>> a
'415'
>>> b
'555-4242'
>>>

6.2 findall()

　　如果調用在一個沒有分組的正則表達式上，findall()將返回一個匹配字元串的列表。

　　如果調用在一個有分組的正則表達式上，findall()將返回一個字元串的元組的列表（每個分組對應一個字元串）

>>> Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> Regex.findall('Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']
>>> Regex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
>>> Regex.findall('Cell: 415-555-9999 Work: 212-555-0000')
[('415', '555', '1122'), ('212', '555', '0000')]

6.3 sub()

>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.'
>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.' , 1)  # 匹配1次
'CENSORED gave the secret documents to Agent Bob.'

7. re.IGNOREC ASE、 re.DOTALL 和 re.VERBOSE

　　要讓正則表達式不區分大小寫，可以向 re.compile()傳入 re.IGNORECASE 或 re.I，作為第二個參數。

　　通過傳入 re.DOTALL 作為 re.compile()的第二個參數，可以讓句點字元匹配所有字元，包括換行字元。

　　要在多行正則表達式中添加註釋，則向 re.compile()傳入變數 re.VERBOSE，作為第二個參數。

>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

8. (?:…)

>>> re.findall(r'//(?:\w+\.)*(\w+\.com)', '//google.com //www.google.com //code.google.com')
['google.com', 'google.com', 'google.com']
>>>

9.程式碼實踐

# （文件讀寫）瘋狂填詞2.py

'''
創建一個瘋狂填詞（ Mad Libs）程式，它將讀入文本文件， 並讓用戶在該文本文件中出現 
ADJECTIVE、 NOUN、 ADVERB 或 VERB 等單詞的地方， 加上他們自己的文本。例如，一個文本文件可能看起來像這樣：
The ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN was
unaffected by these events.
程式將找到這些出現的單詞， 並提示用戶取代它們。
Enter an adjective:
silly
Enter a noun:
chandelier
Enter a verb:
screamed
Enter a noun:
pickup truck
以下的文本文件將被創建：
The silly panda walked to the chandelier and then screamed. A nearby pickup truck was unaffected by these events.
結果應該列印到螢幕上， 並保存為一個新的文本文件。
'''


import re

def mad_libs(filename_path, save_path):
    with open(filename_path,'r') as strings: # 相對路徑下的文檔
        words = strings.read()
    Regex = re.compile(r'\w[A-Z]+')   # \w ：匹配1個任何字母、數字或下劃線
    finds = Regex.findall(words)
    for i in finds:
        replace = input('輸入你想替換 {} 的單詞:\n'.format(i)) 
        Regex2 = re.compile(i)
        words = Regex2.sub(replace,words,1) # 這個變數必須要是words與上面一致否則只列印最後替換的一個,可以畫棧堆圖跟蹤這個變數的值
    print(words)
    
    # strings.close()  不用這一行，with 上下文管理器會自動關閉

    with open(save_path,'a') as txt: 
        txt.write(words + '\n') #分行寫
        txt.close()
        
    # save_txt = open('保存瘋狂填詞文檔.txt','a')
    # save_txt.write(words)
    # save_txt.close()

if __name__ == '__main__': 
    filename_path = input('輸入要替換的txt文本路徑：')    # '瘋狂填詞原始文檔.txt'
    save_path = input('輸入要保存的文件路徑(包含文件名稱）:') # '保存瘋狂填詞文檔.txt'
    mad_libs(filename_path, save_path)

Tags: Python基礎編程基礎知識