Python 正则表达式

什么是正则表达式？

正则表达式（Regular Expression，简称regex或regexp）是一种用于描述字符串模式的强大工具。它是一种特殊的字符序列，可以帮助你检查字符串是否包含某种模式，或者执行查找和替换操作。

在Python中，正则表达式通过内置的re模块提供支持，这使得我们能够在Python代码中使用正则表达式进行复杂的字符串处理。

为什么学习正则表达式很重要？

数据验证（如邮箱、电话号码格式）
搜索和替换文本中的特定模式
从大文本中提取信息
文本解析和处理

基础语法

导入re模块

在使用正则表达式之前，需要导入Python的re模块：

python
import re

最简单的匹配

让我们从最简单的例子开始：

python
import re

text = "Hello, Python!"
result = re.search("Python", text)
print(result)  # <re.Match object; span=(7, 13), match='Python'>

在上面的例子中，re.search()函数查找字符串"Python"在文本中的第一次出现。

特殊字符

正则表达式使用特殊字符来表示各种模式：

字符	描述
`.`	匹配任意字符（除了换行符）
`^`	匹配字符串开头
`$`	匹配字符串结尾
`*`	匹配前面的表达式0次或多次
`+`	匹配前面的表达式1次或多次
`?`	匹配前面的表达式0次或1次
`\`	转义特殊字符
`[]`	字符集合，匹配集合中的任一字符
`()`	分组
`\d`	匹配数字
`\w`	匹配字母、数字或下划线
`\s`	匹配空白字符

常用函数

Python的re模块提供了多个函数来使用正则表达式：

re.search()

查找字符串中第一次出现的匹配项：

python
import re

text = "Python is amazing and Python is powerful"
match = re.search("Python", text)
print(match.group())  # Python
print(match.start())  # 0
print(match.end())    # 6

re.match()

从字符串开头开始匹配：

python
import re

text = "Python is amazing"
match = re.match("Python", text)  # 匹配成功
print(match.group())  # Python

text = "I love Python"
match = re.match("Python", text)  # 匹配失败，因为不在开头
print(match)  # None

re.findall()

查找所有匹配的子串，返回列表：

python
import re

text = "Python is amazing and Python is powerful"
matches = re.findall("Python", text)
print(matches)  # ['Python', 'Python']

re.finditer()

返回一个迭代器，包含所有匹配对象：

python
import re

text = "Python is amazing and Python is powerful"
for match in re.finditer("Python", text):
    print(f"Found at position {match.start()}-{match.end()}: {match.group()}")

# 输出:
# Found at position 0-6: Python
# Found at position 21-27: Python

re.sub()

替换匹配的子串：

python
import re

text = "Python is amazing and Python is powerful"
new_text = re.sub("Python", "Java", text)
print(new_text)  # Java is amazing and Java is powerful

元字符详解

点号 (.)

匹配任意字符（除了换行符）：

python
import re

text = "cat, bat, rat, mat"
matches = re.findall(".at", text)
print(matches)  # ['cat', 'bat', 'rat', 'mat']

脱字符 (^)

匹配字符串的开头：

python
import re

text = "Python is amazing"
match = re.search("^Python", text)
print(match.group())  # Python

美元符 ($)

匹配字符串的结尾：

python
import re

text = "I love Python"
match = re.search("Python$", text)
print(match.group())  # Python

星号 (*)

匹配前面的表达式0次或多次：

python
import re

text = "ca cat caat caaat"
matches = re.findall("ca*t", text)
print(matches)  # ['ct', 'cat', 'caat', 'caaat']

加号 (+)

匹配前面的表达式1次或多次：

python
import re

text = "ca cat caat caaat"
matches = re.findall("ca+t", text)
print(matches)  # ['cat', 'caat', 'caaat']

问号 (?)

匹配前面的表达式0次或1次：

python
import re

text = "color colour"
matches = re.findall("colou?r", text)
print(matches)  # ['color', 'colour']

字符集 []

字符集用于匹配一组字符中的任意一个：

python
import re

text = "cat, bat, rat, mat"
matches = re.findall("[cbr]at", text)
print(matches)  # ['cat', 'bat', 'rat']

范围表示

可以使用连字符表示一个范围：

python
import re

text = "1 a 2 b 3 c 4 d"
# 匹配所有数字
digits = re.findall("[0-9]", text)
print(digits)  # ['1', '2', '3', '4']

# 匹配所有小写字母
letters = re.findall("[a-z]", text)
print(letters)  # ['a', 'b', 'c', 'd']

否定字符集

在字符集开头使用 ^ 可以否定字符集：

python
import re

text = "cat, bat, rat, mat"
matches = re.findall("[^cbr]at", text)
print(matches)  # ['mat']

量词

量词用于指定前面的表达式应该匹配多少次：

`{n}`

精确匹配n次：

python
import re

text = "ca cat caat caaat"
matches = re.findall("ca{2}t", text)
print(matches)  # ['caat']

`{n,}`

至少匹配n次：

python
import re

text = "ca cat caat caaat"
matches = re.findall("ca{2,}t", text)
print(matches)  # ['caat', 'caaat']

`{n,m}`

匹配n到m次：

python
import re

text = "ca cat caat caaat caaaat"
matches = re.findall("ca{2,3}t", text)
print(matches)  # ['caat', 'caaat']

分组 ()

分组允许你捕获匹配的子模式：

python
import re

text = "John Smith, Jane Doe, Bob Johnson"
pattern = r"(\w+) (\w+)"
matches = re.findall(pattern, text)
print(matches)  # [('John', 'Smith'), ('Jane', 'Doe'), ('Bob', 'Johnson')]

命名分组

可以为分组命名，这样便于后续引用：

python
import re

text = "John Smith"
pattern = r"(?P<first_name>\w+) (?P<last_name>\w+)"
match = re.search(pattern, text)
print(match.group("first_name"))  # John
print(match.group("last_name"))   # Smith

实际案例

案例1: 电子邮件验证

python
import re

def is_valid_email(email):
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return bool(re.match(pattern, email))

# 测试
emails = ["user@example.com", "invalid-email", "another.user@domain.co.uk"]
for email in emails:
    if is_valid_email(email):
        print(f"{email} 是有效的电子邮件地址")
    else:
        print(f"{email} 不是有效的电子邮件地址")

# 输出:
# user@example.com 是有效的电子邮件地址
# invalid-email 不是有效的电子邮件地址
# another.user@domain.co.uk 是有效的电子邮件地址

案例2: 从文本中提取日期

python
import re

text = """
会议安排:
- 2023-05-15: 项目启动会
- 2023-06-30: 中期评审
- 2023-09-22: 最终演示
"""

date_pattern = r"\d{4}-\d{2}-\d{2}"
dates = re.findall(date_pattern, text)

print("提取的日期:")
for date in dates:
    print(date)

# 输出:
# 提取的日期:
# 2023-05-15
# 2023-06-30
# 2023-09-22

案例3: 解析日志文件

python
import re

log_line = '192.168.1.1 - - [20/May/2023:10:12:45 +0000] "GET /index.html HTTP/1.1" 200 1234'

# 解析IP地址、日期时间、请求方法、URL路径和状态码
pattern = r'(\d+\.\d+\.\d+\.\d+).*\[(.*?)\].*"(\w+) (.*?) HTTP.*" (\d+)'
match = re.search(pattern, log_line)

if match:
    ip = match.group(1)
    datetime = match.group(2)
    method = match.group(3)
    path = match.group(4)
    status = match.group(5)
    
    print(f"IP地址: {ip}")
    print(f"日期时间: {datetime}")
    print(f"请求方法: {method}")
    print(f"URL路径: {path}")
    print(f"状态码: {status}")

# 输出:
# IP地址: 192.168.1.1
# 日期时间: 20/May/2023:10:12:45 +0000
# 请求方法: GET
# URL路径: /index.html
# 状态码: 200

性能提示

对于频繁使用的正则表达式，考虑使用 re.compile() 预编译它们，这可以提高性能：

python
import re

# 预编译正则表达式
email_pattern = re.compile(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")

# 多次使用
email = "user@example.com"
if email_pattern.match(email):
    print("有效的电子邮件地址")

常见正则表达式模式

以下是一些常用的正则表达式模式：

python
# 匹配电话号码 (如: 123-456-7890)
phone_pattern = r"\d{3}-\d{3}-\d{4}"

# 匹配URL
url_pattern = r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}[/\w.-]*"

# 匹配IP地址
ip_pattern = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"

# 匹配日期 (如: 2023-05-15, 15/05/2023)
date_pattern1 = r"\d{4}-\d{2}-\d{2}"
date_pattern2 = r"\d{2}/\d{2}/\d{4}"

# 匹配标签 (如: <div>, <a href="...">)
tag_pattern = r"<[^>]+>"

注意

正则表达式很强大，但也可能变得非常复杂。对于复杂的字符串处理，有时使用其他解析方法可能更合适，例如使用专门的HTML解析器而不是正则表达式来解析HTML。

总结

正则表达式是Python中处理字符串的强大工具，可以:

验证字符串格式（如电子邮件、电话号码等）
查找和提取文本中的特定模式
替换文本中的特定模式
分割字符串

掌握正则表达式需要大量实践，但一旦熟悉，它将成为你字符串处理工具箱中的有力武器。

练习

编写一个正则表达式来验证密码是否符合以下规则：至少8个字符，至少包含一个数字，一个大写字母和一个小写字母。
编写一个函数，从给定文本中提取所有的中国手机号码（格式：1开头的11位数字）。
编写一个正则表达式，从HTML文本中提取所有的链接（即<a>标签中的href属性值）。

什么是正则表达式？​

为什么学习正则表达式很重要？​

基础语法​

导入re模块​

最简单的匹配​

特殊字符​

常用函数​

re.search()​

re.match()​

re.findall()​

re.finditer()​

re.sub()​

元字符详解​

点号 (.)​

脱字符 (^)​

美元符 ($)​

星号 (*)​

加号 (+)​

问号 (?)​

字符集 []​

范围表示​

否定字符集​

量词​

{n}​

{n,}​

{n,m}​

分组 ()​

命名分组​

实际案例​

案例1: 电子邮件验证​

案例2: 从文本中提取日期​

案例3: 解析日志文件​

常见正则表达式模式​

总结​

练习​

附加资源​