C++ 正则表达式

什么是正则表达式？

正则表达式（Regular Expression，通常简称为regex）是一种用于描述字符串模式的强大工具，能够进行字符串的匹配、搜索和替换操作。C++11标准引入了对正则表达式的原生支持，将其包含在标准库中，大大简化了字符串处理的复杂性。

备注

正则表达式就像是一种特殊的"搜索语言"，让我们能够用一种简洁的方式来描述和查找复杂的字符模式。

C++ 正则表达式库概览

在C++中，正则表达式功能位于<regex>头文件中，主要包含以下核心组件：

std::regex类：用于表示一个正则表达式
std::regex_match函数：判断整个字符串是否匹配正则表达式
std::regex_search函数：在字符串中查找匹配正则表达式的部分
std::regex_replace函数：使用正则表达式替换字符串中的内容
std::smatch类：存储正则表达式匹配的结果

基本语法

使用C++正则表达式前，我们先来了解一些基本语法：

#include <iostream>
#include <regex>
#include <string>

int main() {
    // 创建一个正则表达式对象
    std::regex pattern("hello");
    
    // 待匹配的字符串
    std::string text = "hello world";
    
    // 检查字符串是否包含模式
    bool contains = std::regex_search(text, pattern);
    
    if (contains) {
        std::cout << "找到匹配!" << std::endl;
    } else {
        std::cout << "未找到匹配!" << std::endl;
    }
    
    return 0;
}

输出：

找到匹配!

正则表达式语法元素

正则表达式由各种特殊字符组成，这些字符有特定的含义：

元素	描述
`.`	匹配任意单个字符
`^`	匹配字符串开头
`$`	匹配字符串结尾
`*`	匹配前面的元素零次或多次
`+`	匹配前面的元素一次或多次
`?`	匹配前面的元素零次或一次
`{n}`	匹配前面的元素恰好n次
`{n,}`	匹配前面的元素至少n次
`{n,m}`	匹配前面的元素至少n次，但不超过m次
`[]`	字符类，匹配方括号内的任意一个字符
`()`	分组，可以捕获匹配的子字符串
`\|`	或，匹配两个或多个选择中的一个
`\d`	匹配一个数字，等价于`[0-9]`
`\D`	匹配一个非数字字符，等价于`[^0-9]`
`\s`	匹配一个空白字符
`\S`	匹配一个非空白字符
`\w`	匹配一个字母、数字或下划线，等价于`[a-zA-Z0-9_]`
`\W`	匹配一个非字母、数字、下划线的字符

常用正则表达式函数

`std::regex_match`

regex_match函数用于判断整个字符串是否与正则表达式匹配：

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::regex pattern("\\d{3}-\\d{4}");  // 形如"123-4567"的格式
    
    std::string text1 = "123-4567";
    std::string text2 = "123-45678";
    std::string text3 = "abc-defg";
    
    std::cout << "text1: " << std::regex_match(text1, pattern) << std::endl;
    std::cout << "text2: " << std::regex_match(text2, pattern) << std::endl;
    std::cout << "text3: " << std::regex_match(text3, pattern) << std::endl;
    
    return 0;
}

输出：

text1: 1
text2: 0
text3: 0

提示

在C++中，正则表达式中的反斜杠(\)需要转义，因此我们需要写成\\。例如，要表示\d，我们需要写成\\d。

`std::regex_search`

regex_search函数用于在字符串中查找匹配正则表达式的部分：

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "联系电话: 010-12345678, 手机: 13812345678";
    std::regex pattern("\\d{3}-\\d{8}|1\\d{10}");  // 匹配固定电话或手机号
    
    std::smatch matches;
    std::string::const_iterator searchStart(text.cbegin());
    
    while (std::regex_search(searchStart, text.cend(), matches, pattern)) {
        std::cout << "找到号码: " << matches[0] << std::endl;
        searchStart = matches.suffix().first;  // 移动到当前匹配之后的位置
    }
    
    return 0;
}

输出：

找到号码: 010-12345678
找到号码: 13812345678

`std::regex_replace`

regex_replace函数用于替换字符串中匹配正则表达式的部分：

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "我的邮箱是 user@example.com 和 admin@test.org";
    std::regex pattern("([\\w.]+)@([\\w.]+)");
    
    // 将邮箱替换为 [电子邮件保护]
    std::string result = std::regex_replace(text, pattern, "[电子邮件保护]");
    std::cout << result << std::endl;
    
    // 使用捕获组进行格式化
    std::string formatted = std::regex_replace(text, pattern, "$1 at $2");
    std::cout << formatted << std::endl;
    
    return 0;
}

输出：

我的邮箱是 [电子邮件保护] 和 [电子邮件保护]
我的邮箱是 user at example.com 和 admin at test.org

分组和捕获

正则表达式中的括号()可以创建捕获组，捕获的内容可以后续引用：

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "生日: 1990-05-15";
    std::regex pattern("(\\d{4})-(\\d{2})-(\\d{2})");  // 捕获年、月、日
    
    std::smatch matches;
    if (std::regex_search(text, matches, pattern)) {
        std::cout << "完整日期: " << matches[0] << std::endl;
        std::cout << "年: " << matches[1] << std::endl;
        std::cout << "月: " << matches[2] << std::endl;
        std::cout << "日: " << matches[3] << std::endl;
    }
    
    return 0;
}

输出：

完整日期: 1990-05-15
年: 1990
月: 05
日: 15

正则表达式选项

C++提供了多种正则表达式选项来控制匹配行为：

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "Hello WORLD";
    
    // 区分大小写的匹配
    std::regex pattern1("hello", std::regex::ECMAScript);
    std::cout << "区分大小写: " << std::regex_search(text, pattern1) << std::endl;
    
    // 不区分大小写的匹配
    std::regex pattern2("hello", std::regex::icase);
    std::cout << "不区分大小写: " << std::regex_search(text, pattern2) << std::endl;
    
    return 0;
}

输出：

区分大小写: 0
不区分大小写: 1

常用的正则表达式选项包括：

std::regex::ECMAScript：使用ECMAScript正则表达式语法（默认）
std::regex::icase：忽略大小写
std::regex::nosubs：不存储子表达式匹配
std::regex::optimize：优化匹配速度
std::regex::collate：使用区域设置进行字符范围比较

实际应用案例

案例1：验证电子邮箱格式

#include <iostream>
#include <regex>
#include <string>

bool isValidEmail(const std::string& email) {
    const std::regex pattern("([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6})");
    return std::regex_match(email, pattern);
}

int main() {
    std::vector<std::string> emails = {
        "user@example.com",
        "user.name@domain.co.uk",
        "user_name@domain.com",
        "invalid@domain",
        "invalid@.com",
        "@domain.com"
    };
    
    for (const auto& email : emails) {
        std::cout << email << " 是" 
                  << (isValidEmail(email) ? "有效" : "无效") 
                  << "的电子邮箱地址" << std::endl;
    }
    
    return 0;
}

输出：

user@example.com 是有效的电子邮箱地址
user.name@domain.co.uk 是有效的电子邮箱地址
user_name@domain.com 是有效的电子邮箱地址
invalid@domain 是无效的电子邮箱地址
invalid@.com 是无效的电子邮箱地址
@domain.com 是无效的电子邮箱地址

案例2：提取HTML标签

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string html = "<html><head><title>标题</title></head>"
                       "<body><h1>标题1</h1><p>段落内容</p></body></html>";
    
    std::regex tagPattern("<([a-zA-Z][a-zA-Z0-9]*)>([^<>]*)</\\1>");
    std::smatch matches;
    std::string::const_iterator searchStart(html.cbegin());
    
    while (std::regex_search(searchStart, html.cend(), matches, tagPattern)) {
        std::cout << "标签: " << matches[1] << ", 内容: " << matches[2] << std::endl;
        searchStart = matches.suffix().first;
    }
    
    return 0;
}

输出：

标签: title, 内容: 标题
标签: h1, 内容: 标题1
标签: p, 内容: 段落内容

案例3：格式化日期

#include <iostream>
#include <regex>
#include <string>

int main() {
    // 各种日期格式
    std::vector<std::string> dates = {
        "2023-10-25",
        "10/25/2023",
        "25.10.2023"
    };
    
    // 匹配不同的日期格式并标准化为YYYY-MM-DD
    std::regex iso8601("(\\d{4})-(\\d{1,2})-(\\d{1,2})");          // YYYY-MM-DD
    std::regex american("(\\d{1,2})/(\\d{1,2})/(\\d{4})");         // MM/DD/YYYY
    std::regex european("(\\d{1,2})\\.(\\d{1,2})\\.(\\d{4})");     // DD.MM.YYYY
    
    for (const auto& date : dates) {
        std::string standardized;
        
        if (std::regex_match(date, iso8601)) {
            standardized = date;  // 已经是标准格式
        } else if (std::regex_match(date, american)) {
            standardized = std::regex_replace(date, american, "$3-$1-$2");
        } else if (std::regex_match(date, european)) {
            standardized = std::regex_replace(date, european, "$3-$2-$1");
        }
        
        std::cout << "原始日期: " << date << ", 标准化: " << standardized << std::endl;
    }
    
    return 0;
}

输出：

原始日期: 2023-10-25, 标准化: 2023-10-25
原始日期: 10/25/2023, 标准化: 2023-10-25
原始日期: 25.10.2023, 标准化: 2023-10-25

正则表达式的性能考虑

正则表达式虽然强大，但使用不当会导致性能问题：

避免病态回溯：复杂的正则表达式可能导致指数级回溯，如(a+)+b匹配aaaaX时会有严重的性能问题
预编译正则表达式：对于频繁使用的正则表达式，应该创建一个静态的std::regex对象，而不是每次都重新编译
使用适当的正则表达式选项：如果不需要捕获组，可以使用std::regex::nosubs选项
考虑简单情况使用其他字符串函数：对于简单的字符串操作，使用find()、substr()等函数可能更高效

总结

C++11引入的正则表达式库为字符串处理提供了强大而灵活的工具。通过本文，我们了解了：

正则表达式的基本概念和语法
C++中正则表达式的核心组件和函数
如何使用std::regex_match、std::regex_search和std::regex_replace
分组和捕获的使用方法
正则表达式的实际应用案例

正则表达式是一种强大的工具，掌握它可以大大提高处理字符串的效率，但它也有一定的学习曲线。持续练习和应用是掌握正则表达式的最佳方式。

练习与进一步学习

编写一个程序，验证密码是否符合以下规则：至少8位，包含大小写字母、数字和特殊字符。
创建一个函数，从文本中提取所有符合中国手机号格式的号码（以1开头的11位数字）。
编写一个程序，替换文本中所有的URL为其超链接HTML标签。

学习资源

正则表达式是一个需要不断练习才能掌握的技能，从简单的模式开始，逐步增加复杂度，你会发现它在文本处理中的强大作用。

什么是正则表达式？​

C++ 正则表达式库概览​

基本语法​

正则表达式语法元素​

常用正则表达式函数​

std::regex_match​

std::regex_search​

std::regex_replace​

分组和捕获​

正则表达式选项​

实际应用案例​

案例1：验证电子邮箱格式​

案例2：提取HTML标签​

案例3：格式化日期​

正则表达式的性能考虑​

总结​

练习与进一步学习​

学习资源​