Python 文本处理

文本处理是编程中最常见的任务之一。无论是分析日志文件、处理用户输入，还是从网页中提取数据，Python都提供了强大的工具和库来简化这些工作。在本文中，我们将探索Python中处理文本的各种方法和技术，从基础的字符串操作到更高级的文本分析。

字符串基础

Python中的字符串是不可变序列，用于存储文本数据。在深入了解文本处理之前，让我们先回顾一些字符串的基本操作。

创建字符串

Python中创建字符串非常简单，可以使用单引号、双引号或三引号：

# 使用单引号
single_quoted = 'Hello, World!'

# 使用双引号
double_quoted = "Python Programming"

# 使用三引号（可以包含多行文本）
multi_line = """这是第一行
这是第二行
这是第三行"""

print(single_quoted)
print(double_quoted)
print(multi_line)

输出：

Hello, World!
Python Programming
这是第一行
这是第二行
这是第三行

字符串拼接

拼接字符串有多种方式：

# 使用 + 运算符
first_name = "John"
last_name = "Doe"
full_name = first_name + " " + last_name
print(full_name)  # 输出: John Doe

# 使用 join() 方法
words = ["Python", "is", "awesome"]
sentence = " ".join(words)
print(sentence)  # 输出: Python is awesome

# 使用 f-strings (Python 3.6+)
age = 25
message = f"{first_name} is {age} years old"
print(message)  # 输出: John is 25 years old

字符串切片与索引

Python字符串可以通过索引和切片访问其中的字符：

text = "Python Programming"

# 索引（从0开始）
print(text[0])    # 输出: P
print(text[7])    # 输出: P

# 负索引（从末尾开始计数）
print(text[-1])   # 输出: g
print(text[-5])   # 输出: m

# 切片 [start:stop:step]
print(text[0:6])   # 输出: Python
print(text[7:])    # 输出: Programming
print(text[:6])    # 输出: Python
print(text[::2])   # 输出: Pto rgamn
print(text[::-1])  # 输出: gnimmargorP nohtyP（反转字符串）

字符串方法

Python提供了丰富的内置字符串方法，使文本处理变得简单。

大小写转换

text = "Python Programming"

print(text.upper())      # 输出: PYTHON PROGRAMMING
print(text.lower())      # 输出: python programming
print(text.title())      # 输出: Python Programming
print(text.capitalize()) # 输出: Python programming
print(text.swapcase())   # 输出: pYTHON pROGRAMMING

查找和替换

text = "Python is amazing and Python is fun"

# 查找子字符串
print(text.find("Python"))      # 输出: 0（第一次出现的索引）
print(text.find("Python", 1))   # 输出: 19（从索引1开始搜索）
print(text.find("Java"))        # 输出: -1（未找到）

# 替换子字符串
print(text.replace("Python", "JavaScript"))  # 输出: JavaScript is amazing and JavaScript is fun
print(text.replace("Python", "Java", 1))     # 输出: Java is amazing and Python is fun（只替换第一次出现）

分割与合并

# 分割字符串
text = "apple,banana,orange,grape"
fruits = text.split(",")
print(fruits)  # 输出: ['apple', 'banana', 'orange', 'grape']

# 合并字符串
joined_text = "-".join(fruits)
print(joined_text)  # 输出: apple-banana-orange-grape

# 分割行
multi_line_text = "Line 1\nLine 2\nLine 3"
lines = multi_line_text.splitlines()
print(lines)  # 输出: ['Line 1', 'Line 2', 'Line 3']

检查字符串内容

# 检查开头和结尾
filename = "document.pdf"
print(filename.startswith("doc"))   # 输出: True
print(filename.endswith(".pdf"))    # 输出: True

# 检查字符串内容
text = "Python123"
print(text.isalnum())  # 输出: True（字母或数字）
print(text.isalpha())  # 输出: False（不全是字母）
print(text.isdigit())  # 输出: False（不全是数字）

num = "12345"
print(num.isdigit())   # 输出: True

# 空白字符检查
spaces = "   "
print(spaces.isspace())  # 输出: True

去除空白字符

text = "   Python Programming   "
print(text.strip())      # 输出: "Python Programming"（去除两端空白）
print(text.lstrip())     # 输出: "Python Programming   "（去除左侧空白）
print(text.rstrip())     # 输出: "   Python Programming"（去除右侧空白）

# 去除特定字符
text = "###Python ###"
print(text.strip('#'))   # 输出: "Python"

格式化字符串

Python提供了多种格式化字符串的方法：

使用format()方法

name = "Alice"
age = 30

# 基本格式化
message = "Hello, {}. You are {} years old.".format(name, age)
print(message)  # 输出: Hello, Alice. You are 30 years old.

# 位置参数
message = "Hello, {1}. You are {0} years old.".format(age, name)
print(message)  # 输出: Hello, Alice. You are 30 years old.

# 命名参数
message = "Hello, {name}. You are {age} years old.".format(name=name, age=age)
print(message)  # 输出: Hello, Alice. You are 30 years old.

# 格式化数字
pi = 3.14159
formatted = "Pi is approximately {:.2f}".format(pi)
print(formatted)  # 输出: Pi is approximately 3.14

使用f-strings（Python 3.6+）

name = "Bob"
age = 25
height = 1.75

# 基本用法
message = f"Hello, {name}. You are {age} years old."
print(message)  # 输出: Hello, Bob. You are 25 years old.

# 表达式计算
message = f"{name} is {age} years old and {height * 100} cm tall."
print(message)  # 输出: Bob is 25 years old and 175.0 cm tall.

# 数字格式化
message = f"Pi is approximately {3.14159:.2f}"
print(message)  # 输出: Pi is approximately 3.14

正则表达式

正则表达式是处理文本的强大工具，Python通过re模块提供对正则表达式的支持。

基本匹配

import re

text = "Python was created in 1991 by Guido van Rossum."

# 查找单词
result = re.search(r"Python", text)
print(result.group())  # 输出: Python

# 使用模式
# 查找四位数字（年份）
year = re.search(r"\d{4}", text)
print(year.group())  # 输出: 1991

# 查找所有单词
words = re.findall(r"\b\w+\b", text)
print(words)  # 输出: ['Python', 'was', 'created', 'in', '1991', 'by', 'Guido', 'van', 'Rossum']

替换文本

import re

text = "Contact us at info@example.com or support@example.org"

# 替换所有电子邮件地址
new_text = re.sub(r"\S+@\S+\.\S+", "[EMAIL PROTECTED]", text)
print(new_text)  # 输出: Contact us at [EMAIL PROTECTED] or [EMAIL PROTECTED]

# 提取所有电子邮件地址
emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)  # 输出: ['info@example.com', 'support@example.org']

备注

正则表达式是个强大但复杂的工具。在掌握基础后，可以参考更高级的资料来提升你的正则表达式技能。

文本文件处理

Python可以轻松地读写文本文件，这是文本处理中非常重要的一部分。

读取文本文件

# 基本读取
with open("sample.txt", "r", encoding="utf-8") as file:
    content = file.read()
    print(content)

# 逐行读取
with open("sample.txt", "r", encoding="utf-8") as file:
    for line in file:
        print(line.strip())  # strip() 移除行末的换行符

# 读取所有行到列表
with open("sample.txt", "r", encoding="utf-8") as file:
    lines = file.readlines()
    print(lines)

写入文本文件

# 写入文本
with open("output.txt", "w", encoding="utf-8") as file:
    file.write("Hello, World!\n")
    file.write("This is a test file.")

# 追加文本
with open("output.txt", "a", encoding="utf-8") as file:
    file.write("\nThis line is appended.")

# 写入多行
lines = ["Line 1", "Line 2", "Line 3"]
with open("multiline.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(lines))

警告

始终使用with语句处理文件操作，它能确保文件在操作结束后正确关闭，即使出现异常。

实际应用案例

让我们看几个文本处理的实际应用案例：

案例1：日志分析

假设我们有一个简单的日志文件，需要提取特定的信息：

import re

# 假设log.txt包含像这样的日志行:
# [2023-05-15 14:32:10] INFO: User login successful - username@example.com

log_pattern = r"\[(.*?)\] (\w+): (.*?) - (.*@.*\.\w+)"

users_logged_in = []

with open("log.txt", "r") as log_file:
    for line in log_file:
        match = re.search(log_pattern, line)
        if match and "User login successful" in line:
            timestamp, level, message, email = match.groups()
            users_logged_in.append((timestamp, email))

print(f"成功登录的用户数: {len(users_logged_in)}")
for timestamp, email in users_logged_in:
    print(f"{email} 登录于 {timestamp}")

案例2：简单的文本统计

def analyze_text(filename):
    try:
        with open(filename, "r", encoding="utf-8") as file:
            text = file.read()
            
            # 基本统计
            char_count = len(text)
            word_count = len(text.split())
            line_count = len(text.splitlines())
            
            # 出现频率最高的单词
            words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
            word_freq = {}
            for word in words:
                if word in word_freq:
                    word_freq[word] += 1
                else:
                    word_freq[word] = 1
                    
            # 找出出现频率最高的5个单词
            top_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:5]
            
            return {
                "character_count": char_count,
                "word_count": word_count,
                "line_count": line_count,
                "top_words": top_words
            }
            
    except Exception as e:
        return f"分析文件时出错: {str(e)}"

# 使用函数
stats = analyze_text("article.txt")
print(f"字符数: {stats['character_count']}")
print(f"单词数: {stats['word_count']}")
print(f"行数: {stats['line_count']}")
print("出现频率最高的单词:")
for word, count in stats['top_words']:
    print(f"  {word}: {count}次")

案例3：简单的CSV数据处理

CSV是常见的文本数据格式，Python可以轻松处理：

import csv

# 从CSV读取数据
def read_and_process_csv(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        
        # 处理每一行
        total_age = 0
        people_count = 0
        
        for row in reader:
            print(f"Name: {row['name']}, Age: {row['age']}, City: {row['city']}")
            total_age += int(row['age'])
            people_count += 1
        
        if people_count > 0:
            average_age = total_age / people_count
            print(f"\n平均年龄: {average_age:.1f}")

# 写入CSV数据
def write_csv_data(filename, data):
    with open(filename, 'w', encoding='utf-8', newline='') as file:
        fieldnames = ['name', 'age', 'city']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        
        writer.writeheader()
        for person in data:
            writer.writerow(person)
        
    print(f"数据已写入 {filename}")

# 示例数据
sample_data = [
    {'name': '张三', 'age': '28', 'city': '北京'},
    {'name': '李四', 'age': '32', 'city': '上海'},
    {'name': '王五', 'age': '25', 'city': '广州'}
]

# 写入示例数据并读取处理
write_csv_data('people.csv', sample_data)
read_and_process_csv('people.csv')

总结

Python提供了丰富且强大的文本处理工具，从简单的字符串操作到复杂的正则表达式匹配，再到文件的读写和处理。掌握这些技能对于数据清洗、文本分析和许多日常编程任务都是非常有价值的。

在本文中，我们学习了：

字符串的基本操作，包括创建、拼接、切片和索引
常用字符串方法，如大小写转换、分割、合并等
字符串格式化技术
使用正则表达式进行更复杂的文本处理
文本文件的读写操作
实际应用案例，包括日志分析、文本统计和CSV处理

随着你在Python编程中的进步，这些文本处理技能将成为你工具箱中的重要部分，帮助你解决各种编程挑战。

练习

为了巩固所学知识，尝试完成以下练习：

编写一个函数，统计文本中每个单词的出现频率，并返回前N个最常见的单词。
创建一个简单的日志解析器，从日志文件中提取错误信息。
编写一个程序，读取CSV文件，对其中的数据进行筛选和排序，然后将结果写入新的CSV文件。
实现一个简单的文本替换工具，能够在文件中查找指定模式的文本并替换它。
创建一个简单的密码强度检测器，使用正则表达式检查密码是否包含大小写字母、数字和特殊字符。

附加资源

如果你想进一步提高你的Python文本处理技能，可以参考以下资源：

Python官方文档中的字符串处理部分
Python官方文档中的正则表达式模块
《Python Cookbook》第2章：字符串和文本
在线正则表达式测试工具，如regex101.com

记住，实践是掌握文本处理技能的最佳方式。通过解决实际问题，你将逐渐熟悉这些工具并能有效地运用它们。

字符串基础​

创建字符串​

字符串拼接​

字符串切片与索引​

字符串方法​

大小写转换​

查找和替换​

分割与合并​

检查字符串内容​

去除空白字符​

格式化字符串​

使用format()方法​

使用f-strings（Python 3.6+）​

正则表达式​

基本匹配​

替换文本​

文本文件处理​

读取文本文件​

写入文本文件​

实际应用案例​

案例1：日志分析​

案例2：简单的文本统计​

案例3：简单的CSV数据处理​

总结​

练习​

附加资源​