文本清理器
删除文本中的多余空格、空行、特殊字符和HTML标签。
文本清理器一键去除文本中不需要的格式、多余空白和常见杂质。从 PDF、Word 文档、网页或电子邮件复制文本时,结果通常包含多余的换行符、连续空格、不可见的 Unicode 字符、破坏代码的弯引号或 & 和 等 HTML 实体。
选择所需的清理操作并点击清理。可用操作包括:删除多余空白和空白行、修剪前后空格、将多个空格折叠为一个、将弯引号转换为直引号、删除 HTML 标签、解码 HTML 实体、删除不可打印字符,以及将 Windows 行尾(\r\n)转换为 Unix 行尾(\n)。
文本清理是数据处理管道中常见的第一步。将文本数据导入数据库、机器学习模型或 API 时,意外的空白和特殊字符是解析错误的常见来源。提前清理文本可以防止这些下游问题。
常见问题
代码实现
import re
import unicodedata
def remove_control_chars(text: str) -> str:
"""Remove non-printable control characters (keep tab, newline, carriage return)."""
return "".join(
ch for ch in text
if unicodedata.category(ch) not in ("Cc", "Cf") or ch in ("\t", "\n", "\r")
)
def normalize_line_endings(text: str, style: str = "lf") -> str:
"""Normalize line endings to LF (Unix) or CRLF (Windows)."""
text = text.replace("\r\n", "\n").replace("\r", "\n")
if style == "crlf":
text = text.replace("\n", "\r\n")
return text
def collapse_whitespace(text: str) -> str:
"""Replace multiple consecutive spaces/tabs on each line with a single space."""
return "\n".join(re.sub(r"[ \t]+", " ", line) for line in text.splitlines())
def trim_lines(text: str) -> str:
"""Strip leading and trailing whitespace from each line."""
return "\n".join(line.strip() for line in text.splitlines())
def remove_blank_lines(text: str) -> str:
"""Collapse multiple consecutive blank lines into one."""
return re.sub(r"\n{3,}", "\n\n", text)
def clean_text(text: str,
control_chars: bool = True,
normalize_endings: bool = True,
collapse_spaces: bool = True,
trim: bool = True,
blank_lines: bool = True) -> str:
"""Run all cleaning steps in sequence."""
if control_chars:
text = remove_control_chars(text)
if normalize_endings:
text = normalize_line_endings(text)
if collapse_spaces:
text = collapse_whitespace(text)
if trim:
text = trim_lines(text)
if blank_lines:
text = remove_blank_lines(text)
return text
sample = " Hello\t world! \n\n\nExtra blank lines \n\x00Null byte here "
print(repr(clean_text(sample)))
# 'Hello world!\n\nExtra blank lines\nNull byte here'Comments & Feedback
Comments are powered by Giscus. Sign in with GitHub to leave a comment.