🛠️ToolsShed

词频统计器

分析文本并按频率排名统计每个词的出现次数。

词频统计工具分析一段文本,并告诉您每个单词出现的频率,从最常见到最少见排列。这是作家、编辑、学生和数据分析师的强大工具,可以了解文档的词汇分布、识别过度使用的词语,或在不使用专业软件的情况下进行基本文本分析。

粘贴您的文本,工具将其分割成单独的单词,规范化大小写("The"、"the"和"THE"算作同一个词),并显示按计数排序的频率表。可以过滤常见停用词,以专注于文本中有意义的内容词。

词频分析的应用超出了写作范畴:在语言学中,它是可读性评分的基础;在市场营销中,它揭示客户最常使用的术语;在 SEO 中,它有助于识别内容的自然关键词密度。

常见问题

代码实现

from collections import Counter
import re

STOP_WORDS = {
    "a", "an", "the", "and", "or", "but", "in", "on", "at", "to",
    "for", "of", "with", "by", "from", "is", "are", "was", "were",
    "it", "this", "that", "be", "as", "not", "i", "you", "he", "she",
}

def word_frequency(text: str, stop_words: bool = True, top_n: int = 10) -> list[tuple[str, int]]:
    # Lowercase and extract words
    words = re.findall(r"[a-z']+", text.lower())
    if stop_words:
        words = [w for w in words if w not in STOP_WORDS]
    counter = Counter(words)
    return counter.most_common(top_n)

text = """To be or not to be, that is the question.
Whether tis nobler in the mind to suffer
the slings and arrows of outrageous fortune."""

for word, count in word_frequency(text):
    print(f"{word:<20} {count}")

Comments & Feedback

Comments are powered by Giscus. Sign in with GitHub to leave a comment.