• Python中使用FuzzyWuzzy实现字符串模糊匹配
  • 发布于 2个月前
  • 174 热度
    0 评论
  • 华鑫
  • 0 粉丝 8 篇博客
  •   

FuzzyWuzzy
模糊字符串匹配,它使用Levenshtein Distance来计算简单易用的包中序列之间的差异。
前置条件
.Python 2.7 or higher
.difflib
.python-Levenshtein(可选的,在字符串匹配中提供4-10倍的加速,不过在某些情况下可能导致不同的结果)


测试
.pycodestyle
.hypothesis
.pytest


安装
使用PIP经由PyPI安装

pip install fuzzywuzzy

或者用如下命令安装

pip install fuzzywuzzy[speedup]

使用PIP经由Github安装

pip install git+git://github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy

添加你的requirementrequirements.txt文件(然后运行 pip install -r requirements.txt)

git+ssh://git@github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy

经由Git手动安装

git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
cd fuzzywuzzy
python setup.py install

用法

>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process

简单匹配率

>>> fuzz.ratio("this is a test", "this is a test!")
    97

部分匹配率

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

符号排序后匹配率

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

符号集合匹配率

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

你也可以给extractOne方法传额外的参数,使它使用一个特定的积分器(scorer).一个典型的用法是匹配文件路径:

>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
    ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
    ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

已知移植
FuzzyWuzzy也被移植到了其他语言!以下是我们知道的一些移植:
Java: xpresso"s fuzzywuzzy implementation
Java: fuzzywuzzy (java port)
Rust: fuzzyrusty (Rust port)
JavaScript: fuzzball.js (JavaScript port)
C++: Tmplt/fuzzywuzzy
C#: fuzzysharp (.Net port)
Go: go-fuzzywuzz (Go port)

用户评论