| Documentation | Build Status |
|---|---|
Julia port of SymSpell, extremely fast spelling correction and fuzzy search algorithm.
using SymSpellChecker
d = SymSpell()
push!(d, "hello")
push!(d, "world")
d["wrold"] = ["world"]Dictionaries can be created as follows
using SymSpellChecker
# Loading from file
d = SymSpell("assets/frequency_dictionary_en_30_000.txt")
# Manual update
d = SymSpell()
push!(d, "hello", 100)
push!(d, "world", 50)Third term in push! function is the word frequency, which is used later in lookup to sort results from highest frequency to the lowest.
SymSpell constructor has following arguments
- max_dictionary_edit_distance: maximum allowed search distance. High value of this argument requires lots of memory. Default value is 2.
- prefix_length: prefix length used to generate candidates, higher values corresponds to higher memory requirements, but smaller search times. Default value is 5
- count_threshold: words with frequencies below this threshold wouldn't show in search results.
Words search can be made as follows
lookup(d, "wrold") # [SuggestItem("world", 1, 50)]Here 1 is a Damerau-Levenshtein distance between world and wrold, 50 is a word frequency in current dictionary.
One can extract only words from lookup result
term.(lookup(d, "wrold")) = ["world"]There is more convenient form of lookup exists
d["wrold"] = ["world"]Search arguments can be passed either in lookup function or set globally with the help of set_options!(d::SymSpell; kwargs...) command.
set_options!(d, include_unknown = true, verbosity = "closest")
d["wrold"] = ["wrold", "world"]
# this is equivalent to
term.(lookup(d, include_unknown = true, verbosity = "closest"))Following arguments are supported
- include_unknown: whether include or not original word in results, if it falls under search criteria
- ignore_token: ignore words in lookup that contain token string or regexp.
- transfer_casing: when this option set to
true, results will try to mimic casing of the original word, for exampled["Wrold"] = ["World"] - max_edit_distance: maximum allowed distance for search. By default equals to the
max_dictionary_edit_distance - verbosity: select type of search result. Three levels of verbosity exists
- "top": only single suggestion is returned, with lowest distance and highest frequency
- "closest": all words with lowest distance are returned
- "all": all words within given
max_edit_distanceare returned
The SymSpellChecker.jl package is licensed under the MIT License. This package is based on SymSpell and it's python adaptation. Some parts of the code is based on StringDistances.jl.