Releases: finalfusion/finalfusion-python
Releases · finalfusion/finalfusion-python
Finalfusion in Python
This release marks a major change to finalfusion-python: the entire package has been rewritten in Python and is no longer a wrapper around finalfusion-rust.
The API is now almost on par with finalfusion-rust and in some places even goes beyond that.
Vocab,Storage,MetadataandNormsare now accessible as properties onEmbeddings- Any of the chunks above can be loaded by themselves from a finalfusion file
- All chunks can be constructed from within Python
- It's possible to add, remove or change embeddings
Storagetypes integrate directly withnumpyarrays- Reading and writing to all common Embedding formats (word2vec, GloVe, fastText) is supported
- The API for vocabularies and subword indexers has been made mor ergonomic:
- vocab words and the word -> index mapping are accessible as properties
SubwordVocabs expose the subword indexer throughvocab.subword_indexer
In addition to the overhauled API, finalfusion-python now comes with executables:
ffp-convertto convert between embedding formatsffp-similarandffp-analogyfor similarity and analogy queriesffp-bucket-to-explicitto convert from bucket subword to explicit subword embeddings
Check out the documentation at https://finalfusion-python.readthedocs.io for more information!
0.6.2
0.6.1
0.6.0
Support for fastText, word2vec, and text embeddings
The largest change is this release is support for reading fastText, word2vec, and text embeddings, in addition to finalfusion embeddings.
- Add support for reading fastText (
Embeddings.read_fasttext()), text (Embeddings.read_text()), textdims (Embeddings.read_text()), and word2vec (Embeddings.read_fasttext()) formats. - Each of these newly-supported formats provides a keyword argument
lossy. If set, the embeddings will be read lossily, permitting invalid UTF-8 in words. - Add the
embedding_similaritymethod, which looks up words that are similar to a given embedding. The method for traditional word-based lookups has been renamed fromsimilaritytoword_similarity. - Iteration over embeddings returned tuples
(word, embedding)in previous releases. Now instances of theEmbeddingclass are returned, which provideword,embedding, andnormproperties.normis the embedding norm before normalization of an embedding using its l2 norm. - Add support for memory mapping quantized embedding matrices.
- Add the
ngram_indicesandsubword_indicesto theVocabclass. These methods return the subword indices for a given word, which can be used to retrieve the subword embeddings individually. Thengram_indicesmethods returns each subword with its index, whereassubword_indicesonly returns the indices. - Update to pyo3 0.8.
travis-0.5.0-rebuild
CI: Fix crate name in Travis-CI builds
0.4.0
0.3.1
New convenience methods
This release has the following changes:
- Add the
matrix_copymethod to get a numpy array copy of the embedding matrix. - Add the
vocabmethod to get aVocabinstance, which provides theitem_to_indicesmethod to get the indices or subword indices of a word.Vocabalso provides indexing to look up the word corresponding to an index (e.g.vocab[3823]). - Upgrade to finalfusion 0.6.
Switch to numpy arrays
- Return
numpyarrays rather than Python lists. - Update to
pyo30.6. - Switch from
rust2vecto thefinalfusioncrate.