-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
v4 page for spacy.io #13463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
danieldk
wants to merge
2
commits into
explosion:v4
Choose a base branch
from
danieldk:docs/v4-notes
base: v4
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
v4 page for spacy.io #13463
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,191 @@ | ||
| --- | ||
| title: What's New in v4.0 | ||
| teaser: New features and how to upgrade | ||
| menu: | ||
| - ['New Features', 'features'] | ||
| - ['Upgrading Notes', 'upgrading'] | ||
| --- | ||
|
|
||
| ## New features {id="features",hidden="true"} | ||
|
|
||
| spaCy v4.0 supports more flexible learning rates and adds experimental support | ||
| for model distillation. This release also fixes some long-standing issues that | ||
| require minor API changes. | ||
|
|
||
| spaCy v4.0 drops support for Python 3.7 and 3.8. | ||
|
|
||
| ### Flexible learning rates {id="learn-rate"} | ||
|
|
||
| Thinc 9 adds support for more flexible learning rates that can use the step, | ||
| parameter names, and results from prior evaluations. spaCy v4 makes use of these | ||
| flexible learning rates by passing the aggregate score of the most recent | ||
| evaluation to the learning rate schedule. This makes it possible for schedules | ||
| like [`plateau`](https://thinc.ai/docs/api-schedules#plateau) to adjust the | ||
| learning rate when training is stagnant. | ||
|
|
||
| ### Experimental support for model distillation {id="distillation"} | ||
|
|
||
| spaCy v4 lays the groundwork for model distillation. Distillation trains a | ||
| _student_ model on the predictions of a _teacher_ model using an unannotated | ||
| corpus. One of the more exciting applications of distillation is extracting | ||
| small, task-focused models from large, pretrained transformer models. | ||
|
|
||
| Support for distillation support consists of several parts: | ||
|
|
||
| - [`TrainablePipe`](/api/pipe) now provides a [`distill`](/api/pipe#distill) | ||
| method. This can be used to perform a distillation step, where a student is | ||
| updated to mimick the outputs of the teacher. | ||
| - A configuration section called `distilation` for configuring various | ||
| distillation settings. | ||
| - The distillation loop. | ||
| - The [`distill`](/api/cli#distill) subcommand to run distillation from the | ||
| command-line. | ||
|
|
||
| Most of the trainable pipeline components are updated to support distillation. | ||
|
|
||
| ### Saving activations {id="save-activation"} | ||
|
|
||
| Trainable pipes can now save the pipe's model activations for a document in the | ||
| [`Doc.activations`](/api/doc#attributes) dictionary. You can use this | ||
| functionality to get programmatic access to e.g. the probability distibution of | ||
| a pipe's classifier. | ||
|
|
||
| The following activations are currently available: | ||
|
|
||
| - `EditTreeLemmatizer`: `probabilities` and `tree_ids` | ||
| - `EntityLinker`: `ents` and `scores` | ||
| - `Morphologizer`: `probabilities` and `label_ids` | ||
| - `SentenceRecognizer`: `probabilities` and `label_ids` | ||
| - `SpanCategorizer`: `indices` and `scores` | ||
| - `Tagger`: `probabilities` and `label_ids` | ||
| - `TextCategorizer`: `probabilities` | ||
|
|
||
| > #### Example | ||
| > | ||
| > ```python | ||
| > import spacy | ||
| > nlp = spacy.load("de_core_news_lg") | ||
| > nlp.get_pipe("tagger").save_activations = True | ||
| > doc = nlp("Hallo Welt!") | ||
| > assert "tagger" in doc.activations | ||
| > assert "probabilities" in doc.activations["tagger"] | ||
| > ``` | ||
|
|
||
| ### Additional features and improvements {id="additional-features-and-improvements"} | ||
|
|
||
| - The `--code` option that is used by several CLI subcommands now accepts | ||
| multiple files to load by separating them with a comma. | ||
| - `spacy download` does not redownload models that are already installed. | ||
| - When modifying a `Span` that was retrieved through a `SpanGroup`, the change | ||
| is now reflected in the `SpanGroup`. | ||
| - Lookups can now be downloaded from a URL using | ||
| `spacy.LookupsDataLoaderFromURL.v1`. | ||
|
|
||
| ## Notes about upgrading from v3.7 {id="upgrading"} | ||
|
|
||
| This release drops support for Python 3.7 and 3.8. Most configuration files from | ||
| spaCy 3.7 can be used with spaCy 4.0 without any modifications (excepting | ||
| configurations that use `EntityLinker.v1`, see below). However, spaCy 4.0 | ||
| introduces some (minor) API changes that are discussed in the remainder of this | ||
| section. | ||
|
|
||
| ### Removal of the `EntityRuler` class | ||
|
|
||
| The `EntityRuler` class is removed. The entity ruler is implemented as a special | ||
| case of the `SpanRuler` component. | ||
|
|
||
| See the [migration guide](/api/entityruler#migrating) for differences between | ||
| the v3 `EntityRuler` and v4 `SpanRuler` implementations of the `entity_ruler` | ||
| component. | ||
|
|
||
| ### Renamed language codes: `is` -> `isl` and `xx` to `mul` | ||
|
|
||
| The language code for Icelandic has been changed from `is` to `isl` to avoid | ||
| incompatibilities with the Python `is` keyword. The language code for | ||
| multilingual models has been changed from `xx` to `mul`. Existing code that uses | ||
| these language codes should be adjusted accordingly. | ||
|
|
||
| ### Removal of the `sentiment` attribute | ||
|
|
||
| The `sentiment` attribute is removed the `Token`, `Span`, `Doc` and `Lexeme` | ||
| classes. If you used this attribute in a `sentiment` analysis component, we | ||
| recommend you to store the sentiment analysis in an | ||
| [extension attribute](/usage/processing-pipelines#custom-components-attributes) | ||
| instead. | ||
|
|
||
| ### Removal of `get_candidates_batch` | ||
|
|
||
| Prior to spaCy v4, `get_candidates()` returned an `Iterable` of candidates for a | ||
| specific mention. spaCy >= 3.5 provides `get_candidates_batch()` for looking up | ||
| multiple mentions — given an `Iterable[Span]` of mentions, it returns for each | ||
| mention the candidates. | ||
|
|
||
| spaCy v4 replaces both functions by a single function | ||
| [`get_candidates`](/api/entitylinker#config) that does doc-wise batching. For an | ||
| `Iterator[SpanGroup]` it returns for each mention in the spangroup the | ||
| candidates. The batching is by doc since the [`Span`](/api/span)s in a | ||
danieldk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| [`SpanGroup`](/api/spangroup) belong to the same [`Doc`](/api/doc). | ||
|
|
||
| ### Removal of pool argument from `Vocab.get` and `Vocab.get_by_orth` | ||
|
|
||
| The memory pool argument was removed from the `Vocab.get` and | ||
| `Vocab.get_by_orth` Cython cdef methods. These methods can now be called without | ||
| providing the memory pool as an argument. | ||
|
|
||
| ### Optional arguments of `Span.char_span` are now keyword-only | ||
|
|
||
| > #### Example | ||
| > | ||
| > ```python | ||
| > doc = nlp("I like New York") | ||
| > # Permitted in spaCy 3 | ||
| > span = doc[1:4].char_span(5, 13, "GPE", 42) | ||
| > # spaCy 4 | ||
| > span = doc[1:4].char_span(5, 13, "GPE", kb_id=42) | ||
| > ``` | ||
|
|
||
| The optional arguments for [`Span.char_span`](/api/span#char_span) are now | ||
| keyword-only. Existing code that uses a positional argument to pass an optional | ||
| argument to `char_span` needs to be updated to pass a keyword argument. | ||
|
|
||
| ### Remove backoff from `Doc.vector` to `Doc.tensor` | ||
|
|
||
| In spaCy v3 and earlier, small (`sm`) pipeline packages supported | ||
| [`Doc.vector`](/api/doc#vector) and [`Token.vector`](/api/token#vector) by | ||
| backing off to context-sensitive tensors from the `tok2vec` component. These | ||
| tensors do not work well for this purpose and this backoff has been removed in | ||
| spaCy v4. | ||
|
|
||
| ### Multiple spans returned as `Tuple[Span]` | ||
|
|
||
| In spaCy v3 some methods that returned multiple `Span` objects would return an | ||
| `Iterator[Span]`, while others would return `Tuple[Span]`. In spaCy v4 such | ||
| methods always return `Tuple[Span]`. | ||
|
|
||
| ### Support for `EntityLinker.v1` is dropped | ||
|
|
||
| Support for `EntityLinker.v1` is dropped, migrate to `EntityLinker.v2`. | ||
|
|
||
| ### `spacy[apple]` removed from extras | ||
|
|
||
| The `thinc-apple-ops` package has been merged into Thinc v9. spaCy v4 always | ||
| uses Apple ops on Macs, so the `apple` extra is not needed anymore. | ||
|
|
||
| ### Pipeline package version compatibility {id="version-compat"} | ||
|
|
||
| spaCy v3.x pipelines are not compatible with spaCy v4.0 and need to be | ||
| retrained. | ||
|
|
||
| ### Updating v3.7 configs | ||
|
|
||
| To update a config from spaCy v3.7 with the new v4.0 settings, run | ||
| [`init fill-config`](/api/cli#init-fill-config): | ||
|
|
||
| ```cli | ||
| $ python -m spacy init fill-config config-v3.7.cfg config-v4.0.cfg | ||
| ``` | ||
|
|
||
| In many cases ([`spacy train`](/api/cli#train), | ||
| [`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in | ||
| automatically, but you'll need to fill in the new settings to run | ||
| [`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.