-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
TL;DR: Enable "editing" for markdown text - allow annotators to edit content (detoxify, refine, curate). Critical for LLM training data curation workflows.
β Is your feature request related to a problem? Please describe.
I'm working on LLM fine-tuning data curation where domain experts need to simultaneously edit and label markdown-formatted documents. Current Label Studio only supports view-only annotation, but our workflow requires "editing".
Pain points:
- π« No content editing capability (can only annotate existing text)
- π Users must copy-paste to external editors (VSCode), edit, paste back - breaking the workflow
- π No modification tracking
This is critical for data curation tasks where experts need to:
- βοΈ Edit: Remove toxic/outdated content, adapt for legal compliance
- π·οΈ Label: Classify content (license types, political sensitivity, domain categories)
β Describe the solution you'd like
Integrate a markdown editor (Monaco Editor/CodeMirror) with split-pane live preview to enable "labeling".
Key features:
- π± Split-pane interface: Raw markdown editor (left) + live rendered preview (right) with synchronized scrolling
- π Change tracking: Track modifications as annotations, view diff, export both original and edited versions
- π€ Export: Both edited markdown content and annotations/labels
Typical workflow:
1. π₯ Import JSON documents (markdown text field)
2. π Annotator reviews in split-pane view
3. βοΈ Edit content (detoxify, remove outdated info, refine)
4. π·οΈ Add labels/tags (e.g., "license-type", "political-content")
5. π€ Export as JSON dataset with refined content + annotations
π€ Describe alternatives you've considered
-
External editors (current workaround): Copy to VSCode β edit β paste back
β Problem: Breaks workflow, no tracking, error-prone -
Pre-render markdown to HTML: Import pre-rendered HTML
β Problem: No editing, only annotation
π Additional context
Use cases:
- π§Ή Content detoxification (remove toxic/outdated content)
- βοΈ License classification per paragraph
- π³οΈ Political content identification
- π― Domain-specific refinement for LLM training data
Technical details:
- π» Monaco Editor (VSCode's editor) recommended - mature, excellent markdown support
- π Document size: ~1000 chars per block (browser-friendly)
- π§ Should integrate with Label Studio's existing labeling XML configuration
Sample data format:
{
"data": {
"text": "# Heading\n\nParagraph with **bold** text...",
"metadata": {"source": "book-v1", "document_id": "doc-123"}
}
}Why this matters:
- π Extends Label Studio's paradigm: From "annotate existing content" to "curate and annotate"
- π Markdown is the de facto format for LLM training data
- β‘ Data curation is a critical bottleneck in LLM fine-tuning
- π Unifies editing and annotation workflows in one tool
Visual mockup:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Editor (Raw Markdown) β Preview (Rendered) β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ€
β # Title β Title β
β ## Section 1 β ββββββ β
β This is **important** β Section 1 β
β - Item 1 β This is important β
β β β’ Item 1 β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ€
β Labels: [Political] [Legal-Review] [License: CC-BY] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¬ I'm happy to contribute or provide more details about this workflow!