Skip to content

Commit 6c754f4

Browse files
committed
Expose incremental UTF8 decoding APIs
1 parent 3236d73 commit 6c754f4

File tree

2 files changed

+36
-5
lines changed

2 files changed

+36
-5
lines changed

core/src/Streamly/Internal/Unicode/Stream.hs

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -190,8 +190,25 @@ encodeLatin1Lax = encodeLatin1
190190
-- UTF-8 decoding
191191
-------------------------------------------------------------------------------
192192

193-
-- Int helps in cheaper conversion from Int to Char
193+
-- | CodePoint represents a specific character in the Unicode standard. The
194+
-- code point is a numerical value assigned to each character, and UTF-8
195+
-- encoding uses a variable number of bytes to represent different code points.
196+
--
197+
-- Calculate the code point value: Depending on the type of the leading byte,
198+
-- extract the significant bits from each byte of the sequence and combine them
199+
-- to form the complete code point value. The specific bit manipulations will
200+
-- differ based on the number of bytes used.
194201
type CodePoint = Int
202+
203+
-- | DecodeState refers to the number of bytes remaining to complete the current
204+
-- UTF-8 character decoding. For ASCII characters (code points 0 to 127), no
205+
-- decoding state is necessary because they are represented by a single byte.
206+
-- Therefore, the decoding state for ASCII characters can be considered as 0.
207+
--
208+
-- For multi-byte characters, the decoding state indicates the number of bytes
209+
-- remaining to complete the character. It is usually initialized to a non-zero
210+
-- value corresponding to the number of bytes in the multi-byte character, e.g
211+
-- DecodeState will be 1 for 2-bytes char.
195212
type DecodeState = Word8
196213

197214
-- We can divide the errors in three general categories:
@@ -410,17 +427,24 @@ decodeUtf8EitherD :: Monad m
410427
=> D.Stream m Word8 -> D.Stream m (Either DecodeError Char)
411428
decodeUtf8EitherD = resumeDecodeUtf8EitherD 0 0
412429

413-
-- |
430+
-- | Decode a bytestream as UTF-8 encoded characters, returning an 'Either'
431+
-- stream.
432+
--
433+
-- This function is similar to 'decodeUtf8', but instead of replacing the
434+
-- invalid codepoint encountered, it returns a 'Left' 'DecodeError'.
435+
--
436+
-- When decoding is successful and a valid character is encountered, the
437+
-- function returns 'Right Char'.
414438
--
415-
-- /Pre-release/
416439
{-# INLINE decodeUtf8Either #-}
417440
decodeUtf8Either :: Monad m
418441
=> Stream m Word8 -> Stream m (Either DecodeError Char)
419442
decodeUtf8Either = decodeUtf8EitherD
420443

421-
-- |
444+
-- | Resuming the decoding of a bytestream given a 'DecodeState' and a
445+
-- 'CodePoint'.
422446
--
423-
-- /Pre-release/
447+
-- >>> decodeUtf8Either = resumeDecodeUtf8Either 0 0
424448
{-# INLINE resumeDecodeUtf8Either #-}
425449
resumeDecodeUtf8Either
426450
:: Monad m

core/src/Streamly/Unicode/Stream.hs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,13 @@ module Streamly.Unicode.Stream
8181
, decodeUtf8'
8282
, decodeUtf8Chunks
8383

84+
-- ** Resumable UTF-8 Decoding
85+
, DecodeError(..)
86+
, DecodeState
87+
, CodePoint
88+
, decodeUtf8Either
89+
, resumeDecodeUtf8Either
90+
8491
-- * Elimination (Encoding)
8592
, encodeLatin1
8693
, encodeLatin1'

0 commit comments

Comments
 (0)