Skip to content

Conversation

@Earlopain
Copy link
Contributor

Closes #855

Starting from Ruby 2.4, these are a syntax error but I don't see an easy way of representing such strings. Right now the parser actually crashses (in all versions) so I'd say it's an improvement.

Output of executing puts "\u{D800}" on all ruby versions:

Output

===================1.8===================
u{D800}
===================1.9===================
���
===================2.0===================
���
===================2.1===================
���
===================2.2===================
���
===================2.3===================
���
===================2.4===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^
===================2.5===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================2.6===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================2.7===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================3.0===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================3.1===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================3.2===================
surrogate.rb: --> surrogate.rb
invalid Unicode codepoint
> 1  puts "\u{D800}"
surrogate.rb:1: invalid Unicode codepoint (SyntaxError)
puts "\u{D800}"
             ^

===================3.3===================
surrogate.rb: 
surrogate.rb:1: invalid Unicode codepoint (SyntaxError)
puts "\u{D800}"
             ^

===================3.4===================
surrogate.rb: --> surrogate.rb

invalid Unicode escape sequence

> 1  puts "\u{D800}"

surrogate.rb:1: syntax error found (SyntaxError)
> 1 | puts "\u{D800}"
    |          ^~~~ invalid Unicode escape sequence
  2 | 

I used this script to check that integer.chr behaves the same on all ruby versions:

bounds = []
valid1 = true
valid2 = true
(0..(0x110000 - 1)).each do |num|
  begin
    valid1 = valid2
    num.chr(Encoding::UTF_8)
    valid2 = true
  rescue RangeError
    valid2 = false
  ensure
    bounds << num if valid1 != valid2
  end
end
puts bounds
Output

===================1.8===================
num_char.rb:7: uninitialized constant Encoding (NameError)
        from num_char.rb:4:in `each'
        from num_char.rb:4
===================1.9===================
55296
57344
===================2.0===================
55296
57344
===================2.1===================
55296
57344
===================2.2===================
55296
57344
===================2.3===================
55296
57344
===================2.4===================
55296
57344
===================2.5===================
55296
57344
===================2.6===================
55296
57344
===================2.7===================
55296
57344
===================3.0===================
55296
57344
===================3.1===================
55296
57344
===================3.2===================
55296
57344
===================3.3===================
55296
57344
===================3.4===================
55296
57344

Starting from Ruby 2.4, these are a syntax error.
I don't see an easy way of representing such strings.
Right now the parser actually crashses (in all versions) so I'd say it's an improvement.
@koic koic merged commit e261316 into whitequark:master Mar 31, 2025
9 checks passed
@Earlopain Earlopain deleted the surrogate-pairs branch March 31, 2025 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crashes during escaped Unicode surrogate pairs parsing

2 participants