Skip to content

Commit 18e70e4

Browse files
authored
Telemetry: send events in forked children (#5074)
1 parent dcb0b4f commit 18e70e4

File tree

14 files changed

+355
-28
lines changed

14 files changed

+355
-28
lines changed

docs/TelemetryDevelopment.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Telemetry Development
2+
3+
## Telemetry Presence
4+
5+
`dd-trace-rb` is written to assume that the telemetry component is always
6+
present. If telemetry is disabled, the component is still created but does
7+
nothing.
8+
9+
Most components call methods on `telemetry` unconditionally. There are two
10+
exceptons: DI and Data Streams are written to assume that `telemetry` may be nil.
11+
However, this assumption is not necessary and these components may be
12+
changed in the future to assume that `telemetry` is always present.
13+
14+
## Event Submission Prior To Start
15+
16+
Telemetry is unique among other components in that it permits events to be
17+
submitted to it prior to its worker starting. This is done so that errors
18+
during `Datadog.configure` processing can be reported via telemetry, because
19+
the errors can be produced prior to telemetry worker starting. The telemetry
20+
component keeps the events and sends them after the worker starts.
21+
22+
## Initial Event
23+
24+
`dd-trace-rb` can be initialized multiple times during application boot.
25+
For example, if customers follow our documentation and require
26+
`datadog/auto_instrument`, and call `Datadog.configure`, they would get
27+
`Datadog.configure` invoked two times total (the first time by `auto_instrument`)
28+
and thus telemetry instance would be created twice. This happens in the
29+
applications used with system tests.
30+
31+
System tests, on the other hand, require that there is only one `app-started`
32+
event emitted, because they think the application is launched once.
33+
To deal with this we have a hack in the telemetry code to send an
34+
`app-client-configuration-change` event instead of the second `app-started`
35+
event. This is implemented via the `SynthAppClientConfigurationChange` class.
36+
37+
## Fork Handling
38+
39+
We must send telemetry data from forked children.
40+
41+
Telemetry started out as a diagnostic tool used during application boot,
42+
but is now used for reporting application liveness (and settings/state)
43+
throughout the application lifetime. Live Debugger / Dynamic Instrumentation,
44+
for example, require ongoing `app-heartbeat` events emitted via telemetry
45+
to provide a working UI to customers.
46+
47+
It is somewhat common for customers to preload the application in the parent
48+
web server process and process requests from children. This means telemetry
49+
is initialized from the parent process, and it must emit events in the
50+
forked children.
51+
52+
We use the standard worker `after_fork` handler to recreated the worker
53+
thread in forked children. However, there are two caveats to keep in mind
54+
which are specific to telemetry:
55+
56+
1. Due to telemetry permitting event submission prior to its start, it is
57+
not sufficient to simply reset the state from the worker's `perform` method,
58+
as is done in other components. We must only reset the state when we are
59+
in the forked child, otherwise we'll trash any events submitted to telemetry
60+
prior to its worker starting.
61+
62+
2. The child process is a brand new application as far as the backend/UI is
63+
concerned, having a new runtime ID, and therefore the initial event in the
64+
forked child must always be `app-started`. Since we track the initial event
65+
in the telemetry component, this event must be changed to `app-started` in
66+
forked children regardless of what it was in the parent.

lib/datadog/core/telemetry/component.rb

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,23 @@
1414
module Datadog
1515
module Core
1616
module Telemetry
17-
# Telemetry entrypoint, coordinates sending telemetry events at various points in app lifecycle.
18-
# Note: Telemetry does not spawn its worker thread in fork processes, thus no telemetry is sent in forked processes.
17+
# Telemetry entry point, coordinates sending telemetry events at
18+
# various points in application lifecycle.
1919
#
2020
# @api private
2121
class Component
2222
ENDPOINT_COLLECTION_MESSAGE_LIMIT = 300
2323

24-
attr_reader :enabled, :logger, :transport, :worker
24+
attr_reader :enabled
25+
attr_reader :logger
26+
attr_reader :transport
27+
attr_reader :worker
28+
attr_reader :settings
29+
attr_reader :agent_settings
30+
31+
# Alias for consistency with other components.
32+
# TODO Remove +enabled+ method
33+
alias_method :enabled?, :enabled
2534

2635
include Core::Utils::Forking
2736
include Telemetry::Logging
@@ -110,7 +119,7 @@ def disable!
110119
end
111120

112121
def start(initial_event_is_change = false, components:)
113-
return if !@enabled
122+
return unless enabled?
114123

115124
initial_event = if initial_event_is_change
116125
Event::SynthAppClientConfigurationChange.new(
@@ -136,19 +145,19 @@ def shutdown!
136145
end
137146

138147
def emit_closing!
139-
return if !@enabled || forked?
148+
return unless enabled?
140149

141150
@worker.enqueue(Event::AppClosing.new)
142151
end
143152

144153
def integrations_change!
145-
return if !@enabled || forked?
154+
return unless enabled?
146155

147156
@worker.enqueue(Event::AppIntegrationsChange.new)
148157
end
149158

150159
def log!(event)
151-
return if !@enabled || forked? || !@log_collection_enabled
160+
return unless enabled? && @log_collection_enabled
152161

153162
@worker.enqueue(event)
154163
end
@@ -159,21 +168,21 @@ def log!(event)
159168
#
160169
# @api private
161170
def flush
162-
return if !@enabled || forked?
171+
return unless enabled?
163172

164173
@worker.flush
165174
end
166175

167176
# Report configuration changes caused by Remote Configuration.
168177
def client_configuration_change!(changes)
169-
return if !@enabled || forked?
178+
return unless enabled?
170179

171180
@worker.enqueue(Event::AppClientConfigurationChange.new(changes, 'remote_config'))
172181
end
173182

174183
# Report application endpoints
175184
def app_endpoints_loaded(endpoints, page_size: ENDPOINT_COLLECTION_MESSAGE_LIMIT)
176-
return if !@enabled || forked?
185+
return unless enabled?
177186

178187
endpoints.each_slice(page_size).with_index do |endpoints_slice, i|
179188
@worker.enqueue(Event::AppEndpointsLoaded.new(endpoints_slice, is_first: i.zero?))

lib/datadog/core/telemetry/event/app_started.rb

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ class AppStarted < Base
1111
def initialize(components:)
1212
# To not hold a reference to the component tree, generate
1313
# the event payload here in the constructor.
14+
#
15+
# Important: do not store data that contains (or is derived from)
16+
# the runtime_id or sequence numbers.
17+
# This event is reused when a process forks, but in the
18+
# child process the runtime_id would be different and sequence
19+
# number is reset.
1420
@configuration = configuration(components.settings, components.agent_settings)
1521
@install_signature = install_signature(components.settings)
1622
@products = products(components)
@@ -30,6 +36,15 @@ def payload
3036
}
3137
end
3238

39+
# Whether the event is actually the app-started event.
40+
# For the app-started event we follow up by sending
41+
# app-dependencies-loaded, if the event is
42+
# app-client-configuration-change we don't send
43+
# app-dependencies-loaded.
44+
def app_started?
45+
true
46+
end
47+
3348
private
3449

3550
def products(components)

lib/datadog/core/telemetry/event/synth_app_client_configuration_change.rb

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,36 @@ module Event
2828
# and app-closing events.
2929
class SynthAppClientConfigurationChange < AppStarted
3030
def type
31-
'app-client-configuration-change'
31+
if reset?
32+
super
33+
else
34+
'app-client-configuration-change'
35+
end
3236
end
3337

3438
def payload
35-
{
36-
configuration: @configuration,
37-
}
39+
if reset?
40+
super
41+
else
42+
{
43+
configuration: @configuration,
44+
}
45+
end
46+
end
47+
48+
def app_started?
49+
reset?
50+
end
51+
52+
# Revert this event to a "regular" AppStarted event.
53+
#
54+
# Used in after_fork to send the AppStarted event in child processes.
55+
def reset!
56+
@reset = true
57+
end
58+
59+
def reset?
60+
!!@reset
3861
end
3962
end
4063
end

lib/datadog/core/telemetry/worker.rb

Lines changed: 51 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,10 @@
99
module Datadog
1010
module Core
1111
module Telemetry
12-
# Accumulates events and sends them to the API at a regular interval, including heartbeat event.
12+
# Accumulates events and sends them to the API at a regular interval,
13+
# including heartbeat event.
14+
#
15+
# @api private
1316
class Worker
1417
include Core::Workers::Queue
1518
include Core::Workers::Polling
@@ -40,11 +43,15 @@ def initialize(
4043
self.enabled = enabled
4144
# Workers::IntervalLoop settings
4245
self.loop_base_interval = metrics_aggregation_interval_seconds
43-
self.fork_policy = Core::Workers::Async::Thread::FORK_POLICY_STOP
46+
self.fork_policy = Core::Workers::Async::Thread::FORK_POLICY_RESTART
4447

4548
@shutdown_timeout = shutdown_timeout
4649
@buffer_size = buffer_size
4750

51+
initialize_state
52+
end
53+
54+
private def initialize_state
4855
self.buffer = buffer_klass.new(@buffer_size)
4956

5057
@initial_event_once = Utils::OnlyOnceSuccessful.new(APP_STARTED_EVENT_RETRIES)
@@ -53,12 +60,13 @@ def initialize(
5360
attr_reader :logger
5461
attr_reader :initial_event_once
5562
attr_reader :initial_event
63+
attr_reader :emitter
5664

5765
# Returns true if worker thread is successfully started,
5866
# false if worker thread was not started but telemetry is enabled,
5967
# nil if telemetry is disabled.
6068
def start(initial_event)
61-
return if !enabled? || forked?
69+
return unless enabled?
6270

6371
@initial_event = initial_event
6472

@@ -79,7 +87,21 @@ def stop(force_stop = false, timeout = @shutdown_timeout)
7987
# for not enqueueing event (presently) is that telemetry is disabled
8088
# altogether, and in this case other methods return nil.
8189
def enqueue(event)
82-
return if !enabled? || forked?
90+
return unless enabled?
91+
92+
# Start the worker if needed, including in forked children.
93+
# Needs to be done before pushing to buffer since perform
94+
# may invoke after_fork handler which resets the buffer.
95+
#
96+
# Telemetry is special in that it permits events to be submitted
97+
# to the worker with the worker not running, and the worker is
98+
# explicitly started later (to maintain proper initialization order).
99+
# Thus here we can't just call perform unconditionally and must
100+
# check if the worker is supposed to be running, and only call
101+
# perform in that case.
102+
if worker && !worker.alive?
103+
perform
104+
end
83105

84106
buffer.push(event)
85107
true
@@ -133,7 +155,7 @@ def flush
133155
private
134156

135157
def perform(*events)
136-
return if !enabled? || forked?
158+
return unless enabled?
137159

138160
if need_initial_event?
139161
started!
@@ -189,7 +211,9 @@ def started!
189211
# dependencies and send the new ones.
190212
# System tests demand only one instance of this event per
191213
# dependency.
192-
send_event(Event::AppDependenciesLoaded.new) if @dependency_collection && initial_event.class.eql?(Telemetry::Event::AppStarted) # standard:disable Style/ClassEqualityComparison:
214+
if @dependency_collection && initial_event.app_started?
215+
send_event(Event::AppDependenciesLoaded.new)
216+
end
193217

194218
true
195219
else
@@ -240,6 +264,27 @@ def disable_on_not_found!(response)
240264
disable!
241265
end
242266

267+
# Stop the worker after fork without sending closing event.
268+
# The closing event will be (or should be) sent by the worker
269+
# in the parent process.
270+
# Also, discard any accumulated events since they will be sent by
271+
# the parent.
272+
def after_fork
273+
# If telemetry is disabled, we still reset the state to avoid
274+
# having wrong state. It is possible that in the future telemetry
275+
# will be re-enabled after errors.
276+
initialize_state
277+
# In the child process, we get a new runtime_id.
278+
# As such we need to send AppStarted event.
279+
# In the parent process, the event may have been the
280+
# SynthAppClientConfigurationChange instead of AppStarted,
281+
# and in that case we need to convert it to the "regular"
282+
# AppStarted event.
283+
if @initial_event.is_a?(Event::SynthAppClientConfigurationChange)
284+
@initial_event.reset! # steep:ignore
285+
end
286+
end
287+
243288
# Deduplicate logs by counting the number of repeated occurrences of the same log
244289
# entry and replacing them with a single entry with the calculated `count` value.
245290
# Non-log events are unchanged.

sig/datadog/core/telemetry/component.rbs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ module Datadog
2323
def self.build: (untyped settings, Datadog::Core::Configuration::AgentSettings agent_settings, Datadog::Core::Logger logger) -> Component
2424

2525
def initialize: (logger: Core::Logger, settings: untyped, agent_settings: Datadog::Core::Configuration::AgentSettings, enabled: true | false) -> void
26+
27+
def enabled?: -> bool
2628

2729
def disable!: () -> void
2830

sig/datadog/core/telemetry/event/app_started.rbs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ module Datadog
88
def type: () -> "app-started"
99

1010
def payload: () -> { products: untyped, configuration: untyped, install_signature: untyped }
11+
12+
def app_started?: -> bool
1113

1214
private
1315

sig/datadog/core/telemetry/event/synth_app_client_configuration_change.rbs

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,13 @@ module Datadog
33
module Telemetry
44
module Event
55
class SynthAppClientConfigurationChange < AppStarted
6-
def type: () -> "app-client-configuration-change"
6+
def type: -> ("app-client-configuration-change" | "app-started")
77

8-
def payload: () -> { configuration: untyped }
8+
def payload: () -> { ?products: untyped, configuration: untyped, ?install_signature: untyped }
9+
10+
def reset?: -> bool
11+
12+
def reset!: -> void
913
end
1014
end
1115
end

sig/datadog/core/telemetry/worker.rbs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,11 @@ module Datadog
2222
@logger: ::Logger
2323

2424
attr_reader logger: ::Logger
25-
attr_reader initial_event: Telemetry::Event::Base
25+
attr_reader initial_event: Telemetry::Event::AppStarted
2626
attr_reader initial_event_once: Datadog::Core::Utils::OnlyOnceSuccessful
2727

2828
def initialize: (?enabled: bool, heartbeat_interval_seconds: Float, metrics_aggregation_interval_seconds: Float, emitter: Emitter, metrics_manager: MetricsManager, ?shutdown_timeout: Float | Integer, ?buffer_size: Integer, dependency_collection: bool, logger: ::Logger) -> void
29+
def initialize_state: -> void
2930

3031
def start: (Telemetry::Event::Base initial_event) -> void
3132

0 commit comments

Comments
 (0)