<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.1">Jekyll</generator><link href="https://comandeo.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://comandeo.dev/" rel="alternate" type="text/html" /><updated>2023-01-11T15:22:00+01:00</updated><id>https://comandeo.dev/feed.xml</id><title type="html">Dmitry’s blog</title><subtitle>I am writing mostly about Ruby, Mongoid, and MongoDB.</subtitle><entry><title type="html">Do Not Use Mutexes in Finalizers</title><link href="https://comandeo.dev/2023/01/01/mutexes-in-finalizers.html" rel="alternate" type="text/html" title="Do Not Use Mutexes in Finalizers" /><published>2023-01-01T01:00:00+01:00</published><updated>2023-01-01T01:00:00+01:00</updated><id>https://comandeo.dev/2023/01/01/mutexes-in-finalizers</id><content type="html" xml:base="https://comandeo.dev/2023/01/01/mutexes-in-finalizers.html"><![CDATA[<p>Ruby allows a developer to specify a <em>finalizer</em> proc for an object. This proc is called after an object was destroyed. This is a very useful mechanism that can be used for some cleanup when the object is gone. However, it turned out that there are limitations to what you can do inside finalizers. And these limitations are the same as ones for a signal trap. So, if you write a finalizer,
you should follow <a href="https://github.com/ruby/ruby/blob/master/doc/signals.rdoc">the documentation for signal traps</a>.</p>

<p>Some time ago a user  opened <a href="https://jira.mongodb.org/browse/RUBY-2869">an issue</a> in our bug tracker. In his logs he noticed an exception raised by the MongoDB Ruby driver:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`synchronize': can't be called from trap context (ThreadError)
</code></pre></div></div>

<p>From the logs, we could see that the exception was raised when calling synchronize on a mutex inside the finalizer. However, the exception says that synchronize can’t be called from a “trap context”. What is that, and how is it related to our finalizers?</p>

<p>Finalizer is a proc that will be called when a specific object is about to be destroyed by garbage collection. In the MongoDB Ruby driver, we use finalizers to close unused cursors. A cursor is returned in response to a query and can be iterated to retrieve results. Cursors are a very convenient mechanism; however, cursors are server-side objects, and every cursor consumes server memory. Therefore, it is a good idea to let the server know if a cursor is unused so that the resources are released. So, if an object that represents a cursor is destroyed, the cursor is definitely unused and can be closed.</p>

<p>Below is a very simplified example of how this can be done:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Cursor</span>
  <span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">finalize</span><span class="p">(</span><span class="n">cursor_id</span><span class="p">,</span> <span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">)</span>
    <span class="nb">proc</span> <span class="k">do</span>
      <span class="nb">puts</span> <span class="s2">"Killing cursor </span><span class="si">#{</span><span class="n">cursor_id</span><span class="si">}</span><span class="s2"> on </span><span class="si">#{</span><span class="n">database</span><span class="si">}</span><span class="s2">.</span><span class="si">#{</span><span class="n">collection</span><span class="si">}</span><span class="s2">"</span>
      <span class="c1"># Execute command to close cursor</span>
    <span class="k">end</span>
  <span class="k">end</span>

  <span class="k">def</span> <span class="nf">initialize</span><span class="p">(</span><span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">)</span>
    <span class="c1"># Initialize the cursor</span>
    <span class="no">ObjectSpace</span><span class="p">.</span><span class="nf">define_finalizer</span><span class="p">(</span>
      <span class="nb">self</span><span class="p">,</span>
      <span class="nb">self</span><span class="p">.</span><span class="nf">class</span><span class="p">.</span><span class="nf">finalize</span><span class="p">(</span><span class="vi">@id</span><span class="p">,</span> <span class="vi">@database</span><span class="p">,</span> <span class="vi">@collection</span><span class="p">)</span>
    <span class="p">)</span>
  <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<p>We can ask Ruby to do the garbage collection by calling GC.start, so we can test the code.</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">5</span><span class="p">.</span><span class="nf">times</span> <span class="p">{</span> <span class="no">Cursor</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="s1">'database'</span><span class="p">,</span> <span class="s1">'collection'</span><span class="p">)</span> <span class="p">}</span>
<span class="no">GC</span><span class="p">.</span><span class="nf">start</span>

<span class="c1"># =&gt; Killing cursor 258 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 938 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 791 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 705 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 114 on database.collection</span>
</code></pre></div></div>

<p>So far so good. Of course, this solution is far from ideal. Here we send a command to the server every time the finalizer is called. First, this will block the main thread. Further, it will issue one command per cursor, which is not ideal. We can also reduce the number of commands we send by killing all cursors for a collection in one command. So, we came up with an idea for the cursor reaper — a background thread that wakes up from time to time and kills unused cursors:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CursorReaper</span>
  <span class="no">Task</span> <span class="o">=</span> <span class="no">Struct</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">:id</span><span class="p">,</span> <span class="ss">:database</span><span class="p">,</span> <span class="ss">:collection</span><span class="p">)</span>

  <span class="k">def</span> <span class="nf">initialize</span>
    <span class="vi">@mutex</span> <span class="o">=</span> <span class="no">Mutex</span><span class="p">.</span><span class="nf">new</span>
    <span class="vi">@tasks</span> <span class="o">=</span> <span class="p">[]</span>
  <span class="k">end</span>

  <span class="k">def</span> <span class="nf">schedule</span><span class="p">(</span><span class="nb">id</span><span class="p">,</span> <span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">)</span>
    <span class="vi">@mutex</span><span class="p">.</span><span class="nf">synchronize</span> <span class="k">do</span>
      <span class="vi">@tasks</span> <span class="o">&lt;&lt;</span> <span class="no">Task</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="nb">id</span><span class="p">,</span> <span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">)</span>
    <span class="k">end</span>
  <span class="k">end</span>

  <span class="k">def</span> <span class="nf">kill_cursors</span>
    <span class="vi">@mutex</span><span class="p">.</span><span class="nf">synchronize</span> <span class="k">do</span>
      <span class="k">while</span> <span class="n">task</span> <span class="o">=</span> <span class="vi">@tasks</span><span class="p">.</span><span class="nf">pop</span>
        <span class="nb">puts</span> <span class="s2">"Killing cursor </span><span class="si">#{</span><span class="n">task</span><span class="p">.</span><span class="nf">id</span><span class="si">}</span><span class="s2"> on </span><span class="si">#{</span><span class="n">task</span><span class="p">.</span><span class="nf">database</span><span class="si">}</span><span class="s2">.</span><span class="si">#{</span><span class="n">task</span><span class="p">.</span><span class="nf">collection</span><span class="si">}</span><span class="s2">"</span>
        <span class="c1"># Group cursors per collection</span>
      <span class="k">end</span>
    <span class="k">end</span>
    <span class="c1"># Execute commands to close cursors</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="k">class</span> <span class="nc">Cursor</span>
  <span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">finalize</span><span class="p">(</span><span class="nb">id</span><span class="p">,</span> <span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">,</span> <span class="n">reaper</span><span class="p">)</span>
    <span class="nb">proc</span> <span class="k">do</span>
      <span class="n">reaper</span><span class="p">.</span><span class="nf">schedule</span><span class="p">(</span><span class="nb">id</span><span class="p">,</span> <span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">)</span>
    <span class="k">end</span>
  <span class="k">end</span>

  <span class="k">def</span> <span class="nf">initialize</span><span class="p">(</span><span class="n">database</span><span class="p">,</span> <span class="n">collection</span><span class="p">,</span> <span class="n">reaper</span><span class="p">)</span>
    <span class="c1"># Initialize the cursor</span>
    <span class="no">ObjectSpace</span><span class="p">.</span><span class="nf">define_finalizer</span><span class="p">(</span>
      <span class="nb">self</span><span class="p">,</span>
      <span class="nb">self</span><span class="p">.</span><span class="nf">class</span><span class="p">.</span><span class="nf">finalize</span><span class="p">(</span><span class="vi">@id</span><span class="p">,</span> <span class="vi">@database</span><span class="p">,</span> <span class="vi">@collection</span><span class="p">,</span> <span class="n">reaper</span><span class="p">)</span>
    <span class="p">)</span>
  <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Note that there is a mutex in the CursorReaper class. The kill_cursors method of the reaper will be called in a background thread, hence the locking. Let’s test it:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reaper</span> <span class="o">=</span> <span class="no">CursorReaper</span><span class="p">.</span><span class="nf">new</span>
<span class="n">reaper_thread</span> <span class="o">=</span> <span class="no">Thread</span><span class="p">.</span><span class="nf">new</span> <span class="k">do</span>
  <span class="kp">loop</span> <span class="k">do</span>
    <span class="nb">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">reaper</span><span class="p">.</span><span class="nf">kill_cursors</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="mi">5</span><span class="p">.</span><span class="nf">times</span> <span class="p">{</span> <span class="no">Cursor</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="s1">'database'</span><span class="p">,</span> <span class="s1">'collection'</span><span class="p">,</span> <span class="n">reaper</span><span class="p">)</span> <span class="p">}</span>
<span class="no">GC</span><span class="p">.</span><span class="nf">start</span>
<span class="n">reaper_thread</span><span class="p">.</span><span class="nf">join</span>

<span class="c1"># =&gt; Killing cursor 205 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 847 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 284 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 609 on database.collection</span>
<span class="c1"># =&gt; Killing cursor 485 on database.collection</span>
</code></pre></div></div>

<p>Still, no error, even though the latter example calls synchronize inside the finalizer. What is the difference between the example and the real-world situation? In the example, we trigger garbage collection manually. Normally this is triggered by Ruby itself. What if we create so many objects that Ruby actually starts the GC?</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reaper</span> <span class="o">=</span> <span class="no">CursorReaper</span><span class="p">.</span><span class="nf">new</span>
<span class="n">reaper_thread</span> <span class="o">=</span> <span class="no">Thread</span><span class="p">.</span><span class="nf">new</span> <span class="k">do</span>
  <span class="kp">loop</span> <span class="k">do</span>
    <span class="nb">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">reaper</span><span class="p">.</span><span class="nf">kill_cursors</span>
  <span class="k">end</span>
<span class="k">end</span>
<span class="n">populator_thread</span> <span class="o">=</span> <span class="no">Thread</span><span class="p">.</span><span class="nf">new</span> <span class="k">do</span>
  <span class="kp">loop</span> <span class="k">do</span>
    <span class="mi">5000</span><span class="p">.</span><span class="nf">times</span> <span class="p">{</span> <span class="no">Cursor</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="s1">'database'</span><span class="p">,</span> <span class="s1">'collection'</span><span class="p">,</span> <span class="n">reaper</span><span class="p">)</span> <span class="p">}</span>
    <span class="nb">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
  <span class="k">end</span>
<span class="k">end</span>
<span class="p">[</span><span class="n">reaper_thread</span><span class="p">,</span> <span class="n">populator_thread</span><span class="p">].</span><span class="nf">map</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:join</span><span class="p">)</span>
</code></pre></div></div>

<p>Yes, this code actually reproduces the problem, and the exception is raised! So, it looks like finalizers are executed inside a signal trap. Therefore, to fix the problem we should just follow <a href="https://github.com/ruby/ruby/blob/master/doc/signals.rdoc">the documentation</a> and not use operations that are not allowed inside the traps. In our case with the cursor reaper, we got rid of mutexes in finalizers by using a queue data structure, and the bug was fixed.</p>

<h2 id="we-need-to-go-deeper">We Need to Go Deeper</h2>

<p>Even though the problem was gone, I decided to find out whether finalizers are <em>really</em> executed inside a signal trap. I though maybe Ruby VM
uses signals internally to trigger garbage collection. I could not find any mentions about such a usage of signals, so I had to read
Ruby source code. It tuned out to be fun, and the outcome was very unexpected!</p>

<p>I started by finding where the error <em>“can’t be called from trap context”</em> is raised. I found it in <code class="language-plaintext highlighter-rouge">do_mutex_lock</code> function inside <code class="language-plaintext highlighter-rouge">thread_sync.c</code> file:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* When running trap handler */</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">FL_TEST_RAW</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">MUTEX_ALLOW_TRAP</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
  <span class="n">th</span><span class="o">-&gt;</span><span class="n">ec</span><span class="o">-&gt;</span><span class="n">interrupt_mask</span> <span class="o">&amp;</span> <span class="n">TRAP_INTERRUPT_MASK</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">rb_raise</span><span class="p">(</span><span class="n">rb_eThreadError</span><span class="p">,</span> <span class="s">"can't be called from trap context"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, what is actually verified is whether the execution context has a <code class="language-plaintext highlighter-rouge">TRAP_INTERRUPT_MASK</code> flag set. This flag is set in three functions: <code class="language-plaintext highlighter-rouge">rb_postponed_job_flush</code> in <code class="language-plaintext highlighter-rouge">vm_trace.c</code>, <code class="language-plaintext highlighter-rouge">rb_threadptr_execute_interrupts</code> in <code class="language-plaintext highlighter-rouge">thread.c</code>, and <code class="language-plaintext highlighter-rouge">signal_exec</code> in <code class="language-plaintext highlighter-rouge">signal.c</code>. After some debugging, I found out that in our case the flag is set in the <code class="language-plaintext highlighter-rouge">rb_postponed_job_flush</code> function. Actually, this is also confirmed by this comment for the <code class="language-plaintext highlighter-rouge">rb_gc</code> function in <a href="https://github.com/ruby/ruby/blob/master/include/ruby/internal/intern/gc.h#L230"><code class="language-plaintext highlighter-rouge">gc.h</code></a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>* Finalisers are deferred until we can handle interrupts. See * `rb_postponed_job_flush` in vm_trace.c.
</code></pre></div></div>

<p>Alright, now it is more or less clear what is going on. Finalizers are not executed immediately after an object is “garbage collected”. Instead, a postponed job is created and scheduled. Such jobs are executed in the <code class="language-plaintext highlighter-rouge">rb_postponed_job_flush</code> function. This function sets the <code class="language-plaintext highlighter-rouge">TRAP_INTERRUPT_MASK</code> flag, which is later checked by <code class="language-plaintext highlighter-rouge">do_mutex_lock</code>. Hence the error. I even found <a href="https://github.com/ruby/ruby/commit/05459d1a33db59c47e98e327c9f52808ebc76a3f">the commit</a> that introduces the current behavior, and <a href="https://bugs.ruby-lang.org/issues/10595">a bug</a> that was fixed by this commit.
It looks like the Ruby team wanted to make sure that finalizers are never interrupted by a signal;
as a side effect, code inside finalizers is treated as code inside a signal trap.</p>

<p><em>To summarize, finalizers are <strong>not</strong> executed inside a signal trap; however, Ruby applies the same restrictions to signal traps and finalizers. This is not documented anywhere; further, the exception raised is a bit misleading. Be careful!</em></p>

<p>P.S. It is still unclear why we did not see the exception when we trigger
the garbage collection manually. I wasn’t able to find the answer; maybe this is
a topic for my next article.</p>]]></content><author><name></name></author><category term="ruby" /><category term="mongodb" /><summary type="html"><![CDATA[Ruby allows a developer to specify a finalizer proc for an object. This proc is called after an object was destroyed. This is a very useful mechanism that can be used for some cleanup when the object is gone. However, it turned out that there are limitations to what you can do inside finalizers. And these limitations are the same as ones for a signal trap. So, if you write a finalizer, you should follow the documentation for signal traps.]]></summary></entry></feed>