Eric's Blog
2019-04-09T03:04:55+00:00
http://blog.wjin.org
Eric
wjin.cn@gmail.com
Ceph BlueStore
2018-03-07T00:00:00+00:00
http://blog.wjin.org/posts/ceph-bluestore
<h1 id="introduction">Introduction</h1>
<p>在介绍了BlockDevice,BlueFS,FreelistManager和Allocator后,接下来重点分析BlueStore的实现。BlueStore继承于ObjectStore,它需要实现mkfs/mount/umount等初始化和退出时的操作,同时它也提供对象操作的基本接口,比如read/write,以及对象属性attr和omap相关的操作。虽然代码量巨大,但是除了写操作以外,其他只是代码繁琐,但是不难理解,所以将更多精力花在写操作的流程上,以及和写性能相关的各种监控指标上。对于写操作的请求,怎么从内存的object extent映射到磁盘的地址空间,可以参考网上的这篇<a href="http://www.sysnote.org/2016/08/19/ceph-bluestore/">文章</a>。这里以simple write表示新写/对齐写(cow)等不需要wal的场景,以deferred write表示需要写wal的场景(rmw)。</p>
<p>首先从宏观上了解一下BlueStore的整体架构(借用Ceph作者的图):</p>
<p><img src="/assets/img/post/ceph_bluestore.png" alt="img" /></p>
<p>因为主要还是用KernelDevice,所以仍然以KernelDevice作为介绍,BlueStore涉及到的线程如下:</p>
<ul>
<li>OSD::osd_op_tp: 通过libaio的方式,提交io请求给KernelDevice</li>
<li>KernelDevice::aio_thread: 执行libaio完成后的回调</li>
<li>BlueStore::kv_sync_thread: 同步kv数据,包括对象的meta信息和磁盘空间使用信息,以及wal日志的清理</li>
<li>BlueStore::kv_finalize_thread: 完成时回调的处理以及其他清理工作。wal情况生成dbh以及提交io请求</li>
<li>BlueStore::deferred_finisher: 通过libaio的方式,提交deferred io的请求</li>
<li>BlueStore::finishers: finisher线程的sharding,用来回调通知用户请求完成</li>
</ul>
<h1 id="aiocontext">AioContext</h1>
<p>写设备都是通过libaio,首先需要了解回调函数的执行流程。AioContext派生了两种context,TransContext和DeferredBatch,前者对应simple write,简称为txc,后者对应deferred write,简称为dbh。创建块设备的时候,会设置好回调函数,由块设备的aio thread线程执行回调:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">AioContext</span> <span class="p">{</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">aio_finish</span><span class="p">(</span><span class="n">BlueStore</span> <span class="o">*</span><span class="n">store</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="o">~</span><span class="n">AioContext</span><span class="p">()</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">TransContext</span> <span class="o">:</span> <span class="k">public</span> <span class="n">AioContext</span> <span class="p">{</span>
<span class="p">......</span>
<span class="kt">void</span> <span class="n">aio_finish</span><span class="p">(</span><span class="n">BlueStore</span> <span class="o">*</span><span class="n">store</span><span class="p">)</span> <span class="k">override</span> <span class="p">{</span>
<span class="n">store</span><span class="o">-></span><span class="n">txc_aio_finish</span><span class="p">(</span><span class="k">this</span><span class="p">);</span> <span class="c1">// txc的回调
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">DeferredBatch</span> <span class="o">:</span> <span class="k">public</span> <span class="n">AioContext</span> <span class="p">{</span>
<span class="p">......</span>
<span class="kt">void</span> <span class="n">aio_finish</span><span class="p">(</span><span class="n">BlueStore</span> <span class="o">*</span><span class="n">store</span><span class="p">)</span> <span class="k">override</span> <span class="p">{</span>
<span class="n">store</span><span class="o">-></span><span class="n">_deferred_aio_finish</span><span class="p">(</span><span class="n">osr</span><span class="p">);</span> <span class="c1">// dbh的回调
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>回调函数在创建设备的时候,会提前设置好:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_open_bdev</span><span class="p">(</span><span class="kt">bool</span> <span class="n">create</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">bdev</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">string</span> <span class="n">p</span> <span class="o">=</span> <span class="n">path</span> <span class="o">+</span> <span class="s">"/block"</span><span class="p">;</span>
<span class="n">bdev</span> <span class="o">=</span> <span class="n">BlockDevice</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">aio_cb</span><span class="p">,</span> <span class="k">static_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="k">this</span><span class="p">));</span> <span class="c1">// 传入回调函数
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">void</span> <span class="n">aio_cb</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">priv2</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">BlueStore</span> <span class="o">*</span><span class="n">store</span> <span class="o">=</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">BlueStore</span><span class="o">*></span><span class="p">(</span><span class="n">priv</span><span class="p">);</span>
<span class="n">BlueStore</span><span class="o">::</span><span class="n">AioContext</span> <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">BlueStore</span><span class="o">::</span><span class="n">AioContext</span><span class="o">*></span><span class="p">(</span><span class="n">priv2</span><span class="p">);</span>
<span class="n">c</span><span class="o">-></span><span class="n">aio_finish</span><span class="p">(</span><span class="n">store</span><span class="p">);</span> <span class="c1">// 执行回调函数
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="write-type">Write Type</h1>
<p>对于用户或osd层面的一次IO写请求,到BlueStore这一层,可能是simple write,也可能是deferred write,还有可能既有simple write的场景,也有deferred write的场景。</p>
<h3 id="simple-write">Simple Write</h3>
<p><img src="/assets/img/post/ceph_bluestore_simple_write.png" alt="img" /></p>
<p>对于simple write场景,先把数据写入新的block,然后更新k/v元信息,txc状态转换图如下:</p>
<p>写新block:</p>
<blockquote>
<p>STATE_PREPARE -> STATE_AIO_WAIT -> STATE_IO_DONE -> STATE_KV_QUEUED</p>
</blockquote>
<p>写k/v元信息:</p>
<blockquote>
<p>STATE_KV_QUEUED -> STATE_KV_SUBMITTED -> STATE_KV_DONE -> STATE_FINISHING -> STATE_DONE</p>
</blockquote>
<p>步骤:</p>
<ul>
<li>
<p>线程池osd_op_tp设置状态STATE_PREPARE和STATE_AIO_WAIT,提交IO等待回调</p>
</li>
<li>
<p>aio_thread回调处理,设置状态STATE_IO_DONE和STATE_KV_QUEUED</p>
</li>
<li>
<p>kv_sync_thread commit k/v事务,设置状态STATE_KV_SUBMITTED</p>
</li>
<li>
<p>kv_finalize_thread设置状态STATE_KV_DONE,STATE_FINISHING和STATE_DONE。</p>
</li>
</ul>
<h3 id="deferred-write">Deferred Write</h3>
<p><img src="/assets/img/post/ceph_bluestore_deferred_write.png" alt="img" /></p>
<p>对于deferred write场景,wal直接封装在k/v事务中,先写日志,即commit k/v操作,所以不会经过STATE_AIO_WAIT。日志写完成后,再封装一个dbh事务执行data的写操作。</p>
<p>写k/v日志:</p>
<blockquote>
<p>STATE_PREPARE -> STATE_IO_DONE -> STATE_KV_QUEUED -> STATE_KV_SUBMITTED -> STATE_KV_DONE -> STATE_DEFERRED_QUEUE</p>
</blockquote>
<p>写数据:</p>
<blockquote>
<p>STATE_DEFERRED_QUEUE -> STATE_DEFERRED_CLEANUP -> STATE_FINISHING -> STATE_DONE</p>
</blockquote>
<p>步骤:</p>
<ul>
<li>
<p>线程池osd_op_tp设置状态STATE_PREPARE,STATE_IO_DONE和STATE_KV_QUEUED,将wal日志请求在kv队列中排队,等待commit</p>
</li>
<li>
<p>kv_sync_thread commit k/v日志,设置状态STATE_KV_SUBMITTED</p>
</li>
<li>
<p>kv_finalize_thread设置状态STATE_KV_DONE和STATE_DEFERRED_QUEUE,生成写data的dbh并提交IO请求,等待回调</p>
</li>
<li>
<p>aio_thread回调处理,设置状态STATE_DEFERRED_CLEANUP</p>
</li>
<li>
<p>kv_sync_thread清理k/v中的日志</p>
</li>
<li>
<p>kv_finalize_thread设置状态STATE_FINISHING和STATE_DONE</p>
</li>
</ul>
<h3 id="simple-write--deferred-write">Simple Write + Deferred Write</h3>
<p><img src="/assets/img/post/ceph_bluestore_simple_deferred_write.png" alt="img" /></p>
<p>这种写操作最复杂,状态由前面两种的组合起来,步骤如下:</p>
<ul>
<li>
<p>线程池osd_op_tp设置状态STATE_PREPARE和STATE_AIO_WAIT,提交IO等待回调(simple write的IO)</p>
</li>
<li>
<p>aio_thread回调处理,设置状态STATE_IO_DONE和STATE_KV_QUEUED</p>
</li>
<li>
<p>kv_sync_thread commit k/v日志和部分元信息,设置状态STATE_KV_SUBMITTED</p>
</li>
<li>
<p>kv_finalize_thread设置状态STATE_KV_DONE和STATE_DEFERRED_QUEUE,生成写data的dbh并提交IO请求,等待回调</p>
</li>
<li>
<p>aio_thread回调处理,设置状态STATE_DEFERRED_CLEANUP</p>
</li>
<li>
<p>kv_sync_thread清理k/v中的日志</p>
</li>
<li>
<p>kv_finalize_thread设置状态STATE_FINISHING和STATE_DONE</p>
</li>
</ul>
<p>无论何种情况,当执行到STATE_KV_DONE后,就可以安全通知用户写操作完成,下面具体分析每个线程的工作。</p>
<h1 id="write-process">Write Process</h1>
<h3 id="osdosd_op_tp">OSD::osd_op_tp</h3>
<p>和FileStore类似,pg内部的修改操作,由线程池osd_op_tp封装成事务,通过函数queue_transactions提交请求:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">queue_transactions</span><span class="p">(</span><span class="n">Sequencer</span> <span class="o">*</span><span class="n">posr</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">Transaction</span><span class="o">>&</span> <span class="n">tls</span><span class="p">,</span> <span class="n">TrackedOpRef</span> <span class="n">op</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 准备pg的OpSequencer,保证pg内部操作串行执行
</span> <span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">posr</span><span class="o">-></span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="n">osr</span> <span class="o">=</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">OpSequencer</span> <span class="o">*></span><span class="p">(</span><span class="n">posr</span><span class="o">-></span><span class="n">p</span><span class="p">.</span><span class="n">get</span><span class="p">());</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">osr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">OpSequencer</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="k">this</span><span class="p">);</span>
<span class="n">osr</span><span class="o">-></span><span class="n">parent</span> <span class="o">=</span> <span class="n">posr</span><span class="p">;</span>
<span class="n">posr</span><span class="o">-></span><span class="n">p</span> <span class="o">=</span> <span class="n">osr</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 创建txc并将其在OpSequencer内部排队
</span> <span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span> <span class="o">=</span> <span class="n">_txc_create</span><span class="p">(</span><span class="n">osr</span><span class="p">);</span> <span class="c1">// txc初始状态为STATE_PREPARE
</span>
<span class="c1">// 准备回调函数
</span> <span class="n">txc</span><span class="o">-></span><span class="n">onreadable</span> <span class="o">=</span> <span class="n">onreadable</span><span class="p">;</span>
<span class="n">txc</span><span class="o">-></span><span class="n">onreadable_sync</span> <span class="o">=</span> <span class="n">onreadable_sync</span><span class="p">;</span>
<span class="n">txc</span><span class="o">-></span><span class="n">oncommit</span> <span class="o">=</span> <span class="n">ondisk</span><span class="p">;</span>
<span class="c1">// 将osd层面的事务,转换为BlueStore层面的事务操作
</span> <span class="k">for</span> <span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">Transaction</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">tls</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">tls</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">).</span><span class="n">set_osr</span><span class="p">(</span><span class="n">osr</span><span class="p">);</span>
<span class="n">txc</span><span class="o">-></span><span class="n">bytes</span> <span class="o">+=</span> <span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">).</span><span class="n">get_num_bytes</span><span class="p">();</span>
<span class="n">_txc_add_transaction</span><span class="p">(</span><span class="n">txc</span><span class="p">,</span> <span class="o">&</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">));</span> <span class="c1">// 非常复杂
</span> <span class="p">}</span>
<span class="c1">// 将deferred类型的日志加入k/v的事务中
</span> <span class="k">if</span> <span class="p">(</span><span class="n">txc</span><span class="o">-></span><span class="n">deferred_txn</span><span class="p">)</span> <span class="p">{</span>
<span class="n">txc</span><span class="o">-></span><span class="n">deferred_txn</span><span class="o">-></span><span class="n">seq</span> <span class="o">=</span> <span class="o">++</span><span class="n">deferred_seq</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">bl</span><span class="p">;</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="o">*</span><span class="n">txc</span><span class="o">-></span><span class="n">deferred_txn</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span>
<span class="n">string</span> <span class="n">key</span><span class="p">;</span>
<span class="n">get_deferred_key</span><span class="p">(</span><span class="n">txc</span><span class="o">-></span><span class="n">deferred_txn</span><span class="o">-></span><span class="n">seq</span><span class="p">,</span> <span class="o">&</span><span class="n">key</span><span class="p">);</span>
<span class="n">txc</span><span class="o">-></span><span class="n">t</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">PREFIX_DEFERRED</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// 限流操作
</span> <span class="p">......</span>
<span class="c1">// 执行状态机,会将io请求提交给块设备执行
</span> <span class="n">_txc_state_proc</span><span class="p">(</span><span class="n">txc</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// _txc_state_proc函数就是一个状态机,实现比较简单,后续不再列举
</span><span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_txc_state_proc</span><span class="p">(</span><span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">txc</span><span class="o">-></span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">TransContext</span><span class="p">:</span><span class="o">:</span><span class="n">STATE_PREPARE</span><span class="o">:</span>
<span class="n">txc</span><span class="o">-></span><span class="n">log_state_latency</span><span class="p">(</span><span class="n">logger</span><span class="p">,</span> <span class="n">l_bluestore_state_prepare_lat</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">txc</span><span class="o">-></span><span class="n">ioc</span><span class="p">.</span><span class="n">has_pending_aios</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 区分是否包含simple write,如果没有,直接执行后面的case
</span> <span class="n">txc</span><span class="o">-></span><span class="n">state</span> <span class="o">=</span> <span class="n">TransContext</span><span class="o">::</span><span class="n">STATE_AIO_WAIT</span><span class="p">;</span>
<span class="n">txc</span><span class="o">-></span><span class="n">had_ios</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">_txc_aio_submit</span><span class="p">(</span><span class="n">txc</span><span class="p">);</span> <span class="c1">// 提交io请求
</span> <span class="k">return</span><span class="p">;</span> <span class="c1">// 返回,等待io完成后的回调
</span> <span class="p">}</span>
<span class="c1">// 注意没有break
</span>
<span class="k">case</span> <span class="n">TransContext</span><span class="p">:</span><span class="o">:</span><span class="n">STATE_AIO_WAIT</span><span class="o">:</span>
<span class="n">txc</span><span class="o">-></span><span class="n">log_state_latency</span><span class="p">(</span><span class="n">logger</span><span class="p">,</span> <span class="n">l_bluestore_state_aio_wait_lat</span><span class="p">);</span>
<span class="n">_txc_finish_io</span><span class="p">(</span><span class="n">txc</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这个线程涉及的latency有:</p>
<ul>
<li>
<p>throttle latency</p>
</li>
<li>
<p>prepare latency</p>
</li>
<li>
<p>submit latency</p>
</li>
</ul>
<h3 id="blockdeviceaio_thread">BlockDevice::aio_thread</h3>
<p>对于simple write,io执行完成后,会再次执行状态机函数_txc_state_proc,根据前面的状态STATE_AIO_WAIT,会执行函数_txc_finish_io,这个函数的主要目的是保证pg对应的OpSequencer中的txc按排队的先后顺序依次进入kv_sync_thread线程的队列(因为libaio可能乱序)。为什么必须保证先后顺序?原因是对pg中的同一个object可能连续提交多次写请求,每次对应一个txc,在osd层面通过pg lock保证顺序,依次提交到ObjectStore层面,ObjectStore也必须保证这样的顺序,不然可能发生数据错乱。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_txc_finish_io</span><span class="p">(</span><span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span> <span class="o">=</span> <span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="p">.</span><span class="n">get</span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">osr</span><span class="o">-></span><span class="n">qlock</span><span class="p">);</span> <span class="c1">// 获取OpSequencer中的锁,主要是保互斥访问txc队列
</span> <span class="n">txc</span><span class="o">-></span><span class="n">state</span> <span class="o">=</span> <span class="n">TransContext</span><span class="o">::</span><span class="n">STATE_IO_DONE</span><span class="p">;</span> <span class="c1">// 设置状态
</span>
<span class="n">txc</span><span class="o">-></span><span class="n">ioc</span><span class="p">.</span><span class="n">running_aios</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="n">OpSequencer</span><span class="o">::</span><span class="n">q_list_t</span><span class="o">::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">osr</span><span class="o">-></span><span class="n">q</span><span class="p">.</span><span class="n">iterator_to</span><span class="p">(</span><span class="o">*</span><span class="n">txc</span><span class="p">);</span> <span class="c1">// 定位到当前txc在队列中的位置
</span> <span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">!=</span> <span class="n">osr</span><span class="o">-></span><span class="n">q</span><span class="p">.</span><span class="n">begin</span><span class="p">())</span> <span class="p">{</span>
<span class="o">--</span><span class="n">p</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="o">-></span><span class="n">state</span> <span class="o"><</span> <span class="n">TransContext</span><span class="o">::</span><span class="n">STATE_IO_DONE</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 前面还有未完成io操作的txc,这个txc不能继续进行下去,等待前面的完成,所以直接return
</span> <span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="o">-></span><span class="n">state</span> <span class="o">></span> <span class="n">TransContext</span><span class="o">::</span><span class="n">STATE_IO_DONE</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 前面以及进入下一个状态了,递增p并退出循环,下面接着处理当前的txc
</span> <span class="o">++</span><span class="n">p</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 依次处理状态为STATE_IO_DONE的txc,会将txc放入kv_sync_thread的队列kv_queue和kv_queue_unsubmitted
</span> <span class="k">do</span> <span class="p">{</span>
<span class="n">_txc_state_proc</span><span class="p">(</span><span class="o">&*</span><span class="n">p</span><span class="o">++</span><span class="p">);</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">!=</span> <span class="n">osr</span><span class="o">-></span><span class="n">q</span><span class="p">.</span><span class="n">end</span><span class="p">()</span> <span class="o">&&</span> <span class="n">p</span><span class="o">-></span><span class="n">state</span> <span class="o">==</span> <span class="n">TransContext</span><span class="o">::</span><span class="n">STATE_IO_DONE</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>对于deferred write,第一阶段将日志写入k/v中后,会继续准备写data的dbh然后提交写io请求,完成后由aio_thread回调,主要工作是设置状态STATE_DEFERRED_CLEANUP和将dbh入队列deferred_done_queue等待kv_sync_thread线程处理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_deferred_aio_finish</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">l2</span><span class="p">(</span><span class="n">osr</span><span class="o">-></span><span class="n">qlock</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="n">i</span> <span class="o">:</span> <span class="n">b</span><span class="o">-></span><span class="n">txcs</span><span class="p">)</span> <span class="p">{</span>
<span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span> <span class="o">=</span> <span class="o">&</span><span class="n">i</span><span class="p">;</span>
<span class="n">txc</span><span class="o">-></span><span class="n">state</span> <span class="o">=</span> <span class="n">TransContext</span><span class="o">::</span><span class="n">STATE_DEFERRED_CLEANUP</span><span class="p">;</span> <span class="c1">// 设置状态
</span> <span class="n">costs</span> <span class="o">+=</span> <span class="n">txc</span><span class="o">-></span><span class="n">cost</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">osr</span><span class="o">-></span><span class="n">qcond</span><span class="p">.</span><span class="n">notify_all</span><span class="p">();</span>
<span class="n">throttle_deferred_bytes</span><span class="p">.</span><span class="n">put</span><span class="p">(</span><span class="n">costs</span><span class="p">);</span> <span class="c1">// 释放throttle资源
</span> <span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">kv_lock</span><span class="p">);</span>
<span class="n">deferred_done_queue</span><span class="p">.</span><span class="n">emplace_back</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="c1">// 入队列
</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>综上,可以看出,当IO执行完成后,要么是将txc放入队列,要么是将dbh放入队列,虽然对应不同队列,但是都是等待线程kv_sync_thread执行,这里sync的意思应该是同步,写完数据后,需要更新k/v,比如object的Onode信息,FreelistManager的磁盘空间信息等等,这些必须按顺序操作。</p>
<p>对于simple write情况,都是写新的磁盘block(如果是cow,也是写新的block,只是事务中k/v操作增加对旧的block的回收操作),所以先由aio_thread写block,再由kv_sync_thread同步元信息,无论什么时候挂掉,数据都不会损坏。</p>
<p>对于deferred write,kv_sync_thread第一次的commit操作中,将wal记录在了k/v系统中,然后进行后续的操作,异常的情况,可以通过回放wal,数据也不会损坏。</p>
<p>这个线程涉及的latency有:</p>
<ul>
<li>
<p>aio wait latency</p>
</li>
<li>
<p>aio done latency(OpSequencer的保序操作可能会block)</p>
</li>
</ul>
<h3 id="bluestorekv_sync_thread">BlueStore::kv_sync_thread</h3>
<p>这个线程即为同步线程,处理三个队列,kv_queue表示需要执行commit的txc队列,deferred_done_queue表示已经完成wal操作的dbh队列,deferred_stable_queue表示已经落盘,等待清理日志。线程一直循环处理每个队列:</p>
<p><strong>kv_queue:</strong>
将kv_queue中的txc存入kv_committing中,并提交给k/v系统执行,即执行操作db->submit_transaction,设置状态为STATE_KV_SUBMITTED,并将kv_committing中的txc放入kv_committing_to_finalize,等待线程kv_finalize_thread执行。</p>
<p><strong>deferred_done_queue:</strong>
这个队列的dbh会有两种结果: 1) 如果没有做flush操作,会将其放入deferred_stable_queue待下次循环继续处理 2) 如果做了flush操作,说明数据已经落盘,即已经是stable的了,直接将其插入deferred_stable_queue队列,这里stable的意思就是数据已经写好了,前面k/v中记录的wal没用了,可以删除。</p>
<p><strong>deferred_stable_queue:</strong>
依次操作dbh中的txc,将k/v中的wal日志删除,然后dbh入队列deferred_stable_to_finalize,等待线程kv_finalize_thread执行。</p>
<p>这个线程涉及的latency有:</p>
<ul>
<li>
<p>kv queued</p>
</li>
<li>
<p>kv flush</p>
</li>
<li>
<p>kv commit</p>
</li>
<li>
<p>kv latency</p>
</li>
</ul>
<h3 id="bluestorekv_finalize_thread">BlueStore::kv_finalize_thread</h3>
<p>这个线程即为清理线程,处理两个个队列kv_committing_to_finalize和deferred_stable_to_finalize。</p>
<p><strong>kv_committing_to_finalize:</strong>
再次调用函数_txc_state_proc,设置状态为STATE_KV_DONE,并执行回调函数<strong>通知用户io操作完成</strong>。然后根据条件判读是否需要继续执行操作:</p>
<ul>
<li>
<p>如果不是wal情况,即不包含deferred txc,设置状态为STATE_FINISHING,继续调用_txc_finish,设置状态为STATE_DONE,完成。</p>
</li>
<li>
<p>如果是wal情况,设置状态为STATE_DEFERRED_QUEUED,调用_deferred_queue准备写data的事务dbh(DeferredBatch,它有一个list成员,类型为txc),进一步调用_deferred_submit_unlock,然后bdev->aio_submit提交给块设备,此时aio类型为DeferredBatch。当io执行完成后,同样由线程aio thread执行回调函数_deferred_aio_finish,会将状态设置为STATE_DEFERRED_CLEANUP,并将事务放入队列deferred_done_queued。这个队列会由线程kv_sync_thread继续处理,先放入队列deferred_stable_queue,然后下次执行的时候,清除日志对应的k/v,将dbh放入队列deferred_stable_to_finalize,等待kv_finalize_thread继续执行。</p>
</li>
</ul>
<p><strong>deferred_stable_to_finalize:</strong>
遍历dbh中包含的txc,再次调用函数_txc_state_proc,设置状态为STATE_FINISHING,继续调用_txc_finish,设置状态为STATE_DONE,完成。</p>
<p>latency指标:</p>
<ul>
<li>
<p>state kv committing</p>
</li>
<li>
<p>state kv done</p>
</li>
<li>
<p>state finishing</p>
</li>
<li>
<p>state deferred cleanup</p>
</li>
</ul>
<p>deferred_finisher和finishers两个线程比较简单,只是执行一个回调,就不介绍了,其他每个线程基本上都是身兼数职,处理多个队列,功能分的不是很清楚,以至于阅读代码的时候稍显晦涩。这样做可能是出于性能考虑,避免跨度多个线程,中断IO的流水线,引入不必要的开销。</p>
<h1 id="deferred-io-order">Deferred IO Order</h1>
<p>还有一个问题需要注意,dbh执行的时候,通过libaio提交给BlockDevice执行写data的请求,这个时候的顺序怎么保证?实现的时候,BlueStore内部包含一个成员变量deferred_queue,这个队列包含需要执行deferred IO的OpSequencer,而每个OpSequencer包含两个成员变量,deferred_running和deferred_pending,类型为DeferredBatch,这个类包含一个txc的数组,如果pg有写请求,会在pg对应的OpSequencer中的deferred_pending中排队txc,待时机成熟的时候,一次性提交所有txc给libaio,执行完成后才会进行下一次提交,这样不会导致deferred IO写data的时候乱序。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_deferred_queue</span><span class="p">(</span><span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">deferred_lock</span><span class="p">.</span><span class="n">lock</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="o">-></span><span class="n">deferred_pending</span> <span class="o">&&</span> <span class="o">!</span><span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="o">-></span><span class="n">deferred_running</span><span class="p">)</span> <span class="p">{</span>
<span class="n">deferred_queue</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="o">*</span><span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="p">);</span> <span class="c1">// 排队osr
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="o">-></span><span class="n">deferred_pending</span><span class="p">)</span> <span class="p">{</span>
<span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="o">-></span><span class="n">deferred_pending</span> <span class="o">=</span> <span class="k">new</span> <span class="n">DeferredBatch</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="p">.</span><span class="n">get</span><span class="p">());</span> <span class="c1">// 如果不存在,新建dbh
</span> <span class="p">}</span>
<span class="o">++</span><span class="n">deferred_queue_size</span><span class="p">;</span>
<span class="n">txc</span><span class="o">-></span><span class="n">osr</span><span class="o">-></span><span class="n">deferred_pending</span><span class="o">-></span><span class="n">txcs</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="o">*</span><span class="n">txc</span><span class="p">);</span> <span class="c1">// 将txc追加到末尾
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_deferred_submit_unlock</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 切换指针,保证每次操作完成后才会进行下一次提交
</span> <span class="c1">// submit的相关函数都会判断deferred_running是否为空
</span> <span class="n">osr</span><span class="o">-></span><span class="n">deferred_running</span> <span class="o">=</span> <span class="n">osr</span><span class="o">-></span><span class="n">deferred_pending</span><span class="p">;</span>
<span class="n">osr</span><span class="o">-></span><span class="n">deferred_pending</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="p">......</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">bdev</span><span class="o">-></span><span class="n">aio_write</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">bl</span><span class="p">,</span> <span class="o">&</span><span class="n">b</span><span class="o">-></span><span class="n">ioc</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span> <span class="c1">// 准备所有txc的写buffer
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="n">deferred_lock</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>
<span class="n">bdev</span><span class="o">-></span><span class="n">aio_submit</span><span class="p">(</span><span class="o">&</span><span class="n">b</span><span class="o">-></span><span class="n">ioc</span><span class="p">);</span> <span class="c1">// 一次性提交所有txc
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_kv_finalize_thread</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">b</span> <span class="o">:</span> <span class="n">deferred_stable</span><span class="p">)</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">p</span> <span class="o">=</span> <span class="n">b</span><span class="o">-></span><span class="n">txcs</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">!=</span> <span class="n">b</span><span class="o">-></span><span class="n">txcs</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span> <span class="o">=</span> <span class="o">&*</span><span class="n">p</span><span class="p">;</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">b</span><span class="o">-></span><span class="n">txcs</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="n">_txc_state_proc</span><span class="p">(</span><span class="n">txc</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">delete</span> <span class="n">b</span><span class="p">;</span> <span class="c1">// 释放dbh
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>osd层面的一次写操作,在BlueStore层面可能对应多种情况:simple/deferred/simple+deferred。</p>
</li>
<li>
<p>分析中忽略了最复杂的流程,即添加事务的操作(_txc_add_transaction),这个函数会将写操作进行分类。分类的时候,可能会调用BlockDevice的aio_write,这个函数只是准备内存中的buffer,不会写数据,后续调用aio_submit的时候才是提交IO的写请求。前面提到的那个参考文章中的流程图是不对的。</p>
</li>
<li>
<p>只有在overwrite的时候才有可能发生deferred write,所以性能提升明显,没有了以前既要写日志,又要写数据盘,同时还有文件系统本身的损耗的情况。</p>
</li>
<li>
<p>对于对象存储,全部是simple write,性能应该提升非常明显,社区的文档说提升了三倍,待实测。</p>
</li>
<li>
<p>避免了object太多的情况下,文件系统性能的下降以及IO的抖动,存放海量小文件成为可能,只需要大量的k/v存放object元信息。</p>
</li>
</ul>
Ceph BlueStore Cache
2018-02-27T00:00:00+00:00
http://blog.wjin.org/posts/ceph-bluestore-cache
<h1 id="introduction">Introduction</h1>
<p>BlueStore自己管理裸设备,没有文件系统,操作系统的page cache利用不上,需要自己管理缓存,包括元数据的缓存和object data的缓存,缓存系统的性能直接影响整个BlueStore的性能。</p>
<p>缓存的主要对象包括pg对应的Collection,即Cnode,object的元信息Onode以及object data对应的Buffer等。osd负责的pg个数比较少,在osd启动的时候,就会将所有pg对应的Collection加载到内存,而对于Onode,单个osd对应的object成千上万,条件允许可以尽最大可能将object的元信息缓存在内存中,此类缓存采用LRU策略,对于object的数据Buffer,显然是不可能全部缓存住的,尽可能缓存热点object,所以此类缓存采用的2Q算法,而不是LRU。LRU和2Q的主要区别在于,LRU仅仅采用单个链表,而2Q采用多个链表。实现的时候,抽象出Cache类,然后提供LRUCache和TwoQCache两种实现,两种实现都支持Onode和Buffer的缓存,LRUCache中Onode和Buffer各采用一个链表,淘汰用LRU策略,TwoQCache中,Onode仍然采用一个链表,用LRU策略,而buffer采用三个链表,用2Q策略。</p>
<p><img src="/assets/img/post/ceph_bluestore_cache.png" alt="img" /></p>
<p>Cache的实现本身比较简单,本质上就是将数据放入链表或根据某种策略放入多个链表,重点关注怎么使用以及缓存空间的参数调优。</p>
<h1 id="cache-init">Cache Init</h1>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BlueStore</span><span class="o">::</span><span class="n">BlueStore</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">path</span><span class="p">)</span>
<span class="o">:</span> <span class="n">ObjectStore</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">path</span><span class="p">),</span>
<span class="p">......</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">set_cache_shards</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// 初始化cache
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">set_cache_shards</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">size_t</span> <span class="n">old</span> <span class="o">=</span> <span class="n">cache_shards</span><span class="p">.</span><span class="n">size</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">num</span> <span class="o">>=</span> <span class="n">old</span><span class="p">);</span>
<span class="n">cache_shards</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">num</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">old</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">num</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cache_shards</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">Cache</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">bluestore_cache_type</span><span class="p">,</span> <span class="n">logger</span><span class="p">);</span> <span class="c1">// 默认采用2q
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="n">BlueStore</span><span class="o">::</span><span class="n">Cache</span> <span class="o">*</span><span class="n">BlueStore</span><span class="o">::</span><span class="n">Cache</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">CephContext</span><span class="o">*</span> <span class="n">cct</span><span class="p">,</span> <span class="n">string</span> <span class="n">type</span><span class="p">,</span> <span class="n">PerfCounters</span> <span class="o">*</span><span class="n">logger</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Cache</span> <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="s">"lru"</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LRUCache</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span> <span class="c1">// LRU
</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="s">"2q"</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="k">new</span> <span class="n">TwoQCache</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span> <span class="c1">// 2q
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>看上去初始化流程比较简单,而且把分片的shard写死为1?实际上,在osd初始化的时候,会根据参数调整shard的值,隐藏的比较深,稍不注意就疏忽了:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">store</span><span class="o">-></span><span class="n">set_cache_shards</span><span class="p">(</span><span class="n">get_num_op_shards</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">get_num_op_shards</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_num_shards</span><span class="p">)</span>
<span class="k">return</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_num_shards</span><span class="p">;</span> <span class="c1">// 默认0
</span> <span class="k">if</span> <span class="p">(</span><span class="n">store_is_rotational</span><span class="p">)</span>
<span class="k">return</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_num_shards_hdd</span><span class="p">;</span> <span class="c1">// 默认5
</span> <span class="k">else</span>
<span class="k">return</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_num_shards_ssd</span><span class="p">;</span> <span class="c1">// 默认8
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="cnode">Cnode</h1>
<p>cnode指的是pg对应collection的磁盘结构:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">bluestore_cnode_t</span> <span class="p">{</span>
<span class="kt">uint32_t</span> <span class="n">bits</span><span class="p">;</span> <span class="c1">// 只有一个bit表示pg的有效位,在stable_mod的时候用
</span><span class="p">};</span>
</code></pre></div></div>
<p>在创建pg的时候,会将pg的元信息存在kv中:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_create_collection</span><span class="p">(</span><span class="n">TransContext</span> <span class="o">*</span><span class="n">txc</span><span class="p">,</span> <span class="k">const</span> <span class="n">coll_t</span> <span class="o">&</span><span class="n">cid</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">bits</span><span class="p">,</span> <span class="n">CollectionRef</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">c</span><span class="o">-></span><span class="n">reset</span><span class="p">(</span><span class="k">new</span> <span class="n">Collectioni</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">cache_shards</span><span class="p">[</span><span class="n">cid</span><span class="p">.</span><span class="n">hash_to_shard</span><span class="p">(</span><span class="n">cache_shards</span><span class="p">.</span><span class="n">size</span><span class="p">())],</span> <span class="n">cid</span><span class="p">));</span> <span class="c1">// 给pg指定一个cache
</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span><span class="p">)</span><span class="o">-></span><span class="n">cnode</span><span class="p">.</span><span class="n">bits</span> <span class="o">=</span> <span class="n">bits</span><span class="p">;</span>
<span class="n">coll_map</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="o">*</span><span class="n">c</span><span class="p">;</span> <span class="c1">// collection的map
</span>
<span class="o">::</span><span class="n">encode</span><span class="p">((</span><span class="o">*</span><span class="n">c</span><span class="p">)</span><span class="o">-></span><span class="n">cnode</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span>
<span class="n">txc</span><span class="o">-></span><span class="n">t</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">PREFIX_COLL</span><span class="p">,</span> <span class="n">stringify</span><span class="p">(</span><span class="n">cid</span><span class="p">),</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 将pg信息持久化
</span><span class="p">}</span>
</code></pre></div></div>
<p>在osd上电的时候,会创建collection,并从k/v加载所有collection信息,同时会指定一个cache给pg,参见函数_open_collections:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_open_collections</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">errors</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">KeyValueDB</span><span class="o">::</span><span class="n">Iterator</span> <span class="n">it</span> <span class="o">=</span> <span class="n">db</span><span class="o">-></span><span class="n">get_iterator</span><span class="p">(</span><span class="n">PREFIX_COLL</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">it</span><span class="o">-></span><span class="n">upper_bound</span><span class="p">(</span><span class="n">string</span><span class="p">());</span> <span class="n">it</span><span class="o">-></span><span class="n">valid</span><span class="p">();</span> <span class="n">it</span><span class="o">-></span><span class="n">next</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 遍历pg对应的collection k/v元信息
</span> <span class="n">coll_t</span> <span class="n">cid</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cid</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="n">it</span><span class="o">-></span><span class="n">key</span><span class="p">()))</span> <span class="p">{</span>
<span class="n">CollectionRef</span> <span class="n">c</span><span class="p">(</span><span class="k">new</span> <span class="n">Collection</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">cache_shards</span><span class="p">[</span><span class="n">cid</span><span class="p">.</span><span class="n">hash_to_shard</span><span class="p">(</span><span class="n">cache_shards</span><span class="p">.</span><span class="n">size</span><span class="p">())],</span> <span class="n">cid</span><span class="p">));</span> <span class="c1">// 创建collection
</span> <span class="n">bufferlist</span> <span class="n">bl</span> <span class="o">=</span> <span class="n">it</span><span class="o">-></span><span class="n">value</span><span class="p">();</span> <span class="c1">// 获取对应的value
</span> <span class="n">bufferlist</span><span class="o">::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">bl</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="k">try</span> <span class="p">{</span>
<span class="o">::</span><span class="n">decode</span><span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">cnode</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span> <span class="c1">// 解码
</span> <span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="n">buffer</span><span class="o">::</span><span class="n">error</span><span class="o">&</span> <span class="n">e</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EIO</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">coll_map</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">;</span> <span class="c1">// 更新collection map
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>由上面可知,cache shard的分片数是固定的,一般也不太大,所以多个pg可能会共用同一个cache。如果内存较充足,缓存的数据特别多,分片的大小可以适当调大,避免cache内部的链表太长。</p>
<h1 id="onode">Onode</h1>
<p>onode是object的元信息,读写或其他操作的时候,都需要onode信息,而object属于某个pg(collection)的,内部通过一个map记录目前缓存的对象:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Collection</span> <span class="o">:</span> <span class="k">public</span> <span class="n">CollectionImpl</span> <span class="p">{</span>
<span class="n">OnodeSpace</span> <span class="n">onode_map</span><span class="p">;</span> <span class="c1">// objcet name -> onode
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">OnodeSpace</span> <span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">mempool</span><span class="o">::</span><span class="n">bluestore_cache_other</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span><span class="n">OnodeRef</span><span class="o">></span> <span class="n">onode_map</span><span class="p">;</span> <span class="c1">// pg内部的object
</span><span class="p">};</span>
</code></pre></div></div>
<p>几乎所有关于object的操作入口函数,实现的时候都会首先读取onode信息:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BlueStore</span><span class="o">::</span><span class="n">OnodeRef</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">Collection</span><span class="o">::</span><span class="n">get_onode</span><span class="p">(</span><span class="k">const</span> <span class="n">ghobject_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">create</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">OnodeRef</span> <span class="n">o</span> <span class="o">=</span> <span class="n">onode_map</span><span class="p">.</span><span class="n">lookup</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span> <span class="c1">// 查找map
</span> <span class="k">if</span> <span class="p">(</span><span class="n">o</span><span class="p">)</span>
<span class="k">return</span> <span class="n">o</span><span class="p">;</span>
<span class="p">......</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">store</span><span class="o">-></span><span class="n">db</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">PREFIX_OBJ</span><span class="p">,</span> <span class="n">key</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">key</span><span class="p">.</span><span class="n">size</span><span class="p">(),</span> <span class="o">&</span><span class="n">v</span><span class="p">);</span> <span class="c1">// 从k/v系统加载元信息
</span>
<span class="p">......</span>
<span class="k">return</span> <span class="n">onode_map</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="n">o</span><span class="p">);</span> <span class="c1">// 加入缓存
</span><span class="p">}</span>
<span class="n">BlueStore</span><span class="o">::</span><span class="n">OnodeRef</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">OnodeSpace</span><span class="o">::</span><span class="n">add</span><span class="p">(</span><span class="k">const</span> <span class="n">ghobject_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="n">OnodeRef</span> <span class="n">o</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">onode_map</span><span class="p">[</span><span class="n">oid</span><span class="p">]</span> <span class="o">=</span> <span class="n">o</span><span class="p">;</span> <span class="c1">// 加入map
</span> <span class="n">cache</span><span class="o">-></span><span class="n">_add_onode</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// 加入缓存系统
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="buffer">Buffer</h1>
<p>object的数据可能对应很多buffer,通过BufferSpace统一管理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">BufferSpace</span> <span class="p">{</span>
<span class="n">mempool</span><span class="o">::</span><span class="n">bluestore_cache_other</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="kt">uint32_t</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">Buffer</span><span class="o">>></span> <span class="n">buffer_map</span><span class="p">;</span> <span class="c1">// 所有的buffer
</span> <span class="n">state_list_t</span> <span class="n">writing</span><span class="p">;</span> <span class="c1">// 正在写的buffer
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>当数据写完成后,通过标志决定是否加入缓存系统:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">BufferSpace</span><span class="o">::</span><span class="n">finish_write</span><span class="p">(</span><span class="n">Cache</span><span class="o">*</span> <span class="n">cache</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">seq</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">recursive_mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">cache</span><span class="o">-></span><span class="n">lock</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">i</span> <span class="o">=</span> <span class="n">writing</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">!=</span> <span class="n">writing</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="o">-></span><span class="n">flags</span> <span class="o">&</span> <span class="n">Buffer</span><span class="o">::</span><span class="n">FLAG_NOCACHE</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 直接删除
</span> <span class="n">writing</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">i</span><span class="o">++</span><span class="p">);</span>
<span class="n">buffer_map</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">b</span><span class="o">-></span><span class="n">offset</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">writing</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">i</span><span class="o">++</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">cache</span><span class="o">-></span><span class="n">_add_buffer</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">nullptr</span><span class="p">);</span> <span class="c1">// 加入cache
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>同理,当读取完成后,也会考虑加入缓存:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_do_read</span><span class="p">(</span>
<span class="n">Collection</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span>
<span class="n">OnodeRef</span> <span class="n">o</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">length</span><span class="p">,</span>
<span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">,</span>
<span class="kt">uint32_t</span> <span class="n">op_flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 设置是否缓存
</span> <span class="kt">bool</span> <span class="n">buffered</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">op_flags</span> <span class="o">&</span> <span class="n">CEPH_OSD_OP_FLAG_FADVISE_WILLNEED</span><span class="p">)</span> <span class="p">{</span>
<span class="n">buffered</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">bluestore_default_buffered_read</span> <span class="o">&&</span>
<span class="p">(</span><span class="n">op_flags</span> <span class="o">&</span> <span class="p">(</span><span class="n">CEPH_OSD_OP_FLAG_FADVISE_DONTNEED</span> <span class="o">|</span> <span class="n">CEPH_OSD_OP_FLAG_FADVISE_NOCACHE</span><span class="p">))</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">buffered</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">buffered</span><span class="p">)</span> <span class="p">{</span>
<span class="n">bptr</span><span class="o">-></span><span class="n">shared_blob</span><span class="o">-></span><span class="n">bc</span><span class="p">.</span><span class="n">did_read</span><span class="p">(</span><span class="n">bptr</span><span class="o">-></span><span class="n">shared_blob</span><span class="o">-></span><span class="n">get_cache</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">raw_bl</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">did_read</span><span class="p">(</span><span class="n">Cache</span><span class="o">*</span> <span class="n">cache</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">offset</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">recursive_mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">cache</span><span class="o">-></span><span class="n">lock</span><span class="p">);</span>
<span class="n">Buffer</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Buffer</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">Buffer</span><span class="o">::</span><span class="n">STATE_CLEAN</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span>
<span class="n">b</span><span class="o">-></span><span class="n">cache_private</span> <span class="o">=</span> <span class="n">_discard</span><span class="p">(</span><span class="n">cache</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">bl</span><span class="p">.</span><span class="n">length</span><span class="p">());</span>
<span class="n">_add_buffer</span><span class="p">(</span><span class="n">cache</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">nullptr</span><span class="p">);</span> <span class="c1">// 加入cache
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="trim">Trim</h1>
<p>BlueStore很多元信息对象,都是用内存池管理的,Onode和Buffer也不例外,后台由线程监控内存的使用情况,超过内存规定的限制就做trim:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">BlueStore</span><span class="o">::</span><span class="n">MempoolThread</span><span class="o">::</span><span class="n">entry</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">stop</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// 通过参数计算出trim的目标
</span> <span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">i</span> <span class="o">:</span> <span class="n">store</span><span class="o">-></span><span class="n">cache_shards</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span><span class="o">-></span><span class="n">trim</span><span class="p">(</span><span class="n">shard_target</span><span class="p">,</span> <span class="n">store</span><span class="o">-></span><span class="n">cache_meta_ratio</span><span class="p">,</span> <span class="n">store</span><span class="o">-></span><span class="n">cache_data_ratio</span><span class="p">,</span> <span class="n">bytes_per_onode</span><span class="p">);</span> <span class="c1">// 执行trim
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="n">wait</span> <span class="o">+=</span> <span class="n">store</span><span class="o">-></span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">bluestore_cache_trim_interval</span><span class="p">;</span>
<span class="n">cond</span><span class="p">.</span><span class="n">WaitInterval</span><span class="p">(</span><span class="n">lock</span><span class="p">,</span> <span class="n">wait</span><span class="p">);</span> <span class="c1">// 定时唤醒执行trim
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="config">Config</h1>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// BlueStore中cache_shards的分片大小依赖的参数
</span><span class="n">osd_op_num_shards</span> <span class="c1">// 默认为0
</span><span class="n">osd_op_num_shards_hdd</span> <span class="c1">// 默认为5
</span><span class="n">osd_op_num_shards_ssd</span> <span class="c1">// 默认为8
</span>
<span class="n">bluestore_cache_trim_interval</span> <span class="c1">// cache trim的间隔时间,默认为0.2,可适当调大
</span><span class="n">bluestore_cache_trim_max_skip_pinned</span> <span class="c1">// trim cache的时候,如果遇见item是pin的,计数+1,计数超过此值后,停止做trim。默认为64
</span><span class="n">bluestore_cache_type</span> <span class="c1">// 默认为2q
</span><span class="n">bluestore_2q_cache_kin_ratio</span> <span class="c1">// in链表的占比,默认为0.5
</span><span class="n">bluestore_2q_cache_kout_ratio</span> <span class="c1">// out链表的占比,默认为0.5
</span>
<span class="c1">// 缓存空间大小,需要根据物理内存大小以及osd的个数设置合理值
</span><span class="n">bluestore_cache_size</span> <span class="c1">// 默认为0
</span><span class="n">bluestore_cache_size_hdd</span> <span class="c1">// 默认为1GB
</span><span class="n">bluestore_cache_size_ssd</span> <span class="c1">// 默认为3G
</span>
<span class="n">bluestore_cache_meta_ratio</span> <span class="c1">// metadata占用缓存的比率,默认为0.01
</span><span class="n">bluestore_cache_kv_ratio</span> <span class="c1">// rocksdb database cache占用缓存的比率,默认为0.99
</span><span class="n">bluestore_cache_kv_max</span> <span class="c1">// rocksdb database占用缓存的上限,默认为512MB
</span></code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>osd启动的时候,提供参数初始化BlueStore的cache分片大小,供后续pg对应的collection使用</p>
</li>
<li>
<p>osd从磁盘读取collection信息,将pg对应的collection全部加载到内存,并分配一个负责缓存的cache给collection</p>
</li>
<li>
<p>执行对象操作的时候,会首先读取Onode元信息并将其加入缓存管理</p>
</li>
<li>
<p>写入对象的时候,会根据标志,将对象数据的Buffer加入缓存管理</p>
</li>
<li>
<p>Onode/Buffer等对象统一用内存池分配,后台线程定期检查内存使用情况,并将超出的部分trim掉</p>
</li>
</ul>
Ceph BlueStore Allocator
2018-02-06T00:00:00+00:00
http://blog.wjin.org/posts/ceph-bluestore-allocator
<h1 id="introduction">Introduction</h1>
<p>Allocator用来从空闲空间分配block,BlockDevice的空闲空间由FreelistManager管理,FreelistManager提供allocate/release的操作接口,即分配和回收磁盘的block,为什么还会有Allocator?直接用FreelistManager的接口不就完事了吗?</p>
<p>首先,这个问题本身就不太准确,BlueStore中可能存在多个BlockDevice,wal/db对应的BlockDevice,使用者只是BlueFS,BlueFS不用FreelistManager管理块设备空间的使用情况,而是将其持久化记录在文件系统的日志文件中(本身也不可能,FreelistManager依赖于RocksDB存放段的位图信息,而RocksDB依赖于BlueFS,如果BlueFS依赖于FreelistManager,循环依赖)。</p>
<p>其次,如果没有单独的wal/db设备,BlueFS只能借用BlueStore的slow存储空间,这部份空间也只能特殊管理,不能用FreelistManager,理由和上面一样。</p>
<p>最后,BlueStore自己的slow空间(存放data),当写object文件的时候,先通过Allocator分配磁盘存储空间,仅仅在内存中将空间标记为已分配,并封装在写操作的事务中,待后续完成写操作的时候,才会更新FreelistManger的空闲空间,并将对象的磁盘空间信息记录在对象的metadata中。不同的写case(new/cow/overwrite),封装的事务也不同,目的是保证数据的一致性,所以磁盘空间的管理和写操作是密切相关的,不能简单的调用FreelistManager的接口完事。</p>
<p>综上,Allocator只负责在内存中将空闲空间标记为已分配,最终磁盘空间使用情况的持久化操作,由Allocator的使用者负责,BlueFS将其记录在文件系统的日志中,BlueStore通过FreelistManager将其存储在k/v中,并在对象metadata中记录对象的磁盘空间信息。</p>
<p>目前系统中有StupidAllocator(基于extent)和BitMapAllocator两种实现,最开始用的Stupid,后来换成BitMap,但是最近因为性能问题又将默认的改回Stupid :(</p>
<p><img src="/assets/img/post/ceph_bluestore_allocator.png" alt="img" /></p>
<h1 id="data-structure">Data Structure</h1>
<p>Allocator实现的时候,主要数据结构用到了区间树,高效的管理(offset, length):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">StupidAllocator</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Allocator</span> <span class="p">{</span>
<span class="kt">int64_t</span> <span class="n">num_free</span><span class="p">;</span> <span class="c1">// 总的空闲大小
</span> <span class="kt">int64_t</span> <span class="n">num_reserved</span><span class="p">;</span> <span class="c1">// 预留空间
</span>
<span class="c1">// 初始化的时候,free数组的长度为10,即有十颗区间树
</span> <span class="c1">// 根据每个区间的长度,分别插入不同的区间树
</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">btree_interval_set</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span><span class="n">allocator</span><span class="o">>></span> <span class="n">free</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">last_alloc</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="init">Init</h1>
<p>创建Allocator后,调用者紧接着会向Allocator中加入或删除空闲空间:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 增加空闲空间
</span><span class="kt">void</span> <span class="n">StupidAllocator</span><span class="o">::</span><span class="n">init_add_free</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">_insert_free</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span> <span class="c1">// 向free中插入数据
</span> <span class="n">num_free</span> <span class="o">+=</span> <span class="n">length</span><span class="p">;</span> <span class="c1">// 更新可用空间
</span><span class="p">}</span>
<span class="c1">// 删除空闲空间
</span><span class="kt">void</span> <span class="n">StupidAllocator</span><span class="o">::</span><span class="n">init_rm_free</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">btree_interval_set</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span><span class="n">allocator</span><span class="o">></span> <span class="n">rm</span><span class="p">;</span>
<span class="n">rm</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">free</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">&&</span> <span class="o">!</span><span class="n">rm</span><span class="p">.</span><span class="n">empty</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">btree_interval_set</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span><span class="n">allocator</span><span class="o">></span> <span class="n">overlap</span><span class="p">;</span>
<span class="n">overlap</span><span class="p">.</span><span class="n">intersection_of</span><span class="p">(</span><span class="n">rm</span><span class="p">,</span> <span class="n">free</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span> <span class="c1">// 求交集
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">overlap</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 删除
</span> <span class="n">free</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">subtract</span><span class="p">(</span><span class="n">overlap</span><span class="p">);</span>
<span class="n">rm</span><span class="p">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">overlap</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">num_free</span> <span class="o">-=</span> <span class="n">length</span><span class="p">;</span> <span class="c1">// 更新可用空间
</span><span class="p">}</span>
</code></pre></div></div>
<p>核心实现函数就是向free的区间树插入区间:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 根据区间的长度,选取将要存放的区间树,长度越大,bin值越大
</span><span class="kt">unsigned</span> <span class="n">StupidAllocator</span><span class="o">::</span><span class="n">_choose_bin</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">orig_len</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">orig_len</span> <span class="o">/</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">bdev_block_size</span><span class="p">;</span>
<span class="c1">// cbits = (sizeof(v) * 8) - __builtin_clzll(v);
</span> <span class="c1">// 结果是最高位1的下标,len越大,值越大
</span> <span class="kt">int</span> <span class="n">bin</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">cbits</span><span class="p">(</span><span class="n">len</span><span class="p">),</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">free</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">return</span> <span class="n">bin</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">StupidAllocator</span><span class="o">::</span><span class="n">_insert_free</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">off</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">bin</span> <span class="o">=</span> <span class="n">_choose_bin</span><span class="p">(</span><span class="n">len</span><span class="p">);</span> <span class="c1">// 计算区间树的id
</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="n">free</span><span class="p">[</span><span class="n">bin</span><span class="p">].</span><span class="n">insert</span><span class="p">(</span><span class="n">off</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="o">&</span><span class="n">off</span><span class="p">,</span> <span class="o">&</span><span class="n">len</span><span class="p">);</span>
<span class="kt">unsigned</span> <span class="n">newbin</span> <span class="o">=</span> <span class="n">_choose_bin</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">newbin</span> <span class="o">==</span> <span class="n">bin</span><span class="p">)</span>
<span class="k">break</span><span class="p">;</span>
<span class="c1">// 插入数据后,可能合并区间,导致区间长度增大,可能要调整bin,此时需要将旧的删除,然后插入新的bin
</span> <span class="n">free</span><span class="p">[</span><span class="n">bin</span><span class="p">].</span><span class="n">erase</span><span class="p">(</span><span class="n">off</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="n">bin</span> <span class="o">=</span> <span class="n">newbin</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Allocate & Release用来分配和回收空闲空间,最终都是调用区间树的insert和erase API,不再赘述。</p>
<h1 id="usage">Usage</h1>
<p>Allocator本身比较简单,更多的应该关注怎么使用它,特别是怎么将分配信息最终持久化。</p>
<h3 id="bluefs">BlueFS</h3>
<p>BlueFS分配空间的函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueFS</span><span class="o">::</span><span class="n">_allocate</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="n">id</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">mempool</span><span class="o">::</span><span class="n">bluefs</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">bluefs_extent_t</span><span class="o">></span> <span class="o">*</span><span class="n">ev</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">alloc</span><span class="p">[</span><span class="n">id</span><span class="p">])</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">[</span><span class="n">id</span><span class="p">]</span><span class="o">-></span><span class="n">reserve</span><span class="p">(</span><span class="n">left</span><span class="p">);</span> <span class="c1">// 预留空间
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">id</span> <span class="o">!=</span> <span class="n">BDEV_SLOW</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 空间不足,递归降级到下一级的设备
</span> <span class="k">return</span> <span class="n">_allocate</span><span class="p">(</span><span class="n">id</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">ev</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">AllocExtentVector</span> <span class="n">extents</span><span class="p">;</span>
<span class="n">extents</span><span class="p">.</span><span class="n">reserve</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span>
<span class="kt">int64_t</span> <span class="n">alloc_len</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">[</span><span class="n">id</span><span class="p">]</span><span class="o">-></span><span class="n">allocate</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">min_alloc_size</span><span class="p">,</span> <span class="n">hint</span><span class="p">,</span> <span class="o">&</span><span class="n">extents</span><span class="p">);</span> <span class="c1">// 真正分配空间的操作
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这个函数在多种情况下会调用,包括为普通文件分配空间的_preallocate以及和日志文件相关的compact/flush等操作。以普通文件为例:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueFS</span><span class="o">::</span><span class="n">_preallocate</span><span class="p">(</span><span class="n">FileRef</span> <span class="n">f</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">off</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">deleted</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">uint64_t</span> <span class="n">allocated</span> <span class="o">=</span> <span class="n">f</span><span class="o">-></span><span class="n">fnode</span><span class="p">.</span><span class="n">get_allocated</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">off</span> <span class="o">+</span> <span class="n">len</span> <span class="o">></span> <span class="n">allocated</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">want</span> <span class="o">=</span> <span class="n">off</span> <span class="o">+</span> <span class="n">len</span> <span class="o">-</span> <span class="n">allocated</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_allocate</span><span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">fnode</span><span class="p">.</span><span class="n">prefer_bdev</span><span class="p">,</span> <span class="n">want</span><span class="p">,</span> <span class="o">&</span><span class="n">f</span><span class="o">-></span><span class="n">fnode</span><span class="p">.</span><span class="n">extents</span><span class="p">);</span> <span class="c1">// 分配空间
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="n">f</span><span class="o">-></span><span class="n">fnode</span><span class="p">.</span><span class="n">recalc_allocated</span><span class="p">();</span>
<span class="n">log_t</span><span class="p">.</span><span class="n">op_file_update</span><span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">fnode</span><span class="p">);</span> <span class="c1">// 记录日志,日志就是文件的inode信息,inode中包含extents,即物理磁盘空间
</span> <span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在BlueFS的超级快(SuperBlock)中,记录了日志文件的inode,异常情况下通过重新mount文件系统,读取超级块,定位到日志文件,然后读取日志进行回放,重建所有文件的内存映像(file_map),遍历file_map,即可初始化Allocator的空间。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueFS</span><span class="o">::</span><span class="n">mount</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_replay</span><span class="p">(</span><span class="nb">false</span><span class="p">);</span> <span class="c1">// 重建file_map
</span> <span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="n">p</span> <span class="o">:</span> <span class="n">file_map</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="n">q</span> <span class="o">:</span> <span class="n">p</span><span class="p">.</span><span class="n">second</span><span class="o">-></span><span class="n">fnode</span><span class="p">.</span><span class="n">extents</span><span class="p">)</span> <span class="p">{</span>
<span class="n">alloc</span><span class="p">[</span><span class="n">q</span><span class="p">.</span><span class="n">bdev</span><span class="p">]</span><span class="o">-></span><span class="n">init_rm_free</span><span class="p">(</span><span class="n">q</span><span class="p">.</span><span class="n">offset</span><span class="p">,</span> <span class="n">q</span><span class="p">.</span><span class="n">length</span><span class="p">);</span> <span class="c1">// 将已有文件占用的磁盘块从allocator中删除
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="bluestore">BlueStore</h3>
<p>BlueStore的情况,涉及到读写object文件的IO流程,留着分析BlueStore时介绍。</p>
<h1 id="config">Config</h1>
<p>配置参数比较简单,目前默认用stupid,不需要调整任何参数。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bluefs_allocator</span> <span class="c1">// 默认为stupid
</span><span class="n">bluestore_allocator</span> <span class="c1">// 默认为stupid
</span>
<span class="c1">// bitmap allocator 相关的参数
</span><span class="n">bluestore_bitmapallocator_blocks_per_zone</span>
<span class="n">bluestore_bitmapallocator_span_size</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>Allocator用来分配磁盘空间,只在内存做标记,实现包含Stupid和BitMap两种,Stupid即为基于extent的方式</p>
</li>
<li>
<p>Allocator的使用者在创建Allocator的时候需要初始化可用空间,并且在分配空间后,需要使用者在合适时机将磁盘空间使用信息持久化</p>
</li>
<li>
<p>Allocator的使用者包含BlueFS和BlueStore,BlueFS通过文件系统的日志文件固化磁盘空间使用情况,BlueStore通过FreelistManager将磁盘空间信息固化到k/v中</p>
</li>
<li>
<p>Allocator默认的stupid实现比较简单,理解Allocator在整个BlueStore存储引擎中的作用以及使用方式比其实现更有意义</p>
</li>
</ul>
Ceph BlueStore FreelistManager
2018-02-05T00:00:00+00:00
http://blog.wjin.org/posts/ceph-bluestore-freelistmanager
<h1 id="introduction">Introduction</h1>
<p>因为BlueStore采用裸设备,所以需要自己管理磁盘空间的分配和回收。如果以<strong>block</strong>表示磁盘的最小存储单位(Ceph中默认为4k),一个block的状态可以为<strong>使用</strong>和<strong>空闲</strong>两种状态,实现中只需要记录一种状态的block,就可以推导出另一种状态的block。Ceph采用<strong>记录空闲状态</strong>的block,主要原因有二,一是因为在回收空间的时候,方便空闲空间的合并,二是因为已分配的空间在object的元数据Onode中会有记录。</p>
<p>管理空闲空间的类为FreelistManager,最开始有extent和bitmap两种实现,现在已经默认为bitmap实现,并将extent的实现废弃。空闲空间需要持久化到磁盘,并且在运行过程中通过事务更新,很自然的方式可以用k/v存储,将block按一定数量组成<strong>段</strong>,每个段对应一个k/v键值对,key为第一个block在磁盘物理地址空间的offset,value为段内每个block的状态,即由0/1组成的位图,0为空闲,1为使用,这样可以通过与1进行异或运算,将分配和回收空间两种操作统一起来。</p>
<h1 id="data-structure">Data Structure</h1>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BitmapFreelistManager</span> <span class="o">:</span> <span class="k">public</span> <span class="n">FreelistManager</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">meta_prefix</span><span class="p">,</span> <span class="n">bitmap_prefix</span><span class="p">;</span> <span class="c1">// rocksdb中key的前缀,meta为B,bitmap为b
</span> <span class="n">KeyValueDB</span> <span class="o">*</span><span class="n">kvdb</span><span class="p">;</span> <span class="c1">// kvdb指针
</span> <span class="n">ceph</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">KeyValueDB</span><span class="o">::</span><span class="n">MergeOperator</span><span class="o">></span> <span class="n">merge_op</span><span class="p">;</span> <span class="c1">// merge操作,实际上就是按位xor
</span> <span class="n">std</span><span class="o">::</span><span class="n">mutex</span> <span class="n">lock</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">size</span><span class="p">;</span> <span class="c1">// 设备的大小
</span> <span class="kt">uint64_t</span> <span class="n">blocks</span><span class="p">;</span> <span class="c1">// 设备总的block数
</span>
<span class="kt">uint64_t</span> <span class="n">bytes_per_block</span><span class="p">;</span> <span class="c1">// block的大小,对应bdev_block_size
</span> <span class="kt">uint64_t</span> <span class="n">blocks_per_key</span><span class="p">;</span> <span class="c1">// 每个key包含多少个block
</span> <span class="kt">uint64_t</span> <span class="n">bytes_per_key</span><span class="p">;</span> <span class="c1">// 每个key对应的空间大小
</span>
<span class="kt">uint64_t</span> <span class="n">block_mask</span><span class="p">;</span> <span class="c1">// block掩码
</span> <span class="kt">uint64_t</span> <span class="n">key_mask</span><span class="p">;</span> <span class="c1">// key的掩码
</span>
<span class="n">bufferlist</span> <span class="n">all_set_bl</span><span class="p">;</span>
<span class="c1">// 遍历rocksdb key相关的成员
</span> <span class="n">KeyValueDB</span><span class="o">::</span><span class="n">Iterator</span> <span class="n">enumerate_p</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">enumerate_offset</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">enumerate_bl</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">enumerate_bl_pos</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="init">Init</h1>
<p>BlueStore在初始化osd的时候,会执行mkfs,初始化FreelistManager(create/init),后续如果重启进程,会执行mount操作,只会对FreelistManager执行init操作。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">mkfs</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_open_fm</span><span class="p">(</span><span class="nb">true</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_open_fm</span><span class="p">(</span><span class="kt">bool</span> <span class="n">create</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">fm</span> <span class="o">=</span> <span class="n">FreelistManager</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">freelist_type</span><span class="p">,</span> <span class="n">db</span><span class="p">,</span> <span class="n">PREFIX_ALLOC</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">create</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 第一次初始化,需要固化meta参数
</span> <span class="n">fm</span><span class="o">-></span><span class="n">create</span><span class="p">(</span><span class="n">bdev</span><span class="o">-></span><span class="n">get_size</span><span class="p">(),</span> <span class="n">min_alloc_size</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">fm</span><span class="o">-></span><span class="n">init</span><span class="p">(</span><span class="n">bdev</span><span class="o">-></span><span class="n">get_size</span><span class="p">());</span>
<span class="p">}</span>
<span class="c1">// create固化一些meta参数到kvdb中,init的时候,从kvdb读取这些参数
</span><span class="kt">int</span> <span class="n">BitmapFreelistManager</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">new_size</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">min_alloc_size</span><span class="p">,</span>
<span class="n">KeyValueDB</span><span class="o">::</span><span class="n">Transaction</span> <span class="n">txn</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">txn</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">meta_prefix</span><span class="p">,</span> <span class="s">"bytes_per_block"</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 4096
</span> <span class="n">txn</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">meta_prefix</span><span class="p">,</span> <span class="s">"blocks_per_key"</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 128
</span> <span class="n">txn</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">meta_prefix</span><span class="p">,</span> <span class="s">"blocks"</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span>
<span class="n">txn</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">meta_prefix</span><span class="p">,</span> <span class="s">"size"</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// create/init均会调用下面这个函数,初始化block/key的掩码
</span><span class="kt">void</span> <span class="n">BitmapFreelistManager</span><span class="o">::</span><span class="n">_init_misc</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">bufferptr</span> <span class="n">z</span><span class="p">(</span><span class="n">blocks_per_key</span> <span class="o">>></span> <span class="mi">3</span><span class="p">);</span> <span class="c1">// 128 >> 3 = 16,即一个key的value(段)对应128个block,每个block用1个bit表示,需要16字节
</span> <span class="n">memset</span><span class="p">(</span><span class="n">z</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="mh">0xff</span><span class="p">,</span> <span class="n">z</span><span class="p">.</span><span class="n">length</span><span class="p">());</span>
<span class="n">all_set_bl</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="n">all_set_bl</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">);</span>
<span class="n">block_mask</span> <span class="o">=</span> <span class="o">~</span><span class="p">(</span><span class="n">bytes_per_block</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// 0x FFFF FFFF FFFF F000
</span> <span class="n">bytes_per_key</span> <span class="o">=</span> <span class="n">bytes_per_block</span> <span class="o">*</span> <span class="n">blocks_per_key</span><span class="p">;</span>
<span class="n">key_mask</span> <span class="o">=</span> <span class="o">~</span><span class="p">(</span><span class="n">bytes_per_key</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// 0xFFFF FFFF FFF8 0000
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="allocate--release">Allocate & Release</h1>
<p>最主要的接口就是用来分配和释放空间,前面已经提到,两种操作是完全一样的,都是异或操作:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">BitmapFreelistManager</span><span class="o">::</span><span class="n">allocate</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">length</span><span class="p">,</span> <span class="n">KeyValueDB</span><span class="o">::</span><span class="n">Transaction</span> <span class="n">txn</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_xor</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">length</span><span class="p">,</span> <span class="n">txn</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">BitmapFreelistManager</span><span class="o">::</span><span class="n">release</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">length</span><span class="p">,</span> <span class="n">KeyValueDB</span><span class="o">::</span><span class="n">Transaction</span> <span class="n">txn</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_xor</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">length</span><span class="p">,</span> <span class="n">txn</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">BitmapFreelistManager</span><span class="o">::</span><span class="n">_xor</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">length</span><span class="p">,</span> <span class="n">KeyValueDB</span><span class="o">::</span><span class="n">Transaction</span> <span class="n">txn</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 注意offset和length都是以block边界对齐
</span> <span class="kt">uint64_t</span> <span class="n">first_key</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">&</span> <span class="n">key_mask</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">last_key</span> <span class="o">=</span> <span class="p">(</span><span class="n">offset</span> <span class="o">+</span> <span class="n">length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&</span> <span class="n">key_mask</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">first_key</span> <span class="o">==</span> <span class="n">last_key</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 最简单的case,此次操作对应一个段
</span> <span class="n">bufferptr</span> <span class="n">p</span><span class="p">(</span><span class="n">blocks_per_key</span> <span class="o">>></span> <span class="mi">3</span><span class="p">);</span> <span class="c1">// 16字节大小的buffer
</span> <span class="n">p</span><span class="p">.</span><span class="n">zero</span><span class="p">();</span> <span class="c1">// 置为全0
</span> <span class="kt">unsigned</span> <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">offset</span> <span class="o">&</span> <span class="o">~</span><span class="n">key_mask</span><span class="p">)</span> <span class="o">/</span> <span class="n">bytes_per_block</span><span class="p">;</span> <span class="c1">// 段内开始block的编号
</span> <span class="kt">unsigned</span> <span class="n">e</span> <span class="o">=</span> <span class="p">((</span><span class="n">offset</span> <span class="o">+</span> <span class="n">length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&</span> <span class="o">~</span><span class="n">key_mask</span><span class="p">)</span> <span class="o">/</span> <span class="n">bytes_per_block</span><span class="p">;</span> <span class="c1">// 段内结束block的编号
</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">e</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 生成此次操作的掩码
</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span> <span class="o">>></span> <span class="mi">3</span><span class="p">]</span> <span class="o">^=</span> <span class="mi">1ull</span> <span class="o"><<</span> <span class="p">(</span><span class="n">i</span> <span class="o">&</span> <span class="mi">7</span><span class="p">);</span> <span class="c1">// i>>3定位block对应位的字节, 1ull<<(i&7)定位bit,然后异或将位设置位1
</span> <span class="p">}</span>
<span class="n">string</span> <span class="n">k</span><span class="p">;</span>
<span class="n">make_offset_key</span><span class="p">(</span><span class="n">first_key</span><span class="p">,</span> <span class="o">&</span><span class="n">k</span><span class="p">);</span> <span class="c1">// 将内存内容转换为16进制的字符
</span> <span class="n">bufferlist</span> <span class="n">bl</span><span class="p">;</span>
<span class="n">bl</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="n">bl</span><span class="p">.</span><span class="n">hexdump</span><span class="p">(</span><span class="o">*</span><span class="n">_dout</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>
<span class="n">txn</span><span class="o">-></span><span class="n">merge</span><span class="p">(</span><span class="n">bitmap_prefix</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 和目前的value进行异或操作
</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// 对应多个段,分别处理第一个段,中间段,和最后一个段,首尾两个段和前面情况一样
</span>
<span class="c1">// 第一个段
</span> <span class="p">{</span>
<span class="c1">// 类似上面情况
</span> <span class="p">......</span>
<span class="c1">// 增加key,定位下一个段
</span> <span class="n">first_key</span> <span class="o">+=</span> <span class="n">bytes_per_key</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 中间段,此时掩码就是全1,所以用all_set_bl
</span> <span class="k">while</span> <span class="p">(</span><span class="n">first_key</span> <span class="o"><</span> <span class="n">last_key</span><span class="p">)</span> <span class="p">{</span>
<span class="n">string</span> <span class="n">k</span><span class="p">;</span>
<span class="n">make_offset_key</span><span class="p">(</span><span class="n">first_key</span><span class="p">,</span> <span class="o">&</span><span class="n">k</span><span class="p">);</span>
<span class="n">all_set_bl</span><span class="p">.</span><span class="n">hexdump</span><span class="p">(</span><span class="o">*</span><span class="n">_dout</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>
<span class="n">txn</span><span class="o">-></span><span class="n">merge</span><span class="p">(</span><span class="n">bitmap_prefix</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">all_set_bl</span><span class="p">);</span> <span class="c1">// 和目前的value进行异或操作
</span>
<span class="c1">// 增加key,定位下一个段
</span> <span class="n">first_key</span> <span class="o">+=</span> <span class="n">bytes_per_key</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 最后一个段
</span> <span class="p">{</span>
<span class="c1">// 和前面操作类似
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 上面merge操作对应的实现
</span><span class="k">struct</span> <span class="n">XorMergeOperator</span> <span class="o">:</span> <span class="k">public</span> <span class="n">KeyValueDB</span><span class="o">::</span><span class="n">MergeOperator</span> <span class="p">{</span>
<span class="kt">void</span> <span class="n">merge</span><span class="p">(</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ldata</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">llen</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">rdata</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">rlen</span><span class="p">,</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">*</span><span class="n">new_value</span><span class="p">)</span> <span class="k">override</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">llen</span> <span class="o">==</span> <span class="n">rlen</span><span class="p">);</span>
<span class="o">*</span><span class="n">new_value</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="n">ldata</span><span class="p">,</span> <span class="n">llen</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">rlen</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="p">(</span><span class="o">*</span><span class="n">new_value</span><span class="p">)[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">rdata</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="c1">// 按位异或
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>总结一下,xor函数看似复杂,全是位操作,仔细分析一下,分配和释放操作一样,都是将段的bit位和当前的值进行异或。一个段对应一组blocks,默认128个,在k/v中对应一组值。例如,当磁盘空间全部空闲的时候,k/v状态如下: (b00000000,0x00), (b00001000, 0x00), (b00002000, 0x00)……b为key的前缀,代表bitmap。</p>
<h1 id="config">Config</h1>
<p>目前只有一个参数可供调整(部署之前,不能在运行中动态调整),就是段的大小,这个参数如果调整的太小,对应的段的个数增加,k/v键值对数目会暴增,所以需要根据磁盘空间做决定,实际使用过程中不建议修改。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bluestore_freelist_blocks_per_key</span> <span class="c1">// 默认为128
</span></code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>BlueStore用FreelistManager管理磁盘空闲空间,通过将磁盘空间分成大小相同的段,并将段的状态用位图记录在k/v存储中</p>
</li>
<li>
<p>分配和回收空间实现一样,对原来位图进行异或操作</p>
</li>
</ul>
Ceph MDS Behind On Trimming Error
2018-01-26T00:00:00+00:00
http://blog.wjin.org/posts/ceph-mds-behind-on-trimming-error
<h1 id="introduction">Introduction</h1>
<p>在CephFS集群运行过程中,如果一直<strong>持续不停的写入大量文件</strong>,会报告Warning信息:<strong>mds Behind on trimming…</strong>。从文档查看,这个错误是因为日志(MDLog)没来得急trim导致的。一直持续写入文件的时候,虽然data和metadata共用osd,但是优化后的osd负载并不高,trim操作不会因为后端集群负载而delay。告警信息本身影响不大,但是为什么不能及时trim值得深入研究。</p>
<h1 id="trim-process">Trim Process</h1>
<p>为了一探究竟,需要深入了解mds trim log的流程。首先明确CephFS mds进程管理日志的相关类。MDLog集中管理日志,跟踪所有的LogSegment,包含当前active的segment和expiring/expired的segment。当一个segment的使用量达到上限(比如超过segment所规定的事件个数),就新起一个segment,segment数量会逐渐增加,需要定期做trim操作。LogSegment记录一连串的LogEvent事件(这里事件有很多种,对文件系统的更新操作都属于某种事件)。另外,MDLog中包含一个对象Journaler,用来向rados写日志/事件。</p>
<p>触发trim操作的入口函数是时间tick函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MDSDaemon</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">tick_event</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">reset_tick</span><span class="p">();</span> <span class="c1">// 每隔mds_tick_interval秒,调用tick一次,默认值为5秒
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mds_rank</span><span class="p">)</span> <span class="p">{</span>
<span class="n">mds_rank</span><span class="o">-></span><span class="n">tick</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MDSRankDispatcher</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_active</span><span class="p">()</span> <span class="o">||</span> <span class="n">is_stopping</span><span class="p">())</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">mdlog</span><span class="o">-></span><span class="n">trim</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MDLog</span><span class="o">::</span><span class="n">trim</span><span class="p">(</span><span class="kt">int</span> <span class="n">m</span><span class="p">)</span> <span class="c1">// 函数默认参数-1
</span><span class="p">{</span>
<span class="c1">// 获取配置参数
</span> <span class="kt">unsigned</span> <span class="n">max_segments</span> <span class="o">=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mds_log_max_segments</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">max_events</span> <span class="o">=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mds_log_max_events</span><span class="p">;</span> <span class="c1">// 默认为-1
</span> <span class="k">if</span> <span class="p">(</span><span class="n">m</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">max_events</span> <span class="o">=</span> <span class="n">m</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mds</span><span class="o">-></span><span class="n">mdcache</span><span class="o">-></span><span class="n">is_readonly</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 只读,直接返回
</span> <span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 调整max_events,通常情况下,仍然为-1
</span> <span class="k">if</span> <span class="p">(</span><span class="n">max_events</span> <span class="o">></span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">max_events</span> <span class="o"><=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mds_log_events_per_segment</span><span class="p">)</span> <span class="p">{</span>
<span class="n">max_events</span> <span class="o">=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mds_log_events_per_segment</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">submit_mutex</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">segments</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// segment为空,不需要trim,直接返回
</span> <span class="n">submit_mutex</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 设置trim的最长时间,每次2秒钟
</span> <span class="n">utime_t</span> <span class="n">stop</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="n">stop</span> <span class="o">+=</span> <span class="mf">2.0</span><span class="p">;</span>
<span class="c1">// 遍历segment,event或segment的条件满足其一即做trim
</span> <span class="n">map</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span><span class="n">LogSegment</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">segments</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">!=</span> <span class="n">segments</span><span class="p">.</span><span class="n">end</span><span class="p">()</span> <span class="o">&&</span>
<span class="p">((</span><span class="n">max_events</span> <span class="o">>=</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">num_events</span> <span class="o">-</span> <span class="n">expiring_events</span> <span class="o">-</span> <span class="n">expired_events</span> <span class="o">></span> <span class="n">max_events</span><span class="p">)</span> <span class="o">||</span>
<span class="p">(</span><span class="n">segments</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="n">expiring_segments</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="n">expired_segments</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">></span> <span class="n">max_segments</span><span class="p">)))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">stop</span> <span class="o"><</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">))</span> <span class="c1">// 超出时间,退出循环
</span> <span class="k">break</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">num_expiring_segments</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">expiring_segments</span><span class="p">.</span><span class="n">size</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">num_expiring_segments</span> <span class="o">>=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mds_log_max_expiring</span><span class="p">)</span> <span class="c1">// 超出并发trim的限制,退出循环
</span> <span class="k">break</span><span class="p">;</span>
<span class="c1">// 计算op的优先级,后续执行try_to_expire的时候,可能会有执行rados op的操作,这个就是rados op的优先级
</span> <span class="c1">// 如果expiring的segment比较多,优先级就越高
</span> <span class="kt">int</span> <span class="n">op_prio</span> <span class="o">=</span> <span class="n">CEPH_MSG_PRIO_LOW</span> <span class="o">+</span>
<span class="p">(</span><span class="n">CEPH_MSG_PRIO_HIGH</span> <span class="o">-</span> <span class="n">CEPH_MSG_PRIO_LOW</span><span class="p">)</span> <span class="o">*</span>
<span class="n">num_expiring_segments</span> <span class="o">/</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mds_log_max_expiring</span><span class="p">;</span>
<span class="n">LogSegment</span> <span class="o">*</span><span class="n">ls</span> <span class="o">=</span> <span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">;</span>
<span class="n">assert</span><span class="p">(</span><span class="n">ls</span><span class="p">);</span>
<span class="o">++</span><span class="n">p</span><span class="p">;</span>
<span class="c1">// 如果LogSegment包含有pending事件,或者还不能做trim(日志还没落盘),退出循环
</span> <span class="c1">// 注意LogSegment的seq是log event中的序列号,event的序列号单调递增,全局唯一
</span> <span class="c1">// 新建segment的时候,用当前event的序列号作为segment的序列号
</span> <span class="k">if</span> <span class="p">(</span><span class="n">pending_events</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">ls</span><span class="o">-></span><span class="n">seq</span><span class="p">)</span> <span class="o">||</span> <span class="n">ls</span><span class="o">-></span><span class="n">end</span> <span class="o">></span> <span class="n">safe_pos</span><span class="p">)</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 已经在expiring或expired中,不做任何操作
</span> <span class="k">if</span> <span class="p">(</span><span class="n">expiring_segments</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">ls</span><span class="p">))</span> <span class="p">{</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">expired_segments</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">ls</span><span class="p">))</span> <span class="p">{</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// 否则,将LogSegment加入expiring
</span> <span class="n">assert</span><span class="p">(</span><span class="n">expiring_segments</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">ls</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">expiring_segments</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">ls</span><span class="p">);</span>
<span class="n">expiring_events</span> <span class="o">+=</span> <span class="n">ls</span><span class="o">-></span><span class="n">num_events</span><span class="p">;</span>
<span class="n">submit_mutex</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="kt">uint64_t</span> <span class="n">last_seq</span> <span class="o">=</span> <span class="n">ls</span><span class="o">-></span><span class="n">seq</span><span class="p">;</span>
<span class="n">try_expire</span><span class="p">(</span><span class="n">ls</span><span class="p">,</span> <span class="n">op_prio</span><span class="p">);</span> <span class="c1">// 尝试expire
</span>
<span class="n">submit_mutex</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">segments</span><span class="p">.</span><span class="n">lower_bound</span><span class="p">(</span><span class="n">last_seq</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// 更新循环迭代器
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 将expired的segment删除,并更新journaler的expire位置,这个位置也很重要,mds发生异常的时候,此位置即为日志回放的起点
</span> <span class="n">_trim_expired_segments</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>try_expire调用try_to_expire判断一个LogSegment能否最终被trim,实现非常复杂。LogSegment中包含多个链表以及集合,比如dirty dir/inode/dentry以及和目录分片相关的信息。try_to_expire依次循环遍历这些链表和集合,如果不能做trim,就创建callback放入GatherBuild中,GatherBuild搜集所有callback,记录callback的上下文,即产生一个链式callback。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MDLog</span><span class="o">::</span><span class="n">try_expire</span><span class="p">(</span><span class="n">LogSegment</span> <span class="o">*</span><span class="n">ls</span><span class="p">,</span> <span class="kt">int</span> <span class="n">op_prio</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">MDSGatherBuilder</span> <span class="n">gather_bld</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span> <span class="c1">// callback集中处理
</span> <span class="n">ls</span><span class="o">-></span><span class="n">try_to_expire</span><span class="p">(</span><span class="n">mds</span><span class="p">,</span> <span class="n">gather_bld</span><span class="p">,</span> <span class="n">op_prio</span><span class="p">);</span> <span class="c1">// 判断LogSegment是否能够执行expired
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">gather_bld</span><span class="p">.</span><span class="n">has_subs</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 如果gather包含callback事件,暂时不能执行expired操作,等待下次继续调用
</span> <span class="n">gather_bld</span><span class="p">.</span><span class="n">set_finisher</span><span class="p">(</span><span class="k">new</span> <span class="n">C_MaybeExpiredSegment</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">ls</span><span class="p">,</span> <span class="n">op_prio</span><span class="p">));</span>
<span class="n">gather_bld</span><span class="p">.</span><span class="n">activate</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// 可以执行
</span> <span class="n">submit_mutex</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">expiring_segments</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">ls</span><span class="p">);</span>
<span class="n">expiring_events</span> <span class="o">-=</span> <span class="n">ls</span><span class="o">-></span><span class="n">num_events</span><span class="p">;</span>
<span class="n">_expired</span><span class="p">(</span><span class="n">ls</span><span class="p">);</span> <span class="c1">// 将LogSegment放入expired集合
</span> <span class="n">submit_mutex</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">logger</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">l_mdl_segexg</span><span class="p">,</span> <span class="n">expiring_segments</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="n">logger</span><span class="o">-></span><span class="n">set</span><span class="p">(</span><span class="n">l_mdl_evexg</span><span class="p">,</span> <span class="n">expiring_events</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>总结一下,整个流程大致是将需要做trim的segment先放入expring中的集合,然后执行try_expire进行判断是否能够trim,并将其放入expired集合,最后调用_trim_expirted_segments清理expired集合。</p>
<h1 id="config">Config</h1>
<p>主要配置参数如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mds_log_max_events</span> <span class="c1">// 默认值为-1,即没有上限
</span><span class="n">mds_log_events_per_segment</span> <span class="c1">// 每个segment包含事件数的上限,默认为1024。超过上限,新起一个segment
</span><span class="n">mds_log_segment_size</span> <span class="c1">// 每个segment的大小,默认值为0,采用object的大小,即4 mb
</span><span class="n">mds_log_max_segments</span> <span class="c1">// 最大的segment,默认值为30。另一层含义是: 当segment的个数超过此值的两倍,就会发生文章开头的告警信息
</span><span class="n">mds_log_max_expiring</span> <span class="c1">// 能同时做expiring的segment个数,默认值为20
</span></code></pre></div></div>
<h1 id="tuning">Tuning</h1>
<p>如果要避免告警信息,可以从两个方面考虑:</p>
<ol>
<li>
<p>减少segment的个数</p>
</li>
<li>
<p>增加trim的速度</p>
</li>
</ol>
<p>对于第一点,可以调整segment的size和事件个数的上限。注意两者要同时调整,很多事件包含成员EMetaBlob,这个类包含成员还比较多,对象大小应该不小,默认的事件个数1024应该是根据默认对象大小4mb和事件平均大小估算的一个合理值。</p>
<p>对于第二点,首先trim函数每隔5秒钟调用一次,这个参数不应该调整,tick函数不只是做trim操作,还有其他逻辑。每次最多执行两秒中,这个写死在代码里,也不应该去改代码,因为涉及到submit_mutex,不应该频繁的去抢占这个锁。</p>
<p>剩下就是对参数mds_log_max_segments和mds_log_max_expiring的调整,前者不应该调整的太大,太大的话虽然暂时避免了告警信息,但是一直持续写入,最终还是会告警,而且太大存在的LogSegment就太多,异常情况下mds回放日志就会耗费太长时间。最后剩下的参数mds_log_max_expiring,可以根据自己集群的规模和硬件,在不影响客户端IO的情况下适当调大。</p>
<p>异常情况下的恢复流程如下,可以看见回放日志的起点是expire结束的时候:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 初始化回放日志的起点的流程:
</span><span class="n">MDLog</span><span class="o">::</span><span class="n">_recovery_thread</span><span class="p">()</span> <span class="o">-></span> <span class="n">Journaler</span><span class="o">::</span><span class="n">recover</span><span class="p">()</span> <span class="o">-></span> <span class="n">Journaler</span><span class="o">::</span><span class="n">_read_head</span><span class="p">()</span> <span class="o">-></span> <span class="n">Journaler</span><span class="o">::</span><span class="n">_finish_read_head</span><span class="p">()</span>
<span class="n">read_pos</span> <span class="o">=</span> <span class="n">requested_pos</span> <span class="o">=</span> <span class="n">received_pos</span> <span class="o">=</span> <span class="n">expire_pos</span> <span class="o">=</span> <span class="n">h</span><span class="p">.</span><span class="n">expire_pos</span><span class="p">;</span> <span class="c1">// journaler head的expire位置
</span></code></pre></div></div>
<p>同时将mds_log_max_segment调整的比较大的时候,比如1000,观察到cephfs的元数据池使用量增加非常明显,大部分是日志的增加,文件系统本身的元数据只有几百MB:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GLOBAL:
SIZE AVAIL RAW USED %RAW USED
1455T 1118T 336T 23.15
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
cephfs_metadata 1 5759M 0 301T 10128861
cephfs_data 2 112T 27.10 301T 57787299
</code></pre></div></div>
<p>鉴于此,生产环境不建议调整mds_log_max_segments。从实际观察看,参数mds_log_max_expiring很容易达到上限,导致trim不及时,容易发生告警信息,发现社区已经对此问题做了优化,参见<a href="https://github.com/ceph/ceph/pull/18783">patch</a>,可以将此patch backport回来。另外如果不想修改代码,参数mds_log_max_expiring调整多大不好判断,可以直接放任它不管,但是在监控告警层面,即从命令<code class="highlighter-rouge">Ceph Health</code>的结果中过滤告警信息,然后在metric系统中监控segment的相关指标,并在metric系统中对segment的阈值进行告警,这个阈值可以设置的比较大,只是以防代码有bug导致segment变的非常大,比如几百万(通常情况下,持续写入几百万的文件,segment的最大值观察到在1万左右)。</p>
<p>下面是一些可以监控的metric指标:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="s">"mds_log"</span><span class="o">:</span> <span class="p">{</span>
<span class="cp"># active/expiring/expired的event个数
</span> <span class="s">"ev"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Events"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">"evts"</span>
<span class="p">},</span>
<span class="s">"evexg"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Expiring events"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="s">"evexd"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Current expired events"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="cp"># active/expiring/expired的segment个数
</span> <span class="s">"seg"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Segments"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">"segs"</span>
<span class="p">},</span>
<span class="s">"segexg"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Expiring segments"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="s">"segexd"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Current expired segments"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="cp"># journal 过期/写/读的位置
</span> <span class="s">"expos"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Journaler xpire position"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="s">"wrpos"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Journaler write position"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="s">"rdpos"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="o">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"description"</span><span class="o">:</span> <span class="s">"Journaler read position"</span><span class="p">,</span>
<span class="s">"nick"</span><span class="o">:</span> <span class="s">""</span>
<span class="p">},</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>trim速度受到参数控制,不应该为了避免告警信息而将参数mds_log_max_segments调大。</p>
</li>
<li>
<p>根据硬件以及集群规模,适当调整参数mds_log_max_expiring,加快trim的速度,或者backport社区的patch。</p>
</li>
<li>
<p>相对安全的做法是在命令ceph health中忽略告警信息,但在metric指标中增加segment的阈值告警。</p>
</li>
</ul>
Ceph BlueStore BlueFS
2018-01-25T00:00:00+00:00
http://blog.wjin.org/posts/ceph-bluestore-bluefs
<h1 id="introduction">Introduction</h1>
<p>BlueStore存储引擎实现中,需要存储数据和元数据。由于kv存储系统自身的高效性以及对事务的支持,所以选择kv存储元数据是理所当然的(对象的omap属性算作数据,也是存放在kv中的)。Luminous目前默认采用RocksDB来存储元数据(RocksDB本身存在写放大以及compaction的问题,后续可能会针对Ceph的场景量身定制kv),但是BlueStore采用裸设备,RocksDB不支持raw disk,幸运的是,RocksDB提供RocksEnv运行时环境来支持跨平台操作,那么能够想到的方案就是Ceph自己实现一个简单的文件系统,这个文件系统只提供RocksEnv需要的操作接口,这样就可以支持RocksDB的运行,而这个文件系统就是BlueFS。</p>
<p>作为文件系统本身,需要存放日志,保护文件系统数据的一致性。对于RocksDB,也可以对.log文件单独配置性能更好的磁盘。所以在BlueFS内部实现的时候,支持多种不同类型的设备(wal/db/slow),实现非常灵活,大致原则是RocksDB的.log文件和BlueFS自身的日志文件优先使用wal,BlueFS中的普通文件(RocksDB的.sst文件)优先使用db,当当前设备空间不足的时候,自动降级到下一级的设备。</p>
<p>文件系统本身需要使用磁盘空间存放数据,但是BlueFS并不需要管理磁盘空闲空间,它将文件分配和释放空间的操作记录在日志文件中。每次重新加载的时候,扫描文件系统的日志,在内存中还原整个文件系统的元数据信息。运行过程中,磁盘空间使用情况大致如下(借用Ceph作者Sage的图):</p>
<p><img src="/assets/img/post/ceph_bluestore_bluefs.png" alt="img" /></p>
<h1 id="data-structure">Data Structure</h1>
<p>先看看在BlueFS中标识一个文件的inode:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 物理磁盘的位移和长度,代表块设备的一个存储区域
</span><span class="k">class</span> <span class="nc">AllocExtent</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">;</span> <span class="c1">// BlockDevice的物理地址
</span> <span class="kt">uint32_t</span> <span class="n">length</span><span class="p">;</span> <span class="c1">// 长度
</span><span class="p">};</span>
<span class="k">class</span> <span class="nc">bluefs_extent_t</span> <span class="o">:</span> <span class="k">public</span> <span class="n">AllocExtent</span><span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">uint8_t</span> <span class="n">bdev</span><span class="p">;</span> <span class="c1">// 属于哪个block device
</span><span class="p">};</span>
<span class="c1">// 文件的inode
</span><span class="n">truct</span> <span class="n">bluefs_fnode_t</span> <span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">ino</span><span class="p">;</span> <span class="c1">// inode编号
</span> <span class="kt">uint64_t</span> <span class="n">size</span><span class="p">;</span> <span class="c1">// 文件大小
</span> <span class="n">utime_t</span> <span class="n">mtime</span><span class="p">;</span> <span class="c1">// 修改时间
</span> <span class="kt">uint8_t</span> <span class="n">prefer_bdev</span><span class="p">;</span> <span class="c1">// 优先使用哪个block device
</span> <span class="n">mempool</span><span class="o">::</span><span class="n">bluefs</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">bluefs_extent_t</span><span class="o">></span> <span class="n">extents</span><span class="p">;</span> <span class="c1">// 文件对应的磁盘空间
</span> <span class="kt">uint64_t</span> <span class="n">allocated</span><span class="p">;</span> <span class="c1">// 文件实际占用的空间大小,extents的length之和。应该是小于等于size
</span><span class="p">};</span>
</code></pre></div></div>
<p>和一般文件系统类似,需要一个文件系统超级块,在mount文件系统的时候,需要读取超级块里的数据,才能识别文件系统:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">bluefs_super_t</span> <span class="p">{</span>
<span class="n">uuid_d</span> <span class="n">uuid</span><span class="p">;</span> <span class="c1">// 唯一的uuid
</span> <span class="n">uuid_d</span> <span class="n">osd_uuid</span><span class="p">;</span> <span class="c1">// 对应的osd的uuid
</span> <span class="kt">uint64_t</span> <span class="n">version</span><span class="p">;</span> <span class="c1">// 版本
</span> <span class="kt">uint32_t</span> <span class="n">block_size</span><span class="p">;</span> <span class="c1">// 块大小
</span>
<span class="n">bluefs_fnode_t</span> <span class="n">log_fnode</span><span class="p">;</span> <span class="c1">// 记录文件系统日志的文件
</span><span class="p">};</span>
</code></pre></div></div>
<p>接下来就是文件系统的操作,这些操作对文件系统进行修改,需要封装成事务并记录在日志中:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">bluefs_transaction_t</span> <span class="p">{</span>
<span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
<span class="n">OP_NONE</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
<span class="n">OP_INIT</span><span class="p">,</span> <span class="c1">///< initial (empty) file system marker
</span>
<span class="c1">// 给文件分配和释放空间
</span> <span class="n">OP_ALLOC_ADD</span><span class="p">,</span> <span class="c1">///< add extent to available block storage (extent)
</span> <span class="n">OP_ALLOC_RM</span><span class="p">,</span> <span class="c1">///< remove extent from availabe block storage (extent)
</span>
<span class="c1">// 创建和删除目录项
</span> <span class="n">OP_DIR_LINK</span><span class="p">,</span> <span class="c1">///< (re)set a dir entry (dirname, filename, ino)
</span> <span class="n">OP_DIR_UNLINK</span><span class="p">,</span> <span class="c1">///< remove a dir entry (dirname, filename)
</span>
<span class="c1">// 创建和删除目录
</span> <span class="n">OP_DIR_CREATE</span><span class="p">,</span> <span class="c1">///< create a dir (dirname)
</span> <span class="n">OP_DIR_REMOVE</span><span class="p">,</span> <span class="c1">///< remove a dir (dirname)
</span>
<span class="c1">// 文件更新
</span> <span class="n">OP_FILE_UPDATE</span><span class="p">,</span> <span class="c1">///< set/update file metadata (file)
</span> <span class="n">OP_FILE_REMOVE</span><span class="p">,</span> <span class="c1">///< remove file (ino)
</span>
<span class="c1">// bluefs日志文件的compaction操作
</span> <span class="n">OP_JUMP</span><span class="p">,</span> <span class="c1">///< jump the seq # and offset
</span> <span class="n">OP_JUMP_SEQ</span><span class="p">,</span> <span class="c1">///< jump the seq #
</span> <span class="p">}</span> <span class="n">op_t</span><span class="p">;</span>
<span class="n">uuid_d</span> <span class="n">uuid</span><span class="p">;</span> <span class="c1">///< fs uuid
</span> <span class="kt">uint64_t</span> <span class="n">seq</span><span class="p">;</span> <span class="c1">///< sequence number
</span> <span class="n">bufferlist</span> <span class="n">op_bl</span><span class="p">;</span> <span class="c1">///< encoded transaction ops
</span><span class="p">};</span>
</code></pre></div></div>
<p>最后看看文件系统本身的结构:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BlueFS</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// 文件系统支持不同种类的块设备
</span> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">unsigned</span> <span class="n">MAX_BDEV</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">unsigned</span> <span class="n">BDEV_WAL</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">unsigned</span> <span class="n">BDEV_DB</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">unsigned</span> <span class="n">BDEV_SLOW</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="k">enum</span> <span class="p">{</span>
<span class="n">WRITER_UNKNOWN</span><span class="p">,</span>
<span class="n">WRITER_WAL</span><span class="p">,</span> <span class="c1">// RocksDB的log文件
</span> <span class="n">WRITER_SST</span><span class="p">,</span> <span class="c1">// RocksDB的sst文件
</span> <span class="p">};</span>
<span class="c1">// 文件
</span> <span class="k">struct</span> <span class="n">File</span> <span class="o">:</span> <span class="k">public</span> <span class="n">RefCountedObject</span> <span class="p">{</span>
<span class="n">bluefs_fnode_t</span> <span class="n">fnode</span><span class="p">;</span> <span class="c1">// 文件inode
</span> <span class="kt">int</span> <span class="n">refs</span><span class="p">;</span> <span class="c1">// 引用计数
</span> <span class="kt">uint64_t</span> <span class="n">dirty_seq</span><span class="p">;</span> <span class="c1">// dirty序列号
</span> <span class="kt">bool</span> <span class="n">locked</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">deleted</span><span class="p">;</span>
<span class="n">boost</span><span class="o">::</span><span class="n">intrusive</span><span class="o">::</span><span class="n">list_member_hook</span><span class="o"><></span> <span class="n">dirty_item</span><span class="p">;</span>
<span class="c1">// 读写计数
</span> <span class="n">std</span><span class="o">::</span><span class="n">atomic_int</span> <span class="n">num_readers</span><span class="p">,</span> <span class="n">num_writers</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">atomic_int</span> <span class="n">num_reading</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// 目录
</span> <span class="k">struct</span> <span class="n">Dir</span> <span class="o">:</span> <span class="k">public</span> <span class="n">RefCountedObject</span> <span class="p">{</span>
<span class="n">mempool</span><span class="o">::</span><span class="n">bluefs</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span><span class="n">FileRef</span><span class="o">></span> <span class="n">file_map</span><span class="p">;</span> <span class="c1">// 目录包含的文件
</span> <span class="p">};</span>
<span class="c1">// 文件系统的内存映像
</span> <span class="n">mempool</span><span class="o">::</span><span class="n">bluefs</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">DirRef</span><span class="o">></span> <span class="n">dir_map</span><span class="p">;</span> <span class="c1">// 所有的目录
</span> <span class="n">mempool</span><span class="o">::</span><span class="n">bluefs</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span><span class="n">FileRef</span><span class="o">></span> <span class="n">file_map</span><span class="p">;</span> <span class="c1">// 所有的文件
</span>
<span class="n">map</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="n">dirty_file_list_t</span><span class="o">></span> <span class="n">dirty_files</span><span class="p">;</span> <span class="c1">// 脏文件,根据序列号排列
</span>
<span class="c1">// 文件系统超级块和日志
</span> <span class="p">......</span>
<span class="c1">// 结构体FileWriter/FileReader/FileLock,用来对一个文件进行读写和加锁
</span> <span class="p">......</span>
<span class="n">vector</span><span class="o"><</span><span class="n">BlockDevice</span><span class="o">*></span> <span class="n">bdev</span><span class="p">;</span> <span class="c1">// BlueFS能够使用的所有BlockDevice,包括wal/db/slow
</span> <span class="n">vector</span><span class="o"><</span><span class="n">IOContext</span><span class="o">*></span> <span class="n">ioc</span><span class="p">;</span> <span class="c1">// bdev对应的IOContext
</span> <span class="n">vector</span><span class="o"><</span><span class="n">interval_set</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">></span> <span class="o">></span> <span class="n">block_all</span><span class="p">;</span> <span class="c1">// bdev对应的磁盘空间
</span> <span class="n">vector</span><span class="o"><</span><span class="n">Allocator</span><span class="o">*></span> <span class="n">alloc</span><span class="p">;</span> <span class="c1">// bdev对应的allocator
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="bluefs-init">BlueFS Init</h1>
<p>BlueFS的用户(RocksDB/RocksEnv)只会对文件系统进行常规的操作,比如创建/删除文件,打开文件进行读写等操作。但是在使用文件系统之前,文件系统必须格式化,这个是由BlueStore存储引擎统一管理的,部署osd的时候会完成BlueFS的初始化,流程如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">mkfs</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="n">ObjectStore</span> <span class="o">*</span><span class="n">store</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">dev</span><span class="p">,</span>
<span class="n">uuid_d</span> <span class="n">fsid</span><span class="p">,</span> <span class="kt">int</span> <span class="n">whoami</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">store</span><span class="o">-></span><span class="n">mkfs</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">mkfs</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_open_db</span><span class="p">(</span><span class="nb">true</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">BlueStore</span><span class="o">::</span><span class="n">_open_db</span><span class="p">(</span><span class="kt">bool</span> <span class="n">create</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">bluefs</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BlueFS</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span>
<span class="p">......</span>
<span class="c1">// 依次添加 slow/db/wal等设备
</span> <span class="n">bluefs</span><span class="o">-></span><span class="n">add_block_device</span><span class="p">(...)</span> <span class="c1">// 添加设备,会新建一个BlockDevice及其对应的IOContext
</span> <span class="n">bluefs</span><span class="o">-></span><span class="n">add_block_extent</span><span class="p">(...)</span> <span class="c1">// 添加设备的存储空间,一般为SUPER_RESERVED到磁盘空间的上限,SUPER_RESERVED为8192,即从第三个4k开始
</span> <span class="p">......</span>
<span class="n">bluefs</span><span class="o">-></span><span class="n">mkfs</span><span class="p">(</span><span class="n">fsid</span><span class="p">);</span> <span class="c1">// 格式化文件系统,主要工作包括生成文件系统的超级块/log文件等
</span> <span class="n">bluefs</span><span class="o">-></span><span class="n">mount</span><span class="p">();</span> <span class="c1">// mount文件系统
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>整个流程比较简单,这里只需要明白,BlueFS是一个内存文件系统,mount的时候,通过扫码日志,在内存中还原出整个文件系统的状况,包括dir_map和file_map等。之所以这样做,是因为BlueFS仅仅为RocksDB服务,文件系统本身只包含少量的文件,内存空间和磁盘日志空间占用均不大。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">BlueFS</span><span class="o">::</span><span class="n">mount</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 读取超级块
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_open_super</span><span class="p">();</span>
<span class="p">......</span>
<span class="c1">// 初始化allocator为磁盘所有的空间
</span> <span class="n">_init_alloc</span><span class="p">();</span>
<span class="p">......</span>
<span class="c1">// 回放文件系统日志,日志项即为上面的事务OP,针对每个事务进行回放,文件系统的dir_map/file_map就会被更新
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_replay</span><span class="p">(</span><span class="nb">false</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="n">p</span> <span class="o">:</span> <span class="n">file_map</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="n">q</span> <span class="o">:</span> <span class="n">p</span><span class="p">.</span><span class="n">second</span><span class="o">-></span><span class="n">fnode</span><span class="p">.</span><span class="n">extents</span><span class="p">)</span> <span class="p">{</span>
<span class="n">alloc</span><span class="p">[</span><span class="n">q</span><span class="p">.</span><span class="n">bdev</span><span class="p">]</span><span class="o">-></span><span class="n">init_rm_free</span><span class="p">(</span><span class="n">q</span><span class="p">.</span><span class="n">offset</span><span class="p">,</span> <span class="n">q</span><span class="p">.</span><span class="n">length</span><span class="p">);</span> <span class="c1">// 将文件已经占用的内容从allocator中删除
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>mount完成后,文件系统的所有数据,包括文件和目录,在内存中就初始化完成,后续就可以对文件系统进行读写等操作。打开文件进行写的操作是open_for_write,如果是新文件,更新dir_map,注意打开文件的时候,文件的prefer_bdev默认采用的BDEV_DB设备,然后会根据目录名称(slow/wal目录有特殊后缀)进行适当调整,所以BlueFS的普通文件,优先使用db设备,而不会使用空间较少的wal设备。</p>
<p>文件系统提供的API比较少,其他实现也比较简单,同样是记录日志和更新内存中的dir_map和file_map等。当发生异常或重启进程的时候,回放文件系统的日志,将文件系统的内存状态还原到之前的状态。</p>
<p>另外一个值得注意的地方是BlueFS文件系统日志的compact操作,这个分为sync和async两种,大致流程是将文件系统的内存映像(文件和目录)重新生成事务,然后写入新的日志文件,然后将旧的日志文件删除,而不会对旧的日志文件做读写。</p>
<h1 id="config">Config</h1>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 普通文件
</span><span class="n">bluefs_alloc_size</span> <span class="c1">// 最小分配大小,默认为1MB
</span><span class="n">bluefs_max_prefetch</span> <span class="c1">// 预读时的最大字节数,默认为1MB,主要用在顺序读场景
</span>
<span class="c1">// 日志文件
</span><span class="n">bluefs_min_log_runway</span> <span class="c1">// bluefs日志文件的可用空间小于此值时,新分配空间。默认为1MB
</span><span class="n">bluefs_max_log_runway</span> <span class="c1">// bluefs日志文件的单次分配大小,默认为4MB
</span><span class="n">bluefs_log_compact_min_ratio</span> <span class="c1">// 通过当前日志文件大小和预估的日志文件的大小的比率控制compact,默认为5
</span><span class="n">bluefs_log_compact_min_size</span> <span class="c1">// 通过日志文件大小控制compact,小于此值不做compact。默认为16MB
</span><span class="n">bluefs_compact_log_sync</span> <span class="c1">// 日志文件compact的方式,有sync和async两种,默认为false,即采用async方式
</span>
<span class="n">bluefs_min_flush_size</span> <span class="c1">// 因为写文件内容是写到内存中的,当文件内容超过此值就刷新到磁盘。默认为512kb
</span>
<span class="n">bluefs_buffered_io</span> <span class="c1">// bluefs调用BlockDevice的read/write时的参数,默认为false,即采用fd_direct
</span><span class="n">bluefs_sync_write</span> <span class="c1">// 是否采用synchronous写。默认为false,即采用aio_write。这时候在flush block device的时候,需要等待aio完成。参见函数_flush_bdev_safely
</span>
<span class="n">bluefs_allocator</span> <span class="c1">// bluefs分配磁盘空间的分配器,默认为stupid,即基于extent的方式。
</span>
<span class="n">bluefs_preextend_wal_files</span> <span class="c1">// 是否预先更新rocksdb wal文件的大小。默认为false
</span>
<span class="c1">// 另外还有一些参数和BlueStore相关,以bluestore_bluefs_开头,这些参数主要控制将BlueStore的slow存储空间分配给BlueFS使用
</span></code></pre></div></div>
<h1 id="metric">Metric</h1>
<p>监控指标比较好理解,看看描述就能明白,重点应该关注db/wal的使用情况,因为通常情况下会使用更快的ssd,不要因为BlueFS的空间不够而使用BlueStore中的slow空间。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="s">"bluefs"</span><span class="o">:</span> <span class="p">{</span>
<span class="s">"gift_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"reclaim_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"db_total_bytes"</span><span class="o">:</span> <span class="mi">240043163648</span><span class="p">,</span>
<span class="s">"db_used_bytes"</span><span class="o">:</span> <span class="mi">631242752</span><span class="p">,</span>
<span class="s">"wal_total_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"wal_used_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"slow_total_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"slow_used_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"num_files"</span><span class="o">:</span> <span class="mi">14</span><span class="p">,</span>
<span class="s">"log_bytes"</span><span class="o">:</span> <span class="mi">5361664</span><span class="p">,</span>
<span class="s">"log_compactions"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"logged_bytes"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"files_written_wal"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"files_written_sst"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"bytes_written_wal"</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s">"bytes_written_sst"</span><span class="o">:</span> <span class="mi">0</span>
<span class="p">},</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>BlueFS同时支持多个设备(wal/db/slow)</p>
</li>
<li>
<p>BlueFS是个简单的内存文件系统,只提供简单的操作用来支持RocksDB运行</p>
</li>
<li>
<p>BlueFS有自己的日志,用来记录对文件系统的修改。异常情况下,回放日志可重建文件系统的完整的内存映像</p>
</li>
</ul>
Ceph BlueStore BlockDevice
2018-01-24T00:00:00+00:00
http://blog.wjin.org/posts/ceph-bluestore-blockdevice
<h1 id="introduction">Introduction</h1>
<p>Ceph新的存储引擎BlueStore在Luminous版本已经变成默认的存储引擎,这个存储引擎替换了以前的FileStore存储引擎,彻底抛弃了对文件系统的依赖,由Ceph OSD进程直接管理裸盘的存储空间,通过libaio的方式进行读写操作。实现的时候抽象出BlockDevice基类类型,统一管理各种类型的设备,如Kernel, NVME和NVRAM等,为裸盘的使用者(BlueFS/BlueStore)提供统一的操作接口。同时,为了紧跟存储技术的最新进展,将支持NVME的spdk集成进来,完全通过用户态程序操作NVME磁盘,提升iops的同时大大缩短io操作的延时。</p>
<h1 id="source-code">Source Code</h1>
<p>BlockDevice类图继承关系如下:</p>
<p><img src="/assets/img/post/ceph_bluestore_blockdevice.png" alt="img" /></p>
<h3 id="data-structure">Data Structure</h3>
<p>鉴于目前大多数部署还是使用的hdd和sata ssd,故以此为例作介绍,对应的派生类是KernelDevice,主要数据成员如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">KernelDevice</span> <span class="o">:</span> <span class="k">public</span> <span class="n">BlockDevice</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">fd_direct</span><span class="p">,</span> <span class="n">fd_buffered</span><span class="p">;</span> <span class="c1">// 分别存放以direct和buffered两种方式打开裸设备时的fd
</span> <span class="kt">uint64_t</span> <span class="n">size</span><span class="p">;</span> <span class="c1">// 设备总的大小
</span> <span class="kt">uint64_t</span> <span class="n">block_size</span><span class="p">;</span> <span class="c1">// 块的大小
</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">path</span><span class="p">;</span> <span class="c1">// 路径
</span> <span class="kt">bool</span> <span class="n">aio</span><span class="p">,</span> <span class="n">dio</span><span class="p">;</span>
<span class="c1">// libaio相关的线程
</span> <span class="k">struct</span> <span class="n">AioCompletionThread</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Thread</span> <span class="p">{</span>
<span class="n">KernelDevice</span> <span class="o">*</span><span class="n">bdev</span><span class="p">;</span>
<span class="k">explicit</span> <span class="nf">AioCompletionThread</span><span class="p">(</span><span class="n">KernelDevice</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span> <span class="o">:</span> <span class="n">bdev</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">entry</span><span class="p">()</span> <span class="k">override</span> <span class="p">{</span>
<span class="n">bdev</span><span class="o">-></span><span class="n">_aio_thread</span><span class="p">();</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="n">aio_thread</span><span class="p">;</span>
<span class="c1">// aio操作相关的队列和回调函数
</span> <span class="n">aio_queue_t</span> <span class="n">aio_queue</span><span class="p">;</span>
<span class="n">aio_callback_t</span> <span class="n">aio_callback</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">aio_callback_priv</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">aio_stop</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<h3 id="init-device">Init Device</h3>
<p>BlueFS用户态文件系统会使用BlockDevice存放文件系统的数据,BlueStore也会使用BlockDevice存放object相关的数据,创建BlockDevice的时候,通过工厂函数create,根据不同的设备类型,创建不同的设备,并提供callback方法及其参数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BlockDevice</span> <span class="o">*</span><span class="n">BlockDevice</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">CephContext</span><span class="o">*</span> <span class="n">cct</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">path</span><span class="p">,</span>
<span class="n">aio_callback_t</span> <span class="n">cb</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">cbpriv</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// 通过设备的path,判断出设备的类型
</span> <span class="n">string</span> <span class="n">type</span> <span class="o">=</span> <span class="s">"kernel"</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">PATH_MAX</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">::</span><span class="n">readlink</span><span class="p">(</span><span class="n">path</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">buf</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'\0'</span><span class="p">;</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">bname</span> <span class="o">=</span> <span class="o">::</span><span class="n">basename</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">strncmp</span><span class="p">(</span><span class="n">bname</span><span class="p">,</span> <span class="n">SPDK_PREFIX</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">SPDK_PREFIX</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">type</span> <span class="o">=</span> <span class="s">"ust-nvme"</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="s">"kernel"</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="k">new</span> <span class="n">KernelDevice</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">cb</span><span class="p">,</span> <span class="n">cbpriv</span><span class="p">);</span> <span class="c1">// kernel
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="s">"ust-nvme"</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="k">new</span> <span class="n">NVMEDevice</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">cb</span><span class="p">,</span> <span class="n">cbpriv</span><span class="p">);</span> <span class="c1">// nvme
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>创建设备后,接下来就是打开设备并对设备的基础参数进行初始化:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">KernelDevice</span><span class="o">::</span><span class="n">open</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 分别以fd_direct/fd_buffered方式打开块设备
</span> <span class="n">fd_direct</span> <span class="o">=</span> <span class="o">::</span><span class="n">open</span><span class="p">(</span><span class="n">path</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_DIRECT</span><span class="p">);</span>
<span class="n">fd_buffered</span> <span class="o">=</span> <span class="o">::</span><span class="n">open</span><span class="p">(</span><span class="n">path</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">O_RDWR</span><span class="p">);</span>
<span class="c1">// 读取block size等参数
</span> <span class="n">block_size</span> <span class="o">=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">bdev_block_size</span><span class="p">;</span>
<span class="p">......</span>
<span class="c1">// 如果是aio,初始化aio相关参数,并启动aio线程
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_aio_start</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>此时device已经ready,可以进行读写操作。需要说明的是,设备的空间怎么管理,是由设备的使用方,比如BlueFS和BlueStore完成,设备只提供IO操作接口。</p>
<h3 id="aio_write">aio_write</h3>
<p>设备初始化完成后,就可以调用相应接口进行IO的读写操作。KernelDevice提供同步读写接口read/write和异步读写接口aio_read/aio_write。如果是异步的,调用aio_write准备数据到buffer,后续还要调用aio_submit将请求提交,io执行完成后会由线程aio thread执行回调函数。这里以最复杂的流程aio_write为例介绍。</p>
<p>aio接口是通过libaio完成的,libaio怎么使用可以参考网上的<a href="http://blog.csdn.net/heyutao007/article/details/7065166">文章</a>,Ceph将其封装在类IOContext中,每个device对应一个IOContext:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">IOContext</span> <span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">std</span><span class="o">::</span><span class="n">mutex</span> <span class="n">lock</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">condition_variable</span> <span class="n">cond</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">list</span><span class="o"><</span><span class="n">aio_t</span><span class="o">></span> <span class="n">pending_aios</span><span class="p">;</span> <span class="c1">// 待执行的aio
</span> <span class="n">std</span><span class="o">::</span><span class="n">list</span><span class="o"><</span><span class="n">aio_t</span><span class="o">></span> <span class="n">running_aios</span><span class="p">;</span> <span class="c1">// 正在执行的aio
</span>
<span class="c1">// 计数
</span> <span class="n">std</span><span class="o">::</span><span class="n">atomic_int</span> <span class="n">num_pending</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="n">std</span><span class="o">::</span><span class="n">atomic_int</span> <span class="n">num_running</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">aio_t</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">iocb</span> <span class="n">iocb</span><span class="p">;</span> <span class="c1">// libaio相关的结构体
</span> <span class="p">......</span>
<span class="kt">void</span> <span class="n">pwritev</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">_offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
<span class="n">offset</span> <span class="o">=</span> <span class="n">_offset</span><span class="p">;</span>
<span class="n">length</span> <span class="o">=</span> <span class="n">len</span><span class="p">;</span>
<span class="n">io_prep_pwritev</span><span class="p">(</span><span class="o">&</span><span class="n">iocb</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&</span><span class="n">iov</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">iov</span><span class="p">.</span><span class="n">size</span><span class="p">(),</span> <span class="n">offset</span><span class="p">);</span> <span class="c1">// 准备数据
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>执行aio_write,实际上是在Device对应的IOContext结构体的成员变量pending_aios中追加了一个和libaio相关的aio_t结构:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">KernelDevice</span><span class="o">::</span><span class="n">aio_write</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">off</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">&</span><span class="n">bl</span><span class="p">,</span> <span class="n">IOContext</span> <span class="o">*</span><span class="n">ioc</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">buffered</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">aio</span> <span class="o">&&</span> <span class="n">dio</span> <span class="o">&&</span> <span class="o">!</span><span class="n">buffered</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ioc</span><span class="o">-></span><span class="n">pending_aios</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">aio_t</span><span class="p">(</span><span class="n">ioc</span><span class="p">,</span> <span class="n">fd_direct</span><span class="p">));</span> <span class="c1">// 放入IOContext的pending队列,等待执行
</span> <span class="o">++</span><span class="n">ioc</span><span class="o">-></span><span class="n">num_pending</span><span class="p">;</span>
<span class="c1">// 将待写入的数据准备在aio的buffer中
</span> <span class="n">aio_t</span><span class="o">&</span> <span class="n">aio</span> <span class="o">=</span> <span class="n">ioc</span><span class="o">-></span><span class="n">pending_aios</span><span class="p">.</span><span class="n">back</span><span class="p">();</span>
<span class="n">bl</span><span class="p">.</span><span class="n">prepare_iov</span><span class="p">(</span><span class="o">&</span><span class="n">aio</span><span class="p">.</span><span class="n">iov</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">aio</span><span class="p">.</span><span class="n">iov</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aio</span><span class="p">.</span><span class="n">bl</span><span class="p">.</span><span class="n">claim_append</span><span class="p">(</span><span class="n">bl</span><span class="p">);</span>
<span class="n">aio</span><span class="p">.</span><span class="n">pwritev</span><span class="p">(</span><span class="n">off</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span> <span class="c1">// 写buffer
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="aio_submit">aio_submit</h3>
<p>使用方调用aio_write准备数据后,紧接着会调用aio_submit提交IO请求:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">KernelDevice</span><span class="o">::</span><span class="n">aio_submit</span><span class="p">(</span><span class="n">IOContext</span> <span class="o">*</span><span class="n">ioc</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ioc</span><span class="o">-></span><span class="n">num_pending</span><span class="p">.</span><span class="n">load</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 获取pending的aio
</span> <span class="n">list</span><span class="o"><</span><span class="n">aio_t</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">e</span> <span class="o">=</span> <span class="n">ioc</span><span class="o">-></span><span class="n">running_aios</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">ioc</span><span class="o">-></span><span class="n">running_aios</span><span class="p">.</span><span class="n">splice</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="n">ioc</span><span class="o">-></span><span class="n">pending_aios</span><span class="p">);</span>
<span class="p">......</span>
<span class="c1">// 批量提交aio
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">aio_queue</span><span class="p">.</span><span class="n">submit_batch</span><span class="p">(</span><span class="n">ioc</span><span class="o">-></span><span class="n">running_aios</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">e</span><span class="p">,</span>
<span class="n">ioc</span><span class="o">-></span><span class="n">num_running</span><span class="p">.</span><span class="n">load</span><span class="p">(),</span> <span class="n">priv</span><span class="p">,</span> <span class="o">&</span><span class="n">retries</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">aio_queue_t</span><span class="o">::</span><span class="n">submit_batch</span><span class="p">(</span><span class="n">aio_iter</span> <span class="n">begin</span><span class="p">,</span> <span class="n">aio_iter</span> <span class="n">end</span><span class="p">,</span>
<span class="kt">uint16_t</span> <span class="n">aios_size</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">,</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">retries</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">while</span> <span class="p">(</span><span class="n">left</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">io_submit</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">piocb</span> <span class="o">+</span> <span class="n">done</span><span class="p">);</span> <span class="c1">// 调用libaio相关的api提交io
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="aio_thread">aio_thread</h3>
<p>提交aio请求后,device的使用方就完成了,需要单独的线程来检查io的完成情况,当真正完成的时候,执行回调函数通知调用方,此线程即为设备对应的aio thread线程,线程入口如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">KernelDevice</span><span class="o">::</span><span class="n">_aio_thread</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">aio_stop</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">aio_queue</span><span class="p">.</span><span class="n">get_next_completed</span><span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">bdev_aio_poll_ms</span><span class="p">,</span> <span class="c1">// 调用libaio相关的api,检查io是否完成
</span> <span class="n">aio</span><span class="p">,</span> <span class="n">max</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ioc</span><span class="o">-></span><span class="n">priv</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">--</span><span class="n">ioc</span><span class="o">-></span><span class="n">num_running</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aio_callback</span><span class="p">(</span><span class="n">aio_callback_priv</span><span class="p">,</span> <span class="n">ioc</span><span class="o">-></span><span class="n">priv</span><span class="p">);</span> <span class="c1">// 执行回调
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">aio_queue_t</span><span class="o">::</span><span class="n">get_next_completed</span><span class="p">(</span><span class="kt">int</span> <span class="n">timeout_ms</span><span class="p">,</span> <span class="n">aio_t</span> <span class="o">**</span><span class="n">paio</span><span class="p">,</span> <span class="kt">int</span> <span class="n">max</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">io_getevents</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">max</span><span class="p">,</span> <span class="n">event</span><span class="p">,</span> <span class="o">&</span><span class="n">t</span><span class="p">);</span> <span class="c1">// 调用libaio相关的api,获取已经完成aio请求
</span> <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="n">EINTR</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="config">Config</h1>
<p>除了debug相关参数外,主要配置参数与libaio相关,目前看来,基本不需要做调整:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bdev_aio <span class="c"># 默认为true。不能修改,现在只支持aio方式操作磁盘</span>
bdev_aio_poll_ms <span class="c"># libaio API io_getevents的超时时间,默认为250</span>
bdev_aio_max_queue_depth <span class="c"># libaio API io_setup的最大队列深度, 默认为1024</span>
bdev_aio_reap_max <span class="c"># libaio API io_getevents每次请求返回的最大条目数</span>
bdev_block_size <span class="c"># 磁盘块大小,默认4096字节</span>
<span class="c"># nvme相关参数</span>
bdev_nvme_unbind_from_kernel
bdev_nvme_retry_count
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>以BlockDevice为基类,抽象出了不同类型的Device,包括KernelDevice、NVMEDevice和PMEMDevice等。</p>
</li>
<li>
<p>Device的使用者通过设备提供的create和open接口初始化设备。</p>
</li>
<li>
<p>Device提供同步和异步的读写接口,以及flush等操作保证数据落盘。</p>
</li>
<li>
<p>异步写操作通过aio_write和aio_submit接口完成,Device的aio相关的线程会在io执行完成后,执行回调函数通知调用者。</p>
</li>
</ul>
CephFS Remove File Error
2018-01-09T00:00:00+00:00
http://blog.wjin.org/posts/cephfs-remove-file-error
<h1 id="问题">问题</h1>
<p>在使用cephfs的过程中,用户报告删除文件的时候,出现错误: <strong>No space left on device。</strong> 通过简单搜索,发现社区有类似的记录,从邮件列表讨论看,一方面原因是参数的限制,另外一方面存在潜在的bug,特别是在硬链接的时候容易触发。</p>
<h1 id="分析">分析</h1>
<p>cephfs分布式文件系统在删除文件的时候,和传统单机文件系统还是有区别的。对于单机文件系统,比如ext4,删除的时候,只需要将文件对应的inode等元数据释放即可,对于数据区不需要执行删除,后续直接覆盖写就好了。但是对于ceph这样的分布式文件系统,删除文件的时候不仅要操作文件系统的元数据,还必须把文件对应的内容删除掉,否则就会造成磁盘空间泄漏。因为一个文件可能对应后端rados集群的多个object,这些object名称全局唯一的,如果不删除object,那么object占用的磁盘空间就浪费了。</p>
<p>既然要删除内容本身,自然而然会遇到一个问题,当文件非常大的时候,文件会被映射到很多个rados object,删除大量object肯定会非常耗时,所以必须采用异步的方式,而且还必须控制好删除的粒度,避免影响集群性能。对于cephfs来说,删除文件就是一个<strong>unlink</strong>操作,将待删除文件移动到一个特殊的目录下面(ceph mds内部叫<strong>stray</strong>目录),然后更新文件系统的元数据信息,告知用户删除操作成功,后续再慢慢清理文件系统的stray目录包含的内容。</p>
<p>stray是cephfs分布式文件系统的一个特殊目录,用户不可见,但是在实现的时候,和普通的目录并无区别,一样遵循文件系统内部设定的参数,比如参数<strong>mds_bal_fragment_size_max</strong>,这个参数限制目录能够创建的文件个数,默认是10万。前面提到删除的时候需要注意并发度,不要占用太多的集群带宽,但是当大量用户都执行批量删除的时候,stray目录增加的速度远大于删除的速度,stray目录很快就会被耗尽,此时就会发生用户报告的错误(这里stray目录实际上做了个分片,总共有10个stray,总计可以容纳一百万文件)。</p>
<p>另外一直种情况是硬连接的场景,stray目录的内容删除不掉,从邮件讨论看代码存在潜在bug,我们的使用场景应该没有硬连接。这里简单说一下ceph的实现,和单机文件系统一样,硬链接的时候增加inode的引用计数,多个dentry会对应同一个inode,为了提升性能,ceph实现的时候将inode嵌入到dentry,为了便于区分inode是否是硬链接,将其抽象成了类型linkage,分别用primary和remote来记录其类型。</p>
<p>单个目录文件个数太多,会影响性能,而且也有另外的bug,所以不建议在生产环境调整stray的空间。对于删除速度,有参数可以调整并发度,但是删除数据需要占用集群的带宽,如果调整太大,会影响集群的正常使用,鉴于后台清理工作不是那么急迫,建议保持默认的参数。</p>
<p>对于生产环境,可以增加一些stray相关的metrics,实时监控stray的大小,如有异常,提前通知用户。同时,可以修改ceph源码,将stray的分片大小改为可配置的参数,这样方便用户根据需要进行修改。</p>
<h1 id="源码分析">源码分析</h1>
<p>涉及的相关重要数据结构,需要明白<strong>目录</strong>、<strong>目录项</strong>、<strong>inode</strong>等之间的关系:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 目录
</span><span class="k">class</span> <span class="nc">CDir</span> <span class="o">:</span> <span class="k">public</span> <span class="n">MDSCacheObject</span> <span class="p">{</span>
<span class="k">typedef</span> <span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">dentry_key_t</span><span class="p">,</span> <span class="n">CDentry</span><span class="o">*></span> <span class="n">map_t</span><span class="p">;</span>
<span class="n">map_t</span> <span class="n">items</span><span class="p">;</span> <span class="c1">// 目录包含的目录项
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="c1">// 目录项
</span><span class="k">class</span> <span class="nc">CDentry</span> <span class="o">:</span> <span class="k">public</span> <span class="n">MDSCacheObject</span><span class="p">,</span> <span class="k">public</span> <span class="n">LRUObject</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">linkage_t</span> <span class="p">{</span> <span class="c1">// 抽象的inode结构体,主要为了实现硬连接,一般为primary,硬链接为remote
</span> <span class="n">CInode</span> <span class="o">*</span><span class="n">inode</span><span class="p">;</span>
<span class="n">inodeno_t</span> <span class="n">remote_ino</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">remote_d_type</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">protected</span><span class="o">:</span>
<span class="n">linkage_t</span> <span class="n">linkage</span><span class="p">;</span> <span class="c1">// inode嵌入到目录项
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="c1">// inode
</span><span class="k">class</span> <span class="nc">CInode</span> <span class="o">:</span> <span class="k">public</span> <span class="n">MDSCacheObject</span><span class="p">,</span> <span class="k">public</span> <span class="n">InodeStoreBase</span> <span class="p">{</span>
<span class="k">protected</span><span class="o">:</span>
<span class="c1">// parent dentries in cache
</span> <span class="n">CDentry</span> <span class="o">*</span><span class="n">parent</span><span class="p">;</span> <span class="c1">// 一般情况
</span> <span class="n">compact_set</span><span class="o"><</span><span class="n">CDentry</span><span class="o">*></span> <span class="n">remote_parents</span><span class="p">;</span> <span class="c1">// 硬链接
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="cp">#define NUM_STRAY 10
</span><span class="k">class</span> <span class="nc">MDCache</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">CInode</span> <span class="o">*</span><span class="n">strays</span><span class="p">[</span><span class="n">NUM_STRAY</span><span class="p">];</span> <span class="c1">// strays目录的inode
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>请求处理的大致流程:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Server组件用来维护session,处理客户端的请求,删除操作为unlink请求
</span><span class="kt">void</span> <span class="n">Server</span><span class="o">::</span><span class="n">handle_client_unlink</span><span class="p">(</span><span class="n">MDRequestRef</span><span class="o">&</span> <span class="n">mdr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">CDentry</span><span class="o">::</span><span class="n">linkage_t</span> <span class="o">*</span><span class="n">dnl</span> <span class="o">=</span> <span class="n">dn</span><span class="o">-></span><span class="n">get_linkage</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">mdr</span><span class="p">);</span>
<span class="n">straydn</span> <span class="o">=</span> <span class="n">prepare_stray_dentry</span><span class="p">(</span><span class="n">mdr</span><span class="p">,</span> <span class="n">dnl</span><span class="o">-></span><span class="n">get_inode</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">CDentry</span><span class="o">*</span> <span class="n">Server</span><span class="o">::</span><span class="n">prepare_stray_dentry</span><span class="p">(</span><span class="n">MDRequestRef</span><span class="o">&</span> <span class="n">mdr</span><span class="p">,</span> <span class="n">CInode</span> <span class="o">*</span><span class="n">in</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">CDir</span> <span class="o">*</span><span class="n">straydir</span> <span class="o">=</span> <span class="n">mdcache</span><span class="o">-></span><span class="n">get_stray_dir</span><span class="p">(</span><span class="n">in</span><span class="p">);</span> <span class="c1">// 返回数组strays中的一个dir,这个数组长度为10
</span> <span class="n">straydn</span> <span class="o">=</span> <span class="n">mdcache</span><span class="o">-></span><span class="n">get_or_create_stray_dentry</span><span class="p">(</span><span class="n">in</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">CDentry</span> <span class="o">*</span><span class="n">MDCache</span><span class="o">::</span><span class="n">get_or_create_stray_dentry</span><span class="p">(</span><span class="n">CInode</span> <span class="o">*</span><span class="n">in</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">CDir</span> <span class="o">*</span><span class="n">straydir</span> <span class="o">=</span> <span class="n">get_stray_dir</span><span class="p">(</span><span class="n">in</span><span class="p">);</span>
<span class="n">CDentry</span> <span class="o">*</span><span class="n">straydn</span> <span class="o">=</span> <span class="n">straydir</span><span class="o">-></span><span class="n">lookup</span><span class="p">(</span><span class="n">straydname</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">straydn</span><span class="p">)</span> <span class="p">{</span>
<span class="n">straydn</span> <span class="o">=</span> <span class="n">straydir</span><span class="o">-></span><span class="n">add_null_dentry</span><span class="p">(</span><span class="n">straydname</span><span class="p">);</span> <span class="c1">// 增加一个dentry到stray目录
</span> <span class="n">straydn</span><span class="o">-></span><span class="n">mark_new</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>加入stray目录后,还没有真正删除,删除是通过周期性的tick函数触发:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MDSDaemon</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mds_rank</span><span class="p">)</span> <span class="p">{</span>
<span class="n">mds_rank</span><span class="o">-></span><span class="n">tick</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MDSRankDispatcher</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_active</span><span class="p">()</span> <span class="o">||</span> <span class="n">is_stopping</span><span class="p">())</span> <span class="p">{</span>
<span class="n">mdcache</span><span class="o">-></span><span class="n">trim</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">MDCache</span><span class="o">::</span><span class="n">trim</span><span class="p">(</span><span class="kt">int</span> <span class="n">max</span><span class="p">,</span> <span class="kt">int</span> <span class="n">count</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">trim_inode</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">subtree_in</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">expiremap</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">MDCache</span><span class="o">::</span><span class="n">trim_inode</span><span class="p">(</span><span class="n">CDentry</span> <span class="o">*</span><span class="n">dn</span><span class="p">,</span> <span class="n">CInode</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="n">CDir</span> <span class="o">*</span><span class="n">con</span><span class="p">,</span> <span class="n">map</span><span class="o"><</span><span class="n">mds_rank_t</span><span class="p">,</span> <span class="n">MCacheExpire</span><span class="o">*>&</span> <span class="n">expiremap</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">maybe_eval_stray</span><span class="p">(</span><span class="n">in</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MDCache</span><span class="o">::</span><span class="n">maybe_eval_stray</span><span class="p">(</span><span class="n">CInode</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">delay</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">dn</span><span class="o">-></span><span class="n">get_projected_linkage</span><span class="p">()</span><span class="o">-></span><span class="n">is_primary</span><span class="p">()</span> <span class="o">&&</span>
<span class="n">dn</span><span class="o">-></span><span class="n">get_dir</span><span class="p">()</span><span class="o">-></span><span class="n">get_inode</span><span class="p">()</span><span class="o">-></span><span class="n">is_stray</span><span class="p">())</span> <span class="p">{</span>
<span class="n">stray_manager</span><span class="p">.</span><span class="n">eval_stray</span><span class="p">(</span><span class="n">dn</span><span class="p">,</span> <span class="n">delay</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>代码实现的时候,各种细节需要考虑,实际情况非常复杂,不过最终都会调用函数maybe_eval_stray,判断文件是否能够最终被删除,而删除操作通过StrayManager的purge函数完成,大致处理流程:</p>
<blockquote>
<p>eval_stray -> __eval_stray -> enqueue -> consume -> purge</p>
</blockquote>
<p>purge完成后,回调_purge_stray_purged,这边会一直迭代purge,直到条件不允许:</p>
<blockquote>
<p>_advance -> _purge_stray_purged -> _advance</p>
</blockquote>
<h1 id="选项">选项</h1>
<p>可调整的参数:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mds_bal_fragment_size_max // 单个目录容纳的文件个数
mds_max_purge_files // 删除时并发文件的上限
mds_max_purge_ops // 删除时操作个数的上限
mds_max_purge_ops_per_pg // 每个pg的操作上限
</code></pre></div></div>
<p>监控选项:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="s2">"mds_cache"</span>: <span class="o">{</span>
<span class="s2">"num_strays"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries"</span>,
<span class="s2">"nick"</span>: <span class="s2">"stry"</span>
<span class="o">}</span>,
<span class="s2">"num_strays_purging"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries purging"</span>,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries"</span>,
<span class="s2">"nick"</span>: <span class="s2">"stry"</span>
<span class="o">}</span>,
<span class="s2">"num_strays_purging"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries purging"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"num_strays_delayed"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries delayed"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"num_purge_ops"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Purge operations"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"strays_created"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 10,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries created"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"strays_purged"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 10,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries purged"</span>,
<span class="s2">"nick"</span>: <span class="s2">"purg"</span>
<span class="o">}</span>,
<span class="s2">"strays_reintegrated"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 10,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries reintegrated"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"strays_migrated"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 10,
<span class="s2">"description"</span>: <span class="s2">"Stray dentries migrated"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"num_recovering_processing"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Files currently being recovered"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"num_recovering_enqueued"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Files waiting for recovery"</span>,
<span class="s2">"nick"</span>: <span class="s2">"recy"</span>
<span class="o">}</span>,
<span class="s2">"num_recovering_prioritized"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 2,
<span class="s2">"description"</span>: <span class="s2">"Files waiting for recovery with elevated priority"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"recovery_started"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 10,
<span class="s2">"description"</span>: <span class="s2">"File recoveries started"</span>,
<span class="s2">"nick"</span>: <span class="s2">""</span>
<span class="o">}</span>,
<span class="s2">"recovery_completed"</span>: <span class="o">{</span>
<span class="s2">"type"</span>: 10,
<span class="s2">"description"</span>: <span class="s2">"File recoveries completed"</span>,
<span class="s2">"nick"</span>: <span class="s2">"recd"</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h1 id="参考">参考</h1>
<p><a href="https://www.mail-archive.com/ceph-users@lists.ceph.com/msg43007.html">1.https://www.mail-archive.com/ceph-users@lists.ceph.com/msg43007.html</a></p>
<p><a href="http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013397.html">2.http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013397.html</a></p>
CephFS Kernel Client Throughput Limitation
2017-10-13T00:00:00+00:00
http://blog.wjin.org/posts/cephfs-kernel-client-throughput-limitation
<h1 id="问题描述">问题描述</h1>
<p>测试cephfs内核客户端的吞吐性能,direct写时单个客户端性能有上限,只能接近150 mb/s:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:/mnt/cephfs/jinwei_test/dd# dd if=/dev/zero of=test bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 7.63581 s, 141 MB/s
root@n8-147-034:/mnt/cephfs/jinwei_test/dd#
</code></pre></div></div>
<p>查看网卡流量,并没有打满:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:~# ifstat
eth0
KB/s in KB/s out
704.64 144464.6
330.61 144193.2
550.67 143631.0
604.46 143381.1
</code></pre></div></div>
<p>查看集群负载也很低,osd磁盘很空闲,验证多台机器同时并发测试,总吞吐可以上去,怀疑单个客户端的上限有瓶颈。</p>
<h1 id="源码分析">源码分析</h1>
<p>集群没有打满,网络也不是瓶颈,那么只能从内核客户端cephfs的写IO入手,寻找问题的根源。
cephfs内核客户端写IO的代码在文件fs/ceph/file.c:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="k">struct</span> <span class="n">file_operations</span> <span class="n">ceph_file_fops</span> <span class="o">=</span> <span class="p">{</span>
<span class="p">.</span><span class="n">open</span> <span class="o">=</span> <span class="n">ceph_open</span><span class="p">,</span>
<span class="p">.</span><span class="n">release</span> <span class="o">=</span> <span class="n">ceph_release</span><span class="p">,</span>
<span class="p">.</span><span class="n">llseek</span> <span class="o">=</span> <span class="n">ceph_llseek</span><span class="p">,</span>
<span class="p">.</span><span class="n">read_iter</span> <span class="o">=</span> <span class="n">ceph_read_iter</span><span class="p">,</span>
<span class="p">.</span><span class="n">write_iter</span> <span class="o">=</span> <span class="n">ceph_write_iter</span><span class="p">,</span> <span class="c1">// 写文件的hook函数
</span> <span class="p">.</span><span class="n">mmap</span> <span class="o">=</span> <span class="n">ceph_mmap</span><span class="p">,</span>
<span class="p">.</span><span class="n">fsync</span> <span class="o">=</span> <span class="n">ceph_fsync</span><span class="p">,</span>
<span class="p">.</span><span class="n">lock</span> <span class="o">=</span> <span class="n">ceph_lock</span><span class="p">,</span>
<span class="p">.</span><span class="n">flock</span> <span class="o">=</span> <span class="n">ceph_flock</span><span class="p">,</span>
<span class="p">.</span><span class="n">splice_write</span> <span class="o">=</span> <span class="n">iter_file_splice_write</span><span class="p">,</span>
<span class="p">.</span><span class="n">unlocked_ioctl</span> <span class="o">=</span> <span class="n">ceph_ioctl</span><span class="p">,</span>
<span class="p">.</span><span class="n">compat_ioctl</span> <span class="o">=</span> <span class="n">ceph_ioctl</span><span class="p">,</span>
<span class="p">.</span><span class="n">fallocate</span> <span class="o">=</span> <span class="n">ceph_fallocate</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">ceph_write_iter</span><span class="p">(</span><span class="k">struct</span> <span class="n">kiocb</span> <span class="o">*</span><span class="n">iocb</span><span class="p">,</span> <span class="k">struct</span> <span class="n">iov_iter</span> <span class="o">*</span><span class="n">from</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">iocb</span><span class="o">-></span><span class="n">ki_flags</span> <span class="o">&</span> <span class="n">IOCB_DIRECT</span><span class="p">)</span>
<span class="n">written</span> <span class="o">=</span> <span class="n">ceph_direct_read_write</span><span class="p">(</span><span class="n">iocb</span><span class="p">,</span> <span class="o">&</span><span class="n">data</span><span class="p">,</span> <span class="n">snapc</span><span class="p">,</span> <span class="c1">// direct情况
</span> <span class="o">&</span><span class="n">prealloc_cf</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">ssize_t</span>
<span class="nf">ceph_direct_read_write</span><span class="p">(</span><span class="k">struct</span> <span class="n">kiocb</span> <span class="o">*</span><span class="n">iocb</span><span class="p">,</span> <span class="k">struct</span> <span class="n">iov_iter</span> <span class="o">*</span><span class="n">iter</span><span class="p">,</span>
<span class="k">struct</span> <span class="n">ceph_snap_context</span> <span class="o">*</span><span class="n">snapc</span><span class="p">,</span>
<span class="k">struct</span> <span class="n">ceph_cap_flush</span> <span class="o">**</span><span class="n">pcf</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">while</span> <span class="p">(</span><span class="n">iov_iter_count</span><span class="p">(</span><span class="n">iter</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">req</span> <span class="o">=</span> <span class="n">ceph_osdc_new_request</span><span class="p">(</span><span class="o">&</span><span class="n">fsc</span><span class="o">-></span><span class="n">client</span><span class="o">-></span><span class="n">osdc</span><span class="p">,</span> <span class="o">&</span><span class="n">ci</span><span class="o">-></span><span class="n">i_layout</span><span class="p">,</span> <span class="c1">// 创建请求
</span> <span class="n">vino</span><span class="p">,</span> <span class="n">pos</span><span class="p">,</span> <span class="o">&</span><span class="n">size</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
<span class="cm">/*include a 'startsync' command*/</span>
<span class="n">write</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="n">write</span> <span class="o">?</span> <span class="n">CEPH_OSD_OP_WRITE</span> <span class="o">:</span>
<span class="n">CEPH_OSD_OP_READ</span><span class="p">,</span>
<span class="n">flags</span><span class="p">,</span> <span class="n">snapc</span><span class="p">,</span>
<span class="n">ci</span><span class="o">-></span><span class="n">i_truncate_seq</span><span class="p">,</span>
<span class="n">ci</span><span class="o">-></span><span class="n">i_truncate_size</span><span class="p">,</span>
<span class="nb">false</span><span class="p">);</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">ceph_osdc_start_request</span><span class="p">(</span><span class="n">req</span><span class="o">-></span><span class="n">r_osdc</span><span class="p">,</span> <span class="n">req</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span> <span class="c1">// 开始请求
</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ret</span><span class="p">)</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">ceph_osdc_wait_request</span><span class="p">(</span><span class="o">&</span><span class="n">fsc</span><span class="o">-></span><span class="n">client</span><span class="o">-></span><span class="n">osdc</span><span class="p">,</span> <span class="n">req</span><span class="p">);</span> <span class="c1">// 等待结束
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>从代码实现看,主要流程是new_request, start_request, wait_request三个步骤。
直觉告诉我这里的wait会被block住,跟踪一下这里的wait实现:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// net/ceph/osd_client.c
</span><span class="kt">int</span> <span class="nf">ceph_osdc_wait_request</span><span class="p">(</span><span class="k">struct</span> <span class="n">ceph_osd_client</span> <span class="o">*</span><span class="n">osdc</span><span class="p">,</span>
<span class="k">struct</span> <span class="n">ceph_osd_request</span> <span class="o">*</span><span class="n">req</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">wait_request_timeout</span><span class="p">(</span><span class="n">req</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 第二个参数是0
</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">wait_request_timeout</span><span class="p">(</span><span class="k">struct</span> <span class="n">ceph_osd_request</span> <span class="o">*</span><span class="n">req</span><span class="p">,</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">timeout</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 这个函数返回,要么是第一个参数的条件满足,要么是超时
</span> <span class="n">left</span> <span class="o">=</span> <span class="n">wait_for_completion_killable_timeout</span><span class="p">(</span><span class="o">&</span><span class="n">req</span><span class="o">-></span><span class="n">r_completion</span><span class="p">,</span>
<span class="n">ceph_timeout_jiffies</span><span class="p">(</span><span class="n">timeout</span><span class="p">));</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>先看超时的时间,传入的是0,最终结果是LONG_MAX,差不多就是一直wait:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// include/linux/ceph/libceph.h
</span><span class="k">static</span> <span class="kr">inline</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">ceph_timeout_jiffies</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">timeout</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 这里是gnu对c语言条件运算符的扩展,其实就是: timeout ? timeout : MAX_SCHEDULE_TIMEOUT;
</span> <span class="k">return</span> <span class="n">timeout</span> <span class="o">?:</span> <span class="n">MAX_SCHEDULE_TIMEOUT</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// include/linux/sched.h
// long的最大值
</span><span class="cp">#define MAX_SCHEDULE_TIMEOUT LONG_MAX
</span></code></pre></div></div>
<p>接下来看条件的满足:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// kernel/sched/completion.c
</span><span class="kt">long</span> <span class="n">__sched</span>
<span class="nf">wait_for_completion_killable_timeout</span><span class="p">(</span><span class="k">struct</span> <span class="n">completion</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">timeout</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">wait_for_common</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">timeout</span><span class="p">,</span> <span class="n">TASK_KILLABLE</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">long</span> <span class="n">__sched</span>
<span class="nf">wait_for_common</span><span class="p">(</span><span class="k">struct</span> <span class="n">completion</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">long</span> <span class="n">timeout</span><span class="p">,</span> <span class="kt">int</span> <span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 注意第二个参数schedule_timeout
</span> <span class="k">return</span> <span class="n">__wait_for_common</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">schedule_timeout</span><span class="p">,</span> <span class="n">timeout</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">long</span> <span class="n">__sched</span>
<span class="nf">__wait_for_common</span><span class="p">(</span><span class="k">struct</span> <span class="n">completion</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span>
<span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">action</span><span class="p">)(</span><span class="kt">long</span><span class="p">),</span> <span class="kt">long</span> <span class="n">timeout</span><span class="p">,</span> <span class="kt">int</span> <span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">might_sleep</span><span class="p">();</span>
<span class="n">spin_lock_irq</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="o">-></span><span class="n">wait</span><span class="p">.</span><span class="n">lock</span><span class="p">);</span>
<span class="n">timeout</span> <span class="o">=</span> <span class="n">do_wait_for_common</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">timeout</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span>
<span class="n">spin_unlock_irq</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="o">-></span><span class="n">wait</span><span class="p">.</span><span class="n">lock</span><span class="p">);</span>
<span class="k">return</span> <span class="n">timeout</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">long</span> <span class="n">__sched</span>
<span class="nf">do_wait_for_common</span><span class="p">(</span><span class="k">struct</span> <span class="n">completion</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span>
<span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">action</span><span class="p">)(</span><span class="kt">long</span><span class="p">),</span> <span class="kt">long</span> <span class="n">timeout</span><span class="p">,</span> <span class="kt">int</span> <span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">x</span><span class="o">-></span><span class="n">done</span><span class="p">)</span> <span class="p">{</span>
<span class="n">DECLARE_WAITQUEUE</span><span class="p">(</span><span class="n">wait</span><span class="p">,</span> <span class="n">current</span><span class="p">);</span>
<span class="n">__add_wait_queue_tail_exclusive</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="o">-></span><span class="n">wait</span><span class="p">,</span> <span class="o">&</span><span class="n">wait</span><span class="p">);</span>
<span class="k">do</span> <span class="p">{</span> <span class="c1">// 循环,直到x->done完成,这个是条件的判断。而x就是发送io请求时,传递的结构体,完成后会更新这个结构体内容
</span> <span class="k">if</span> <span class="p">(</span><span class="n">signal_pending_state</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">current</span><span class="p">))</span> <span class="p">{</span>
<span class="n">timeout</span> <span class="o">=</span> <span class="o">-</span><span class="n">ERESTARTSYS</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">__set_current_state</span><span class="p">(</span><span class="n">state</span><span class="p">);</span>
<span class="n">spin_unlock_irq</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="o">-></span><span class="n">wait</span><span class="p">.</span><span class="n">lock</span><span class="p">);</span>
<span class="n">timeout</span> <span class="o">=</span> <span class="n">action</span><span class="p">(</span><span class="n">timeout</span><span class="p">);</span> <span class="c1">// action就是上面传入的schedule_timeout
</span> <span class="n">spin_lock_irq</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="o">-></span><span class="n">wait</span><span class="p">.</span><span class="n">lock</span><span class="p">);</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">x</span><span class="o">-></span><span class="n">done</span> <span class="o">&&</span> <span class="n">timeout</span><span class="p">);</span>
<span class="n">__remove_wait_queue</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="o">-></span><span class="n">wait</span><span class="p">,</span> <span class="o">&</span><span class="n">wait</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">x</span><span class="o">-></span><span class="n">done</span><span class="p">)</span>
<span class="k">return</span> <span class="n">timeout</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">x</span><span class="o">-></span><span class="n">done</span><span class="o">--</span><span class="p">;</span>
<span class="k">return</span> <span class="n">timeout</span> <span class="o">?:</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>从kernel的注释看,函数schedule_timeout就是sleep直到timeout:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/**
* schedule_timeout - sleep until timeout
* @timeout: timeout value in jiffies
*
* Make the current task sleep until @timeout jiffies have
* elapsed.
signed long __sched schedule_timeout(signed long timeout)
{
</code></pre></div></div>
<p>从源码分析看,已经比较明确,一次请求下发后,只有等请求完成了才会进行下一次请求,IO并不是并发的下发给后端的集群。</p>
<p>接下来的问题是,每次请求的size如何决定?
这个和文件的layout属性和当前写的位置相关,如果从文件offset 0开始写的话,以及采用默认属性,最大就是ceph object size的大小,即4MB。
ceph的layout解释可以参考<a href="http://docs.ceph.com/docs/master/architecture/#data-striping">官方文档</a>。</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// net/ceph/osd_client.c
</span><span class="k">struct</span> <span class="n">ceph_osd_request</span> <span class="o">*</span><span class="nf">ceph_osdc_new_request</span><span class="p">(</span><span class="k">struct</span> <span class="n">ceph_osd_client</span> <span class="o">*</span><span class="n">osdc</span><span class="p">,</span>
<span class="k">struct</span> <span class="n">ceph_file_layout</span> <span class="o">*</span><span class="n">layout</span><span class="p">,</span>
<span class="k">struct</span> <span class="n">ceph_vino</span> <span class="n">vino</span><span class="p">,</span>
<span class="n">u64</span> <span class="n">off</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">plen</span><span class="p">,</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">which</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num_ops</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">opcode</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
<span class="k">struct</span> <span class="n">ceph_snap_context</span> <span class="o">*</span><span class="n">snapc</span><span class="p">,</span>
<span class="n">u32</span> <span class="n">truncate_seq</span><span class="p">,</span>
<span class="n">u64</span> <span class="n">truncate_size</span><span class="p">,</span>
<span class="n">bool</span> <span class="n">use_mempool</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="cm">/* calculate max write size */</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">calc_layout</span><span class="p">(</span><span class="n">layout</span><span class="p">,</span> <span class="n">off</span><span class="p">,</span> <span class="n">plen</span><span class="p">,</span> <span class="o">&</span><span class="n">objnum</span><span class="p">,</span> <span class="o">&</span><span class="n">objoff</span><span class="p">,</span> <span class="o">&</span><span class="n">objlen</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="cm">/*
* calculate the mapping of a file extent onto an object, and fill out the
* request accordingly. shorten extent as necessary if it crosses an
* object boundary.
*
* fill osd op in request message.
*/</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">calc_layout</span><span class="p">(</span><span class="k">struct</span> <span class="n">ceph_file_layout</span> <span class="o">*</span><span class="n">layout</span><span class="p">,</span> <span class="n">u64</span> <span class="n">off</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">plen</span><span class="p">,</span>
<span class="n">u64</span> <span class="o">*</span><span class="n">objnum</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">objoff</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">objlen</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 根据layout计算
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ceph_calc_file_object_mapping</span><span class="p">(</span><span class="n">layout</span><span class="p">,</span> <span class="n">off</span><span class="p">,</span> <span class="n">orig_len</span><span class="p">,</span> <span class="n">objnum</span><span class="p">,</span>
<span class="n">objoff</span><span class="p">,</span> <span class="n">objlen</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="实验证明">实验证明</h1>
<h3 id="调整文件属性">调整文件属性</h3>
<p>为了更明显的观察延时,我们将文件的属性调整一下,即从4m到64m:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:/mnt/cephfs/jinwei_test/dd# touch foo
root@n8-147-034:/mnt/cephfs/jinwei_test/dd# getfattr <span class="nt">-n</span> ceph.file.layout foo
<span class="c"># file: foo</span>
ceph.file.layout<span class="o">=</span><span class="s2">"stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data"</span>
root@n8-147-034:/mnt/cephfs/jinwei_test/dd# setfattr <span class="nt">-n</span> ceph.file.layout.object_size <span class="nt">-v</span> 67108864 foo
root@n8-147-034:/mnt/cephfs/jinwei_test/dd# setfattr <span class="nt">-n</span> ceph.file.layout.stripe_unit <span class="nt">-v</span> 67108864 foo
root@n8-147-034:/mnt/cephfs/jinwei_test/dd# getfattr <span class="nt">-n</span> ceph.file.layout foo
<span class="c"># file: foo</span>
ceph.file.layout<span class="o">=</span><span class="s2">"stripe_unit=67108864 stripe_count=1 object_size=67108864 pool=cephfs_data"</span>
root@n8-147-034:/mnt/cephfs/jinwei_test/dd#
</code></pre></div></div>
<h3 id="获取文件inode">获取文件inode</h3>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:/mnt/cephfs/jinwei_test/dd# stat <span class="nt">-c</span> %i foo | xargs <span class="nb">printf</span> <span class="s1">'%x\n'</span>
10000000388
</code></pre></div></div>
<h3 id="文件对应的对象">文件对应的对象</h3>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 文件没有内容,这时候是没有对象的</span>
root@n10-075-086:~# rados <span class="nt">-p</span> cephfs_data <span class="nb">ls</span> | <span class="nb">grep </span>10000000388
<span class="c"># 写入两个对象</span>
root@n8-147-034:/mnt/cephfs/jinwei_test/dd# dd <span class="k">if</span><span class="o">=</span>/dev/zero <span class="nv">of</span><span class="o">=</span>foo <span class="nv">bs</span><span class="o">=</span>64M <span class="nv">count</span><span class="o">=</span>2 <span class="nv">oflag</span><span class="o">=</span>direct
2+0 records <span class="k">in
</span>2+0 records out
134217728 bytes <span class="o">(</span>134 MB<span class="o">)</span> copied, 0.958122 s, 140 MB/s
root@n8-147-034:/mnt/cephfs/jinwei_test/dd#
<span class="c"># 再次查看</span>
root@n10-075-086:~# rados <span class="nt">-p</span> cephfs_data <span class="nb">ls</span> | <span class="nb">grep </span>10000000388
10000000388.00000000
10000000388.00000001
root@n10-075-086:~#
</code></pre></div></div>
<p>查看两个对象对应的osd信息,分别对应osd 121和130:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n10-075-086:~# ceph osd map cephfs_data 10000000388.00000000
osdmap e2231 pool 'cephfs_data' (2) object '10000000388.00000000' -> pg 2.201b2309 (2.309) -> up ([121,95,41], p121) acting ([121,95,41], p121)
root@n10-075-086:~# ceph osd map cephfs_data 10000000388.00000001
osdmap e2231 pool 'cephfs_data' (2) object '10000000388.00000001' -> pg 2.5052a9b5 (2.9b5) -> up ([130,77,32], p130) acting ([130,77,32], p130)
root@n10-075-086:~#
</code></pre></div></div>
<p>再次执行刚才的dd命令,并在两个primary osd(121, 130)上观察op的情况,并同时用ftrace,观察kernel客户端写的过程。</p>
<h3 id="osd机器op请求">osd机器OP请求</h3>
<p>通过以下命令查看osd的op信息,ID为上面的121和130:</p>
<blockquote>
<p>ceph daemon osd.ID dump_historic_ops</p>
</blockquote>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"osd_op(client.717850.1:5196 2.201b2309 10000000388.00000000 [] snapc 1=[] RETRY=1 ondisk+retry+write+ordersnap+known_if_redirected e2236)"</span><span class="p">,</span><span class="w">
</span><span class="s2">"initiated_at"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.049346"</span><span class="p">,</span><span class="w">
</span><span class="s2">"age"</span><span class="p">:</span><span class="w"> </span><span class="mf">47.697939</span><span class="p">,</span><span class="w">
</span><span class="s2">"duration"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.426153</span><span class="p">,</span><span class="w"> </span><span class="err">//</span><span class="w"> </span><span class="err">总耗时大约</span><span class="mi">426</span><span class="err">毫秒</span><span class="w">
</span><span class="s2">"type_data"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"commit sent; apply or cleanup"</span><span class="p">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"client"</span><span class="p">:</span><span class="w"> </span><span class="s2">"client.717850"</span><span class="p">,</span><span class="w">
</span><span class="s2">"tid"</span><span class="p">:</span><span class="w"> </span><span class="mi">5196</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.049346"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"initiated"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.121712"</span><span class="p">,</span><span class="w"> </span><span class="err">//</span><span class="w"> </span><span class="err">网络层读消息,大约</span><span class="mi">70</span><span class="err">毫秒,一个object</span><span class="w"> </span><span class="mi">64</span><span class="err">mb,万兆网卡差不多需要这个时间</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"queued_for_pg"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.121801"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"reached_pg"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.121867"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"started"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.125718"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"waiting for subops from 41,95"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.164887"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"commit_queued_for_journal_write"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.164955"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"write_thread_in_journal_buffer"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.356237"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"journaled_completion_queued"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.356259"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"op_commit"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.394599"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"op_applied"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.465524"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sub_op_commit_rec from 41"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.475450"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sub_op_commit_rec from 95"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.475473"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"commit_sent"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.475499"</span><span class="p">,</span><span class="w"> </span><span class="err">//</span><span class="w"> </span><span class="err">op结束时间</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"done"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>上面是osd 121的信息,操作的对象是10000000388.00000000,op持续了426.153ms,主要耗费时间在网络读数据的延时和副本操作的延时。op开始时间<strong>16:04:19.049346</strong>,结束时间<strong>16:04:19.475499</strong>。</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"osd_op(client.717850.1:5197 2.5052a9b5 10000000388.00000001 [] snapc 1=[] ondisk+write+ordersnap+known_if_redirected e2236)"</span><span class="p">,</span><span class="w">
</span><span class="s2">"initiated_at"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.491627"</span><span class="p">,</span><span class="w">
</span><span class="s2">"age"</span><span class="p">:</span><span class="w"> </span><span class="mf">73.131756</span><span class="p">,</span><span class="w">
</span><span class="s2">"duration"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.439539</span><span class="p">,</span><span class="w">
</span><span class="s2">"type_data"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"commit sent; apply or cleanup"</span><span class="p">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"client"</span><span class="p">:</span><span class="w"> </span><span class="s2">"client.717850"</span><span class="p">,</span><span class="w">
</span><span class="s2">"tid"</span><span class="p">:</span><span class="w"> </span><span class="mi">5197</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.491627"</span><span class="p">,</span><span class="w"> </span><span class="err">//</span><span class="w"> </span><span class="err">op开始时间,注意是在上一个op完成之后</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"initiated"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.558547"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"queued_for_pg"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.558630"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"reached_pg"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.558777"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"started"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.562634"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"waiting for subops from 32,77"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.580423"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"commit_queued_for_journal_write"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.580483"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"write_thread_in_journal_buffer"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.783500"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"journaled_completion_queued"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.783521"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"op_commit"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.820925"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"op_applied"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.917342"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sub_op_commit_rec from 77"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.931116"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sub_op_commit_rec from 32"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.931140"</span><span class="p">,</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"commit_sent"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2017-10-13 16:04:19.931166"</span><span class="p">,</span><span class="w"> </span><span class="err">//</span><span class="w"> </span><span class="err">op结束时间</span><span class="w">
</span><span class="s2">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"done"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>这是osd 130的信息,操作的对象是10000000388.00000001,op持续了439.539ms。op开始时间<strong>16:04:19.491627</strong>,结束时间<strong>16:04:19.931166</strong>。</p>
<p>可以很清楚的看见,先写第一个对象,再写第二个对象,对象之间是没有并发写的,这区别于块存储,块存储的实现,至少librbd的实现,如果一次io对应多个object,多个请求是同时发出的,而不会等第一个对象完成了才下发第二个对象的IO,参见如下代码:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">I</span><span class="o">></span>
<span class="kt">void</span> <span class="n">AbstractAioImageWrite</span><span class="o"><</span><span class="n">I</span><span class="o">>::</span><span class="n">send_object_requests</span><span class="p">(</span>
<span class="k">const</span> <span class="n">ObjectExtents</span> <span class="o">&</span><span class="n">object_extents</span><span class="p">,</span> <span class="k">const</span> <span class="o">::</span><span class="n">SnapContext</span> <span class="o">&</span><span class="n">snapc</span><span class="p">,</span>
<span class="n">AioObjectRequests</span> <span class="o">*</span><span class="n">aio_object_requests</span><span class="p">)</span> <span class="p">{</span>
<span class="n">I</span> <span class="o">&</span><span class="n">image_ctx</span> <span class="o">=</span> <span class="k">this</span><span class="o">-></span><span class="n">m_image_ctx</span><span class="p">;</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="n">image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">;</span>
<span class="n">AioCompletion</span> <span class="o">*</span><span class="n">aio_comp</span> <span class="o">=</span> <span class="k">this</span><span class="o">-></span><span class="n">m_aio_comp</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">ObjectExtents</span><span class="o">::</span><span class="n">const_iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">object_extents</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">p</span> <span class="o">!=</span> <span class="n">object_extents</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="n">C_AioRequest</span> <span class="o">*</span><span class="n">req_comp</span> <span class="o">=</span> <span class="k">new</span> <span class="n">C_AioRequest</span><span class="p">(</span><span class="n">aio_comp</span><span class="p">);</span>
<span class="n">AioObjectRequestHandle</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">create_object_request</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="n">snapc</span><span class="p">,</span>
<span class="n">req_comp</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">request</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">aio_object_requests</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aio_object_requests</span><span class="o">-></span><span class="n">push_back</span><span class="p">(</span><span class="n">request</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">request</span><span class="o">-></span><span class="n">send</span><span class="p">();</span> <span class="c1">// 发送请求
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="写文件的客户端ftrace信息">写文件的客户端ftrace信息</h3>
<p>enable ftrace步骤:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 修改tracer类型</span>
root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">echo </span><span class="k">function</span> <span class="o">></span> current_tracer
root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">cat </span>current_tracer
<span class="k">function</span>
<span class="c"># 添加trace的函数,这几个函数就是前面分析代码的函数</span>
root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">echo </span>ceph_osdc_new_request <span class="o">>></span> set_ftrace_filter
root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">echo </span>ceph_osdc_start_request <span class="o">>></span> set_ftrace_filter
root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">echo </span>ceph_osdc_wait_request <span class="o">>></span> set_ftrace_filter
<span class="c"># 查看是否正确</span>
root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">cat </span>set_ftrace_filter
ceph_osdc_new_request <span class="o">[</span>libceph]
ceph_osdc_wait_request <span class="o">[</span>libceph]
ceph_osdc_start_request <span class="o">[</span>libceph]
root@n8-147-034:/sys/kernel/debug/tracing#
</code></pre></div></div>
<p>观察日志:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:/sys/kernel/debug/tracing# <span class="nb">cat </span>trace
<span class="c"># tracer: function</span>
<span class="c">#</span>
<span class="c"># entries-in-buffer/entries-written: 6/6 #P:48</span>
<span class="c">#</span>
<span class="c"># _-----=> irqs-off</span>
<span class="c"># / _----=> need-resched</span>
<span class="c"># | / _---=> hardirq/softirq</span>
<span class="c"># || / _--=> preempt-depth</span>
<span class="c"># ||| / delay</span>
<span class="c"># TASK-PID CPU# |||| TIMESTAMP FUNCTION</span>
<span class="c"># | | | |||| | |</span>
<...>-118883 <span class="o">[</span>005] .... 3978299.661306: ceph_osdc_new_request <<span class="nt">-ceph_direct_read_write</span>
<...>-118883 <span class="o">[</span>005] .... 3978299.662699: ceph_osdc_start_request <<span class="nt">-ceph_direct_read_write</span>
<...>-118883 <span class="o">[</span>005] .... 3978299.662806: ceph_osdc_wait_request <<span class="nt">-ceph_direct_read_write</span>
<span class="c"># 暂停了500ms</span>
<...>-118883 <span class="o">[</span>005] .... 3978300.175224: ceph_osdc_new_request <<span class="nt">-ceph_direct_read_write</span>
<...>-118883 <span class="o">[</span>005] .... 3978300.176661: ceph_osdc_start_request <<span class="nt">-ceph_direct_read_write</span>
<...>-118883 <span class="o">[</span>005] .... 3978300.176749: ceph_osdc_wait_request <<span class="nt">-ceph_direct_read_write</span>
root@n8-147-034:/sys/kernel/debug/tracing#
</code></pre></div></div>
<p>这里用了差不多500ms才开始下一个请求,而上面从osd端的分析看,第一个IO用了426ms才完成,osd完成IO后通知kernel客户端有网络延时,然后加上kernel调度的延时,差不多能够匹配。</p>
<h1 id="结论">结论</h1>
<p>通过源码分析,然后分别从集群osd端和kernel客户端进行验证,direct的情况,cephfs性能确实有限制。但是,用户也不用过于担心性能跟不上,因为通常情况下,不会是direct写,kernel客户端有page cache,写会非常快,</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:/mnt/cephfs/jinwei_test/dd# dd if=/dev/zero of=foo bs=64M count=1000
1000+0 records in
1000+0 records out
67108864000 bytes (67 GB) copied, 53.5032 s, 1.3 GB/s
root@n8-147-034:/mnt/cephfs/jinwei_test/dd#
</code></pre></div></div>
<p>更贴近真实的使用场景,用户先写数据,最后调用一次sync操作:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@n8-147-034:/mnt/cephfs/jinwei_test/dd# dd if=/dev/zero of=foo bs=64M count=1000 conv=fdatasync
1000+0 records in
1000+0 records out
67108864000 bytes (67 GB) copied, 68.61 s, 978 MB/s
root@n8-147-034:/mnt/cephfs/jinwei_test/dd#
</code></pre></div></div>
CephFS Filesystem Init
2017-09-15T00:00:00+00:00
http://blog.wjin.org/posts/cephfs-filesystem-init
<h1 id="introduction">Introduction</h1>
<p>文件存储是ceph分布式存储系统提供的三种接口之一,兼容posix协议,挂载上后就可以当做ext4或xfs本地文件系统使用。
部署的时候,需要部署mds进程服务,这个是cephfs特有的,mds本身不存储数据,数据仍然存储在OSD,
但是访问文件系统的文件时,必须通过mds进程授权,获取文件inode等信息,才知道向哪个osd进程发送io请求。</p>
<h1 id="daemon-start">Daemon Start</h1>
<p>mds daemon进程实现在文件src/ceph_mds.cc,比较简单:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 创建用于通信的messenger
</span> <span class="n">Messenger</span> <span class="o">*</span><span class="n">msgr</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span>
<span class="n">entity_name_t</span><span class="o">::</span><span class="n">MDS</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="s">"mds"</span><span class="p">,</span>
<span class="n">nonce</span><span class="p">);</span>
<span class="c1">// 开始接收消息
</span> <span class="n">msgr</span><span class="o">-></span><span class="n">start</span><span class="p">();</span>
<span class="c1">// 创建mds daemon对象
</span> <span class="n">mds</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MDSDaemon</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">get_id</span><span class="p">().</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">msgr</span><span class="p">,</span> <span class="o">&</span><span class="n">mc</span><span class="p">);</span>
<span class="c1">// 初始化daemon
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">mds</span><span class="o">-></span><span class="n">init</span><span class="p">();</span>
<span class="c1">// 等待结束
</span> <span class="n">msgr</span><span class="o">-></span><span class="n">wait</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">class</span> <span class="nc">MDSDaemon</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Dispatcher</span><span class="p">,</span> <span class="k">public</span> <span class="n">md_config_obs_t</span> <span class="p">{</span>
<span class="n">Beacon</span> <span class="n">beacon</span><span class="p">;</span> <span class="c1">// beacon专门用来向mon发送和接收beacon消息,可以理解为mds向mon的心跳消息
</span> <span class="n">Messenger</span> <span class="o">*</span><span class="n">messenger</span><span class="p">;</span> <span class="c1">// 网络通信的组件
</span> <span class="n">MonClient</span> <span class="o">*</span><span class="n">monc</span><span class="p">;</span> <span class="c1">// 连接mon的组件
</span> <span class="n">MDSMap</span> <span class="o">*</span><span class="n">mdsmap</span><span class="p">;</span> <span class="c1">// mdsmap信息
</span> <span class="n">Objecter</span> <span class="o">*</span><span class="n">objecter</span><span class="p">;</span> <span class="c1">// 连接osd的组件
</span> <span class="n">MDSRankDispatcher</span> <span class="o">*</span><span class="n">mds_rank</span><span class="p">;</span> <span class="c1">// 文件系统真正的核心实现类
</span><span class="p">}</span>
</code></pre></div></div>
<p>主要类是MDSDaemon,看看这个类初始化干了什么事情:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">MDSDaemon</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 初始化messenger, objecter, monc,以及订阅需要的map
</span> <span class="n">monc</span><span class="o">-></span><span class="n">sub_want</span><span class="p">(</span><span class="s">"mdsmap"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="c1">// 初始化beacon对象
</span> <span class="n">beacon</span><span class="p">.</span><span class="n">init</span><span class="p">(</span><span class="n">mdsmap</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Beacon</span><span class="o">::</span><span class="n">init</span><span class="p">(</span><span class="n">MDSMap</span> <span class="k">const</span> <span class="o">*</span><span class="n">mdsmap</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_send</span><span class="p">();</span> <span class="c1">// 周期性地向mon发送beacon消息
</span><span class="p">}</span>
</code></pre></div></div>
<p>这里将Beacon单独出一个类,而不是在MDSDaemon内部实现,主要是怕daemon忙碌的时候,由于mds_lock的原因,耽误了beacon消息的收发。
daemon启动以后,就等待mdsmap消息和beacon消息,然后做出状态切换,比如谁是Active,谁是Standby,以及daemon负责哪些文件系统等。
而mdsmap的改变,和集群其他map一样,毫无疑问,是要通过paxos决议完成的,所以接下来先看看文件系统涉及的map。</p>
<h1 id="fsmap--mdsmap--mdsmonitor">FSMap & MDSMap & MDSMonitor</h1>
<h3 id="fsmap">FSMap</h3>
<p>FSMap用来记录整个集群内所有的文件系统,因为单个集群支持创建多个文件系统(虽然目前这个feature还不稳定)。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Filesystem</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">fs_cluster_id_t</span> <span class="n">fscid</span><span class="p">;</span> <span class="c1">// 文件系统的编号
</span> <span class="n">MDSMap</span> <span class="n">mds_map</span><span class="p">;</span> <span class="c1">// 对应的mds信息
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">FSMap</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">fs_cluster_id_t</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Filesystem</span><span class="o">></span> <span class="o">></span> <span class="n">filesystems</span><span class="p">;</span> <span class="c1">// 所有的文件系统
</span> <span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">mds_gid_t</span><span class="p">,</span> <span class="n">fs_cluster_id_t</span><span class="o">></span> <span class="n">mds_roles</span><span class="p">;</span> <span class="c1">// 记录mds与文件系统的对应关系,如果是多活模式,多个mds会对应同一个文件系统
</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">mds_gid_t</span><span class="p">,</span> <span class="n">MDSMap</span><span class="o">::</span><span class="n">mds_info_t</span><span class="o">></span> <span class="n">standby_daemons</span><span class="p">;</span> <span class="c1">// standby的mds
</span><span class="p">};</span>
</code></pre></div></div>
<h3 id="mdsmap">MDSMap</h3>
<p>一个文件系统,对应一个mdsmap,用来记录负责文件系统的mds进程的相关信息,主要信息如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MDSMap</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">set</span><span class="o"><</span><span class="kt">int64_t</span><span class="o">></span> <span class="n">data_pools</span><span class="p">;</span> <span class="c1">// 数据池对应的pool id,一个文件系统是可以设置多个data pool的
</span> <span class="kt">int64_t</span> <span class="n">metadata_pool</span><span class="p">;</span> <span class="c1">// 元数据池对应的pool id
</span>
<span class="n">mds_rank_t</span> <span class="n">max_mds</span><span class="p">;</span> <span class="c1">// 文件系统包含的active mds个数
</span>
<span class="c1">// 与rank相关的信息
</span> <span class="n">std</span><span class="o">::</span><span class="n">set</span><span class="o"><</span><span class="n">mds_rank_t</span><span class="o">></span> <span class="n">in</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">set</span><span class="o"><</span><span class="n">mds_rank_t</span><span class="o">></span> <span class="n">failed</span><span class="p">,</span> <span class="n">stopped</span><span class="p">,</span> <span class="n">damaged</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">mds_rank_t</span><span class="p">,</span> <span class="n">mds_gid_t</span><span class="o">></span> <span class="n">up</span><span class="p">;</span>
<span class="c1">// mds的具体信息
</span> <span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">mds_gid_t</span><span class="p">,</span> <span class="n">mds_info_t</span><span class="o">></span> <span class="n">mds_info</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>内部类DaemonState枚举定义了mds所有的状态,状态怎么转化的可以参考具体的文档。mds_info_t包含mds的具体信息,比如id, name, state等。</p>
<h3 id="mdsmonitor">MDSMonitor</h3>
<p>这个类是paxos的一个服务,可以参考以前monitor的相关文章,更新fsmap的时候(包括mds状态),需要通过paxos完成:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MDSMonitor</span> <span class="o">:</span> <span class="k">public</span> <span class="n">PaxosService</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">FSMap</span> <span class="n">fsmap</span><span class="p">;</span> <span class="c1">// 当前的map
</span> <span class="n">FSMap</span> <span class="n">pending_fsmap</span><span class="p">;</span> <span class="c1">// 待决议的map
</span>
<span class="k">struct</span> <span class="n">beacon_info_t</span> <span class="p">{</span>
<span class="n">utime_t</span> <span class="n">stamp</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">seq</span><span class="p">;</span>
<span class="p">};</span>
<span class="n">map</span><span class="o"><</span><span class="n">mds_gid_t</span><span class="p">,</span> <span class="n">beacon_info_t</span><span class="o">></span> <span class="n">last_beacon</span><span class="p">;</span> <span class="c1">// 记录daemon的beacon信息
</span><span class="p">};</span>
</code></pre></div></div>
<h1 id="create-filesystem">Create FileSystem</h1>
<p>当集群部署好以后,没有文件系统,mds也处于standby状态,用户需要先创建文件系统才能使用:</p>
<blockquote>
<p>ceph fs new fs_name metadata_pool data_pool</p>
</blockquote>
<p>很显然,当执行这条命令后,fsmap会被更新,paxos执行流程可以参考<a href="http://blog.wjin.org/posts/ceph-monitor-paxosservice.html">这里</a>,大致是这样:</p>
<blockquote>
<p>MDSMonitor::prepare_update() -> MDSMonitor::prepare_command() -> MDSMonitor::management_command() -> MDSMonitor::create_new_fs()</p>
</blockquote>
<p>首先是更新了MDSMonitor::pending_fsmap成员信息,然后等待paxos进行propose,待paxos完成决议后,执行MDSMonitor::update_from_paxos(),
更新当前的fsmap为刚才决议的内容,并向订阅者推送最新fsmap/mdsmap,注意这时候虽然有文件系统,但是还没分配mds为其服务,mds收到新map的时候做不了太多事情。</p>
<h1 id="allocate-rank">Allocate Rank</h1>
<p>文件系统创建后,还没有mds进程为其服务,怎么分配mds是通过周期性tick函数调度:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MDSMonitor</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="kt">bool</span> <span class="n">do_propose</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">i</span> <span class="o">:</span> <span class="n">pending_fsmap</span><span class="p">.</span><span class="n">filesystems</span><span class="p">)</span> <span class="p">{</span>
<span class="n">do_propose</span> <span class="o">|=</span> <span class="n">maybe_expand_cluster</span><span class="p">(</span><span class="n">i</span><span class="p">.</span><span class="n">second</span><span class="p">);</span> <span class="c1">// 如果需要更新,会做paxos决议
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">do_propose</span><span class="p">)</span> <span class="p">{</span>
<span class="n">propose_pending</span><span class="p">();</span> <span class="c1">// 开始决议
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">MDSMonitor</span><span class="o">::</span><span class="n">maybe_expand_cluster</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Filesystem</span><span class="o">></span> <span class="n">fs</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">bool</span> <span class="n">do_propose</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">fs</span><span class="o">-></span><span class="n">mds_map</span><span class="p">.</span><span class="n">get_num_in_mds</span><span class="p">()</span> <span class="o"><</span> <span class="kt">size_t</span><span class="p">(</span><span class="n">fs</span><span class="o">-></span><span class="n">mds_map</span><span class="p">.</span><span class="n">get_max_mds</span><span class="p">())</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">fs</span><span class="o">-></span><span class="n">mds_map</span><span class="p">.</span><span class="n">is_degraded</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 当前负责文件系统的mds个数小于最大值
</span> <span class="p">......</span>
<span class="n">mds_gid_t</span> <span class="n">newgid</span> <span class="o">=</span> <span class="n">pending_fsmap</span><span class="p">.</span><span class="n">find_replacement_for</span><span class="p">({</span><span class="n">fs</span><span class="o">-></span><span class="n">fscid</span><span class="p">,</span> <span class="n">mds</span><span class="p">},</span>
<span class="n">name</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mon_force_standby_active</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">pending_fsmap</span><span class="p">.</span><span class="n">promote</span><span class="p">(</span><span class="n">newgid</span><span class="p">,</span> <span class="n">fs</span><span class="p">,</span> <span class="n">mds</span><span class="p">);</span> <span class="c1">// 增加一个mds,并且会根据不同情况设置mds的初始状态,没有文件系统的时候其状态为STATE_CREATING
</span> <span class="n">do_propose</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">do_propose</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>和上面一样,paxos完成后,执行MDSMonitor::update_from_paxos(),再次更新fsmap的内容,并向订阅者推送最新fsmap/mdsmap。</p>
<h1 id="filesystem-init">FileSystem Init</h1>
<p>mds进程收到最新map后,发现有文件系统派发给自己,就需要初始化,准备好为文件系统服务:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MDSDaemon</span><span class="o">::</span><span class="n">handle_mds_map</span><span class="p">(</span><span class="n">MMDSMap</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mds_rank</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 新建一个rank,准备为文件系统服务
</span> <span class="n">mds_rank</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MDSRankDispatcher</span><span class="p">(</span><span class="n">whoami</span><span class="p">,</span> <span class="n">mds_lock</span><span class="p">,</span> <span class="n">clog</span><span class="p">,</span>
<span class="n">timer</span><span class="p">,</span> <span class="n">beacon</span><span class="p">,</span> <span class="n">mdsmap</span><span class="p">,</span> <span class="n">messenger</span><span class="p">,</span> <span class="n">monc</span><span class="p">,</span> <span class="n">objecter</span><span class="p">,</span>
<span class="k">new</span> <span class="n">C_VoidFn</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="o">&</span><span class="n">MDSDaemon</span><span class="o">::</span><span class="n">respawn</span><span class="p">),</span>
<span class="k">new</span> <span class="n">C_VoidFn</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="o">&</span><span class="n">MDSDaemon</span><span class="o">::</span><span class="n">suicide</span><span class="p">));</span>
<span class="n">mds_rank</span><span class="o">-></span><span class="n">init</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="n">mds_rank</span><span class="o">-></span><span class="n">handle_mds_map</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">oldmap</span><span class="p">);</span> <span class="c1">// rank处理mdsmap,会对文件系统进行初始化
</span><span class="p">}</span>
<span class="c1">// MDSRankDispatcher继承自MDSRank,后者就是文件系统的核心,每个组建都单独独立出一个类完成
</span><span class="k">class</span> <span class="nc">MDSRank</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">mds_rank_t</span> <span class="n">whoami</span><span class="p">;</span> <span class="c1">// rank id
</span>
<span class="n">MDSMap</span> <span class="o">*&</span><span class="n">mdsmap</span><span class="p">;</span> <span class="c1">// 当前的mdsmap
</span>
<span class="n">Objecter</span> <span class="o">*</span><span class="n">objecter</span><span class="p">;</span> <span class="c1">// 和osd通信的组件
</span>
<span class="c1">// 各个子系统的实现
</span> <span class="n">Server</span> <span class="o">*</span><span class="n">server</span><span class="p">;</span>
<span class="n">MDCache</span> <span class="o">*</span><span class="n">mdcache</span><span class="p">;</span>
<span class="n">Locker</span> <span class="o">*</span><span class="n">locker</span><span class="p">;</span>
<span class="n">MDLog</span> <span class="o">*</span><span class="n">mdlog</span><span class="p">;</span>
<span class="n">MDBalancer</span> <span class="o">*</span><span class="n">balancer</span><span class="p">;</span>
<span class="n">ScrubStack</span> <span class="o">*</span><span class="n">scrubstack</span><span class="p">;</span>
<span class="n">DamageTable</span> <span class="n">damage_table</span><span class="p">;</span>
<span class="n">InoTable</span> <span class="o">*</span><span class="n">inotable</span><span class="p">;</span>
<span class="n">SnapServer</span> <span class="o">*</span><span class="n">snapserver</span><span class="p">;</span>
<span class="n">SnapClient</span> <span class="o">*</span><span class="n">snapclient</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>rank初始化完成后,就会对最新的mdsmap进行处理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MDSRankDispatcher</span><span class="o">::</span><span class="n">handle_mds_map</span><span class="p">(</span>
<span class="n">MMDSMap</span> <span class="o">*</span><span class="n">m</span><span class="p">,</span>
<span class="n">MDSMap</span> <span class="o">*</span><span class="n">oldmap</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 很多case
</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">is_creating</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 第一次的时候,需要新建文件系统
</span> <span class="n">boot_create</span><span class="p">();</span> <span class="c1">// 初始化文件系统
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MDSRank</span><span class="o">::</span><span class="n">boot_create</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 创建journal
</span> <span class="n">mdlog</span><span class="o">-></span><span class="n">create</span><span class="p">(</span><span class="n">fin</span><span class="p">.</span><span class="n">new_sub</span><span class="p">());</span>
<span class="c1">// 新建存放jorunal日志的segment
</span> <span class="n">mdlog</span><span class="o">-></span><span class="n">prepare_new_segment</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">whoami</span> <span class="o">==</span> <span class="n">mdsmap</span><span class="o">-></span><span class="n">get_root</span><span class="p">())</span> <span class="p">{</span>
<span class="n">mdcache</span><span class="o">-></span><span class="n">create_empty_hierarchy</span><span class="p">(</span><span class="n">fin</span><span class="p">.</span><span class="n">get</span><span class="p">());</span> <span class="c1">// 创建文件系统的根目录,分配ionde和dentry等
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MDCache</span><span class="o">::</span><span class="n">create_empty_hierarchy</span><span class="p">(</span><span class="n">MDSGather</span> <span class="o">*</span><span class="n">gather</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 创建根目录的inode,默认编号为1
</span> <span class="n">CInode</span> <span class="o">*</span><span class="n">root</span> <span class="o">=</span> <span class="n">create_root_inode</span><span class="p">();</span>
<span class="c1">// 创建根目录
</span> <span class="n">CDir</span> <span class="o">*</span><span class="n">rootdir</span> <span class="o">=</span> <span class="n">root</span><span class="o">-></span><span class="n">get_or_open_dirfrag</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">frag_t</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>文件系统就这样初始化完成了,后续就等待客户端进行mount操作,然后进行读写。</p>
<h1 id="summary">Summary</h1>
<p>本文总结了一下cephfs的初始化流程,从mds daemon启动开始,到文件系统最终被创建出来可以被用户使用,主要涉及paxos算法维护fsmap以及mdsmap的变更。但是,这还只是冰山一角,文件系统的核心实现(MDSRank中的各个组件)以及各种高大上的feature等等,还得慢慢挖掘,路漫漫其修远兮…</p>
Ceph Manager Overview
2017-03-29T00:00:00+00:00
http://blog.wjin.org/posts/ceph-manager-overview
<h1 id="introduction">Introduction</h1>
<p>从Kraken版本开始,ceph新增加了一个daemon进程mgr,主要目的是将monitor的一些非paxos相关的服务(比如pg相关统计信息)单独移出来,
减轻monitor的负担,并且提供python插件接口用来获取统计信息,供其他监控系统使用。mgr采用master-standby模式,
部署的时候可以部署多个,避免单点故障,但只会存在一个mgr为active状态并提供服务。</p>
<h1 id="daemon-start">Daemon Start</h1>
<p>mgr daemon进程实现在文件src/ceph_mgr.cc,比较简单:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">MgrStandby</span> <span class="n">mgr</span><span class="p">;</span>
<span class="p">......</span>
<span class="kt">int</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">mgr</span><span class="p">.</span><span class="n">init</span><span class="p">();</span> <span class="c1">// 初始化
</span>
<span class="p">......</span>
<span class="k">return</span> <span class="n">mgr</span><span class="p">.</span><span class="n">main</span><span class="p">(</span><span class="n">args</span><span class="p">);</span> <span class="c1">// 等待退出
</span><span class="p">}</span>
</code></pre></div></div>
<p>看看MgrStandby做了些什么初始化工作,代码在src/mgr/MgrStandby.[h|cc]:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">MgrStandby</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 初始化messenger, monitor client, objecter等组件
</span> <span class="p">......</span>
<span class="c1">// 订阅mgrmap
</span> <span class="n">monc</span><span class="o">-></span><span class="n">sub_want</span><span class="p">(</span><span class="s">"mgrmap"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="c1">// 向monitor发送beacon消息,告诉自己已经启动
</span> <span class="n">send_beacon</span><span class="p">();</span> <span class="c1">// 这个函数会周期性地定时调用,目的是告诉monitor自己还活着
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">MgrStandby</span><span class="o">::</span><span class="n">main</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*></span> <span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">client_messenger</span><span class="o">-></span><span class="n">wait</span><span class="p">();</span> <span class="c1">// 和其他daemon一样,等待结束
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>执行到这里,似乎初始化就完成了:( mgr进程开始等待MMgrMap消息的到来,前面提到mgr是master-standby模式,
多个mgr进程,谁是master,谁是standby角色,这个得由mgrmap决定,最终由monitor通过paxos算法维护mgrmap的一致性,所以先看看mgrmap。</p>
<h1 id="mgrmap--mgrmonitor">MgrMap & MgrMonitor</h1>
<p>mgrmap的信息比较简单,代码在src/mon/MgrMap.h:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">StandbyInfo</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">uint64_t</span> <span class="n">gid</span><span class="p">;</span> <span class="c1">// 全局id, 这个id是由monitor client向monitor获取的
</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">MgrMap</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">epoch_t</span> <span class="n">epoch</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">active_gid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// active mgr进程的id
</span> <span class="n">entity_addr_t</span> <span class="n">active_addr</span><span class="p">;</span> <span class="c1">// active mgr进程的地址
</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">active_name</span><span class="p">;</span> <span class="c1">// active mgr进程的名字
</span>
<span class="kt">bool</span> <span class="n">available</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="c1">// 是否已经初始化完成,可以使用
</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="n">StandbyInfo</span><span class="o">></span> <span class="n">standbys</span><span class="p">;</span> <span class="c1">// standby mgr进程的信息汇总
</span><span class="p">};</span>
</code></pre></div></div>
<p>mgrmap也是通过paxos算法更新,和其他map一样,存在一个paxos service服务,即MgrMonitor,继承自基类PaxosService,参考代码src/mon/MgrMonitor.[h|cc]:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MgrMonitor</span> <span class="o">:</span> <span class="k">public</span> <span class="n">PaxosService</span>
<span class="p">{</span>
<span class="n">MgrMap</span> <span class="n">map</span><span class="p">;</span> <span class="c1">// 当前的mgrmap
</span> <span class="n">MgrMap</span> <span class="n">pending_map</span><span class="p">;</span> <span class="c1">// 待更新的mgrmap
</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="n">utime_t</span><span class="o">></span> <span class="n">last_beacon</span><span class="p">;</span> <span class="c1">// 记录每个mgr进程发送beacon消息的时间戳
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>paxos运行流程可以参考<a href="http://blog.wjin.org/posts/ceph-monitor-paxosservice.html">这篇文章</a>。对于MgrMonitor,目前只对两种消息进行响应,
即MSG_MGR_BEACON和MSG_MON_COMMAND:</p>
<ul>
<li>
<p>收到beacon消息后,如果自己已经是active或者standby的,就不更新mgrmap,否则就需要更新mgrmap,
将mgr加进来,即执行prepare_update阶段并返回true,当paxos完成后,执行函数update_from_paxos,然后调用check_sub,
向已经订阅过mgrmap的客户端发送最新的map。</p>
</li>
<li>
<p>收到command消息后,执行命令,目前支持的命令仅仅是mgr fail,即将mgr进程强制移除mgrmap。</p>
</li>
</ul>
<p>同时需要注意,MgrMonitor有个tick函数,内部周期性地检查active和standby mgr进程,如果beacon超时,就会将对应的mgr从map中删除,
超时时间由参数mon_mgr_beacon_grace控制。</p>
<h1 id="handle-mgrmap">Handle Mgrmap</h1>
<p>前面知道mgr启动的时候向monitor发送了beacon消息,会导致mgrmap更新,并且自己订阅此类信息,所以会收到消息MMgrMap,处理函数如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MgrStandby</span><span class="o">::</span><span class="n">handle_mgr_map</span><span class="p">(</span><span class="n">MMgrMap</span><span class="o">*</span> <span class="n">mmap</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">auto</span> <span class="n">map</span> <span class="o">=</span> <span class="n">mmap</span><span class="o">-></span><span class="n">get_map</span><span class="p">();</span>
<span class="k">const</span> <span class="kt">bool</span> <span class="n">active_in_map</span> <span class="o">=</span> <span class="n">map</span><span class="p">.</span><span class="n">active_gid</span> <span class="o">==</span> <span class="n">monc</span><span class="o">-></span><span class="n">get_global_id</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">active_in_map</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">active_mgr</span><span class="p">)</span> <span class="p">{</span>
<span class="n">active_mgr</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="k">new</span> <span class="n">Mgr</span><span class="p">(</span><span class="n">monc</span><span class="p">,</span> <span class="n">client_messenger</span><span class="p">,</span> <span class="n">objecter</span><span class="p">));</span> <span class="c1">// 如果自己在mgrmap中是active的,创建实例mgr,准备干活
</span> <span class="n">active_mgr</span><span class="o">-></span><span class="n">background_init</span><span class="p">();</span> <span class="c1">// 执行初始化
</span> <span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">active_mgr</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 否则,销毁实例
</span> <span class="n">active_mgr</span><span class="o">-></span><span class="n">shutdown</span><span class="p">();</span>
<span class="n">active_mgr</span><span class="p">.</span><span class="n">reset</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>可以发现,所有mgr daemon只会有一个进程创建mgr实例进行工作,即其他都处于standby状态,并未创建mgr实例对象。</p>
<h1 id="mgr-instance">Mgr Instance</h1>
<p>前面说了这么多,mgr到底是来干什么的还并没有涉及,这个就得从类Mgr入手了,代码在src/mgr/Mgr.[h|cc]:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Mgr</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">PyModules</span> <span class="n">py_modules</span><span class="p">;</span> <span class="c1">// python插件相关的处理类
</span> <span class="n">DaemonStateIndex</span> <span class="n">daemon_state</span><span class="p">;</span> <span class="c1">// 集群daemon进程metadata相关统计信息
</span> <span class="n">ClusterState</span> <span class="n">cluster_state</span><span class="p">;</span> <span class="c1">// 整个集群状态
</span> <span class="n">DaemonServer</span> <span class="n">server</span><span class="p">;</span> <span class="c1">// mgr是个service,需要绑定地址并进行监听,其他daemon,比如osd, mds等就可以连接mgr进行通信
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>从此类的实现看,主要包含以下几方面:</p>
<ul>
<li>
<p>python插件相关的处理</p>
</li>
<li>
<p>daemon进程的metadata统计信息,包括PerfCounters等</p>
</li>
<li>
<p>集群状态相关信息</p>
</li>
<li>
<p>与daemon进程通信的机制</p>
</li>
</ul>
<p>下面简单分析下这几个模块。</p>
<h3 id="pymodules">PyModules</h3>
<p>mgr可以动态加载python插件,主要涉及代码PyModules.[h|cc], MgrPyModule.[h|cc], PyState.[h|cc]:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PyModules</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">MgrPyModule</span><span class="o">>></span> <span class="n">modules</span><span class="p">;</span> <span class="c1">// 所有的插件,比如目前默认有两个插件rest和fsstatus,参考目录src/pybind/mgr
</span> <span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">ServeThread</span><span class="o">>></span> <span class="n">serve_threads</span><span class="p">;</span> <span class="c1">// 每个插件的执行线程
</span> <span class="n">DaemonStateIndex</span> <span class="o">&</span><span class="n">daemon_state</span><span class="p">;</span> <span class="c1">// daemon状态的引用,方便插件获取信息
</span> <span class="n">ClusterState</span> <span class="o">&</span><span class="n">cluster_state</span><span class="p">;</span> <span class="c1">// cluster状态的引用,方便插件获取信息
</span><span class="p">};</span>
</code></pre></div></div>
<p>看看插件的初始化流程:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// mgr实例初始化的过程中,加载插件
</span><span class="kt">void</span> <span class="n">Mgr</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">py_modules</span><span class="p">.</span><span class="n">init</span><span class="p">();</span>
<span class="n">py_modules</span><span class="p">.</span><span class="n">start</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">PyModules</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">global_handle</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span> <span class="c1">// 这个全局变量定义在PyState.cc文件中,此文件实现了ceph_state模块,即封装了一些c++ wrapper函数,供python模块使用
</span>
<span class="p">......</span>
<span class="k">auto</span> <span class="n">py_logger</span> <span class="o">=</span> <span class="n">Py_InitModule</span><span class="p">(</span><span class="s">"ceph_logger"</span><span class="p">,</span> <span class="n">log_methods</span><span class="p">);</span> <span class="c1">// 加载模块ceph_logger供插件调用,主要用来打印日志
</span> <span class="n">Py_InitModule</span><span class="p">(</span><span class="s">"ceph_state"</span><span class="p">,</span> <span class="n">CephStateMethods</span><span class="p">);</span> <span class="c1">// 加载模块ceph_state供插件调用,主要用来获取集群状态daemonstate和clusterstate等相关信息
</span>
<span class="p">......</span>
<span class="n">boost</span><span class="o">::</span><span class="n">tokenizer</span><span class="o"><></span> <span class="n">tok</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">mgr_modules</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="k">const</span> <span class="k">auto</span><span class="o">&</span> <span class="n">module_name</span> <span class="o">:</span> <span class="n">tok</span><span class="p">)</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">mod</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">MgrPyModule</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">MgrPyModule</span><span class="p">(</span><span class="n">module_name</span><span class="p">));</span> <span class="c1">// 创建python模块实例
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">mod</span><span class="o">-></span><span class="n">load</span><span class="p">();</span> <span class="c1">// 实例化python模块并且加载模块提供的command
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>需要注意的是,python插件一方面可以调用mgr提供的接口,参见模块ceph_logger和ceph_state中的函数,这是插件编写者主动调用,
另一方面,插件可能也需要在某些情况下被通知到,MgrPyModule::notify函数完成此功能,即mgr c++端主动调用python函数,插件编写者可以实现notify函数接口,
以便在状态信息发生变化时做出响应,具体怎么写插件,可以参考<a href="http://docs.ceph.com/docs/master/mgr/plugins/">官方文档</a>。</p>
<h3 id="daemonserver">DaemonServer</h3>
<p>DaemonServer的作用是让mgr在某个地址上监听,这样osd和mds等daemon进程可以连接此服务,同时也处理一些命令请求(mon内部的PGMonitor将来可能会被删除掉,移到mgr内部来)。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DaemonServer</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Dispatcher</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">msgr</span><span class="p">;</span> <span class="c1">// 这个messenger用来绑定地址,让mgr成为一个service
</span>
<span class="c1">// 其他组件
</span> <span class="n">MonClient</span> <span class="o">*</span><span class="n">monc</span><span class="p">;</span>
<span class="n">DaemonStateIndex</span> <span class="o">&</span><span class="n">daemon_state</span><span class="p">;</span>
<span class="n">ClusterState</span> <span class="o">&</span><span class="n">cluster_state</span><span class="p">;</span>
<span class="n">PyModules</span> <span class="o">&</span><span class="n">py_modules</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>主要就是这个messenger需要注意,区别于其他地方的client_messenger。在MgrStandy内部有个client_messenger,并且将其传给了mgr,monc,objecter等,
这个是mgr作为客户端主动连接其他服务的时候使用的。而DaemonServer内部这个messenger,是作为服务端,即被动连接使用的,这个messenger的地址,
会更新在mgrmap中,其他daemon进程获取到mgrmap后就知道如何去连接mgr服务:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MgrStandby</span><span class="o">::</span><span class="n">send_beacon</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">bool</span> <span class="n">available</span> <span class="o">=</span> <span class="n">active_mgr</span> <span class="o">!=</span> <span class="nb">nullptr</span> <span class="o">&&</span> <span class="n">active_mgr</span><span class="o">-></span><span class="n">is_initialized</span><span class="p">();</span>
<span class="k">auto</span> <span class="n">addr</span> <span class="o">=</span> <span class="n">available</span> <span class="o">?</span> <span class="n">active_mgr</span><span class="o">-></span><span class="n">get_server_addr</span><span class="p">()</span> <span class="o">:</span> <span class="n">entity_addr_t</span><span class="p">();</span> <span class="c1">// DaemonServer内部的messenger地址
</span> <span class="n">MMgrBeacon</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MMgrBeacon</span><span class="p">(</span><span class="n">monc</span><span class="o">-></span><span class="n">get_fsid</span><span class="p">(),</span>
<span class="n">monc</span><span class="o">-></span><span class="n">get_global_id</span><span class="p">(),</span>
<span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">get_id</span><span class="p">(),</span>
<span class="n">addr</span><span class="p">,</span>
<span class="n">available</span><span class="p">);</span>
<span class="n">monc</span><span class="o">-></span><span class="n">send_mon_message</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>有了server,肯定还有client,相关代码在src/mgr/MgrClient.[h|cc],这个类似于MonClient用来连接mon,MgrClient用来连接mgr。</p>
<p>以osd daemon为例,在启动过程中,会发送消息MMgrOpen,mgr收到后,会回复MMgrConfigure消息,主要是返回一个period时间,后续osd就根据设定的period,
定期将状态信息上报给mgr,即消息MMgrReport和MPGStats:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">mgrc</span><span class="p">.</span><span class="n">init</span><span class="p">();</span>
<span class="n">client_messenger</span><span class="o">-></span><span class="n">add_dispatcher_head</span><span class="p">(</span><span class="o">&</span><span class="n">mgrc</span><span class="p">);</span> <span class="c1">// 将mgr client加入messenger的dispatch列表,这样mgr client可以处理mgrmap等消息
</span>
<span class="p">......</span>
<span class="n">monc</span><span class="o">-></span><span class="n">sub_want</span><span class="p">(</span><span class="s">"mgrmap"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 订阅mgrmap
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>osd收到mgrmap消息后,会分发给MgrClient处理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">MgrClient</span><span class="o">::</span><span class="n">handle_mgr_map</span><span class="p">(</span><span class="n">MMgrMap</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">map</span> <span class="o">=</span> <span class="n">m</span><span class="o">-></span><span class="n">get_map</span><span class="p">();</span>
<span class="n">m</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">session</span> <span class="o">||</span> <span class="n">session</span><span class="o">-></span><span class="n">con</span><span class="o">-></span><span class="n">get_peer_addr</span><span class="p">()</span> <span class="o">!=</span> <span class="n">map</span><span class="p">.</span><span class="n">get_active_addr</span><span class="p">())</span> <span class="p">{</span>
<span class="n">reconnect</span><span class="p">();</span> <span class="c1">// 建立连接
</span> <span class="p">}</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MgrClient</span><span class="o">::</span><span class="n">reconnect</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">g_conf</span> <span class="o">&&</span> <span class="o">!</span><span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">is_client</span><span class="p">())</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">open</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MMgrOpen</span><span class="p">();</span>
<span class="n">open</span><span class="o">-></span><span class="n">daemon_name</span> <span class="o">=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">get_id</span><span class="p">();</span>
<span class="n">session</span><span class="o">-></span><span class="n">con</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="n">open</span><span class="p">);</span> <span class="c1">// 发送MMgrOpen消息给mgr进程
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>如果向mgr进程发送命令请求,流程也类似,都是通过MgrClient完成。</p>
<h3 id="clusterstate">ClusterState</h3>
<p>ClusterState主要包括fsmap, pgmap,以及健康状态信息,结构如下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ClusterState</span>
<span class="p">{</span>
<span class="n">FSMap</span> <span class="n">fsmap</span><span class="p">;</span> <span class="c1">// fsmap信息
</span> <span class="n">PGMap</span> <span class="n">pg_map</span><span class="p">;</span> <span class="c1">// pgmap信息
</span> <span class="n">bufferlist</span> <span class="n">health_json</span><span class="p">;</span> <span class="c1">// health信息
</span> <span class="n">bufferlist</span> <span class="n">mon_status_json</span><span class="p">;</span> <span class="c1">// monitor状态信息
</span><span class="p">};</span>
</code></pre></div></div>
<p>Mgr实例在初始化过程中订阅过mgrdigest和fsmap:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Mgr</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">monc</span><span class="o">-></span><span class="n">sub_want</span><span class="p">(</span><span class="s">"mgrdigest"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">monc</span><span class="o">-></span><span class="n">sub_want</span><span class="p">(</span><span class="s">"fsmap"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>health_json和mon_status_json由MgrMonitor定期将mon进程内部记录的状态信息通过消息MMgrDigest发送给订阅的mgr进程,
fsmap由MDSMonitor在完成paxos算法更新fsmap后,通过消息MFSMap推送给订阅者mgr,这个只会推送一次,不同于digest。
对于pgmap的更新,则是osd进程通过MgrClient定期将自己的pg统计信息通过消息MPGStats发送给mgr完成更新。</p>
<h3 id="daemonstate">DaemonState</h3>
<p>这个主要就是osd进程的PerfCounters信息,由osd进程通过MgrClient定期发送消息MMgrReport完成更新。
内部实现的时候对上报信息稍微进行了加工,方便通过域名,ID或服务类型进行查找。</p>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>mgr采用master-standby模式,mon进程通过paxos算法维护mgrmap的一致性,决定谁是master或active mgr</p>
</li>
<li>
<p>mgr进程启动的时候,通过订阅mgrmap,并且周期性地发送beacon消息给mon,mon根据消息更新mgrmap</p>
</li>
<li>
<p>mgr收到mgrmap的更新,如果自己被选为active,则会实例化类Mgr进行干活</p>
</li>
<li>
<p>Mgr类功能主要包括:</p>
<ul>
<li>
<p>python模块相关处理对象PyModules</p>
</li>
<li>
<p>将自己作为服务的DaemonServer</p>
</li>
<li>
<p>集群统计信息ClusterState与daemon统计信息DaemonState</p>
</li>
</ul>
</li>
</ul>
<p>注意区分MgrStandby中的client messenger和DaemonServer中的server messenger,同时也注意区分各个类中dispatch消息的类型。
最后说明一下,mgr只是将一些信息进行了存储,不会影响集群的正确运转,存储的信息可以通过python插件,比如默认的rest插件,
提供rest接口方便其他监控系统获取信息。</p>
<p>另外,鉴于mgr新添加不久,这块代码还不够稳定,未来可能会有一些较大变动,比如bug修复以及将PGMonitor功能移到mgr内部,
也可能添加更多的接口供python插件使用等等,暂时分析到此,如有必要再更新,记录下代码commit ID: 3c257ef131bbd825ac9c1373a21f92d4dee8b047。</p>
Ceph PerfCounters
2017-01-18T00:00:00+00:00
http://blog.wjin.org/posts/ceph-perfcounters
<h1 id="introduction">Introduction</h1>
<p>ceph代码中对每个模块都加入了性能统计分析,在实际运营过程中,可以对这些关键指标进行监控,以便更精确的了解集群内部运行状态。
详细介绍可以参考<a href="http://docs.ceph.com/docs/master/dev/perf_counters/">官方文档</a>,使用方式如下:</p>
<blockquote>
<p>ceph daemon osd.0 perf schema</p>
</blockquote>
<blockquote>
<p>ceph daemon osd.0 perf dump</p>
</blockquote>
<p>解释一下schema输出的意思,dump出来的结果,如果type为5,表示4+1,type为10,表示8+2,以此类推,引用官方文档如下:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bit meaning
1 floating point value // 浮点数
2 unsigned 64-bit integer value // 整数
4 average (sum + count pair) // 设置4,有两个值,一般需要计算sum/avgcount以获取平均值,主要用来计算延时
8 counter (vs gauge) // 设置8,监控的时候dump出来的值可能需要减去上次dump的值,进而求得两次dump间隔内的差值
</code></pre></div></div>
<h1 id="implementation">Implementation</h1>
<h3 id="perfcounterscollection">PerfCountersCollection</h3>
<p>首先,一个daemon进程有一个唯一的CephContext,里面包含一个PerfCountersCollection变量,跟踪daemon的所有PerfCounters:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CephContext</span> <span class="p">{</span>
<span class="n">PerfCountersCollection</span> <span class="o">*</span><span class="n">_perf_counters_collection</span><span class="p">;</span> <span class="c1">// daemon的唯一collection
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="k">typedef</span> <span class="n">std</span><span class="o">::</span><span class="n">set</span> <span class="o"><</span><span class="n">PerfCounters</span><span class="o">*</span><span class="p">,</span> <span class="n">SortPerfCountersByName</span><span class="o">></span> <span class="n">perf_counters_set_t</span><span class="p">;</span>
<span class="k">class</span> <span class="nc">PerfCountersCollection</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">add</span><span class="p">(</span><span class="k">class</span> <span class="nc">PerfCounters</span> <span class="o">*</span><span class="n">l</span><span class="p">);</span> <span class="c1">// 增加一个perfcounter
</span> <span class="kt">void</span> <span class="n">remove</span><span class="p">(</span><span class="k">class</span> <span class="nc">PerfCounters</span> <span class="o">*</span><span class="n">l</span><span class="p">);</span> <span class="c1">// 删除一个perfcounter
</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">perf_counters_set_t</span> <span class="n">m_loggers</span><span class="p">;</span> <span class="c1">// 集合,根据名字排序,daemon的所有模块的PerfCounters都记录在此
</span> <span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="p">,</span> <span class="n">PerfCounters</span><span class="o">::</span><span class="n">perf_counter_data_any_d</span> <span class="o">*></span> <span class="n">by_path</span><span class="p">;</span> <span class="c1">// 所有perfcounter包含的字段的k/v
</span><span class="p">};</span>
</code></pre></div></div>
<p>添加删除的API都很简单,这里不再详述。</p>
<h3 id="perfcounters">PerfCounters</h3>
<p>每个模块都有一个独立的PerfCounters,包含多个关键的性能指标,即多个item,这些指标限定在给定模块的索引范围内[lower_bound, upper_bound], 以librbd为例:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// src/librbd/internal.h
</span><span class="k">enum</span> <span class="p">{</span>
<span class="n">l_librbd_first</span> <span class="o">=</span> <span class="mi">26000</span><span class="p">,</span> <span class="c1">// lower_bound
</span>
<span class="n">l_librbd_rd</span><span class="p">,</span> <span class="c1">// read ops
</span> <span class="n">l_librbd_rd_bytes</span><span class="p">,</span> <span class="c1">// bytes read
</span> <span class="n">l_librbd_rd_latency</span><span class="p">,</span> <span class="c1">// average latency
</span> <span class="p">......</span>
<span class="n">l_librbd_last</span><span class="p">,</span> <span class="c1">// upper_bound
</span><span class="p">};</span>
</code></pre></div></div>
<p>由于有上下边界,并且每个item有一个索引,所以存放的时候只需要一个vector即可:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PerfCounters</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">perf_counter_data_any_d</span> <span class="p">{</span> <span class="c1">// 每一个指标记录的值
</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">description</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">nick</span><span class="p">;</span>
<span class="k">enum</span> <span class="n">perfcounter_type_d</span> <span class="n">type</span><span class="p">;</span>
<span class="n">atomic64_t</span> <span class="n">u64</span><span class="p">;</span> <span class="c1">// 计数
</span>
<span class="c1">// 这里的两个count很有意思
</span> <span class="n">atomic64_t</span> <span class="n">avgcount</span><span class="p">;</span>
<span class="n">atomic64_t</span> <span class="n">avgcount2</span><span class="p">;</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="n">m_lower_bound</span><span class="p">;</span> <span class="c1">// 索引下界
</span> <span class="kt">int</span> <span class="n">m_upper_bound</span><span class="p">;</span> <span class="c1">// 索引上界
</span> <span class="k">typedef</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">perf_counter_data_any_d</span><span class="o">></span> <span class="n">perf_counter_data_vec_t</span><span class="p">;</span>
<span class="n">perf_counter_data_vec_t</span> <span class="n">m_data</span><span class="p">;</span> <span class="c1">// 所有的值
</span><span class="p">};</span>
</code></pre></div></div>
<p>前面提到,当type中包含4的时候,需要设置两个值,即sum和avgcount,在读取两个值计算平均值的时候,需要保证读到的两个值是同一次修改的,否则就不准确,
最简单的做法是加锁,在修改的时候加锁,修改两个值,然后在读取的时候也加锁,读取两个值,但是这对性能影响太大,代码中使用了两个计数来巧妙的到达目的,从而避免了锁的使用:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 修改
</span><span class="kt">void</span> <span class="n">PerfCounters</span><span class="o">::</span><span class="n">inc</span><span class="p">(</span><span class="kt">int</span> <span class="n">idx</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">amt</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">perf_counter_data_any_d</span><span class="o">&</span> <span class="n">data</span><span class="p">(</span><span class="n">m_data</span><span class="p">[</span><span class="n">idx</span> <span class="o">-</span> <span class="n">m_lower_bound</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">type</span> <span class="o">&</span> <span class="n">PERFCOUNTER_U64</span><span class="p">))</span>
<span class="k">return</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">type</span> <span class="o">&</span> <span class="n">PERFCOUNTER_LONGRUNAVG</span><span class="p">)</span> <span class="p">{</span>
<span class="n">data</span><span class="p">.</span><span class="n">avgcount</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span> <span class="c1">// 增加第一个计数
</span> <span class="n">data</span><span class="p">.</span><span class="n">u64</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">amt</span><span class="p">);</span> <span class="c1">// 增加值,即sum
</span> <span class="n">data</span><span class="p">.</span><span class="n">avgcount2</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span> <span class="c1">// 增加第二个计数
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">data</span><span class="p">.</span><span class="n">u64</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">amt</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 读取
</span><span class="n">pir</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span><span class="kt">uint64_t</span><span class="o">></span> <span class="n">read_avg</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">sum</span><span class="p">,</span> <span class="n">count</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">count</span> <span class="o">=</span> <span class="n">avgcount</span><span class="p">.</span><span class="n">read</span><span class="p">();</span>
<span class="n">sum</span> <span class="o">=</span> <span class="n">u64</span><span class="p">.</span><span class="n">read</span><span class="p">();</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">avgcount2</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="o">!=</span> <span class="n">count</span><span class="p">);</span> <span class="c1">// avgcount和avgcount2如果相等,那么读到的sum值就是对应的avgcount时设置的
</span> <span class="k">return</span> <span class="n">make_pair</span><span class="p">(</span><span class="n">sum</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>性能计数的修改是很频繁的操作,如果每次修改都需要加锁解锁,overhead还是比较大的,使用两个原子变量,仅仅在读的时候可能会多次读取,但是读的频率只发生在dump数据的时候,
影响不大,这样即满足了统计的需求,也降低了overhead。</p>
<h3 id="perfcountersbuilder">PerfCountersBuilder</h3>
<p>这个类供模块使用,内部增加了对item的参数检查:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PerfCountersBuilder</span> <span class="p">{</span>
<span class="n">PerfCounters</span><span class="o">*</span> <span class="n">create_perf_counters</span><span class="p">();</span> <span class="c1">// 将构造函数分配的指针返回给用户
</span> <span class="n">PerfCounters</span> <span class="o">*</span><span class="n">m_perf_counters</span><span class="p">;</span>
<span class="p">};</span>
<span class="n">PerfCountersBuilder</span><span class="o">::</span><span class="n">PerfCountersBuilder</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">&</span><span class="n">name</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">first</span><span class="p">,</span> <span class="kt">int</span> <span class="n">last</span><span class="p">)</span>
<span class="o">:</span> <span class="n">m_perf_counters</span><span class="p">(</span><span class="k">new</span> <span class="n">PerfCounters</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">first</span><span class="p">,</span> <span class="n">last</span><span class="p">))</span> <span class="c1">// new一个PerfCounters
</span><span class="p">{</span>
<span class="p">}</span>
<span class="n">PerfCounters</span> <span class="o">*</span><span class="n">PerfCountersBuilder</span><span class="o">::</span><span class="n">create_perf_counters</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">PerfCounters</span><span class="o">::</span><span class="n">perf_counter_data_vec_t</span><span class="o">::</span><span class="n">const_iterator</span> <span class="n">d</span> <span class="o">=</span> <span class="n">m_perf_counters</span><span class="o">-></span><span class="n">m_data</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">PerfCounters</span><span class="o">::</span><span class="n">perf_counter_data_vec_t</span><span class="o">::</span><span class="n">const_iterator</span> <span class="n">d_end</span> <span class="o">=</span> <span class="n">m_perf_counters</span><span class="o">-></span><span class="n">m_data</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="c1">// 检查item是否合法
</span> <span class="k">for</span> <span class="p">(;</span> <span class="n">d</span> <span class="o">!=</span> <span class="n">d_end</span><span class="p">;</span> <span class="o">++</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">type</span> <span class="o">==</span> <span class="n">PERFCOUNTER_NONE</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">type</span> <span class="o">!=</span> <span class="n">PERFCOUNTER_NONE</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">PerfCounters</span> <span class="o">*</span><span class="n">ret</span> <span class="o">=</span> <span class="n">m_perf_counters</span><span class="p">;</span>
<span class="n">m_perf_counters</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="c1">// 清除内部指针
</span> <span class="k">return</span> <span class="n">ret</span><span class="p">;</span> <span class="c1">// 将动态分配的对象返回给用户
</span><span class="p">}</span>
</code></pre></div></div>
<h3 id="usage">Usage</h3>
<p>看看模块使用的例子,以FileStore为例:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">FileStore</span><span class="o">::</span><span class="n">FileStore</span><span class="p">(</span><span class="n">CephContext</span><span class="o">*</span> <span class="n">cct</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">&</span><span class="n">base</span><span class="p">,</span>
<span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">&</span><span class="n">jdev</span><span class="p">,</span> <span class="n">osflagbits_t</span> <span class="n">flags</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">do_update</span><span class="p">)</span> <span class="o">:</span>
<span class="p">......</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">PerfCountersBuilder</span> <span class="n">plb</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">internal_name</span><span class="p">,</span> <span class="n">l_filestore_first</span><span class="p">,</span> <span class="n">l_filestore_last</span><span class="p">);</span> <span class="c1">// 对象的构造函数中会动态创建PerfCounters
</span>
<span class="c1">// 增加item
</span> <span class="n">plb</span><span class="p">.</span><span class="n">add_u64</span><span class="p">(</span><span class="n">l_filestore_journal_queue_ops</span><span class="p">,</span> <span class="s">"journal_queue_ops"</span><span class="p">,</span> <span class="s">"Operations in journal queue"</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">plb</span><span class="p">.</span><span class="n">create_perf_counters</span><span class="p">();</span> <span class="c1">// 获取动态分配的指针
</span>
<span class="n">cct</span><span class="o">-></span><span class="n">get_perfcounters_collection</span><span class="p">()</span><span class="o">-></span><span class="n">add</span><span class="p">(</span><span class="n">logger</span><span class="p">);</span> <span class="c1">// 将PerfCounters加入collection
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>一个daemon包含一个唯一的PerfCountersCollection,用来记录所有模块的PerfCounters</p>
</li>
<li>
<p>每个需要性能计数的模块,实现一个PerfCounters,内部包含很多item(k/v),并且分配一个范围,不同模块的这个范围实际上可以重复</p>
</li>
<li>
<p>模块使用PerfCountersBuilder来创建PerfCounters并检查item合法性</p>
</li>
</ul>
Ceph AsyncMessenger Stack
2017-01-12T00:00:00+00:00
http://blog.wjin.org/posts/ceph-asyncmessenger-stack
<h1 id="introduction">Introduction</h1>
<p>随着硬件设备的快速发展,存储系统的瓶颈逐渐转化为存储系统软件本身,因而出现了DPDK/SPDK这样的开发组件,
帮助存储系统开发者借助最新的硬件技术开发存储系统软件,提升性能。</p>
<p>对于ceph这样越来越受到开发者青睐的开源分布式存储系统,拥抱这样的新技术也是顺理成章,后端的单机存储引擎,最新的BlueStore已经支持SPDK。
对于网络层,ceph通过在AsyncMessenger这一层加一个抽象的NetworkStack,用以支持不同的协议栈(Posix/DPDK/RDMA)。</p>
<p>在早期的<a href="http://blog.wjin.org/posts/ceph-async-messenger.html">这篇</a>文章中,详细分析过AsyncMessenger的工作流程,那时候参考的代码是hammer版本,
AsyncMessenger框架基本上没怎么改变,这篇文章分析怎么引入NetworkStack这一抽象层,以支持不同的协议栈,主要以PosixStack为例加以说明。
以master代码举例,commit值为5b97cce360fe1f6b15dfad0866d90c85262f8253。</p>
<h1 id="initialize">Initialize</h1>
<p>还是先从构造函数入手:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">AsyncMessenger</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="n">entity_name_t</span> <span class="n">name</span><span class="p">,</span>
<span class="n">string</span> <span class="n">mname</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">_nonce</span><span class="p">)</span>
<span class="o">:</span> <span class="n">SimplePolicyMessenger</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span><span class="n">mname</span><span class="p">,</span> <span class="n">_nonce</span><span class="p">),</span>
<span class="p">......</span>
<span class="p">{</span>
<span class="n">ceph_spin_init</span><span class="p">(</span><span class="o">&</span><span class="n">global_seq_lock</span><span class="p">);</span>
<span class="n">StackSingleton</span> <span class="o">*</span><span class="n">single</span><span class="p">;</span>
<span class="n">cct</span><span class="o">-></span><span class="n">lookup_or_create_singleton_object</span><span class="o"><</span><span class="n">StackSingleton</span><span class="o">></span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="s">"AsyncMessenger::NetworkStack"</span><span class="p">);</span> <span class="c1">// 一个进程对应一个唯一的stack,以前这里是WorkerPool
</span> <span class="n">stack</span> <span class="o">=</span> <span class="n">single</span><span class="o">-></span><span class="n">stack</span><span class="p">.</span><span class="n">get</span><span class="p">();</span>
<span class="n">stack</span><span class="o">-></span><span class="n">start</span><span class="p">();</span> <span class="c1">// 启动线程
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">StackSingleton</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">NetworkStack</span><span class="o">></span> <span class="n">stack</span><span class="p">;</span>
<span class="n">StackSingleton</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
<span class="n">stack</span> <span class="o">=</span> <span class="n">NetworkStack</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">c</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">ms_async_transport_type</span><span class="p">);</span> <span class="c1">// 根据类型创建不同的stack
</span> <span class="p">}</span>
<span class="o">~</span><span class="n">StackSingleton</span><span class="p">()</span> <span class="p">{</span>
<span class="n">stack</span><span class="o">-></span><span class="n">stop</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="n">NetworkStack</span><span class="o">::</span><span class="n">NetworkStack</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">t</span><span class="p">)</span><span class="o">:</span> <span class="n">type</span><span class="p">(</span><span class="n">t</span><span class="p">),</span> <span class="n">started</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span> <span class="n">cct</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">num_workers</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Worker</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="n">create_worker</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="c1">// 创建worker
</span> <span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">init</span><span class="p">(</span><span class="n">InitEventNumber</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="c1">// 初始化事件处理器
</span> <span class="n">workers</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">w</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>以前在构造函数中,是创建一个WorkerPool,用来管理所有worker,并且worker类继承自线程类,现在换成抽象的NetworkStack,并且将线程移入stack自己实现,
而worker类演变为仅仅对应一个EventCenter,根据各个stack实现自己的worker,继承关系如下:</p>
<p><img src="/assets/img/post/ceph_asyncmessenger_networkstack.png" alt="img" />
<img src="/assets/img/post/ceph_asyncmessenger_worker.png" alt="img" /></p>
<p>接下来就是初始化干活的线程:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">NetworkStack</span><span class="o">::</span><span class="n">start</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">num_workers</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">workers</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">is_init</span><span class="p">())</span>
<span class="k">continue</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">function</span><span class="o"><</span><span class="kt">void</span> <span class="p">()</span><span class="o">></span> <span class="kr">thread</span> <span class="o">=</span> <span class="n">add_thread</span><span class="p">(</span><span class="n">i</span><span class="p">);</span> <span class="c1">// 获取线程入口的匿名函数
</span> <span class="n">spawn_worker</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="kr">thread</span><span class="p">));</span> <span class="c1">// 启动线程
</span> <span class="p">}</span>
<span class="n">started</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="c1">// PosixNetworkStack创建线程的实现
</span><span class="k">virtual</span> <span class="kt">void</span> <span class="n">spawn_worker</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">function</span><span class="o"><</span><span class="kt">void</span> <span class="p">()</span><span class="o">></span> <span class="o">&&</span><span class="n">func</span><span class="p">)</span> <span class="k">override</span> <span class="p">{</span>
<span class="n">threads</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="p">(</span><span class="n">func</span><span class="p">);</span> <span class="c1">// 使用c++11的线程库创建线程
</span><span class="p">}</span>
<span class="c1">// 线程入口函数
</span><span class="n">std</span><span class="o">::</span><span class="n">function</span><span class="o"><</span><span class="kt">void</span> <span class="p">()</span><span class="o">></span> <span class="n">NetworkStack</span><span class="o">::</span><span class="n">add_thread</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Worker</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="n">workers</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="k">return</span> <span class="p">[</span><span class="k">this</span><span class="p">,</span> <span class="n">w</span><span class="p">]()</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">uint64_t</span> <span class="n">EventMaxWaitUs</span> <span class="o">=</span> <span class="mi">30000000</span><span class="p">;</span>
<span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">set_owner</span><span class="p">();</span>
<span class="n">w</span><span class="o">-></span><span class="n">initialize</span><span class="p">();</span>
<span class="n">w</span><span class="o">-></span><span class="n">init_done</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">w</span><span class="o">-></span><span class="n">done</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">process_events</span><span class="p">(</span><span class="n">EventMaxWaitUs</span><span class="p">);</span> <span class="c1">// 处理事件
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="n">w</span><span class="o">-></span><span class="n">reset</span><span class="p">();</span>
<span class="n">w</span><span class="o">-></span><span class="n">destroy</span><span class="p">();</span>
<span class="p">};</span>
</code></pre></div></div>
<p>这里在AsyncMessenger类构造函数中,直接就将stack的线程初始完毕,准备处理事件,而线程的入口函数和worker的事件处理器相关,这就涉及到不同worker的实现。
前面提到一个worker对应一个EventCenter,而在EventCenter里面,对应一个EventDriver,对于DPDK,因为有用户态的poll接口,这里实现了自己的DPDKDriver(仍然是继承EventDriver),
并且在EventCenter里面增加了poll相关的结构体。</p>
<h1 id="worker">Worker</h1>
<p>Worker的抽象类实现比较简单,最主要就是两个抽象接口listen和connect,前者用来创建一个在给定地址监听的套接字,这个套接字可以处理接下来的监听请求,
后者用来创建一个已经连接成功的套接字,可以通过它读写数据。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Worker</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">id</span><span class="p">;</span> <span class="c1">// worker的ID
</span> <span class="n">EventCenter</span> <span class="n">center</span><span class="p">;</span> <span class="c1">// 处理事件
</span>
<span class="p">......</span>
<span class="k">virtual</span> <span class="kt">int</span> <span class="n">listen</span><span class="p">(</span><span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">addr</span><span class="p">,</span>
<span class="k">const</span> <span class="n">SocketOptions</span> <span class="o">&</span><span class="n">opts</span><span class="p">,</span> <span class="n">ServerSocket</span> <span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">int</span> <span class="n">connect</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">addr</span><span class="p">,</span>
<span class="k">const</span> <span class="n">SocketOptions</span> <span class="o">&</span><span class="n">opts</span><span class="p">,</span> <span class="n">ConnectedSocket</span> <span class="o">*</span><span class="n">socket</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>对于不同协议栈的实现,socket的实现肯定是不一样的,区别就在这里,所以抽象出来了两个通用的wrapper类ServerSocket和ConnectedSocket,用来隐藏细节:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ConnectedSocket</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">ConnectedSocketImpl</span><span class="o">></span> <span class="n">_csi</span><span class="p">;</span> <span class="c1">// 依赖于具体协议栈的实现
</span> <span class="kt">ssize_t</span> <span class="nf">read</span><span class="p">(</span><span class="kt">char</span><span class="o">*</span> <span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 读数据
</span> <span class="k">return</span> <span class="n">_csi</span><span class="o">-></span><span class="n">read</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">ssize_t</span> <span class="nf">send</span><span class="p">(</span><span class="n">bufferlist</span> <span class="o">&</span><span class="n">bl</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">more</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 写数据
</span> <span class="k">return</span> <span class="n">_csi</span><span class="o">-></span><span class="n">send</span><span class="p">(</span><span class="n">bl</span><span class="p">,</span> <span class="n">more</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">ServerSocket</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">ServerSocketImpl</span><span class="o">></span> <span class="n">_ssi</span><span class="p">;</span> <span class="c1">// 依赖于具体协议栈的实现
</span>
<span class="kt">int</span> <span class="nf">accept</span><span class="p">(</span><span class="n">ConnectedSocket</span> <span class="o">*</span><span class="n">sock</span><span class="p">,</span> <span class="k">const</span> <span class="n">SocketOptions</span> <span class="o">&</span><span class="n">opt</span><span class="p">,</span> <span class="n">entity_addr_t</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="n">Worker</span> <span class="o">*</span><span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 接受连接请求,成功后第一个参数为已经成功连接的套接字
</span> <span class="k">return</span> <span class="n">_ssi</span><span class="o">-></span><span class="n">accept</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="n">opt</span><span class="p">,</span> <span class="n">out</span><span class="p">,</span> <span class="n">w</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>看看posix的实现,和以前一样:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">PosixWorker</span><span class="o">::</span><span class="n">listen</span><span class="p">(</span><span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">sa</span><span class="p">,</span> <span class="k">const</span> <span class="n">SocketOptions</span> <span class="o">&</span><span class="n">opt</span><span class="p">,</span>
<span class="n">ServerSocket</span> <span class="o">*</span><span class="n">sock</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">listen_sd</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">create_socket</span><span class="p">(</span><span class="n">sa</span><span class="p">.</span><span class="n">get_family</span><span class="p">(),</span> <span class="nb">true</span><span class="p">);</span> <span class="c1">// 创建套接字
</span> <span class="n">r</span> <span class="o">=</span> <span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="n">listen_sd</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">get_sockaddr</span><span class="p">(),</span> <span class="n">sa</span><span class="p">.</span><span class="n">get_sockaddr_len</span><span class="p">());</span> <span class="c1">// 绑定地址
</span> <span class="n">r</span> <span class="o">=</span> <span class="o">::</span><span class="n">listen</span><span class="p">(</span><span class="n">listen_sd</span><span class="p">,</span> <span class="mi">128</span><span class="p">);</span> <span class="c1">// 开始监听
</span> <span class="o">*</span><span class="n">sock</span> <span class="o">=</span> <span class="n">ServerSocket</span><span class="p">(</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">PosixServerSocketImpl</span><span class="o">></span><span class="p">(</span>
<span class="k">new</span> <span class="n">PosixServerSocketImpl</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">listen_sd</span><span class="p">)));</span> <span class="c1">// 创建serversocket,返回给用户,可以用此socket进行accept请求
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">PosixWorker</span><span class="o">::</span><span class="n">connect</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">addr</span><span class="p">,</span> <span class="k">const</span> <span class="n">SocketOptions</span> <span class="o">&</span><span class="n">opts</span><span class="p">,</span> <span class="n">ConnectedSocket</span> <span class="o">*</span><span class="n">socket</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sd</span><span class="p">;</span>
<span class="n">sd</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">addr</span><span class="p">);</span> <span class="c1">// 发起连接
</span> <span class="c1">// 创建connectedsocket,返回给用户,可以用此socket进行send/read
</span> <span class="o">*</span><span class="n">socket</span> <span class="o">=</span> <span class="n">ConnectedSocket</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">PosixConnectedSocketImpl</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">PosixConnectedSocketImpl</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">sd</span><span class="p">,</span> <span class="o">!</span><span class="n">opts</span><span class="p">.</span><span class="n">nonblock</span><span class="p">)));</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="socket">Socket</h1>
<p>不同的通信模式实现不同的抽象socket接口,继承关系如下:</p>
<p><img src="/assets/img/post/ceph_asyncmessenger_socket.png" alt="img" /></p>
<p>以posix的实现为例说明:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">PosixServerSocketImpl</span><span class="o">::</span><span class="n">accept</span><span class="p">(</span><span class="n">ConnectedSocket</span> <span class="o">*</span><span class="n">sock</span><span class="p">,</span> <span class="k">const</span> <span class="n">SocketOptions</span> <span class="o">&</span><span class="n">opt</span><span class="p">,</span> <span class="n">entity_addr_t</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="n">Worker</span> <span class="o">*</span><span class="n">w</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sd</span> <span class="o">=</span> <span class="o">::</span><span class="n">accept</span><span class="p">(</span><span class="n">_fd</span><span class="p">,</span> <span class="p">(</span><span class="n">sockaddr</span><span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">ss</span><span class="p">,</span> <span class="o">&</span><span class="n">slen</span><span class="p">);</span> <span class="c1">// 接受连接请求
</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_ptr</span><span class="o"><</span><span class="n">PosixConnectedSocketImpl</span><span class="o">></span> <span class="n">csi</span><span class="p">(</span><span class="k">new</span> <span class="n">PosixConnectedSocketImpl</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="n">sd</span><span class="p">,</span> <span class="nb">true</span><span class="p">));</span> <span class="c1">// 创建连接成功的socket
</span> <span class="o">*</span><span class="n">sock</span> <span class="o">=</span> <span class="n">ConnectedSocket</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">csi</span><span class="p">));</span> <span class="c1">// 返回给用户使用
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>ConnectedSocket的实现主要是read/send接口,对应于read和sendmsg等系统调用,这里不在介绍。</p>
<h1 id="listen">Listen</h1>
<p>到此,我们了解到不同stack创建了不同的worker,而worker通过listen和connect创建出可以监听的serversocket和可以收发数据的connectedsocket,不同的实现对应于不同的socket。
async messenger的其他模块就通过这两个抽象的套接字进行监听和数据的读写,实现网络通信的功能。</p>
<p>以进程怎么绑定到特定地址进行监听来举例说明,回顾以前的文章:</p>
<blockquote>
<p>AsyncMessenger.bind -> Processor.bind()</p>
</blockquote>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">Processor</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">bind_addr</span><span class="p">,</span>
<span class="k">const</span> <span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">>&</span> <span class="n">avoid_ports</span><span class="p">,</span>
<span class="n">entity_addr_t</span><span class="o">*</span> <span class="n">bound_addr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 向worker的事件中心提交一个外部事件,当这个事件执行的时候,就是执行第二个参数的匿名函数对象,也就是调用worker的listen
</span> <span class="n">worker</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">submit_to</span><span class="p">(</span><span class="n">worker</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">get_id</span><span class="p">(),</span> <span class="p">[</span><span class="k">this</span><span class="p">,</span> <span class="o">&</span><span class="n">listen_addr</span><span class="p">,</span> <span class="o">&</span><span class="n">opts</span><span class="p">,</span> <span class="o">&</span><span class="n">r</span><span class="p">]()</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">worker</span><span class="o">-></span><span class="n">listen</span><span class="p">(</span><span class="n">listen_addr</span><span class="p">,</span> <span class="n">opts</span><span class="p">,</span> <span class="o">&</span><span class="n">listen_socket</span><span class="p">);</span> <span class="c1">// 事件被执行后,listen_socket就被更新,可以用它来进行accept
</span> <span class="p">},</span> <span class="nb">false</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>和以前类似,监听的socket有了,但是并没有将fd加入事件中心进行管理,还是得等osd初始化的时候:</p>
<blockquote>
<p>OSD.init() -> AsyncMessenger.ready() -> Processor.start()</p>
</blockquote>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Processor</span><span class="o">::</span><span class="n">start</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 向worker的事件中心提交一个外部事件,当这个事件执行的时候,就是执行第二个参数的匿名函数对象,也就是将fd加入事件中心进行管理
</span> <span class="k">if</span> <span class="p">(</span><span class="n">listen_socket</span><span class="p">)</span> <span class="p">{</span>
<span class="n">worker</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">submit_to</span><span class="p">(</span><span class="n">worker</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">get_id</span><span class="p">(),</span> <span class="p">[</span><span class="k">this</span><span class="p">]()</span> <span class="p">{</span>
<span class="n">worker</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">create_file_event</span><span class="p">(</span><span class="n">listen_socket</span><span class="p">.</span><span class="n">fd</span><span class="p">(),</span> <span class="n">EVENT_READABLE</span><span class="p">,</span> <span class="n">listen_handler</span><span class="p">);</span> <span class="p">},</span> <span class="nb">false</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 上面加入的可读事件,回调是listen_handler,即执行如下函数
</span><span class="kt">void</span> <span class="n">Processor</span><span class="o">::</span><span class="n">accept</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">.......</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">listen_socket</span><span class="p">.</span><span class="n">accept</span><span class="p">(</span><span class="o">&</span><span class="n">cli_socket</span><span class="p">,</span> <span class="n">opts</span><span class="p">,</span> <span class="o">&</span><span class="n">addr</span><span class="p">,</span> <span class="n">w</span><span class="p">);</span> <span class="c1">// 执行监听socket的accept,成功后第一个参数的值为connectedsocket类型,可以用来收发数据
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="connect">Connect</h1>
<p>再来看一个客户端连接的例子,回顾以前文章:</p>
<blockquote>
<p>AsyncMessenger.create_connect() -> AsyncConnection.connect() -> AsyncConnection._process_connection()</p>
</blockquote>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="n">AsyncConnection</span><span class="o">::</span><span class="n">_process_connection</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">case</span> <span class="n">STATE_CONNECTING</span><span class="p">:</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">worker</span><span class="o">-></span><span class="n">connect</span><span class="p">(</span><span class="n">get_peer_addr</span><span class="p">(),</span> <span class="n">opts</span><span class="p">,</span> <span class="o">&</span><span class="n">cs</span><span class="p">);</span> <span class="c1">// 调用worker的connect
</span> <span class="n">center</span><span class="o">-></span><span class="n">create_file_event</span><span class="p">(</span><span class="n">cs</span><span class="p">.</span><span class="n">fd</span><span class="p">(),</span> <span class="n">EVENT_READABLE</span><span class="p">,</span> <span class="n">read_handler</span><span class="p">);</span> <span class="c1">// 将连接成功的connectedsocket加入事件中心进行管理
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_CONNECTING_RE</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接下来就是通过connectedsocket进行数据的读取和发送,框架基本没变。</p>
<h1 id="summary">Summary</h1>
<p>总结一下,引入了一层NetworkStack以及对应不同stack的worker,其中worker最主要的用途就是listen和accept接口,用于创建一个抽象的ServerSocket和ConnectedSocket,
前者用来监听请求,后者用来读写数据。而AsyncMessenger其他地方的主要改动就是调用worker的这两个抽象接口,对于DPDK的轮询模式,
在worker对应的EventCenter内部增加了一些特殊处理,并且增加了DPDKDriver。</p>
Ceph Dynamic Throttle
2016-12-02T00:00:00+00:00
http://blog.wjin.org/posts/ceph-dynamic-throttle
<h1 id="introduction">Introduction</h1>
<p>在Ceph Jewel版本中,新增加了一种限流实现,叫BackoffThrottle,然后filestore存储引擎利用新的限流方式对后端journal以及apply op queue进行动态限流。
因为限流和性能关系密切,需要进行重点调优。BackoffThrottle的原理其实非常简单,就是动态地插入delay时间,阻塞调用线程,这个delay时间由一系列参数控制。
因为delay的时间动态地改变,可以看成是将限流均摊到每个调用线程的每次调用,而以前的限流是只要到达阈值就block住,很明显这会带来长尾效应,
而新引入的BackoffThrottle限流将时间均摊到每次操作后更平缓,避免了长尾效应,参见<a href="https://github.com/ceph/ceph/pull/7767">pr</a>。</p>
<p>dealy值是怎么计算出来的呢?首先将[0,1]区间分成三份,[0, low_threshold), [low_threshhold, high_threshhold), [high_threshhold, 1],三个区间的斜率分别为0, s0, s1。
另外假设限流的最大值为max,当前值为current,x的值为current/max,当x落入上述区间后,根据如下公式计算delay值:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 具体参考代码中的注释
</span><span class="n">delay</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">x</span> <span class="n">in</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">l</span><span class="p">)</span>
<span class="n">delay</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">l</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">e</span> <span class="o">/</span> <span class="p">(</span><span class="n">h</span> <span class="o">-</span> <span class="n">l</span><span class="p">)),</span> <span class="n">x</span> <span class="n">in</span> <span class="p">[</span><span class="n">l</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="n">delay</span> <span class="o">=</span> <span class="n">e</span> <span class="o">+</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">h</span><span class="p">)((</span><span class="n">m</span> <span class="o">-</span> <span class="n">e</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">h</span><span class="p">)),</span> <span class="n">x</span> <span class="n">in</span> <span class="p">[</span><span class="n">h</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/img/post/ceph_backoffthrottle.png" alt="img" /></p>
<p>如图所示,在第一个区间的时候,也就是压力不大的情况下,delay值为0,是不需要wait的。当压力增大,x落入第二个区间后,delay值开始起作用,并且逐步增大,
当压力过大的时候,会落入第三个区间,这时候delay值增加明显加快,wait值明显增大,尽量减慢io速度,减缓压力,故而得名dynamic throttle。</p>
<h1 id="implementation">Implementation</h1>
<table>
<tbody>
<tr>
<td>原理简单,源码实现也容易理解,以后自己编程过程中遇见类似情况,可以完全借鉴来用,简单看看实现,参考代码Throttle.[h</td>
<td>c]:</td>
</tr>
</tbody>
</table>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BackoffThrottle</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">next_cond</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// conds变量的索引
</span> <span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">condition_variable</span><span class="o">></span> <span class="n">conds</span><span class="p">;</span> <span class="c1">// 这里相当于做了一个分片,避免所有线程都wait在一个条件变量上
</span>
<span class="n">list</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">condition_variable</span><span class="o">*></span> <span class="n">waiters</span><span class="p">;</span> <span class="c1">// wait的fifo队列
</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="kt">double</span><span class="o">></span> <span class="n">_get_delay</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">c</span><span class="p">)</span> <span class="k">const</span><span class="p">;</span> <span class="c1">// 计算delay值的函数,参见上面的公式
</span>
<span class="n">std</span><span class="o">::</span><span class="n">list</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">condition_variable</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">_push_waiter</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">next</span> <span class="o">=</span> <span class="n">next_cond</span><span class="o">++</span><span class="p">;</span> <span class="c1">// 获取此次wait的条件变量
</span> <span class="k">if</span> <span class="p">(</span><span class="n">next_cond</span> <span class="o">==</span> <span class="n">conds</span><span class="p">.</span><span class="n">size</span><span class="p">())</span>
<span class="n">next_cond</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">waiters</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">waiters</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="o">&</span><span class="p">(</span><span class="n">conds</span><span class="p">[</span><span class="n">next</span><span class="p">]));</span> <span class="c1">// 插入链表
</span> <span class="p">}</span>
<span class="kt">void</span> <span class="n">_kick_waiters</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">waiters</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span>
<span class="n">waiters</span><span class="p">.</span><span class="n">front</span><span class="p">()</span><span class="o">-></span><span class="n">notify_all</span><span class="p">();</span> <span class="c1">// 唤醒睡眠在链表头部条件变量的线程
</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>核心实现函数就是get函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="kt">double</span><span class="o">></span> <span class="n">BackoffThrottle</span><span class="o">::</span><span class="n">get</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">delay</span> <span class="o">=</span> <span class="n">_get_delay</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// 提前计算一下wait值
</span>
<span class="c1">// 不用wait,直接返回
</span> <span class="k">if</span> <span class="p">(</span><span class="n">delay</span> <span class="o">==</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="kt">double</span><span class="o">></span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">&&</span>
<span class="n">waiters</span><span class="p">.</span><span class="n">empty</span><span class="p">()</span> <span class="o">&&</span>
<span class="p">((</span><span class="n">max</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">current</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">||</span> <span class="p">((</span><span class="n">current</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o"><=</span> <span class="n">max</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">current</span> <span class="o">+=</span> <span class="n">c</span><span class="p">;</span>
<span class="k">return</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="kt">double</span><span class="o">></span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">auto</span> <span class="n">ticket</span> <span class="o">=</span> <span class="n">_push_waiter</span><span class="p">();</span> <span class="c1">// 获取wait的条件变量并插入链表
</span> <span class="k">while</span> <span class="p">(</span><span class="n">waiters</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span> <span class="o">!=</span> <span class="n">ticket</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 等待自己变为链表头部
</span> <span class="p">(</span><span class="o">*</span><span class="n">ticket</span><span class="p">)</span><span class="o">-></span><span class="n">wait</span><span class="p">(</span><span class="n">l</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">auto</span> <span class="n">start</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">system_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>
<span class="n">delay</span> <span class="o">=</span> <span class="n">_get_delay</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// 再次计算wait的值,此时自己已经是链表头部的条件变量
</span> <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">((</span><span class="n">max</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">current</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">current</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o"><=</span> <span class="n">max</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 超过上限(current + c > max),一直wait,等待唤醒
</span> <span class="p">(</span><span class="o">*</span><span class="n">ticket</span><span class="p">)</span><span class="o">-></span><span class="n">wait</span><span class="p">(</span><span class="n">l</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">delay</span> <span class="o">></span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="kt">double</span><span class="o">></span><span class="p">(</span><span class="mi">0</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// wait一段时间
</span> <span class="p">(</span><span class="o">*</span><span class="n">ticket</span><span class="p">)</span><span class="o">-></span><span class="n">wait_for</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">delay</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">assert</span><span class="p">(</span><span class="n">ticket</span> <span class="o">==</span> <span class="n">waiters</span><span class="p">.</span><span class="n">begin</span><span class="p">());</span>
<span class="n">delay</span> <span class="o">=</span> <span class="n">_get_delay</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">system_clock</span><span class="o">::</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">);</span> <span class="c1">// 重新计算wait值
</span> <span class="p">}</span>
<span class="n">waiters</span><span class="p">.</span><span class="n">pop_front</span><span class="p">();</span> <span class="c1">// 清除条件变量
</span> <span class="n">_kick_waiters</span><span class="p">();</span> <span class="c1">// 唤醒后面的wait
</span>
<span class="n">current</span> <span class="o">+=</span> <span class="n">c</span><span class="p">;</span> <span class="c1">// get成功,修改计数
</span> <span class="k">return</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">system_clock</span><span class="o">::</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="usage">Usage</h1>
<p>使用的地方有两个,第一个是journal的限流JournalThrottle对BackoffThrottle进行了wrapper,另外一个就是filestore中op queue的ops和bytes。</p>
<h1 id="tuning">Tuning</h1>
<p>对于性能调优,需要关注以下参数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filestore_queue_max_ops</span>
<span class="n">filestore_queue_max_bytes</span>
<span class="n">filestore_caller_concurrency</span>
<span class="n">filestore_expected_throughput_bytes</span>
<span class="n">filestore_expected_throughput_ops</span>
<span class="n">filestore_queue_low_threshhold</span>
<span class="n">filestore_queue_high_threshhold</span>
<span class="n">filestore_queue_max_delay_multiple</span>
<span class="n">filestore_queue_high_delay_multiple</span>
<span class="n">journal_throttle_low_threshhold</span>
<span class="n">journal_throttle_high_threshhold</span>
<span class="n">journal_throttle_high_multiple</span>
<span class="n">journal_throttle_max_multiple</span>
</code></pre></div></div>
Ceph Librbd Overview
2016-10-27T00:00:00+00:00
http://blog.wjin.org/posts/ceph-librbd-overview
<h1 id="overview">Overview</h1>
<p>在早期firefly和hammer版本的时候,大致看过librbd这块代码,那时候feature还比较少,代码大概也就1w行左右,还比较简单,到jewel版本的时候,新增加了很多feature,
不管是代码量还是复杂度,都上去了,最新的maintainer用了一年多时间,贡献量已经刷到top5 :(</p>
<p>不过虽然代码量增多了,但是作者的设计功底还是值得称赞的,代码本身的可读性非常高。本文大致描述一下librbd各个类或文件的作用,代码参考jewel版本10.2.3。</p>
<h1 id="class">Class</h1>
<ul>
<li><strong>AioCompletion</strong></li>
</ul>
<p>处理上层IO回调。用户的一次异步IO请求(主要包括read/write/discard/flush等)是否完成,由此类控制,内部有一个成员AsyncOperation async_op,
用于控制单个IO操作的的start/finish。一次IO可能对应多个object的请求,内部计数pending_count控制是否此次IO已经完成,回调过程:</p>
<blockquote>
<p>complete_request -> complete -> complete_cb</p>
</blockquote>
<ul>
<li><strong>AioImageRequestWQ</strong></li>
</ul>
<p>异步提交IO请求的队列,aio接口变为彻底的non-block模式。线程池有且仅有一个线程处理队列的请求,由于有一个唯一的IO队列请求入口,所以block IO变得非常容易,此类提供block IO的功能。
线程池从队列获取请求的时候,会调用AioImageRequestWQ::process函数进行处理,最终调用AioImageRequest::send。</p>
<ul>
<li><strong>AioImageRequest</strong></li>
</ul>
<p>IO请求的处理,可能一次image的请求会被拆分为多个对象的请求,即AioObjectRequest,在发送对象请求前会先设置AioCompletion::pending_count为请求总数,
当所有对象的请求都完成后,回调AioCompletion::complete_request内部的判断才会为真,标志用户的这次io请求全部完成,进而回调用户最开始设置的回调函数。
拆分成对象的请求逻辑在函数send_request中,主要借助Striper::file_to_extents函数,将逻辑线性地址转换为对象的extent。</p>
<ul>
<li><strong>AioObjectRequest</strong></li>
</ul>
<p>单个对象的IO处理逻辑,向rados发送请求。主要是读写操作的逻辑,当image是clone操作产生的时候,读写时需要注意对象在当前image不存在的时候,需要去parent image读取,即copyup操作。
无论是读或写,内部逻辑都是一个小的状态机,参见should_complete函数。</p>
<ul>
<li><strong>CopyupRequest</strong></li>
</ul>
<p>处理从parent image读取object内容的请求。将这种操作也归类为一种IO操作,内部有一个成员AsyncOperation m_async_op,控制IO的开始与结束。
需要注意COR和COW两种操作的区别,还有当读到内容后需要更新object map等。</p>
<ul>
<li><strong>ImageCtx</strong></li>
</ul>
<p>image的大杂烩,包括成员(ImageWatcher/ImageState/ExclusiveLock/ObjectMap/Operation/Journal)等变量,并且还有很多把锁,以及两个最主要的队列aio_work_queue和op_work_queue。
另外,还有用于跟踪IO的链表async_ops和用于跟踪Operation的链表async_requests。</p>
<ul>
<li><strong>ImageWatcher</strong></li>
</ul>
<p>因为存在多个客户端同时打开一个image,为了保护数据的一致性,新增加了exclusive_lock特性,只有持有exclusive_lock的客户端才能进行image数据以及元数据的修改,
并且在修改以后,需要通知其他客户端关于image的变化,通知消息的发送及响应就由这个类来负责处理。在open image的时候,会添加watch,用来接收发送消息。
内部实现在osd这端借助于watch/notify机制。</p>
<ul>
<li><strong>WatchNotifyTypes</strong></li>
</ul>
<p>主要封装了ImageWatcher类需要处理的消息类型,没有什么逻辑处理,比较简单。</p>
<ul>
<li><strong>ImageState</strong></li>
</ul>
<p>有exclusive lock feature后,image打开关闭的流程变得复杂,这个类用于处理image的状态(open/close/refresh/snap),在open的时候,会初始化ImageWatcher 变量,向rados注册接收通知的句柄。
具体的实现内部是一个小状态机,针对不同状态执行不同的action,具体请求实现在目录image。需要注意这个文件里有个类ImageUpdateWatchers,这个和前面的ImageWatcher没有关系,
前面是针对多个客户端的处理,不同客户端即不同进程之间监视,这个类是本客户端自己的监视,如果发现image状态有了改变,需要通知注册的watch,进行相应的处理,不会跨越osd的watch/notify。</p>
<ul>
<li><strong>Operations</strong></li>
</ul>
<p>处理对image的维护操作(flatten/resize/rename/snap等)。内部实现的时候,首先会获取exclusive lock,如果成功获取,则执行操作,否则将操作封装成message,
由ImageWatcher类发出,由远端持有exclusive lock的客户端代为执行。</p>
<ul>
<li><strong>AsyncOperation</strong></li>
</ul>
<p>区别于Operations的维护操作,这个类是IO相关类,统一控制IO的开始与结束,start_op的时候会将自己插入链表ImageCtx::async_ops,finish_op的时候从链表删除。</p>
<ul>
<li><strong>AsyncRequest</strong></li>
</ul>
<p>统一控制Operations的开始与结束(Operations类处理的操作,以及ObjectMap的部分操作,最后都会封装成一个request,即这个类的派生类),start_request的时候会将自己插入链表ImageCtx::async_requests,finish_requests的时候从链表删除。</p>
<ul>
<li><strong>ExclusiveLock</strong></li>
</ul>
<p>分布式锁的实现,内部一个状态机处理各种状态,执行对应的action,即发送对应的请求,比如acquire/release等,每个请求的实现在目录exclusive_lock。
同时,这个类提供block request的功能,可以禁止对image的维护操作。</p>
<ul>
<li><strong>ObjectMap</strong></li>
</ul>
<p>处理object map的删除,更新等。内部实现就是维护一个image的所有对象的位图,可以快速判断对象文件是否存在,而不用向rados后端集群发送消息等待消息结果返回的时候才判断。
具体每个请求的实现在目录object_map。</p>
<p>另外还有一些是与journal feature和data group相关的文件,因暂时用不上,留作后面分析。</p>
<h1 id="thread">Thread</h1>
<ul>
<li>
<p>ImageWatcher类中有一个TaskFinisher成员,其中包含两个线程,Finisher和SafeTimer。这两个线程用来接收和发送notify message,以及消息超时后的处理。
这些消息主要是和request相关。</p>
</li>
<li>
<p>ImageCtx类中,包含两个成员AioImageRequestWQ* aio_work_queue和ContextWQ* op_work_queue,这两个队列共用一个线程池,并且线程池只有一个线程,
前一个队列主要用于IO提交,后一个队列用于callback的回调,单线程可以保证IO的顺序。</p>
</li>
<li>
<p>ImageState类中,有一个成员ImageUpdateWatchers* m_update_watchers,它包含一个ContextWQ* m_work_queue,此队列有一个线程。
image的state可能会发生变化,这时候需要通知已经注册的watchers,调用已经注册的callback,主要用于场景rbd-nbd,见文件src/tools/rbd_nbd/rbd_nbd.cc。
注意区分,这里的watchers不同于ImageWatcher类的功能。</p>
</li>
</ul>
Ceph Librbd Feature
2016-10-27T00:00:00+00:00
http://blog.wjin.org/posts/ceph-librbd-feature
<h1 id="feature-list">Feature List</h1>
<ul>
<li><strong>layering</strong></li>
</ul>
<p>image的克隆操作。可以对image创建快照并保护,然后从快照克隆出新的image出来,父子image之间采用COW技术,共享对象数据。</p>
<ul>
<li><strong>striping v2</strong></li>
</ul>
<p>条带化对象数据,类似raid 0,可改善顺序读写场景较多情况下的性能。</p>
<ul>
<li><strong>exclusive lock</strong></li>
</ul>
<p>保护image数据一致性,对image做修改时,需要持有此锁。这个可以看做是一个分布式锁,在开启的时候,确保只有一个客户端在访问image,否则锁的竞争会导致io急剧下降。
主要应用场景是qemu live-migration。</p>
<ul>
<li><strong>object map</strong></li>
</ul>
<p>此特性依赖于exclusive lock。因为image的对象分配是thin-provisioning,此特性开启的时候,会记录image所有对象的一个位图,用以标记对象是否真的存在,在一些场景下可以加速io。</p>
<ul>
<li><strong>fast diff</strong></li>
</ul>
<p>此特性依赖于object map和exlcusive lock。快速比较image的snapshot之间的差异。</p>
<ul>
<li><strong>deep-flatten</strong></li>
</ul>
<p>layering特性使得克隆image的时候,父子image之间采用COW,他们之间的对象文件存在依赖关系,flatten操作的目的是解除父子image的依赖关系,但是子image的快照并没有解除依赖,deep-flatten特性使得快照的依赖也解除。</p>
<ul>
<li><strong>journaling</strong></li>
</ul>
<p>依赖于exclusive lock。将image的所有修改操作进行日志化,并且复制到另外一个集群(mirror),可以做到块存储的异地灾备。这个特性在部署的时候需要新部署一个daemon进程,目前还在试验阶段,不过这个特性很重要,可以做跨集群/机房容灾。</p>
<p>创建image的时候,jewel默认开启的特性包括: layering/exlcusive lock/object map/fast diff/deep flatten</p>
<h1 id="exclusive-lock">Exclusive Lock</h1>
<p>从上面可以看出,很多特性都依赖于exclusive lock,重点介绍一下。</p>
<p>exclusive lock 是分布式锁,实现的时候默认是客户端在第一次写的时候获取锁,并且在收到其他客户端的锁请求时自动释放锁。这个特性在jewel默认开启后,本身没什么问题,
客户端可以自动获取和释放锁,在客户端crash后也能够正确处理。</p>
<p>但是qemu社区的人员,将这个feature和以前旧的librbd API(rbd_lock_exclusive/rbd_lock_shared) 联系在了一起,发了一个错误的patch在qemu社区,
参见<a href="https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg02422.html">maillist</a>讨论。大致情况是当qemu线程已经获取锁的情况下,
librbd线程想要获取锁的时候就会一直获取不到,导致锁的冲突,进而变为只读,因为实现锁的方式是一样的。鉴于此,ceph社区新增加了librbd API,用来显示地获取和释放exclusive lock的锁,
见<a href="https://github.com/ceph/ceph/pull/9592">pr</a>,个人认为旧的API是一个advisory lock,应该会被deprecated。</p>
<p>同时,目前k8s使用ceph krbd作为存储卷,并且通过加锁的方式将卷挂载在指定的pod上,锁的使用方式和上面qemu的误导patch原理一样,
见<a href="https://github.com/kubernetes/kubernetes/issues/33013">issue</a>,如果将来krbd enable了exclusive lock feature,也会导致冲突。改进方式是增加选项来控制锁,由k8s自己控制谁该拥有锁:</p>
<ul>
<li>
<p>krbd禁止exclusive lock的自动转换,<a href="http://tracker.ceph.com/issues/17524">issue</a></p>
</li>
<li>
<p>k8s支持rbd-nbd <a href="https://github.com/kubernetes/kubernetes/issues/32266">issue</a>, rbd-nbd已经支持禁止exclusive lock的自动转换, <a href="https://github.com/ceph/ceph/pull/11438">pr</a></p>
</li>
</ul>
<p>k8s社区有人建议容器可以考虑不走krbd,因为krbd严重依赖于内核,feature落后librbd太多,并且生产环境不宜随时升级kernel版本。
替代方案是用rbd-nbd方式,nbd是kernel的网络块设备,已经很稳定,ceph社区已经实现了rbd-nbd 用户空间客户端,用以连接kernel nbd设备,
客户端使用librbd实现,非常简单,性能损耗也不大。</p>
Ceph Monitor PaxosService
2016-06-13T00:00:00+00:00
http://blog.wjin.org/posts/ceph-monitor-paxosservice
<h1 id="introduction">Introduction</h1>
<p>PaxosService是一个虚基类,内部利用Paxos类的功能,包装了一些接口,即提供一些模板方法,用来构建基于paxos的服务。
目前所有服务如下图所示:</p>
<p><img src="/assets/img/post/ceph_mon_paxosservice.png" alt="img" /></p>
<p>如果考虑需要实现自己的一个能够利用paxos的服务,应该从何入手?大致应该考虑如下几个方面:</p>
<ul>
<li>Init</li>
<li>Restart</li>
<li>Process</li>
<li>Update</li>
<li>Active</li>
</ul>
<h1 id="init">Init</h1>
<p>monitor进程启动的时候,会初始化paxos及其服务,如果服务需要特殊初始化,应该重载基类PaxosService::init接口:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">virtual</span> <span class="kt">void</span> <span class="nf">init</span><span class="p">()</span> <span class="p">{}</span>
</code></pre></div></div>
<p>调用流程: Monitor::preinit() -> Monitor::init_paxos() -> FooService::init()</p>
<h1 id="restart">Restart</h1>
<p>monitor进程在很多情况下会重新进入bootstrap流程,这个过程会restart服务,应该重载基类PaxosService::on_restart接口:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">virtual</span> <span class="kt">void</span> <span class="nf">on_restart</span><span class="p">()</span> <span class="p">{</span> <span class="p">}</span>
</code></pre></div></div>
<p>调用流程: Monitor::bootstrap() -> Monitor::_reset() -> PaxosService::restart() -> FooService::on_restart()</p>
<h1 id="process">Process</h1>
<p>服务需要正常工作,一是对收到的命令进行响应(当然命令也封装在消息中),二是对收到的消息进行响应,如果需要进行paxos round,则发起决议,待完成后更新处理结果,
这些流程基类PaxosService都提供了模板方法,服务只需要实现特定接口即可:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">PaxosService</span><span class="o">::</span><span class="n">dispatch</span><span class="p">(</span><span class="n">MonOpRequestRef</span> <span class="n">op</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">PaxosServiceMessage</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">PaxosServiceMessage</span><span class="o">*></span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">get_req</span><span class="p">());</span> <span class="c1">// 消息类型
</span>
<span class="p">......</span>
<span class="c1">// 只读消息,处理后直接返回
</span> <span class="k">if</span> <span class="p">(</span><span class="n">preprocess_query</span><span class="p">(</span><span class="n">op</span><span class="p">))</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="c1">// 如果不是只读,那么要做更新,需要paxos round,必须leader节点处理
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">mon</span><span class="o">-></span><span class="n">is_leader</span><span class="p">())</span> <span class="p">{</span>
<span class="n">mon</span><span class="o">-></span><span class="n">forward_request_leader</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 非leader,转发消息
</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 如果目前不可更新,等待重试
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">is_writeable</span><span class="p">())</span> <span class="p">{</span>
<span class="n">wait_for_writeable</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="k">new</span> <span class="n">C_RetryMessage</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">op</span><span class="p">));</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 更新
</span> <span class="k">if</span> <span class="p">(</span><span class="n">prepare_update</span><span class="p">(</span><span class="n">op</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 准备更新,
</span> <span class="kt">double</span> <span class="n">delay</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">should_propose</span><span class="p">(</span><span class="n">delay</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">delay</span> <span class="o">==</span> <span class="mf">0.0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">propose_pending</span><span class="p">();</span> <span class="c1">// 发起决议
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">mon</span><span class="o">-></span><span class="n">timer</span><span class="p">.</span><span class="n">add_event_after</span><span class="p">(</span><span class="n">delay</span><span class="p">,</span> <span class="n">proposal_timer</span><span class="p">);</span> <span class="c1">// 等待一段时间后再决议
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>不难看出,基类大部份功能都实现好了,服务需要实现的主要接口为preprocess_query和prepare_update。前者处理一些只读命令或消息,
后者处理需要修改操作的命令或消息。</p>
<p>需要决议什么内容?派生类需要实现接口encode_pending:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">PaxosService</span><span class="o">::</span><span class="n">propose_pending</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">MonitorDBStore</span><span class="o">::</span><span class="n">TransactionRef</span> <span class="n">t</span> <span class="o">=</span> <span class="n">paxos</span><span class="o">-></span><span class="n">get_pending_transaction</span><span class="p">();</span> <span class="c1">// 获取paxos的transaction
</span>
<span class="p">......</span>
<span class="n">encode_pending</span><span class="p">(</span><span class="n">t</span><span class="p">);</span> <span class="c1">// 将决议的内容放入transaction中
</span> <span class="n">have_pending</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">proposing</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">paxos</span><span class="o">-></span><span class="n">queue_pending_finisher</span><span class="p">(</span><span class="k">new</span> <span class="n">C_Committed</span><span class="p">(</span><span class="k">this</span><span class="p">));</span> <span class="c1">// 完成后的回调
</span> <span class="n">paxos</span><span class="o">-></span><span class="n">trigger_propose</span><span class="p">();</span> <span class="c1">// 发起决议
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="update">Update</h1>
<p>决议完成后,需要更新决议的内容,需要实现接口update_from_paxos:</p>
<p>调用流程如下: Paxos::do_refresh() -> Monitor::refresh_from_paxos() -> PaxosService::refresh() -> FooService::update_from_paxos()</p>
<h1 id="active">Active</h1>
<p>更新完成后,需要执行最开始的回调,然后重新回到active状态,服务需要重载PaxosService::on_active接口:</p>
<h1 id="summary">Summary</h1>
<p><img src="/assets/img/post/ceph_mon_sequence.png" alt="img" /></p>
Ceph Monitor Paxos
2016-06-03T00:00:00+00:00
http://blog.wjin.org/posts/ceph-monitor-paxos
<h1 id="introduction">Introduction</h1>
<p>paxos算法主要用来解决分布式系统中的数据一致性,ceph monitor中实现了paxos算法,然后抽象出了PaxosService基类,基于此实现了不同的服务,
比如MonmapMonitor, OSDMonitor, PGMonitor等,分别对应monmap, osdmap, pgmap。</p>
<p>paxos需要根据monitor状态来做转换,大致如下:</p>
<ul>
<li>
<p>monitor启动的时候,preinit会调用函数init_paxos初始化paxos</p>
</li>
<li>
<p>monitor进入bootstrap,准备重新选举的时候,会restart paxos</p>
</li>
<li>
<p>monitor选举成功,成为leader的时候,会将paxos初始化leader</p>
</li>
<li>
<p>monitor选举失败,成为peon的时候,会将paxos初始化为peon</p>
</li>
<li>
<p>monitor运行过程中,leader上的PaxosService会提议一些值,进行paxos决议,即propose</p>
</li>
<li>
<p>monitor发生故障后,重新启动,会对paxos做recover</p>
</li>
</ul>
<p>搞清楚每一步大致做了什么以及类Monitor, Paxos以及PaxosService之间的关系,整个流程就会水落石出,前面四种情况都非常简单,
关键部分是做propose和recover,接下来分析下每个步骤。</p>
<h1 id="init">Init</h1>
<p>monitor在启动的时候,会初始化paxos及其服务(Monitor::preinit()->Monitor::init_paxos()->Paxos::init()):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">init_paxos</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">paxos</span><span class="o">-></span><span class="n">init</span><span class="p">();</span> <span class="c1">// 初始化paxos
</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">PAXOS_NUM</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">init</span><span class="p">();</span> <span class="c1">// 初始化服务,只有LogMonitor实现了init函数,做了简单初始化
</span> <span class="p">}</span>
<span class="n">refresh_from_paxos</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span> <span class="c1">// 更新
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 加载paxos算法相关变量
</span> <span class="n">last_pn</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"last_pn"</span><span class="p">);</span> <span class="c1">// 最后一次提议编号
</span> <span class="n">accepted_pn</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"accepted_pn"</span><span class="p">);</span> <span class="c1">// 最后一次接受的提议编号
</span> <span class="n">last_committed</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"last_committed"</span><span class="p">);</span> <span class="c1">// 最后一次commit的版本
</span> <span class="n">first_committed</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"first_committed"</span><span class="p">);</span> <span class="c1">// 第一次commit的版本
</span><span class="p">}</span>
</code></pre></div></div>
<p>Monitor的preinit只会调用一次,所以只会初始化一次paxos,即加载一些变量。但是,上面的函数refresh_from_paxos需要注意,
后面paxos运行过程中,会在refresh的时候反复调用,refresh发生在commit完成或者recover完成后。</p>
<h1 id="restart">Restart</h1>
<p>monitor进程会在多种情况下重新bootstrap,paxos也会相应的被重置,终止未决的提议以及清理一些timeout事件,然后进入recovering状态等待恢复:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">bootstrap</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// reset
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_PROBING</span><span class="p">;</span>
<span class="n">_reset</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">_reset</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">paxos</span><span class="o">-></span><span class="n">restart</span><span class="p">();</span> <span class="c1">// 重启paxos
</span>
<span class="k">for</span> <span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">PaxosService</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">paxos_service</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">paxos_service</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">p</span><span class="p">)</span>
<span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">)</span><span class="o">-></span><span class="n">restart</span><span class="p">();</span> <span class="c1">// 重启服务
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">restart</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cancel_events</span><span class="p">();</span> <span class="c1">// 取消所有timeout事件
</span> <span class="n">new_value</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span> <span class="c1">// 清理提议的值
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_writing</span><span class="p">()</span> <span class="o">||</span> <span class="n">is_writing_previous</span><span class="p">())</span> <span class="p">{</span>
<span class="n">mon</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="n">mon</span><span class="o">-></span><span class="n">store</span><span class="o">-></span><span class="n">flush</span><span class="p">();</span> <span class="c1">// 等待写完成
</span> <span class="n">mon</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">STATE_RECOVERING</span><span class="p">;</span> <span class="c1">// 重新回到recovering状态
</span> <span class="n">pending_proposal</span><span class="p">.</span><span class="n">reset</span><span class="p">();</span> <span class="c1">// 重置待决议的事务
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="leader-init">Leader Init</h1>
<p>Monitor选举完成后,就会告诉paxos,目前是leader还是peon,分别做相应的处理。</p>
<p>如果只有一个leader,paxos直接进入active状态,否则,发送消息给其他peon,进入recovering状态,等待其他monitor的响应:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">win_election</span><span class="p">(</span><span class="n">epoch_t</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">>&</span> <span class="n">active</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">features</span><span class="p">,</span>
<span class="k">const</span> <span class="n">MonCommand</span> <span class="o">*</span><span class="n">cmdset</span><span class="p">,</span> <span class="kt">int</span> <span class="n">cmdsize</span><span class="p">,</span>
<span class="k">const</span> <span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">*</span><span class="n">classic_monitors</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 更新状态
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_LEADER</span><span class="p">;</span>
<span class="n">leader_since</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="n">leader</span> <span class="o">=</span> <span class="n">rank</span><span class="p">;</span>
<span class="n">quorum</span> <span class="o">=</span> <span class="n">active</span><span class="p">;</span>
<span class="p">......</span>
<span class="c1">// 初始化leader的paxos
</span> <span class="n">paxos</span><span class="o">-></span><span class="n">leader_init</span><span class="p">();</span>
<span class="p">......</span>
<span class="n">monmon</span><span class="p">()</span><span class="o">-></span><span class="n">election_finished</span><span class="p">();</span> <span class="c1">// active服务
</span> <span class="k">for</span> <span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">PaxosService</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">paxos_service</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">p</span> <span class="o">!=</span> <span class="n">paxos_service</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">p</span> <span class="o">!=</span> <span class="n">monmon</span><span class="p">())</span>
<span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">)</span><span class="o">-></span><span class="n">election_finished</span><span class="p">();</span> <span class="c1">// active服务
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="n">finish_election</span><span class="p">();</span> <span class="c1">// 完成
</span>
<span class="c1">// 启动leader的服务
</span> <span class="k">if</span> <span class="p">(</span><span class="n">monmap</span><span class="o">-></span><span class="n">size</span><span class="p">()</span> <span class="o">></span> <span class="mi">1</span> <span class="o">&&</span>
<span class="n">monmap</span><span class="o">-></span><span class="n">get_epoch</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">timecheck_start</span><span class="p">();</span> <span class="c1">// leader需要检查monitor时钟倾斜
</span> <span class="n">health_tick_start</span><span class="p">();</span> <span class="c1">// 磁盘状态检查
</span> <span class="n">do_health_to_clog_interval</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">leader_init</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 清理工作
</span> <span class="n">cancel_events</span><span class="p">();</span>
<span class="n">new_value</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="n">pending_proposal</span><span class="p">.</span><span class="n">reset</span><span class="p">();</span>
<span class="n">finish_contexts</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">pending_finishers</span><span class="p">,</span> <span class="o">-</span><span class="n">EAGAIN</span><span class="p">);</span>
<span class="n">finish_contexts</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">committing_finishers</span><span class="p">,</span> <span class="o">-</span><span class="n">EAGAIN</span><span class="p">);</span>
<span class="c1">// 如果只有一个moitor,不需要collet其他monitor信息,直接进入active状态
</span> <span class="k">if</span> <span class="p">(</span><span class="n">mon</span><span class="o">-></span><span class="n">get_quorum</span><span class="p">().</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">STATE_ACTIVE</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 否则需要collect,进入recovering状态
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_RECOVERING</span><span class="p">;</span>
<span class="n">lease_expire</span> <span class="o">=</span> <span class="n">utime_t</span><span class="p">();</span>
<span class="n">collect</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span> <span class="c1">// 收集信息
</span><span class="p">}</span>
</code></pre></div></div>
<h1 id="peon-init">Peon Init</h1>
<p>peon节点直接进入recovering状态,等待leader的collect消息,协助leader recover流程,如果超时,会重新发起选举:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">lose_election</span><span class="p">(</span><span class="n">epoch_t</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">l</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">features</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 更新状态
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_PEON</span><span class="p">;</span>
<span class="n">leader_since</span> <span class="o">=</span> <span class="n">utime_t</span><span class="p">();</span>
<span class="n">leader</span> <span class="o">=</span> <span class="n">l</span><span class="p">;</span>
<span class="n">quorum</span> <span class="o">=</span> <span class="n">q</span><span class="p">;</span>
<span class="c1">// 初始化peon的paxos
</span> <span class="n">paxos</span><span class="o">-></span><span class="n">peon_init</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">PaxosService</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">paxos_service</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">paxos_service</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">p</span><span class="p">)</span>
<span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">)</span><span class="o">-></span><span class="n">election_finished</span><span class="p">();</span> <span class="c1">// active服务
</span>
<span class="p">......</span>
<span class="n">finish_election</span><span class="p">();</span> <span class="c1">// 完成
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">peon_init</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 清理工作
</span> <span class="n">cancel_events</span><span class="p">();</span>
<span class="n">new_value</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="c1">// peon进入recovering状态,等待leader的collect消息
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_RECOVERING</span><span class="p">;</span>
<span class="n">lease_expire</span> <span class="o">=</span> <span class="n">utime_t</span><span class="p">();</span>
<span class="n">reset_lease_timeout</span><span class="p">();</span> <span class="c1">// 如果长时间没收到collect消息,会重新选举
</span>
<span class="c1">// 清理工作
</span> <span class="n">pending_proposal</span><span class="p">.</span><span class="n">reset</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>election_finished函数仅仅是PaxosService基类实现了,目的是调用_active函数,然后各个服务实现自己的on_active函数,
即在paxos进入active状态的时候各服务需要做哪些相应的处理。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">PaxosService</span><span class="o">::</span><span class="n">election_finished</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">_active</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">PaxosService</span><span class="o">::</span><span class="n">_active</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_active</span><span class="p">())</span>
<span class="n">on_active</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="propose">Propose</h1>
<p>leader初始化后会进入collect阶段,用于做数据恢复。其实数据恢复,也是一个paxos propose过程,需要根据propose的时候存储了些什么值来做决策,
所以先看看propose怎么实现的。</p>
<p>如果需要对数据做修改,需要进行paxos算法表决(propose),ceph这里为了简化数据恢复流程,一次只能决议一个值,
不难猜测,propose只能由leader节点提出,所以ceph更新操作还是挺快的,不会产生多个proposer竞争的活锁情况,那么什么时候需要修改数据?</p>
<p>源码中发现除了升级的特殊情况外,主要还包含以下三种情况需要做propose:</p>
<ul>
<li>
<p>ConfigKeyService服务在修改或删除key/value值的时候,这个服务将monitor当作一个存储k/v数据的黑盒子,
参见ConfigKeyService::store_put和store_delete</p>
</li>
<li>
<p>Paxos以及PaxosService对数据做trim的时候,trim的目的是为了节省存储空间,参见Paxos::trim和PaxosService::maybe_trim</p>
</li>
<li>
<p>PaxosService的各种服务,需要更新值的时候,参见PaxosService::propose_pending</p>
</li>
</ul>
<p>以上三种情况在决定做propose之前,会将操作封装成事务,存放在Paxos类的变量pending_proposal中,
然后设置commit完成后需要调用的callback,接着就调用Paxos::trigger_propose函数开始决议。</p>
<p>这里需要注意的是,事务操作pending_proposal会被编码到bufferlist中,作为此次决议的值,会存放在paxos相关的k/v中,key为版本号,
value为bufferlist二进制数据。commit的时候需要将bufferlist中的二进制数据还原成transaction,然后执行其中的操作,
即让决议的值反应在各个服务中,更新相关map。</p>
<p>下面以一个简单例子来阐述流程,不考虑异常的情况,假设最开始leader和peon都处于稳定状态,且k/v中存储paxos的值如下:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>first_committed = 1
last_committed = 10
accepted_pn = 100
</code></pre></div></div>
<p>此时leader接收到决议的请求,整个算法运转流程如下:</p>
<blockquote>
<p>1) leader的操作,参见函数begin</p>
</blockquote>
<ul>
<li>leader节点首先将pending_proposal编码到bufferlist new_value中</li>
<li>更新相关值在后端存储中</li>
<li>发送消息给peon,消息内容(v6=new_value, last_committed=10, pn=100)</li>
</ul>
<p>leader状态由<code class="highlighter-rouge">active变为updating</code>,此时存储数据如下:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>first_committed = 1
last_committed = 10
accepted_pn = 100
# 此次提议增加的数据
v11=new_value; # 11是last_committed+1的值,这里key会有前缀,简单以v代替
pending_v=11
pending_pn=100
</code></pre></div></div>
<blockquote>
<p>2) peon收到propose消息的处理,参见函数handle_begin</p>
</blockquote>
<ul>
<li>peon只处理pn >= accepted_pn的消息,很明显,这里是相等,所以会处理。</li>
<li>更新相关值在后端存储中</li>
<li>发送消息告诉leader已经接受</li>
</ul>
<p>peon状态由<code class="highlighter-rouge">active变为updating</code>,此时peon节点和leader节点存储的数据一样:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>first_committed = 1
last_committed =10
accepted_pn = 100
v11=new_value
pending_v=11
pending_pn=100
</code></pre></div></div>
<blockquote>
<p>3) leader收到peon发回来的accept消息, 参见函数handle_accept</p>
</blockquote>
<ul>
<li>对accept_pn和last_committed做检查</li>
<li>检查通过后,放入accept集合</li>
<li>待收到quorum集合的<code class="highlighter-rouge">所有accept消息</code>以后,执行commit操作 (注意这里并不是paxos算法中大多数接受就ok)</li>
</ul>
<p>commit操作由函数commit_start开始:</p>
<ul>
<li>更新后端存储中last_committed值,即+1</li>
<li>将new_value中的值解码成事务,然后向后端存储排队执行,注意这里采用异步写,此时leader状态从<code class="highlighter-rouge">updating变为writing</code></li>
</ul>
<p>采用异步写,在写的过程中,可以释放monitor lock,这样可以处理其他消息,time线程也可以获取锁处理事件。后端存储完成后,会回调commit_finish:</p>
<ul>
<li>将内存中last_committed值+1</li>
<li>向peon发送commit消息</li>
<li>设置状态为<code class="highlighter-rouge">refresh</code>,刷新PaxosService服务</li>
</ul>
<p>PaxosService的服务会检查是否需要更新,依据是refresh的时候会更新各服务的cached_last_committed,这个值有变化各服务就会相应的处理,
完成后leader就从<code class="highlighter-rouge">refresh回到active状态</code>,此时leader信息如下:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>first_committed = 1
last_committed =11 # 更新版本
accepted_pn = 100
v11=new_value
pending_v=11
pending_pn=100
# 还有根据最开始发起提议的时候,事务中记录的操作对后端存储的影响,即几个map的变化
</code></pre></div></div>
<p>需要注意的是,refresh完成后,在变回状态active之前,会开始lease协议,即发送lease消息给peon,这会帮助peon也变为active。</p>
<blockquote>
<p>4) peon收到commit消息, 参见函数handle_commit</p>
</blockquote>
<ul>
<li>更新内存中和后端存储中last_committed值,即+1</li>
<li>将new_value中的值解码成事务,然后调用后端存储接口执行请求,这里采用同步写,和leader节点不一样</li>
<li>刷新PaxosService服务</li>
</ul>
<p>peon一直处在updating状态,最终信息如下:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>first_committed = 1
last_committed =11 # 更新版本
accepted_pn = 100
v11=new_value
pending_v=11
pending_pn=100
# 还有根据最开始发起提议的时候,事务中记录的操作对后端存储的影响,即几个map的变化
</code></pre></div></div>
<blockquote>
<p>5) peon收到lease消息, 参见函数handle_lease,<code class="highlighter-rouge">peon状态从updating变回active</code></p>
</blockquote>
<p>总结一下,一轮propose完成后,更新的数据如下:</p>
<ul>
<li>last_committed会增加1</li>
<li>paxos新commit过的k/v值</li>
<li>新提议的值内部的事务操作对服务的影响</li>
</ul>
<p>leader经历的状态转换图:</p>
<p><img src="/assets/img/post/ceph_mon_paxos_1.png" alt="img" /></p>
<p>peon经历的状态转换图:</p>
<p><img src="/assets/img/post/ceph_mon_paxos_2.png" alt="img" /></p>
<p>发现pending_v和pending_pn只在最开始设置了值后,就一直没有引用,不难猜测,这对值是用来在异常情况下做recover的。</p>
<h1 id="recover">Recover</h1>
<p>除了上面提到的状态外,paxos还有另外三种状态,recovering, updating_previous和writing_previous,这些状态都和异常情况下的恢复相关,
leader选举完成后,会进入collect阶段,此时paxos状态为recovering,会尝试恢复paxos数据(注意monitor probing阶段也会sync数据)。因为恢复的时候,
依赖于各自的paxos数据的版本(first_committed和last_committed)以及accepted_pn编号,这个编号在所有monitor中全局唯一且单调递增的,先看编号是怎么产生;</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">version_t</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">get_new_proposal_number</span><span class="p">(</span><span class="n">version_t</span> <span class="n">gt</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">last_pn</span> <span class="o"><</span> <span class="n">gt</span><span class="p">)</span>
<span class="n">last_pn</span> <span class="o">=</span> <span class="n">gt</span><span class="p">;</span> <span class="c1">// 保存旧值
</span>
<span class="c1">// 更新
</span> <span class="n">last_pn</span> <span class="o">/=</span> <span class="mi">100</span><span class="p">;</span>
<span class="n">last_pn</span><span class="o">++</span><span class="p">;</span>
<span class="n">last_pn</span> <span class="o">*=</span> <span class="mi">100</span><span class="p">;</span>
<span class="n">last_pn</span> <span class="o">+=</span> <span class="p">(</span><span class="n">version_t</span><span class="p">)</span><span class="n">mon</span><span class="o">-></span><span class="n">rank</span><span class="p">;</span>
<span class="p">.......</span>
<span class="k">return</span> <span class="n">last_pn</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在上次的值基础上增加100然后加上monitor的rank值。比如假设三个monitor,rank值分别为0,1,2,最开始pn为100,每次触发选举,假设monitor 0一直存在,
那么每次选举完成后,pn会增加100即,200,300。如果此时monitor 0宕机,那么monitor 1会获胜,pn为401,继续选举为501, 601,如果此时monitor 0恢复,
则pn为700等等。这个值只会在每次选举完成后,leader collect的时候更新一次,后期paxos决议的时候,不会更新。</p>
<p>接下来分两种情况,分别研究paxos的恢复逻辑。</p>
<h3 id="peon-down">Peon Down</h3>
<p>peon down的时间点不重要,重要的是什么时候leader被重新bootstrap开始选举,当leader没有收到peon的lease ack,
leader的事件lease_ack_timeout_event(时间设定上看,好像accept不会timeout,需要确认)会在超时后执行,然后会进行重新选举,
因为leader还是编号最小的,仍然会选举成为leader(这里只考虑monitor个数大于monmap个数的一半,小于一半集群就不能工作了),
选举完成后会做一次collect操作,进行recover,这里分两方面,即down的时候集群需要恢复到正常工作状态,以后又重新up的时候也需要恢复正常。</p>
<blockquote>
<p>down</p>
</blockquote>
<p>peon down隐含的条件是重新选举后leader节点不会发生变化,且其他peon的数据一定不会比leader的数据更新,即</p>
<ul>
<li>last_committed(leader) >= last_committed(peon)</li>
<li>accepted_pn(leader) > accepted_pn(peon)</li>
</ul>
<p>另外,timeout事件是在time线程内完成,time线程干活的时候会获取monitor lock,那么可以推断,leader的paxos流程可能被中断的情况包括以下几个点:</p>
<ol>
<li>leader为active状态,未开始任何决议</li>
<li>leader为updating状态,即begin函数已经执行,等待accept中,此时leader有uncommitted数据,并且可能已经有部分accept消息</li>
<li>leader为writing状态,说明已经接收到所有accept消息,即commit_start已经开始执行,事务已经排队等待执行</li>
<li>leader为writing状态,写操作已经执行完成,即事务已经生效,只是回调函数(commit_finish)还没有被执行(回调函数没被执行是因为需要获取monitor lock的锁)</li>
</ol>
<p>之所以会有3和4两种情况,是因为leader节点采用异步写的机制。leader不会被中断在refresh状态,因为一旦commit_finish函数开始执行,
会将refresh状态执行完成,重新回到active状态,time线程才可能获取到锁执行。第1种情况不用特殊处理,第2种情况会存在uncommitted数据,
待重新选举完成后,leader会重新开始一个propose过程。第3和4种情况会等待已经在writing状态的数据commit完成后,才会重新选举:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">bootstrap</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">wait_for_paxos_write</span><span class="p">();</span> <span class="c1">// 等待writing的数据完成
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">start_election</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">wait_for_paxos_write</span><span class="p">();</span> <span class="c1">// 等待writing的数据完成
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>无论如何,一轮消息过后: collect -> handle_collect -> handle_last,数据就应该同步好。</p>
<blockquote>
<p>up</p>
</blockquote>
<p>peon重新up后,probing阶段会先sync数据,然后发起选举,这会导致其他节点也发起选举,leader仍然会获胜,且leader被中断的时机和上面情况类似,数据恢复也一样。</p>
<h3 id="leader-down">Leader Down</h3>
<p>leader可能core在paxos任意函数的任意时间点,这时候新的leader会从peon中选择一个编号最小的,peon的数据根据leader core的位置会有些变化,
还是分down的时候和又重新up的情况来分析:</p>
<blockquote>
<p>down</p>
</blockquote>
<p>peon在lease超时后会重新选举,peon可能中断在active或updating状态,peon之间的状态并不是一样的,可能一些在active,一些在updating:</p>
<ol>
<li>leader down在active状态,不需要特殊处理</li>
<li>leader down在updating状态,如果没有peon已经accept,不需要特殊处理,如果有peon已经accept,新的leader要么自己已经accept,要么会从其他peon学习到,会重新propose</li>
<li>leader down在writing状态,说明所有peon已经accept,新的leader会重新propose已经accept的值(此时down的leader可能已经写成功,也可能没有写成功)</li>
<li>leader down在refresh状态,down的leader已经写成功,如果有peon已经收到commit消息,新的commit会被新的leader在collect阶段学习到,如果没有peon收到commit消息,会重新propose</li>
</ol>
<p>上面的3和4两种情况,意味着正在propose的数据一定会被重新propose(所有peon都没有收到commit消息的情况),
所以不用担心leader已经commit过数据了,而peon还没有commit数据的情况。</p>
<blockquote>
<p>up</p>
</blockquote>
<p>leader重新up后,可能probing阶段就会做一次sync,此时数据可能会同步一部分,再一次被选举成leader,collect阶段会同步差异的几个版本数据,
同时,如果peon有uncommitted的数据,也会同步给leader,由新的leader重新propose。</p>
<p>唯一需要注意的是,leader down的时候存在的uncommitted的数据,由上面的情况可知,如果有peon已经接受,数据会被重新propose,
重新up后,根据pending_v,由于版本较低,pending数据会被抛弃。如果leader已经commit过,peon也一定会commit,所以不会导致数据不一致。</p>
<p>另外,前面提到的updating_previous状态,发生在新的leader学习到uncommitted值再次propose的情况,相应地,updating_previous的下一阶段会进入writing_previous,
所以leader节点最终的状态转换图如下:</p>
<p><img src="/assets/img/post/ceph_mon_paxos_3.png" alt="img" /></p>
<p>对于存在连续的宕机情况,只要存活的monitor的个数超过monmap一半,数据恢复不外乎就是上面这些情况的组合。</p>
Ceph Monitor Leader Elect
2016-06-01T00:00:00+00:00
http://blog.wjin.org/posts/ceph-monitor-leader-elect
<h1 id="introduction">Introduction</h1>
<p>monitor运行过程中,需要选举leader,然后所有更新的操作都是通过leader发出paxos propose完成,
如果非leader收到更新请求,会将请求转发到leader节点,让leader代为执行。</p>
<p>leader本身的选举,并不是paxos算法,ceph本身实现的比较简单高效,因为利用了节点在monmap中的rank值,人为的造成各个节点不平等,
rank值最小的获胜,简单快速的达到选举目的。</p>
<h1 id="when-start-elect">When Start Elect</h1>
<p>一个节点发起leader选举是通过函数Monitor::start_election()完成,这个函数会在以下几种情况被调用:</p>
<ol>
<li>
<p>节点调用bootstrap函数引导启动,接着会probing,查询其他monitor信息(有可能需要同步数据),完成后发起选举</p>
</li>
<li>
<p>节点收到选举消息MMonElection,如果节点自己已经处于quorum或自己的编号更小,也会重新发起选举</p>
</li>
<li>
<p>节点收到quorum enter/exit命令</p>
</li>
</ol>
<h3 id="bootstrap">bootstrap</h3>
<p>对于第一种情况,bootstrap被调用的地方非常频繁:</p>
<ul>
<li>
<p>monitor节点重新启动</p>
</li>
<li>
<p>monitor节点由于信息不全,或者运行过程中很多事件超时,可能需要重启,即调用bootstrap然后probing,同步数据</p>
</li>
<li>
<p>leader发出lease消息,等待其他节点回应,如果超时,会重新bootstrap,然后选举</p>
</li>
</ul>
<p>monitor提供的paxos算法内部采用lease协议,保证副本数据在一定时间范围内可读,leader节点会不停的发送lease消息,延长各peon的时间。
如果peon节点down掉,leader节点不会收到lease_ack消息,超时后就会重新选举。如果leader节点down掉,
所有的peon节点收不到来自leader的更新消息,也会重新选举。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// leader调用
</span><span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">lease_ack_timeout</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">mon</span><span class="o">-></span><span class="n">is_leader</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">is_active</span><span class="p">());</span>
<span class="n">lease_ack_timeout_event</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">mon</span><span class="o">-></span><span class="n">bootstrap</span><span class="p">();</span> <span class="c1">// 重启
</span><span class="p">}</span>
<span class="c1">// peon调用
</span><span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">lease_timeout</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">mon</span><span class="o">-></span><span class="n">is_peon</span><span class="p">());</span>
<span class="n">lease_timeout_event</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">mon</span><span class="o">-></span><span class="n">bootstrap</span><span class="p">();</span> <span class="c1">// 重启
</span><span class="p">}</span>
</code></pre></div></div>
<h3 id="mmonelection">MMonElection</h3>
<p>对于第二种情况,应该说这是第一种情况间接导致的,当某个节点发出选举消息后,其他节点收到消息会做相应的处理。
常见的case是其他节点已经形成一个quorum,并且有leader存在,此时收到选举消息后,发现是来自quorum外的节点,
表明有新节点加入,需要选举:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Elector</span><span class="o">::</span><span class="n">handle_propose</span><span class="p">(</span><span class="n">MMonElection</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">epoch</span> <span class="o"><</span> <span class="n">epoch</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// got an "old" propose,
</span> <span class="k">if</span> <span class="p">(</span><span class="n">epoch</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">&&</span> <span class="c1">// in a non-election cycle
</span> <span class="n">mon</span><span class="o">-></span><span class="n">quorum</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">from</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// from someone outside the quorum
</span> <span class="n">mon</span><span class="o">-></span><span class="n">start_election</span><span class="p">();</span> <span class="c1">// 本节点已经形成quorum,有节点重新启动
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mon</span><span class="o">-></span><span class="n">rank</span> <span class="o"><</span> <span class="n">from</span><span class="p">)</span> <span class="p">{</span>
<span class="n">mon</span><span class="o">-></span><span class="n">start_election</span><span class="p">();</span> <span class="c1">// 自己编号更小
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">m</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="quorum-enterexit">quorum enter/exit</h3>
<p>第三种情况,这个其实是一个命令,感觉主要用于运维测试。原理是设置一个bool值,然后调用选举函数,这样就会让此monitor加入或者退出quorum。
针对某个特定的monitor,可以通过如下命令验证:</p>
<blockquote>
<p>$ceph –admin-daemon=path-to-admin-socket quorum enter</p>
</blockquote>
<blockquote>
<p>$ceph –admin-daemon=path-to-admin-socket quorum exit</p>
</blockquote>
<p>Elector中的处理就是设置标志:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Elector</span><span class="o">::</span><span class="n">start_participating</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">participating</span><span class="p">)</span> <span class="p">{</span>
<span class="n">participating</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="c1">// 参与选举
</span> <span class="n">call_election</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">stop_participating</span><span class="p">()</span> <span class="p">{</span> <span class="n">participating</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="p">}</span> <span class="c1">// 不参与
</span></code></pre></div></div>
<h1 id="how-elect">How Elect</h1>
<p>整个选举的流程,完全是在Elector类中完成。此类中维护了一个election_epoch,为偶数,表示已经加入quorum且处于稳定状态,为奇数,表示正在选举。
选举的时候,永远只选举rank值为最小的为leader。下表格中的quorum集合只列出了选举成功后的quroum,在选举过程中,会有一个outside quorum表示新加入集群的节点。</p>
<p>以下图演示选举变化过程,第一行中的0,1,2表示三个monitor的rank值:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>epoch | 0 | 1 | 2 | quorum | leader | comment
------ | --- | --- | --- | ------ | ------ | -------
epoch | 0 | 0 | 0 | | | mon(0,1,2) startup
epoch | 1 | 1 | 1 | | | electing
epoch | 2 | 2 | 2 | 0,1,2 | 0 |
...
epoch | 2 | 2 | 2 | 0,1,2 | 0 |
epoch | 3 | 2 | | | | mon2 down; leader lease_ack timeout -> electing
epoch | 3 | 3 | | | | electing
epoch | 4 | 4 | | 0,1 | 0 |
...
epoch | 4 | 4 | 2 | 0,1 | 0 | mon2 up
epoch | 4 | 4 | 3 | 0,1 | 0 | electing
epoch | 4 | 4 | 4 | 0,1,2 | 0 |
...
epoch | | 4 | 5 | | | mon0 down; mon2 lease timeout -> electing
epoch | | 5 | 5 | | | mon1 lease timeout -> electing
epoch | | 6 | 6 | 1,2 | 1 |
...
epoch | | 6 | 6 | 1,2 | 1 |
epoch | 4 | 6 | 6 | | | mon0 up
epoch | 5 | 6 | 6 | | | electing
epoch | 6 | 6 | 6 | 0,1,2 | 0 |
...
</code></pre></div></div>
<p>下面简单跟踪一下代码流程:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">start_election</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">wait_for_paxos_write</span><span class="p">();</span>
<span class="n">_reset</span><span class="p">();</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">STATE_ELECTING</span><span class="p">;</span>
<span class="n">cancel_probe_timeout</span><span class="p">();</span>
<span class="n">elector</span><span class="p">.</span><span class="n">call_election</span><span class="p">();</span> <span class="c1">// 开始选举
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">call_election</span><span class="p">()</span> <span class="p">{</span>
<span class="n">start</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Elector</span><span class="o">::</span><span class="n">start</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">participating</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 默认值为true,可以通过命令quorum exit修改
</span> <span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">acked_me</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="n">classic_mons</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="n">init</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">epoch</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">bump_epoch</span><span class="p">(</span><span class="n">epoch</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// 设置election_epoch为奇数,表示正在选举,这个值会存入store中
</span>
<span class="n">start_stamp</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="n">electing_me</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">acked_me</span><span class="p">[</span><span class="n">mon</span><span class="o">-></span><span class="n">rank</span><span class="p">]</span> <span class="o">=</span> <span class="n">CEPH_FEATURES_ALL</span><span class="p">;</span>
<span class="n">leader_acked</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="o">-></span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">i</span> <span class="o">==</span> <span class="n">mon</span><span class="o">-></span><span class="n">rank</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span>
<span class="n">Message</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MMonElection</span><span class="p">(</span><span class="n">MMonElection</span><span class="o">::</span><span class="n">OP_PROPOSE</span><span class="p">,</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="p">);</span> <span class="c1">// 发送消息给其他monitor
</span> <span class="n">mon</span><span class="o">-></span><span class="n">messenger</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="o">-></span><span class="n">get_inst</span><span class="p">(</span><span class="n">i</span><span class="p">));</span>
<span class="p">}</span>
<span class="n">reset_timer</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>其他节点收到消息后,调用handle_propose处理,如果接受,就会调用defer,发送回接受消息:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Elector</span><span class="o">::</span><span class="n">defer</span><span class="p">(</span><span class="kt">int</span> <span class="n">who</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">leader_acked</span> <span class="o">=</span> <span class="n">who</span><span class="p">;</span>
<span class="n">ack_stamp</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="n">MMonElection</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MMonElection</span><span class="p">(</span><span class="n">MMonElection</span><span class="o">::</span><span class="n">OP_ACK</span><span class="p">,</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="p">);</span> <span class="c1">// 发回确认消息
</span> <span class="n">m</span><span class="o">-></span><span class="n">sharing_bl</span> <span class="o">=</span> <span class="n">mon</span><span class="o">-></span><span class="n">get_supported_commands_bl</span><span class="p">();</span>
<span class="n">mon</span><span class="o">-></span><span class="n">messenger</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="o">-></span><span class="n">get_inst</span><span class="p">(</span><span class="n">who</span><span class="p">));</span>
<span class="c1">// set a timer
</span> <span class="n">reset_timer</span><span class="p">(</span><span class="mf">1.0</span><span class="p">);</span> <span class="c1">// give the leader some extra time to declare victory
</span><span class="p">}</span>
</code></pre></div></div>
<p>待收到所有ACK消息后(注意这里并不是收到大多数, leader必须是所有节点都确认),就会宣告自己胜出:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Elector</span><span class="o">::</span><span class="n">handle_ack</span><span class="p">(</span><span class="n">MMonElection</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">acked_me</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="o">-></span><span class="n">size</span><span class="p">())</span> <span class="p">{</span>
<span class="n">victory</span><span class="p">();</span> <span class="c1">// 选举成功
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Elector</span><span class="o">::</span><span class="n">victory</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">bump_epoch</span><span class="p">(</span><span class="n">epoch</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// 设置为偶数
</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">quorum</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">p</span> <span class="o">!=</span> <span class="n">quorum</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">p</span> <span class="o">==</span> <span class="n">mon</span><span class="o">-></span><span class="n">rank</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span>
<span class="n">MMonElection</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MMonElection</span><span class="p">(</span><span class="n">MMonElection</span><span class="o">::</span><span class="n">OP_VICTORY</span><span class="p">,</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="p">);</span> <span class="c1">// 告诉其他monitor
</span> <span class="n">m</span><span class="o">-></span><span class="n">quorum</span> <span class="o">=</span> <span class="n">quorum</span><span class="p">;</span>
<span class="n">m</span><span class="o">-></span><span class="n">quorum_features</span> <span class="o">=</span> <span class="n">features</span><span class="p">;</span>
<span class="n">m</span><span class="o">-></span><span class="n">sharing_bl</span> <span class="o">=</span> <span class="o">*</span><span class="n">cmds_bl</span><span class="p">;</span>
<span class="n">mon</span><span class="o">-></span><span class="n">messenger</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">mon</span><span class="o">-></span><span class="n">monmap</span><span class="o">-></span><span class="n">get_inst</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">));</span>
<span class="p">}</span>
<span class="c1">// tell monitor
</span> <span class="n">mon</span><span class="o">-></span><span class="n">win_election</span><span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="n">quorum</span><span class="p">,</span> <span class="n">features</span><span class="p">,</span> <span class="n">cmds</span><span class="p">,</span> <span class="n">cmdsize</span><span class="p">,</span> <span class="o">&</span><span class="n">copy_classic_mons</span><span class="p">);</span> <span class="c1">// 胜出,让monitor做相应的初始化
</span><span class="p">}</span>
</code></pre></div></div>
<p>后面流程就是胜出的monitor一方执行win_election,调用函数leader_init初始化leader,失败的一方执行lose_election,
调用函数peon_init初始化peon,整个monitor集群差不多就可以开始稳定工作。</p>
<h1 id="summary">Summary</h1>
<p>总结来看,monitor选举的过程是非常简单迅速的,满足条件后向monmap中的所有节点发送消息MMonElection::OP_PROPOSE消息,待收到所有确认消息后就会胜出。</p>
<p>逻辑处理主要是在函数handle_propose中,举例如下,假设monitor A向monitor B发送PROPOSE消息,考虑两种情况:</p>
<blockquote>
<p>rank A < rank B</p>
</blockquote>
<ul>
<li>
<p>epoch A < epoch B, B会发起选举(并且确认A), A收到B的选举消息,更新epoch,rank A较小,A不会确认B,A用新的epoch再次选举, B会再次收到A的PROPOSE消息,此时epoch相等</p>
</li>
<li>
<p>epoch A = epoch B, B会确认A</p>
</li>
<li>
<p>epoch A > epoch B, B会更新自己的epoch,确认A</p>
</li>
</ul>
<blockquote>
<p>rank A > rank B</p>
</blockquote>
<ul>
<li>
<p>epoch A < epoch B, B会发起选举,并且不确认A,A收到B的消息,更新epoch,由于rank A较大,A会确认B</p>
</li>
<li>
<p>epoch A = epoch B, B会发起选举,并且不确认A,A会确认B</p>
</li>
<li>
<p>epoch A > epoch B, B会更新自己的epoch,并且不确认A,然后发起选举,A会确认B</p>
</li>
</ul>
Ceph Monitor Overview
2016-05-26T00:00:00+00:00
http://blog.wjin.org/posts/ceph-monitor-overview
<h1 id="introduction">Introduction</h1>
<p>monitor在ceph集群中起着非常关键的作用,它维护着几张map(monmap, osdmap, pgmap等), 通过paxos算法保证数据的一致性。</p>
<p>虽然一个monitor也可以工作,但是为了防止单点故障,monitor可以部署多个,一般情况部署三个在不同的故障域。在ceph的实现中,
monitor节点信息会存放在monmap这张表中,每个monitor在monmap中会有一个rank值,这个值非常关键,多个monitor会形成一个quorum(set<int>类型),
其中存放的就是rank值,在选举leader的时候,rank值最小的获胜,所以monitor地位并不是平等的,这样做的目的可能是为了快速的选举出leader。</p>
<p>monitor维护的map,都是以PaxosService的服务提供,不同服务继承基类PaxosService实现自己的特性,这些服务通过paxos算法对数据进行更新,
只有leader才可以调用propose相关函数进行更新,如果peon节点收到更新的消息,则需要将消息转发给leader节点,
所以同一时刻paxos算法只会存在一个proposer,几乎没有竞争,决议会很快完成,更新也是非常迅速的。</p>
<p>monitor涉及的内容大致包括以下几个方面:</p>
<ul>
<li>startup</li>
<li>data store</li>
<li>data sync</li>
<li>data check</li>
<li>scrub</li>
<li>leader elect</li>
<li>timecheck</li>
<li>lease</li>
<li>paxos</li>
<li>paxos service</li>
<li>consistency</li>
</ul>
<p>下面简单对每一方面进行介绍。</p>
<h1 id="startup">Startup</h1>
<p>monitor启动的流程可以参考<a href="http://blog.wjin.org/posts/ceph-monitor-startup.html">这篇文章</a>,monitor经历的状态转换图如下:</p>
<p><img src="/assets/img/post/ceph_mon_stat.png" alt="img" /></p>
<h1 id="data-store">Data Store</h1>
<p>monitor维护了很多map以及自身Elector和Paxos算法的数据,这些数据肯定是需要地方存储的,最开始的时候monitor采用文件存储,后来采用k/v存储,
主要是利用k/v的原子操作以及对key做有序排列,目前支持levelDB和rocksDB。主要实现是在文件MonitorDBStore.h中,将对key的操作封装成一个op,
然后考虑到同时对多个key操作的时候,需要确保事务性,所以使用的时候,都是以transaction的形式提交,一个transaction可能包含多个op。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MonitorDBStore</span>
<span class="p">{</span>
<span class="n">boost</span><span class="o">::</span><span class="n">scoped_ptr</span><span class="o"><</span><span class="n">KeyValueDB</span><span class="o">></span> <span class="n">db</span><span class="p">;</span> <span class="c1">// 具体存储的backend,可以是levelDB或rocksDB
</span> <span class="p">......</span>
<span class="k">struct</span> <span class="n">Op</span> <span class="p">{</span> <span class="c1">// 对key的操作
</span> <span class="kt">uint8_t</span> <span class="n">type</span><span class="p">;</span>
<span class="n">string</span> <span class="n">prefix</span><span class="p">;</span>
<span class="n">string</span> <span class="n">key</span><span class="p">,</span> <span class="n">endkey</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">bl</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">Transaction</span> <span class="p">{</span>
<span class="n">list</span><span class="o"><</span><span class="n">Op</span><span class="o">></span> <span class="n">ops</span><span class="p">;</span> <span class="c1">// 多个op
</span> <span class="kt">uint64_t</span> <span class="n">bytes</span><span class="p">,</span> <span class="n">keys</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="data-sync">Data Sync</h1>
<p>除了基本的monitor的k/v元数据,更多的数据应该是通过paxos算法生成的各个map的不同版本的数据,这些数据需要保证一致性。
monitor可能发生故障,比如宕机或网络中断,在运行过程中也可能需要新加入monitor,新加入的monitor就需要同步已有monitor集群的数据。
monitor在启动的时候,会进入bootstrap阶段,然后probing其他monitor的信息,其他monitor收到消息后,会返回消息,内容主要包括以下几项:</p>
<ul>
<li>当前的monmap</li>
<li>当前的quorum</li>
<li>paxos的第一次提交first_committed</li>
<li>paxos的最后一次提交last_committed</li>
</ul>
<p>收到返回消息后,会做相应的处理,首先会判断monmap,如果收到的monmap版本更大(数据更新),会更新自己的monmap。其次,如果paxos提交的序列号差异过大,那么需要sync对方的数据。
这里有两种情况,如果有重叠(my->last_committed >= peer->first_committed),只会sync差异的序列号,如果没有,那么需要执行一次full sync,
即将对方的first_committed到last_committed的数据全部拷贝过来。当然,如果差异还在接受范围内,probing阶段可能跳过sync,后续由paxos恢复差异的版本数据。</p>
<p>注意这里只关心已经commit过的数据,paxos propose或accept过的数据,如果还没commit,这里不做处理,后面paxos算法初始化的时候会进行处理。</p>
<h1 id="data-check">Data Check</h1>
<p>monitor有一个抽象基类QuorumService,用以派生一些针对quorum的服务,类图如下:</p>
<p><img src="/assets/img/post/ceph_mon_quorumservice.png" alt="img" /></p>
<p>这里感觉继承关系有些滥用,类ConfigKeyService提供给用户一些接口,可以方便的在monitor存储一些自定义的key/value数据,
这需要通过leader向paxos算法发出propose完成,似乎和QuorumService关系不大。</p>
<p>HealthMonitor用来检查monitor状态,内部包含一个服务的map:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HealthMonitor</span> <span class="o">:</span> <span class="k">public</span> <span class="n">QuorumService</span>
<span class="p">{</span>
<span class="n">map</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span><span class="n">HealthService</span><span class="o">*></span> <span class="n">services</span><span class="p">;</span> <span class="c1">// 需要检查的服务
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>目前只实现了一个服务,即DataHealthService,这个用来检查monitor存储的数据,一方面检查磁盘空间使用情况,另一方面检查后端k/v存储的具体使用情况:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DataHealthService</span> <span class="o">:</span> <span class="k">public</span> <span class="n">HealthService</span>
<span class="p">{</span>
<span class="n">map</span><span class="o"><</span><span class="n">entity_inst_t</span><span class="p">,</span><span class="n">DataStats</span><span class="o">></span> <span class="n">stats</span><span class="p">;</span> <span class="c1">// 检查的项目
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">DataStats</span> <span class="p">{</span>
<span class="n">ceph_data_stats_t</span> <span class="n">fs_stats</span><span class="p">;</span> <span class="c1">// 文件系统使用情况
</span> <span class="n">LevelDBStoreStats</span> <span class="n">store_stats</span><span class="p">;</span> <span class="c1">// k/v后端存储的使用情况,支持多个后端的情况下,名字不应该再用leveldb了
</span><span class="p">};</span>
</code></pre></div></div>
<h1 id="scrub">Scrub</h1>
<p>类似于osd进程需要通过scrub比较副本数据,及时发现并处理不一致的数据。各个monitor节点也需要保证数据一致(这里一致是指磁盘数据没有被损坏,不是paxos算法的一致性)。
因为monitor的数据会根据版本做trim,旧的数据意义不大,需求并不是那么强烈,所以并没有后台任务周期性地scrub(这里针对hammer版本代码而言,
最新master代码有周期性的time事件调度scrub执行),而只是提供一种机制,即显示地发出scrub命令,由leader向各peon发送MMonScrub消息完成:</p>
<blockquote>
<p>ceph scrub</p>
</blockquote>
<p>scrub的对象只是PaxosService的数据,不会包括monitor自身的一些元数据和paxos的数据,monitor自身的数据,在启动的时候应该就会做check,
paxos的数据因为很有可能每个节点的数据本身就不一样,比如正在处理proposal的时候,所以也不scrub,这也从侧面反应monitor的scrub不是那么重要。</p>
<h1 id="leader-elect">Leader Elect</h1>
<p>ceph为了简化设计,monitor内部会选一个leader出来,负责发起paxos propose对数据进行更新。选举过程非常简单,
如前所述,选择编号最小的,参考<a href="http://blog.wjin.org/posts/ceph-monitor-leader-elect.html">这篇文章</a>。</p>
<h1 id="timecheck">Timecheck</h1>
<p>分布式系统正常运转依赖于系统时间,ceph提供timecheck机制,用来检查每个monitor时间是否一致,如果误差过大(clock skew),会发出警告消息。
检查由leader节点向peon节点发送消息MTimeCheck完成,leader会估算一个rtt(消息来回的时间)值,然后才是skew值:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">map</span><span class="o"><</span><span class="n">entity_inst_t</span><span class="p">,</span> <span class="kt">double</span><span class="o">></span> <span class="n">timecheck_skews</span><span class="p">;</span> <span class="c1">// clock skew
</span><span class="n">map</span><span class="o"><</span><span class="n">entity_inst_t</span><span class="p">,</span> <span class="kt">double</span><span class="o">></span> <span class="n">timecheck_latencies</span><span class="p">;</span> <span class="c1">// rtt
</span></code></pre></div></div>
<p>leader check的频率由以下参数控制,默认是300秒:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mon_timecheck_interval
</code></pre></div></div>
<p>如果系统发生严重的时钟飘逸,有可能导致peon节点的lease失效,进而导致peon进入bootstrap,从而重新选举。可以通过将leader的时间调小(回滚)验证。</p>
<h1 id="lease">Lease</h1>
<p>monitor内部采用lease协议,保证副本数据在一定时间范围内可读写(写需要是leader节点),同时也用来发现monitor的异常,然后重新选举。
leader节点会定期发送lease消息,延长各peon的时间。 如果peon节点down掉,leader节点不会收到lease_ack消息,
超时后就会重新选举。如果leader节点down掉,所有的peon节点不会收到来自leader的lease更新消息,超时后也会重新选举。
这个协议实现在类Paxos内部,主要消息类型是MMonPaxos::OP_LEASE,以下是几个关键参数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mon_lease_renew_interval</span> <span class="c1">// leader发送lease消息的间隔,默认为3秒
</span><span class="n">mon_lease</span> <span class="c1">// 每次延长lease的时间,默认为延长5秒,必须大于mon_lease_renew_interval
</span><span class="n">mon_lease_ack_timeout</span> <span class="c1">// 超时重新选举的时间,默认为10秒,必须大于mon_lease
</span></code></pre></div></div>
<p>注意最后一个参数,leader和peon共用的超时时间,名字取的不是很好。leader发出lease消息后,超过此值没有收到所有的回应消息(ack),
就会重新进入bootstrap选举。peon在超过这个值的时候,如果还没有收到lease消息,也会进入bootstrap重新选举。
这里存在一段时间lease过期了,但是还没超时,这段时间内是不可读写的,无论是leader还是peon,这个时间点用变量lease_expire存放:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 判断lease是否在有效期内
</span><span class="kt">bool</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">is_lease_valid</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="p">((</span><span class="n">mon</span><span class="o">-></span><span class="n">get_quorum</span><span class="p">().</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span>
<span class="o">||</span> <span class="p">(</span><span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">)</span> <span class="o"><</span> <span class="n">lease_expire</span><span class="p">));</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">extend_lease</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">mon</span><span class="o">-></span><span class="n">is_leader</span><span class="p">());</span>
<span class="n">lease_expire</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="n">lease_expire</span> <span class="o">+=</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">mon_lease</span><span class="p">;</span> <span class="c1">// leader发送消息的时候,延长时间
</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">handle_lease</span><span class="p">(</span><span class="n">MMonPaxos</span> <span class="o">*</span><span class="n">lease</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// extend lease
</span> <span class="k">if</span> <span class="p">(</span><span class="n">lease_expire</span> <span class="o"><</span> <span class="n">lease</span><span class="o">-></span><span class="n">lease_timestamp</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lease_expire</span> <span class="o">=</span> <span class="n">lease</span><span class="o">-></span><span class="n">lease_timestamp</span><span class="p">;</span> <span class="c1">// peon根据leader的消息,更新时间
</span>
<span class="n">utime_t</span> <span class="n">now</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">lease_expire</span> <span class="o"><</span> <span class="n">now</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 不可读写的时间段,只是打印警告消息
</span> <span class="n">utime_t</span> <span class="n">diff</span> <span class="o">=</span> <span class="n">now</span> <span class="o">-</span> <span class="n">lease_expire</span><span class="p">;</span>
<span class="n">derr</span> <span class="o"><<</span> <span class="s">"lease_expire from "</span> <span class="o"><<</span> <span class="n">lease</span><span class="o">-></span><span class="n">get_source_inst</span><span class="p">()</span> <span class="o"><<</span> <span class="s">" is "</span> <span class="o"><<</span> <span class="n">diff</span>
<span class="o"><<</span> <span class="s">" seconds in the past; mons are probably laggy (or possibly clocks are too skewed)"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="paxos">Paxos</h1>
<p>paxos算法保证各monitor的数据一致,具体参见<a href="http://blog.wjin.org/posts/ceph-monitor-paxos.html">这篇文章</a>。</p>
<h1 id="paxosservice">PaxosService</h1>
<p>PaxosService比较简单,内部利用类Paxos的功能,提供一些模板方法,方便实现不同的服务,具体参见<a href="http://blog.wjin.org/posts/ceph-monitor-paxosservice.html">这篇文章</a>。</p>
<h1 id="consistency">Consistency</h1>
<h3 id="monmap">Monmap</h3>
<p>monitor新加入的时候,会mkfs,将自己加入monmap,并且存在后端存储中。后续如果异常宕机或退出,再次启动后会读取原来的monmap,
然后通过probing机制发现其他的monitor,申请加入quorum,并且重新选举。如果在宕机过程中monmap有变化,probing阶段会share monmap并更新。
这里有个极端case需要注意,假设最开始有monitor 0,1,2:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0,1,2
1,2 #0 down
1,2,3,4,5,6 #加入4个monitor: 3-6
3,4,5,6 #1,2 down
0,3,4,5,6 #0 up
</code></pre></div></div>
<p>这时候0是加入不进来的,probing会一直超时而完不成,因为0知道的monmap只知道1,2,probing只会向1,2发送消息,但是1,2已经down了。
而此时3,4,5,6因为大于一半,一直在正确工作,也不会重新bootstrap,导致他们发现不了0已经up。</p>
<p>当然这种case实际运维过程中应该不会遇到这么极端,但是需要明白monmap是非常重要的,它的一致性很关键,是一切后续流程的基石,所以在bootstrap阶段sync数据的时候,
都会备份一份monmap以防万一。</p>
<h3 id="sync">Sync</h3>
<p>probing阶段,monitor会sync数据,如果差距过大,即版本没重叠,就做全量sync,否则增量sync,这里也需要注意,如果差距不大,还没有超过需要sync的阈值,不会做数据sync,
这个阈值由参数paxos_max_join_drift控制。这就意味着,probing完成后,进入electing阶段,新加入的这个monitor数据很可能是落后几个版本的,
这个数据的恢复需要paxos的recovering阶段来完成,从而达到数据的一致性。</p>
<h3 id="paxos-1">Paxos</h3>
<p>leader选举完成后,leader节点会执行collect函数,即做数据recover,这会保证各monitor数据最终一致,commit的数据一定会一样,如有accept过但没commit的数据,会重新propose。</p>
Ceph FileStore
2016-05-04T00:00:00+00:00
http://blog.wjin.org/posts/ceph-filestore
<h1 id="introduction">Introduction</h1>
<p>ceph后端存储引擎有多种实现(filestore, kstore, memstore, bluestore), bluestore将来会成为默认的后端存储,
但是需要一些时间,现在大部分部署都是使用filestore。filestore的代码还是比较好理解的,
执行流程可以参考网上的<a href="http://blog.csdn.net/ywy463726588/article/details/42679869">这篇文章</a>。</p>
<p>在阅读代码的过程中,一些细节还是需要注意,比如不同PG的OP操作可以并行执行,同一PG内部OP请求必须串行执行,
各个限流组件怎么协调工作,journal回放的时候需要注意非幂等操作。</p>
<p>本文顺着代码流程,分析写流程中一些值得关注的细节,然后总结下throttle, 非幂等操作和tuning参数。</p>
<p>涉及到的相关线程:</p>
<ul>
<li>
<p>OSD::osd_op_tp -> 提交写请求到journal队列</p>
</li>
<li>
<p>FileJournal::write thread -> 写journal</p>
</li>
<li>
<p>JournalingObjectStore::finisher -> journal完成后的回调</p>
</li>
<li>
<p>FileStore::ondisk_finisher -> journal落盘的回调,标志写成功,但数据不可读</p>
</li>
<li>
<p>FileStore::op_tp -> apply到文件系统的page cache,不保证落盘</p>
</li>
<li>
<p>WBThrottle::thread -> apply文件系统的限流</p>
</li>
<li>
<p>FileStore::op_finisher -> apply文件系统完成的回调,标志数据可读</p>
</li>
<li>
<p>FileStore::sync thread -> sync文件系统的内容到磁盘,将序列号通知journal,使得journal可以释放空间,重复利用</p>
</li>
</ul>
<p>操作语义:</p>
<ul>
<li>
<p>submit: 提交到journal队列</p>
</li>
<li>
<p>apply: 写文件系统page cache</p>
</li>
<li>
<p>commit: 将文件系统page cache数据sync到磁盘</p>
</li>
</ul>
<p>备注一下,这里的commit指sync线程将page cache的数据sync到<code class="highlighter-rouge">data disk</code>的意思,不是journal的disk。容易引起误解的地方是,osd_op_tp在提交事务的时候,
会有两个回调,一个是ondisk(客户端可以认为是on commit),表示数据已经落盘,因为journal一般采用O_DIRECT + O_DSYNC方式,写journal成功就表示数据落盘,
可以调用ondisk回调,通知客户端写成功,所以journal能改善写的性能,将随机转化为顺序,并且多个写可以合并成一次journal的写。
另外一个是onreadable,表示数据可读,在FileStore::op_tp线程池将数据写入page cache后,就可以读数据,可以调用onreadable回调。</p>
<p>重要数据结构:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">JournalingObjectStore</span> <span class="o">:</span> <span class="k">public</span> <span class="n">ObjectStore</span> <span class="p">{</span>
<span class="k">protected</span><span class="o">:</span>
<span class="k">class</span> <span class="nc">SubmitManager</span> <span class="p">{</span>
<span class="n">Mutex</span> <span class="n">lock</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">op_seq</span><span class="p">;</span> <span class="c1">// journal提交的序列号,全局唯一
</span> <span class="kt">uint64_t</span> <span class="n">op_submitted</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">}</span> <span class="n">submit_manager</span><span class="p">;</span>
<span class="k">class</span> <span class="nc">ApplyManager</span> <span class="p">{</span>
<span class="n">Mutex</span> <span class="n">apply_lock</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">blocked</span><span class="p">;</span>
<span class="n">Cond</span> <span class="n">blocked_cond</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">open_ops</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">max_applied_seq</span><span class="p">;</span> <span class="c1">// apply到文件系统page cache的序列号
</span>
<span class="n">Mutex</span> <span class="n">com_lock</span><span class="p">;</span>
<span class="n">map</span><span class="o"><</span><span class="n">version_t</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">Context</span><span class="o">*></span> <span class="o">></span> <span class="n">commit_waiters</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">committing_seq</span><span class="p">,</span> <span class="n">committed_seq</span><span class="p">;</span> <span class="c1">// 将文件系统page cache的数据fsync到磁盘的序列号,用来通知journal释放空间
</span> <span class="p">......</span>
<span class="p">}</span> <span class="n">apply_manager</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="write">Write</h1>
<h2 id="osdosd_op_tp">OSD::osd_op_tp</h2>
<p>OSD::osd_op_tp线程池在执行PG写操作的时候,是通过函数queue_transactions提交请求的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">queue_transactions</span><span class="p">(</span><span class="n">Sequencer</span> <span class="o">*</span><span class="n">posr</span><span class="p">,</span> <span class="n">list</span><span class="o"><</span><span class="n">Transaction</span><span class="o">*></span> <span class="o">&</span><span class="n">tls</span><span class="p">,</span>
<span class="n">TrackedOpRef</span> <span class="n">osd_op</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// posr,定义在PG类中,对于同一个PG,一定是一样的,这里的default_osr根本就不会用,Jewel新代码已经删除了
</span> <span class="c1">// OpSequencer非常关键,同一个PG会使用同样的OpSequencer,保证PG操作串行化
</span> <span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">posr</span><span class="p">)</span>
<span class="n">posr</span> <span class="o">=</span> <span class="o">&</span><span class="n">default_osr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">posr</span><span class="o">-></span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="n">osr</span> <span class="o">=</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">OpSequencer</span> <span class="o">*></span><span class="p">(</span><span class="n">posr</span><span class="o">-></span><span class="n">p</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">osr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">OpSequencer</span><span class="p">;</span> <span class="c1">// PG的第一次操作的时候,会创建一个OpSequencer,以后就会复用
</span> <span class="n">osr</span><span class="o">-></span><span class="n">parent</span> <span class="o">=</span> <span class="n">posr</span><span class="p">;</span>
<span class="n">posr</span><span class="o">-></span><span class="n">p</span> <span class="o">=</span> <span class="n">osr</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">journal</span> <span class="o">&&</span> <span class="n">journal</span><span class="o">-></span><span class="n">is_writeable</span><span class="p">()</span> <span class="o">&&</span> <span class="o">!</span><span class="n">m_filestore_journal_trailing</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Op</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="n">build_op</span><span class="p">(</span><span class="n">tls</span><span class="p">,</span> <span class="n">onreadable</span><span class="p">,</span> <span class="n">onreadable_sync</span><span class="p">,</span> <span class="n">osd_op</span><span class="p">);</span>
<span class="n">op_queue_reserve_throttle</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">handle</span><span class="p">);</span> <span class="c1">// filestore层对整个op的限流,释放的时候是在FileStore::_finish_op
</span>
<span class="n">journal</span><span class="o">-></span><span class="n">throttle</span><span class="p">();</span> <span class="c1">// 对journal的限流
</span>
<span class="c1">// 在journal层面为op生成唯一sequence, 因为journal是单线程写,所以写一定是串行的
</span> <span class="kt">uint64_t</span> <span class="n">op_num</span> <span class="o">=</span> <span class="n">submit_manager</span><span class="p">.</span><span class="n">op_submit_start</span><span class="p">();</span>
<span class="n">o</span><span class="o">-></span><span class="n">op</span> <span class="o">=</span> <span class="n">op_num</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m_filestore_journal_parallel</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">m_filestore_journal_writeahead</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// ext4, xfs都需要wal
</span>
<span class="n">osr</span><span class="o">-></span><span class="n">queue_journal</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">op</span><span class="p">);</span> <span class="c1">// journal层面的sequence记录在OpSequencer中的journal queue中, JournalingObjectStore::finisher线程中会deque_journal
</span>
<span class="n">_op_journal_transactions</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">tls</span><span class="p">,</span> <span class="n">o</span><span class="o">-></span><span class="n">op</span><span class="p">,</span> <span class="c1">// 提交OP,注意这里的callback以及sequence
</span> <span class="k">new</span> <span class="n">C_JournaledAhead</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">osr</span><span class="p">,</span> <span class="n">o</span><span class="p">,</span> <span class="n">ondisk</span><span class="p">),</span>
<span class="n">osd_op</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">submit_manager</span><span class="p">.</span><span class="n">op_submit_finish</span><span class="p">(</span><span class="n">op_num</span><span class="p">);</span> <span class="c1">// op提交到journal队列完成
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">JournalingObjectStore</span><span class="o">::</span><span class="n">_op_journal_transactions</span><span class="p">(</span>
<span class="n">list</span><span class="o"><</span><span class="n">ObjectStore</span><span class="o">::</span><span class="n">Transaction</span><span class="o">*>&</span> <span class="n">tls</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">op</span><span class="p">,</span>
<span class="n">Context</span> <span class="o">*</span><span class="n">onjournal</span><span class="p">,</span> <span class="n">TrackedOpRef</span> <span class="n">osd_op</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">journal</span> <span class="o">&&</span> <span class="n">journal</span><span class="o">-></span><span class="n">is_writeable</span><span class="p">())</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">journal</span><span class="o">-></span><span class="n">submit_entry</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">tbl</span><span class="p">,</span> <span class="n">data_align</span><span class="p">,</span> <span class="n">onjournal</span><span class="p">,</span> <span class="n">osd_op</span><span class="p">);</span> <span class="c1">// 放入journal队列,等待write线程执行journal写请求
</span> <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">onjournal</span><span class="p">)</span> <span class="p">{</span>
<span class="n">apply_manager</span><span class="p">.</span><span class="n">add_waiter</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">onjournal</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">FileJournal</span><span class="o">::</span><span class="n">submit_entry</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">seq</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">e</span><span class="p">,</span> <span class="kt">int</span> <span class="n">alignment</span><span class="p">,</span>
<span class="n">Context</span> <span class="o">*</span><span class="n">oncommit</span><span class="p">,</span> <span class="n">TrackedOpRef</span> <span class="n">osd_op</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 获取journal限流资源
</span> <span class="n">throttle_ops</span><span class="p">.</span><span class="n">take</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">throttle_bytes</span><span class="p">.</span><span class="n">take</span><span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">length</span><span class="p">());</span>
<span class="p">{</span>
<span class="c1">// 注意锁的顺序
</span> <span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l1</span><span class="p">(</span><span class="n">writeq_lock</span><span class="p">);</span> <span class="c1">// ** lock **
</span> <span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l2</span><span class="p">(</span><span class="n">completions_lock</span><span class="p">);</span> <span class="c1">// ** lock **
</span>
<span class="c1">// write线程执行完成后,会处理这里的completion
</span> <span class="n">completions</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span>
<span class="n">completion_item</span><span class="p">(</span>
<span class="n">seq</span><span class="p">,</span> <span class="n">oncommit</span><span class="p">,</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">),</span> <span class="n">osd_op</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">writeq</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span>
<span class="n">writeq_cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span>
<span class="n">writeq</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">write_item</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">e</span><span class="p">,</span> <span class="n">alignment</span><span class="p">,</span> <span class="n">osd_op</span><span class="p">));</span> <span class="c1">// 放入队列,等待write线程执行
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>执行到这里,请求已经提交到journal的队列里面,OSD::osd_op_tp工作就结束了。</p>
<h2 id="filejournalwrite_thread">FileJournal::write_thread</h2>
<p>写journal线程通过Filejournal::write_thread完成,流程比较简单,执行完成后,就会调用:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">FileJournal</span><span class="o">::</span><span class="n">queue_completions_thru</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">seq</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// journal的一次写可以同时写入多个op请求日志,所以这里是循环处理所有已经完成的op
</span> <span class="c1">// 将回调全部放入finisher线程的队列
</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">completions_empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">completion_item</span> <span class="n">next</span> <span class="o">=</span> <span class="n">completion_peek_front</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">next</span><span class="p">.</span><span class="n">seq</span> <span class="o">></span> <span class="n">seq</span><span class="p">)</span> <span class="c1">// sequence判断是否已经写入完成
</span> <span class="k">break</span><span class="p">;</span>
<span class="n">completion_pop_front</span><span class="p">();</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">next</span><span class="p">.</span><span class="n">finish</span><span class="p">)</span> <span class="c1">// 放入finisher队列,等待回调
</span> <span class="n">finisher</span><span class="o">-></span><span class="n">queue</span><span class="p">(</span><span class="n">next</span><span class="p">.</span><span class="n">finish</span><span class="p">);</span> <span class="c1">// finisher线程实际上是JournalingObjectStore中的finisher
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="n">finisher_cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>需要注意的是,在journal准备写和写完成处理completions的时候,调用队列的锁太频繁,可以优化。
master branch已经有类似patch: <a href="https://github.com/ceph/ceph/pull/6701">pr6701</a></p>
<h2 id="journalingobjectstorefinisher">JournalingObjectStore::finisher</h2>
<p>这里的回调就是C_JournaledAhead,然后会执行下面这个函数,主要干两件事情:1)将op放入filestore队列排队 2)将ondisk回调放入FileStore::ondisk_finisher:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_journaled_ahead</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">,</span> <span class="n">Op</span> <span class="o">*</span><span class="n">o</span><span class="p">,</span> <span class="n">Context</span> <span class="o">*</span><span class="n">ondisk</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">queue_op</span><span class="p">(</span><span class="n">osr</span><span class="p">,</span> <span class="n">o</span><span class="p">);</span> <span class="c1">// 将op在filestore层面排队,准备写入文件系统
</span>
<span class="n">list</span><span class="o"><</span><span class="n">Context</span><span class="o">*></span> <span class="n">to_queue</span><span class="p">;</span>
<span class="n">osr</span><span class="o">-></span><span class="n">dequeue_journal</span><span class="p">(</span><span class="o">&</span><span class="n">to_queue</span><span class="p">);</span> <span class="c1">// journal已经写成功,出队列
</span>
<span class="c1">// journal写好了,数据就真正落盘了,所以执行ondisk回调
</span> <span class="c1">// 注意此时数据还未写入文件系统,所以不可读
</span> <span class="k">if</span> <span class="p">(</span><span class="n">ondisk</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ondisk_finisher</span><span class="p">.</span><span class="n">queue</span><span class="p">(</span><span class="n">ondisk</span><span class="p">);</span> <span class="c1">// 放入ondisk_finisher的队列,等待回调
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">to_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">ondisk_finisher</span><span class="p">.</span><span class="n">queue</span><span class="p">(</span><span class="n">to_queue</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>journal是单线程顺序执行的,且每条op请求都有唯一的sequence,使得queue_op一定是按提交时候的顺序调用。
但是同一个PG可能连续提交了很多次op请求,这些请求会放入PG对应的OpSequencer中进行排队,然后同时将OpSequencer放入
op_wq队列等待FileStore::op_tp执行,所以如果PG连续提交请求,OpSequencer会在op_wq中同时出现多次,
op_tp中可能多个线程同时获取同一个OpSequencer准备执行写文件系统的操作:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">queue_op</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">,</span> <span class="n">Op</span> <span class="o">*</span><span class="n">o</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// queue_op按提交时候的顺序调用,必然导致属于同一个OpSequencer的OP按照提交顺序
</span> <span class="c1">// 在OpSequencer内部排队, 保证了PG内部op的先后顺序
</span> <span class="n">osr</span><span class="o">-></span><span class="n">queue</span><span class="p">(</span><span class="n">o</span><span class="p">);</span>
<span class="n">op_wq</span><span class="p">.</span><span class="n">queue</span><span class="p">(</span><span class="n">osr</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="filestoreop_tp">FileStore::op_tp</h2>
<p>PG对应的OpSequencer排队以后,说明PG有OP需要执行,这时候线程池就会对其处理,入口函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_do_op</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">,</span> <span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">&</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">wbthrottle</span><span class="p">.</span><span class="n">throttle</span><span class="p">();</span> <span class="c1">// filestore层面writeback的限流
</span>
<span class="p">......</span>
<span class="c1">// op_tp线程池的多个线程可以并发对同一个OpSequencer执行请求
</span> <span class="c1">// 锁保证同一个OpSequencer中(也即PG中)只能有一个OP在执行
</span> <span class="n">osr</span><span class="o">-></span><span class="n">apply_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">Op</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="n">osr</span><span class="o">-></span><span class="n">peek_queue</span><span class="p">();</span> <span class="c1">// 获取一个op
</span>
<span class="n">apply_manager</span><span class="p">.</span><span class="n">op_apply_start</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">op</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_do_transactions</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">tls</span><span class="p">,</span> <span class="n">o</span><span class="o">-></span><span class="n">op</span><span class="p">,</span> <span class="o">&</span><span class="n">handle</span><span class="p">);</span> <span class="c1">// 执行写请求到文件系统
</span> <span class="n">apply_manager</span><span class="p">.</span><span class="n">op_apply_finish</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">op</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>写执行完成后,线程还会执行一个finish函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_finish_op</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">list</span><span class="o"><</span><span class="n">Context</span><span class="o">*></span> <span class="n">to_queue</span><span class="p">;</span>
<span class="n">Op</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="n">osr</span><span class="o">-></span><span class="n">dequeue</span><span class="p">(</span><span class="o">&</span><span class="n">to_queue</span><span class="p">);</span> <span class="c1">// 将op从OpSequencer出队列
</span> <span class="n">osr</span><span class="o">-></span><span class="n">apply_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span> <span class="c1">// 释放锁,这时候其他线程就可以继续对此QpSequencer执行apply操作
</span>
<span class="n">op_queue_release_throttle</span><span class="p">(</span><span class="n">o</span><span class="p">);</span> <span class="c1">// 释放filestore的throttle,见queue_transactions
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">onreadable_sync</span><span class="p">)</span> <span class="p">{</span>
<span class="n">o</span><span class="o">-></span><span class="n">onreadable_sync</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">onreadable</span><span class="p">)</span> <span class="p">{</span>
<span class="n">op_finisher</span><span class="p">.</span><span class="n">queue</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">onreadable</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">to_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">op_finisher</span><span class="p">.</span><span class="n">queue</span><span class="p">(</span><span class="n">to_queue</span><span class="p">);</span> <span class="c1">// 放入op_finisher队列,等待执行apply回调,标志数据可读
</span> <span class="p">}</span>
<span class="k">delete</span> <span class="n">o</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>OpSequencer中apply_lock保证PG内部OP的串行化,并不是保证内部队列q和jq的互斥,q和jq的互斥是另外一把锁qlock在保证。
所以在apply的过程中,OSD::osd_op_tp可以继续向jq中提交请求,更重要的是,JournalingObjectStore::finisher线程可以继续将
写journal完成的op在q中排队。</p>
<p>如前所述,同一个OpSequencer可能进入FileStore::op_wq多次,然后被多个FileStore:op_tp中的线程获取执行,然后q和jq共用一把锁,是否会影响性能?
其实也还好,虽然FileStore::osd_tp是线程池,会有多个线程,但是这些线程在开始处理apply
的时候,会先获取apply_lock,然后在执行完成的时候,从q出队列op的时候获取qlock,所以不会同时出现多个FileStore::osd_tp的线程去
抢qlock这个锁,可以认为同一时刻q只增加了两个线程去抢qlock,即JournalingObjectStore::finisher 和 其中一个FileStore::osd_tp线程。</p>
<h2 id="filestoresync_thread">FileStore::sync_thread</h2>
<p>sync线程实现比较简单,目的是获取一个序列号,保证此序列号之前的数据都已经apply过了,即数据已经在page cache中,
然后执行fsync,更新序列号,这样可以保证此序列号之前的数据已经存入disk中,以后不在需要,journal可以做trim释放空间。</p>
<p>需要注意在获取序列号的过程中,会导致FileStore::op_tp block住,影响apply流程,对性能有损失,
可以适当调整参数filestore_max_sync_interval。有一个潜在问题是,如果长时间不sync,可能会导致执行sync的时候,
整个目录数据过多,导致一次sync时间太长,也可能导致系统内存不足而OOM,这些需要结合kernel参数dirty_ratio 和 dirty_expire_centisecs调优。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">sync_entry</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">stop</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">op_tp</span><span class="p">.</span><span class="n">pause</span><span class="p">();</span> <span class="c1">// 暂停apply线程池的处理
</span> <span class="k">if</span> <span class="p">(</span><span class="n">apply_manager</span><span class="p">.</span><span class="n">commit_start</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 如果有新的请求需要commit, 返回true
</span>
<span class="kt">uint64_t</span> <span class="n">cp</span> <span class="o">=</span> <span class="n">apply_manager</span><span class="p">.</span><span class="n">get_committing_seq</span><span class="p">();</span> <span class="c1">// 获取已经apply过的序列号
</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">backend</span><span class="o">-></span><span class="n">can_checkpoint</span><span class="p">())</span> <span class="p">{</span>
<span class="p">......</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">apply_manager</span><span class="p">.</span><span class="n">commit_started</span><span class="p">();</span> <span class="c1">// 设置block为false,主要是为journal replay服务
</span> <span class="n">op_tp</span><span class="p">.</span><span class="n">unpause</span><span class="p">();</span> <span class="c1">// 恢复线程池
</span>
<span class="kt">int</span> <span class="n">err</span> <span class="o">=</span> <span class="n">backend</span><span class="o">-></span><span class="n">syncfs</span><span class="p">();</span> <span class="c1">// 这里会sync osd的整个current目录
</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">write_op_seq</span><span class="p">(</span><span class="n">op_fd</span><span class="p">,</span> <span class="n">cp</span><span class="p">);</span> <span class="c1">// 记录下commit的序列号
</span>
<span class="n">err</span> <span class="o">=</span> <span class="o">::</span><span class="n">fsync</span><span class="p">(</span><span class="n">op_fd</span><span class="p">);</span> <span class="c1">// 保证更新序列号的操作落盘
</span> <span class="p">}</span>
<span class="n">apply_manager</span><span class="p">.</span><span class="n">commit_finish</span><span class="p">();</span> <span class="c1">// 完成commit,通知journal
</span> <span class="n">wbthrottle</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">op_tp</span><span class="p">.</span><span class="n">unpause</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">stop</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">JournalingObjectStore</span><span class="o">::</span><span class="n">ApplyManager</span><span class="o">::</span><span class="n">commit_start</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">bool</span> <span class="n">ret</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">_committing_seq</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">apply_lock</span><span class="p">);</span>
<span class="n">blocked</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="c1">// 这个仅仅为journal replay起作用
</span>
<span class="k">while</span> <span class="p">(</span><span class="n">open_ops</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 等待其他inflight apply 完成
</span> <span class="n">blocked_cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">apply_lock</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">com_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">max_applied_seq</span> <span class="o">==</span> <span class="n">committed_seq</span><span class="p">)</span> <span class="p">{</span>
<span class="n">blocked</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">_committing_seq</span> <span class="o">=</span> <span class="n">committing_seq</span> <span class="o">=</span> <span class="n">max_applied_seq</span><span class="p">;</span> <span class="c1">// 更新序列号
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="n">ret</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">out</span><span class="o">:</span>
<span class="k">if</span> <span class="p">(</span><span class="n">journal</span><span class="p">)</span>
<span class="n">journal</span><span class="o">-></span><span class="n">commit_start</span><span class="p">(</span><span class="n">_committing_seq</span><span class="p">);</span> <span class="c1">// tell the journal too
</span> <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这里比较晦涩的地方是,sync线程先pause住FileStore::op_tp线程池,然后调用commit_start(),
pause后说明线程池不会再有新的apply请求了,为什么还设置变量blocked为true?</p>
<p>首先,设置这个变量为true,目的是防止继续apply:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">JournalingObjectStore</span><span class="o">::</span><span class="n">ApplyManager</span><span class="o">::</span><span class="n">op_apply_start</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">op</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">apply_lock</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">blocked</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 新的apply操作将会阻塞
</span> <span class="n">blocked_cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">apply_lock</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">assert</span><span class="p">(</span><span class="o">!</span><span class="n">blocked</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span> <span class="o">></span> <span class="n">committed_seq</span><span class="p">);</span>
<span class="n">open_ops</span><span class="o">++</span><span class="p">;</span>
<span class="k">return</span> <span class="n">op</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>其次,有一种特殊情况,即journal在做replay的时候,apply的操作不是在FileStore::op_op线程池内完成,
而是在其他线程调用mount的时候,回放日志完成apply,所以pause op_tp不起作用,停止不了apply操作。
如果回放日志太多或太久,导致sync线程开始工作,那么此时需要将回放日志的线程暂停一下,
以便获取序列号,这时候blocked就起作用了,可以阻塞调用mount的线程,等commit完成后唤醒继续replay。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">JournalingObjectStore</span><span class="o">::</span><span class="n">ApplyManager</span><span class="o">::</span><span class="n">commit_started</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">blocked</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="c1">// 设置回false
</span> <span class="n">blocked_cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span> <span class="c1">// 唤醒
</span><span class="p">}</span>
</code></pre></div></div>
<p>另外还需要注意,sync线程调用commit_start()是有可能被阻塞的,需要等所有的inflight apply完成,
所以apply完成后会检查是否有blocked,这里和刚才的情况不一样,虽然都是阻塞在blocked变量上:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">JournalingObjectStore</span><span class="o">::</span><span class="n">ApplyManager</span><span class="o">::</span><span class="n">op_apply_finish</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">op</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">blocked</span><span class="p">)</span> <span class="p">{</span>
<span class="n">blocked_cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span> <span class="c1">// 唤醒sync线程
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="throttle">Throttle</h1>
<p>FileStore实现中,提供了三个限流的地方:</p>
<ul>
<li>
<p>journal</p>
</li>
<li>
<p>filestore apply</p>
</li>
<li>
<p>filestore writeback</p>
</li>
</ul>
<h2 id="journal">journal</h2>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">queue_transactions</span><span class="p">(</span><span class="n">Sequencer</span> <span class="o">*</span><span class="n">posr</span><span class="p">,</span> <span class="n">list</span><span class="o"><</span><span class="n">Transaction</span><span class="o">*></span> <span class="o">&</span><span class="n">tls</span><span class="p">,</span>
<span class="n">TrackedOpRef</span> <span class="n">osd_op</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">journal</span><span class="o">-></span><span class="n">throttle</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">FileJournal</span><span class="o">::</span><span class="n">prepare_multi_write</span><span class="p">(</span><span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">,</span> <span class="kt">uint64_t</span><span class="o">&</span> <span class="n">orig_ops</span><span class="p">,</span> <span class="kt">uint64_t</span><span class="o">&</span> <span class="n">orig_bytes</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">put_throttle</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">peek_write</span><span class="p">().</span><span class="n">bl</span><span class="p">.</span><span class="n">length</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="filestore-apply">filestore apply</h2>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">queue_transactions</span><span class="p">(</span><span class="n">Sequencer</span> <span class="o">*</span><span class="n">posr</span><span class="p">,</span> <span class="n">list</span><span class="o"><</span><span class="n">Transaction</span><span class="o">*></span> <span class="o">&</span><span class="n">tls</span><span class="p">,</span>
<span class="n">TrackedOpRef</span> <span class="n">osd_op</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">op_queue_reserve_throttle</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">handle</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_finish_op</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">op_queue_release_throttle</span><span class="p">(</span><span class="n">o</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="filestore-writeback">filestore writeback</h2>
<p>wbthrottle,参见另外一篇<a href="http://blog.wjin.org/posts/ceph-throttle-summary.html">文章</a>。</p>
<h1 id="non-idempotent-op">Non-idempotent OP</h1>
<p>在osd异常崩溃的情况下,journal中的数据不一定全部都存放在了FileStore的data disk中,因为apply到了FileStore中,并不代表数据就在disk中了,
此时很有可能数据在page cache中,需要sync线程调用fdatasync之类的系统调用才能保证数据落盘。</p>
<p>所以为了保证异常情况下数据的一致性,需要对journla的日志做回放,从什么地方开始回放,FileStore中会将已经apply到文件系统并进行
过fdatasync的序列号记录在文件commit_op_seq中,回放的时候就从此文件记录的序列号开始。</p>
<p>然而,回放的时候,部分op可能已经在disk中生效,但是commit_op_seq并没有体现,此时如果仍然回放,对于有些操作,反复执行多次会出问题,也即非幂等操作。</p>
<p>一个例子:</p>
<ul>
<li>clone一个object,这个操作已经提交到日志</li>
<li>将操作apply到FileStore也已经完成</li>
<li>源object后续做了更新,也apply到了FileStore</li>
</ul>
<p>假设如上操作都已经体现在了disk中,但是sync线程并未来得及更新commit_op_seq,此时系统崩溃。
再次启动后,osd启动回放日志,第二次执行clone操作,将拷贝到新版本的数据,而不是期望版本的数据。
FileStore需要保证回放处理这种情况的正确性。</p>
<p>具体做法是在对象文件的属性中记录下最后操作的一个三元组(序列号,事务编号,OP编号),因为journal提交的时候有一个唯一的序列号,通过这个序列号,
就可以找到提交时候的事务,然后根据事务编号和OP编号最终定位出最后操作的OP。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">SequencerPosition</span> <span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">seq</span><span class="p">;</span> <span class="c1">///< seq
</span> <span class="kt">uint32_t</span> <span class="n">trans</span><span class="p">;</span> <span class="c1">///< transaction in that seq (0-based)
</span> <span class="kt">uint32_t</span> <span class="n">op</span><span class="p">;</span> <span class="c1">///< op in that transaction (0-based)
</span> <span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>看clone例子,操作前,先检查下,如果可以继续执行,就执行操作,操作完成后,设置一个guard,这样对于非幂等操作,如果上次执行过,
肯定是有记录的,再一次执行的时候check就会失败,就不继续执行。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_clone</span><span class="p">(</span><span class="k">const</span> <span class="n">coll_t</span><span class="o">&</span> <span class="n">cid</span><span class="p">,</span> <span class="k">const</span> <span class="n">ghobject_t</span><span class="o">&</span> <span class="n">oldoid</span><span class="p">,</span> <span class="k">const</span> <span class="n">ghobject_t</span><span class="o">&</span> <span class="n">newoid</span><span class="p">,</span>
<span class="k">const</span> <span class="n">SequencerPosition</span><span class="o">&</span> <span class="n">spos</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">_check_replay_guard</span><span class="p">(</span><span class="n">cid</span><span class="p">,</span> <span class="n">newoid</span><span class="p">,</span> <span class="n">spos</span><span class="p">)</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">......</span>
<span class="c1">// clone is non-idempotent; record our work.
</span> <span class="n">_set_replay_guard</span><span class="p">(</span><span class="o">**</span><span class="n">n</span><span class="p">,</span> <span class="n">spos</span><span class="p">,</span> <span class="o">&</span><span class="n">newoid</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="tuning">Tuning</h1>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code># journal
journal_queue_max_bytes
journal_queue_max_ops
# filestore apply
filestore_queue_max_bytes
filestore_queue_max_ops
# filestore writeback
filestore_wbthrottle_enable
filestore_wbthrottle_xfs_bytes_start_flusher
filestore_wbthrottle_xfs_bytes_hard_limit
filestore_wbthrottle_xfs_ios_start_flusher
filestore_wbthrottle_xfs_ios_hard_limit
filestore_wbthrottle_xfs_inodes_start_flusher
</code></pre></div></div>
<p>filestore writeback打开以后,一方面需要注意对应限流参数的调整,ext4和xfs是共用一套参数,
另一方面,如果io压力持续过大,可能导致FileStore::op_tp被throttle住而超时,也有可能导致
FileStore::op_wq限流起作用。</p>
<p>如果关闭writeback,可能导致FileStore:sync_thread超时,需要调整参数filestore_commit_timeout,
ssd情况下可以关闭wbthrottle。</p>
<p>其他影响性能的参数:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>filestore_op_threads
filestore_fd_cache_size
journal_max_write_bytes
journal_max_write_entries
filestore_max_sync_interval
</code></pre></div></div>
Ceph OSD Heartbeat
2016-04-14T00:00:00+00:00
http://blog.wjin.org/posts/ceph-osd-heartbeat
<h1 id="introduction">Introduction</h1>
<p>大规模分布式系统中,各种异常情况时有发生,如系统宕机,网络故障,磁盘损坏等等都有可能造成集群内部节点无法通信。
一个分布式系统要正常协调地运转,内部各节点进程间需要通过心跳机制来保证各节点处于正常工作状态,一旦发现故障,及时响应。</p>
<p>本文简单对ceph osd 进程间的心跳机制加以分析。</p>
<h1 id="heartbeat-messenger">HeartBeat Messenger</h1>
<p>进程间心跳消息,需要通过ceph网络层传输,对于ceph网络层的处理,可以参考<a href="http://blog.wjin.org/posts/ceph-async-messenger.html">这篇文章</a>。
在osd进程启动的过程中,创造了三个messenger用于心跳通信,参考文件ceph-osd.cc:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">ms_hbclient</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span> <span class="c1">// 发送ping心跳的messenger
</span> <span class="n">entity_name_t</span><span class="o">::</span><span class="n">OSD</span><span class="p">(</span><span class="n">whoami</span><span class="p">),</span> <span class="s">"hbclient"</span><span class="p">,</span>
<span class="n">getpid</span><span class="p">());</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">ms_hb_back_server</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span> <span class="c1">// 接收来自back地址的ping心跳
</span> <span class="n">entity_name_t</span><span class="o">::</span><span class="n">OSD</span><span class="p">(</span><span class="n">whoami</span><span class="p">),</span> <span class="s">"hb_back_server"</span><span class="p">,</span>
<span class="n">getpid</span><span class="p">());</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">ms_hb_front_server</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span> <span class="c1">// 接收来自front地址的ping心跳
</span> <span class="n">entity_name_t</span><span class="o">::</span><span class="n">OSD</span><span class="p">(</span><span class="n">whoami</span><span class="p">),</span> <span class="s">"hb_front_server"</span><span class="p">,</span>
<span class="n">getpid</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>因为每个osd进程地位是完全对等的,一方面它需要主动发送心跳ping message到其他节点,另一方面,它也会收到其他节点发来的ping message。
所以他们的通信方式是: <code class="highlighter-rouge">ms_hbclient <-> ms_hb_back_server</code> 和 <code class="highlighter-rouge">ms_hbclient <-> ms_hb_front_server</code>。</p>
<p>在部署ceph的时候,一般会使用两个网卡,两个地址back和front, 将纵向和横向流量分开,所以osd进程使用两个messenger分别监听来自back和front的心跳消息。
同时要注意的是,osd启动的时候,会将自己的back和front地址告诉monitor,这些信息都存放在osdmap里面,
其他节点可以通过osdmap来找到监听地址,创建连接,然后进行心跳消息的通信。</p>
<p>虽然front地址是供客户端连接集群使用的,但是这里并不是和客户端进行心跳,osd集群内部检查front地址是否可用也是合理的,
可以看作osd将自己作为客户端,去检测连接集群是否正常,防止客户端连接不上集群。</p>
<h1 id="send">Send</h1>
<p>osd使用单独的线程来发送心跳:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">heartbeat_entry</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">heartbeat_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_stopping</span><span class="p">())</span>
<span class="k">return</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">heartbeat_stop</span><span class="p">)</span> <span class="p">{</span>
<span class="n">heartbeat</span><span class="p">();</span> <span class="c1">// 发送消息
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">heartbeat</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 遍历所有peers,发送心跳,peers集合的选取需要遵循一定规则
</span> <span class="k">for</span> <span class="p">(</span><span class="n">map</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span><span class="n">HeartbeatInfo</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span> <span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">i</span> <span class="o">!=</span> <span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">peer</span> <span class="o">=</span> <span class="n">i</span><span class="o">-></span><span class="n">first</span><span class="p">;</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_tx</span> <span class="o">=</span> <span class="n">now</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">first_tx</span> <span class="o">==</span> <span class="n">utime_t</span><span class="p">())</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">first_tx</span> <span class="o">=</span> <span class="n">now</span><span class="p">;</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">con_back</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="k">new</span> <span class="n">MOSDPing</span><span class="p">(</span><span class="n">monc</span><span class="o">-></span><span class="n">get_fsid</span><span class="p">(),</span> <span class="c1">// 向back地址发送
</span> <span class="n">service</span><span class="p">.</span><span class="n">get_osdmap</span><span class="p">()</span><span class="o">-></span><span class="n">get_epoch</span><span class="p">(),</span>
<span class="n">MOSDPing</span><span class="o">::</span><span class="n">PING</span><span class="p">,</span>
<span class="n">now</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">con_front</span><span class="p">)</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">con_front</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="k">new</span> <span class="n">MOSDPing</span><span class="p">(</span><span class="n">monc</span><span class="o">-></span><span class="n">get_fsid</span><span class="p">(),</span> <span class="c1">// 向front地址发送
</span> <span class="n">service</span><span class="p">.</span><span class="n">get_osdmap</span><span class="p">()</span><span class="o">-></span><span class="n">get_epoch</span><span class="p">(),</span>
<span class="n">MOSDPing</span><span class="o">::</span><span class="n">PING</span><span class="p">,</span>
<span class="n">now</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>可以看出,心跳的发送流程是很简单的,也是很独立的。在设计分布式系统的时候,为了保证集群的内部状态正确,应尽量不要引入过多复杂的因素影响心跳的流程。
毕竟心跳快速正确的处理是确保集群运转正常的最基本条件。</p>
<h1 id="receive">Receive</h1>
<p>对心跳消息的处理,osd采用单独的dispatcher类:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">HeartbeatDispatcher</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Dispatcher</span> <span class="p">{</span>
<span class="n">OSD</span> <span class="o">*</span><span class="n">osd</span><span class="p">;</span>
<span class="n">HeartbeatDispatcher</span><span class="p">(</span><span class="n">OSD</span> <span class="o">*</span><span class="n">o</span><span class="p">)</span> <span class="o">:</span> <span class="n">Dispatcher</span><span class="p">(</span><span class="n">cct</span><span class="p">),</span> <span class="n">osd</span><span class="p">(</span><span class="n">o</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">bool</span> <span class="n">ms_dispatch</span><span class="p">(</span><span class="n">Message</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">osd</span><span class="o">-></span><span class="n">heartbeat_dispatch</span><span class="p">(</span><span class="n">m</span><span class="p">);</span> <span class="c1">// 消息处理函数
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span> <span class="n">heartbeat_dispatcher</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 初始化的时候注册dispatcher,收到消息后才知道怎么处理
</span> <span class="n">hbclient_messenger</span><span class="o">-></span><span class="n">add_dispatcher_head</span><span class="p">(</span><span class="o">&</span><span class="n">heartbeat_dispatcher</span><span class="p">);</span>
<span class="n">hb_front_server_messenger</span><span class="o">-></span><span class="n">add_dispatcher_head</span><span class="p">(</span><span class="o">&</span><span class="n">heartbeat_dispatcher</span><span class="p">);</span>
<span class="n">hb_back_server_messenger</span><span class="o">-></span><span class="n">add_dispatcher_head</span><span class="p">(</span><span class="o">&</span><span class="n">heartbeat_dispatcher</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>当收到消息后,会通过messenger内部的dispatch线程调用事先加入的dispatcher:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">OSD</span><span class="o">::</span><span class="n">heartbeat_dispatch</span><span class="p">(</span><span class="n">Message</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">get_type</span><span class="p">())</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">CEPH_MSG_PING</span><span class="p">:</span>
<span class="n">m</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">MSG_OSD_PING</span><span class="p">:</span>
<span class="n">handle_osd_ping</span><span class="p">(</span><span class="k">static_cast</span><span class="o"><</span><span class="n">MOSDPing</span><span class="o">*></span><span class="p">(</span><span class="n">m</span><span class="p">));</span> <span class="c1">// 处理心跳
</span> <span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">CEPH_MSG_OSD_MAP</span><span class="p">:</span> <span class="c1">// 这个消息在heartbeat messenger内部是不会产生的
</span> <span class="p">{</span>
<span class="n">ConnectionRef</span> <span class="n">self</span> <span class="o">=</span> <span class="n">cluster_messenger</span><span class="o">-></span><span class="n">get_loopback_connection</span><span class="p">();</span>
<span class="n">self</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">default</span><span class="o">:</span>
<span class="n">m</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">handle_osd_ping</span><span class="p">(</span><span class="n">MOSDPing</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">op</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">MOSDPing</span><span class="p">:</span><span class="o">:</span><span class="n">PING</span><span class="o">:</span> <span class="c1">// 处理心跳消息
</span> <span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 当进程内部状态不正确的时候,丢弃心跳消息,此时处理心跳已经变得没有意义
</span> <span class="c1">// 很多线程池会设置timeout时间,如果超时状态就会是unhealthy
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cct</span><span class="o">-></span><span class="n">get_heartbeat_map</span><span class="p">()</span><span class="o">-></span><span class="n">is_healthy</span><span class="p">())</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">Message</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MOSDPing</span><span class="p">(</span><span class="n">monc</span><span class="o">-></span><span class="n">get_fsid</span><span class="p">(),</span>
<span class="n">curmap</span><span class="o">-></span><span class="n">get_epoch</span><span class="p">(),</span>
<span class="n">MOSDPing</span><span class="o">::</span><span class="n">PING_REPLY</span><span class="p">,</span> <span class="c1">// 注意是PING_REPLY
</span> <span class="n">m</span><span class="o">-></span><span class="n">stamp</span><span class="p">);</span>
<span class="n">m</span><span class="o">-></span><span class="n">get_connection</span><span class="p">()</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="n">r</span><span class="p">);</span> <span class="c1">// 发送回包
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">MOSDPing</span><span class="p">:</span><span class="o">:</span><span class="n">PING_REPLY</span><span class="o">:</span> <span class="c1">// 处理心跳回包
</span> <span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 更新时间戳,避免心跳超时
</span> <span class="c1">// osd有专门的tick线程进行周期性的检查,如果发现有心跳超时的,就会上报monitor
</span> <span class="n">map</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span><span class="n">HeartbeatInfo</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span> <span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">from</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">!=</span> <span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">get_connection</span><span class="p">()</span> <span class="o">==</span> <span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">con_back</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_back</span> <span class="o">=</span> <span class="n">m</span><span class="o">-></span><span class="n">stamp</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">con_front</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_front</span> <span class="o">=</span> <span class="n">m</span><span class="o">-></span><span class="n">stamp</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">get_connection</span><span class="p">()</span> <span class="o">==</span> <span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">con_front</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_front</span> <span class="o">=</span> <span class="n">m</span><span class="o">-></span><span class="n">stamp</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="check">Check</h1>
<p>对心跳是否超时的检查,一方面发送线程发送消息后会检查一下,另外还有专门的tick线程,也会检查心跳是否超时:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">heartbeat_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">heartbeat_check</span><span class="p">();</span> <span class="c1">// 检查心跳是否超时
</span> <span class="n">heartbeat_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">heartbeat_check</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="n">map</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span><span class="n">HeartbeatInfo</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">p</span> <span class="o">!=</span> <span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">is_unhealthy</span><span class="p">(</span><span class="n">cutoff</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 检测超时
</span> <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_back</span> <span class="o">==</span> <span class="n">utime_t</span><span class="p">()</span> <span class="o">||</span>
<span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_front</span> <span class="o">==</span> <span class="n">utime_t</span><span class="p">())</span> <span class="p">{</span>
<span class="n">failure_queue</span><span class="p">[</span><span class="n">p</span><span class="o">-></span><span class="n">first</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_tx</span><span class="p">;</span> <span class="c1">// 插入队列,等待上报给monitor
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">failure_queue</span><span class="p">[</span><span class="n">p</span><span class="o">-></span><span class="n">first</span><span class="p">]</span> <span class="o">=</span> <span class="n">MIN</span><span class="p">(</span><span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_back</span><span class="p">,</span> <span class="n">p</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">last_rx_front</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>心跳超时上报的时候,也是在tick线程内完成:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">do_mon_report</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">send_failures</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">send_failures</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">failure_queue</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">osd</span> <span class="o">=</span> <span class="n">failure_queue</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">first</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">failed_for</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)(</span><span class="kt">double</span><span class="p">)(</span><span class="n">now</span> <span class="o">-</span> <span class="n">failure_queue</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">second</span><span class="p">);</span>
<span class="n">entity_inst_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">osdmap</span><span class="o">-></span><span class="n">get_inst</span><span class="p">(</span><span class="n">osd</span><span class="p">);</span>
<span class="n">monc</span><span class="o">-></span><span class="n">send_mon_message</span><span class="p">(</span><span class="k">new</span> <span class="n">MOSDFailure</span><span class="p">(</span><span class="n">monc</span><span class="o">-></span><span class="n">get_fsid</span><span class="p">(),</span> <span class="n">i</span><span class="p">,</span> <span class="n">failed_for</span><span class="p">,</span> <span class="n">osdmap</span><span class="o">-></span><span class="n">get_epoch</span><span class="p">()));</span> <span class="c1">// 向monitor发送消息,报告osd心跳超时
</span> <span class="n">failure_pending</span><span class="p">[</span><span class="n">osd</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="n">failure_queue</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">osd</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>当monitor收到消息后,会对消息进行处理,如果达到了阈值,就会通过paxos算法将osd标记为down,更新osdmap,并通知相关peers。</p>
<h1 id="peer">Peer</h1>
<p>心跳的收发都很简单,需要注意的是,一个osd怎么知道需要和哪些节点进行心跳?肯定不能是其他所有节点,这样集群内部心跳的开销就太大了。
所以,选取心跳的peer也得根据一些规则,主要实现是在下面这个函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">maybe_update_heartbeat_peers</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_waiting_for_healthy</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 在osd启动的过程中,或者在osd收到更新osdmap的消息,osd状态可能变为waiting,此时需要更新peers集合
</span> <span class="n">utime_t</span> <span class="n">now</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">last_heartbeat_resample</span> <span class="o">==</span> <span class="n">utime_t</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 第一次设置需要更新,这时候应该是osd刚启动
</span> <span class="n">last_heartbeat_resample</span> <span class="o">=</span> <span class="n">now</span><span class="p">;</span>
<span class="n">heartbeat_set_peers_need_update</span><span class="p">();</span> <span class="c1">// 设置需要更新peers标志
</span> <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">heartbeat_peers_need_update</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 后续更新,应该是收到osdmap变更的消息
</span> <span class="n">utime_t</span> <span class="n">dur</span> <span class="o">=</span> <span class="n">now</span> <span class="o">-</span> <span class="n">last_heartbeat_resample</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">dur</span> <span class="o">></span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_heartbeat_grace</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 仅仅在超出grace时间后才更新,因为超过grace,osdmap的变更才可能导致pgmap变化
</span> <span class="n">heartbeat_set_peers_need_update</span><span class="p">();</span> <span class="c1">// 设置需要更新peers标志
</span> <span class="n">last_heartbeat_resample</span> <span class="o">=</span> <span class="n">now</span><span class="p">;</span>
<span class="n">reset_heartbeat_peers</span><span class="p">();</span> <span class="c1">// we want *new* peers!
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">heartbeat_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">heartbeat_peers_need_update</span><span class="p">())</span>
<span class="k">return</span><span class="p">;</span> <span class="c1">// 不需要更新直接返回
</span> <span class="n">heartbeat_need_update</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">heartbeat_epoch</span> <span class="o">=</span> <span class="n">osdmap</span><span class="o">-></span><span class="n">get_epoch</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_active</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 需要osd状态是active,不然更新没意义
</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">pg_map_lock</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">ceph</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">spg_t</span><span class="p">,</span> <span class="n">PG</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span> <span class="n">pg_map</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="c1">// 遍历osd负责的所有pg
</span> <span class="n">i</span> <span class="o">!=</span> <span class="n">pg_map</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">PG</span> <span class="o">*</span><span class="n">pg</span> <span class="o">=</span> <span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">;</span>
<span class="n">pg</span><span class="o">-></span><span class="n">heartbeat_peer_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">pg</span><span class="o">-></span><span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="c1">// 遍历pg对应的peers
</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">pg</span><span class="o">-></span><span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">p</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="n">osdmap</span><span class="o">-></span><span class="n">is_up</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">))</span> <span class="c1">// 如果为up,则加入心跳集合
</span> <span class="n">_add_heartbeat_peer</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">pg</span><span class="o">-></span><span class="n">probe_targets</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="c1">// 遍历probe目标集合
</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">pg</span><span class="o">-></span><span class="n">probe_targets</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">p</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="n">osdmap</span><span class="o">-></span><span class="n">is_up</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">))</span> <span class="c1">// 如果为up,则加入心跳集合
</span> <span class="n">_add_heartbeat_peer</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">);</span>
<span class="n">pg</span><span class="o">-></span><span class="n">heartbeat_peer_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 后面流程就比较简单
</span> <span class="c1">// 1) 加入'仅挨着'当前osd编号的下一个和上一个为up的节点
</span> <span class="c1">// 2) 删除down的节点
</span> <span class="c1">// 3) 对peers集合做调整
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>什么时候需要更新peers集合,也即这个函数什么时候会被调用?从实现看,影响peers集合主要是pgmap的变化,那什么时候pgmap可能改变呢?</p>
<ul>
<li>
<p>pg创建的时候,参考函数handle_pg_create</p>
</li>
<li>
<p>osdmap变更的时候,osd承载的pg可能需要重新peering,导致osd状态可能会变为STATE_WAITING_FOR_HEALTHY,参考函数handle_osd_map</p>
</li>
<li>
<p>tick线程中周期性的检查,主要是因为osd启动过程中,会load_pg,类似第二条</p>
</li>
</ul>
<p>还有需要注意,设置peers更新标记,不仅仅是在这个函数内部,在pg peering状态机运作的过程中,会更新标记:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">PG</span><span class="o">::</span><span class="n">update_heartbeat_peers</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="kt">bool</span> <span class="n">need_update</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">heartbeat_peer_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">new_peers</span> <span class="o">==</span> <span class="n">heartbeat_peers</span><span class="p">)</span> <span class="p">{</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">heartbeat_peers</span><span class="p">.</span><span class="n">swap</span><span class="p">(</span><span class="n">new_peers</span><span class="p">);</span>
<span class="n">need_update</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="c1">// 需要update
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">need_update</span><span class="p">)</span>
<span class="n">osd</span><span class="o">-></span><span class="n">need_heartbeat_peer_update</span><span class="p">();</span> <span class="c1">// 更新
</span><span class="p">}</span>
</code></pre></div></div>
<p>总结一下就是,osd启动或者异常退出,monitor会收到消息,然后进行paxos,将结果会反应在osdmap上,进而通知相关osd进程,
osd进程收到消息后,会处理map的变更,可能导致pg重新peering。monitor也会收到创建pool或修改pg_num的消息,最终会导致创建pg,
osd收到消息创建pg,也会导致peering。osd启动的过程中,load_pg也会导致peering,一旦有peering发生,osd进程的状态就是STATE_WAITING_FOR_HEALTHY,
就可能导致更新peer集合。</p>
<h1 id="optimization">Optimization</h1>
<p>发送心跳采用单独的线程,目前来看没有什么好优化的(社区好像有提议希望将心跳信息附带在op内部,不过还只是草案)。</p>
<p>对于收到消息后的分发,本人有一个优化的patch,见<a href="https://github.com/ceph/ceph/pull/8808">pr8808</a>。
主要是在大规模集群的情况下,鉴于目前simple messenger导致线程数过多,messenger内部dispatch线程可能由于system schedule会被delay,
也有可能因为与很多心跳线程竞争dispatch queue lock而失败导致睡眠,从而导致处理心跳的回调变慢,进而超时。pr8808通过将心跳消息fast dispatch后,
减少了消息需要进入dispatch队列的竞争。</p>
<p>另外,在async messenger情况下,虽然连接线程数减少了,但是存在另外的问题,
因为进程中所有的async messenger共用workerpool,如果所有worker线程因为竞争锁而被block住,则系统无法进行消息的dispatch,心跳消息的处理也会block住,
参见<a href="http://tracker.ceph.com/issues/15758#change-70168">bug15758</a>。</p>
<p>还有一个需要注意的是,心跳messenger内部是不会收到osdmap更新的消息的,见<a href="https://github.com/ceph/ceph/pull/8831">pr8831</a>。</p>
<h1 id="tuning">Tuning</h1>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>osd_heartbeat_grace
osd_heartbeat_interval
</code></pre></div></div>
<p>大规模部署情况下,压测的时候(比如1000 vm 跑fio),心跳可能会出问题,可能需要将grace时间调大,避免误报,
如果调大后,interval也应相应增大,避免发送频率太高,保证至少发送过3次心跳后没有回包才上报,比如目前grace为20,
interval为6,如果grace为30,则interval建议为9比较合适。</p>
<p>grace调大,也有副作用,如果某个osd异常退出,等待其他osd上报的时间必须为grace,在这段时间段内,这个osd负责的pg的io会hang住。
可以采用我之前的优化的patch,尽量不要将grace调的太大。</p>
<p>如果集群存在大规模的顺序读写,网络成为瓶颈的时候,可以通过下面这个参数调高心跳消息在内核网络层的优先级:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>osd_heartbeat_use_min_delay_socket
</code></pre></div></div>
<p>另外还需要注意,在压测的过程中,osd内部如果有线程池timeout,会导致心跳数据报的丢失,所以很多线程池的timeout时间应做调整,
线程池的timeout由以下map管理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HeartbeatMap</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// register/unregister
</span> <span class="n">heartbeat_handle_d</span> <span class="o">*</span><span class="n">add_worker</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span><span class="p">);</span> <span class="c1">// 注册handler
</span> <span class="kt">void</span> <span class="n">remove_worker</span><span class="p">(</span><span class="n">heartbeat_handle_d</span> <span class="o">*</span><span class="n">h</span><span class="p">);</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">m_cct</span><span class="p">;</span>
<span class="n">RWLock</span> <span class="n">m_rwlock</span><span class="p">;</span>
<span class="kt">time_t</span> <span class="n">m_inject_unhealthy_until</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">list</span><span class="o"><</span><span class="n">heartbeat_handle_d</span><span class="o">*></span> <span class="n">m_workers</span><span class="p">;</span> <span class="c1">// 注册的所有timeout handler集合
</span>
<span class="p">.....</span>
<span class="kt">bool</span> <span class="n">_check</span><span class="p">(</span><span class="n">heartbeat_handle_d</span> <span class="o">*</span><span class="n">h</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">who</span><span class="p">,</span> <span class="kt">time_t</span> <span class="n">now</span><span class="p">);</span> <span class="c1">// 检查是否超时
</span><span class="p">};</span>
<span class="k">struct</span> <span class="n">heartbeat_handle_d</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span><span class="p">;</span>
<span class="n">atomic_t</span> <span class="n">timeout</span><span class="p">,</span> <span class="n">suicide_timeout</span><span class="p">;</span>
<span class="kt">time_t</span> <span class="n">grace</span><span class="p">,</span> <span class="n">suicide_grace</span><span class="p">;</span> <span class="c1">// 超过grace时间,表示timeout;超过suicide_grace,进程会退出
</span> <span class="n">std</span><span class="o">::</span><span class="n">list</span><span class="o"><</span><span class="n">heartbeat_handle_d</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">list_item</span><span class="p">;</span>
<span class="n">heartbeat_handle_d</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&</span> <span class="n">n</span><span class="p">)</span>
<span class="o">:</span> <span class="n">name</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="n">grace</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">suicide_grace</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="p">{</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
Ceph Monitor Startup
2016-02-26T00:00:00+00:00
http://blog.wjin.org/posts/ceph-monitor-startup
<h1 id="introduction">Introduction</h1>
<p>monitor进程启动过程,主要会经历下面这个状态转换:</p>
<p><img src="/assets/img/post/ceph_mon_startup.png" alt="img" /></p>
<p>monior初始化完成后,会首先进入状态probing,询问其他monitor的信息,当收到应答消息后,会检查自己的数据决定是否需要同步数据。
因为monitor可能是新建的,加入集群的时候需要同步已有集群monitor的数据,如果monitor已经down掉很久,也需要同步数据。</p>
<p>数据同步好后(注意这里数据并不是完全一致,后续可能还需要paxos算法进行恢复,达到数据一致),就可以加入quorum,发起选举,选举完成后,
如果胜出,就会变成leader状态,否则peon状态。成功与否取决于monitor在map中的rank值,较小的获胜。</p>
<h1 id="main-thread">Main Thread</h1>
<p>主线程的初始化工作,参见代码ceph_mon.cc:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 创建存储monitor数据的store,后端主要是用k/v形式存储
</span> <span class="n">MonitorDBStore</span> <span class="n">store</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">mon_data</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">store</span><span class="p">.</span><span class="n">create_and_open</span><span class="p">(</span><span class="n">cerr</span><span class="p">);</span>
<span class="c1">// 获取monitor map信息
</span> <span class="n">MonMap</span> <span class="n">monmap</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">mapbl</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">err</span> <span class="o">=</span> <span class="n">obtain_monmap</span><span class="p">(</span><span class="o">*</span><span class="n">store</span><span class="p">,</span> <span class="n">mapbl</span><span class="p">);</span> <span class="c1">// 获取map
</span>
<span class="c1">// 获取监听的地址, 第一次会mkfs,然后生成一个map
</span> <span class="c1">// 以后直接从map里获取,所以如果第一次配置错误,修改配置文件,重启monitor进程是没有用的
</span> <span class="n">entity_addr_t</span> <span class="n">ipaddr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">monmap</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">get_id</span><span class="p">()))</span> <span class="p">{</span>
<span class="n">ipaddr</span> <span class="o">=</span> <span class="n">monmap</span><span class="p">.</span><span class="n">get_addr</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">get_id</span><span class="p">());</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="c1">// 创建用于消息通信的messenger
</span> <span class="n">Messenger</span> <span class="o">*</span><span class="n">msgr</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span>
<span class="n">entity_name_t</span><span class="o">::</span><span class="n">MON</span><span class="p">(</span><span class="n">rank</span><span class="p">),</span> <span class="s">"mon"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="c1">// 绑定端口
</span> <span class="n">err</span> <span class="o">=</span> <span class="n">msgr</span><span class="o">-></span><span class="n">bind</span><span class="p">(</span><span class="n">ipaddr</span><span class="p">);</span>
<span class="c1">// monitor实现的具体类
</span> <span class="n">mon</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Monitor</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">get_id</span><span class="p">(),</span> <span class="n">store</span><span class="p">,</span>
<span class="n">msgr</span><span class="p">,</span> <span class="o">&</span><span class="n">monmap</span><span class="p">);</span>
<span class="c1">// 初始化
</span> <span class="n">err</span> <span class="o">=</span> <span class="n">mon</span><span class="o">-></span><span class="n">preinit</span><span class="p">();</span>
<span class="c1">// messenger开始接收消息
</span> <span class="n">msgr</span><span class="o">-></span><span class="n">start</span><span class="p">();</span>
<span class="c1">// 继续初始化
</span> <span class="n">mon</span><span class="o">-></span><span class="n">init</span><span class="p">();</span>
<span class="c1">// 等待结束
</span> <span class="n">msgr</span><span class="o">-></span><span class="n">wait</span><span class="p">();</span>
<span class="c1">// 关闭store
</span> <span class="n">store</span><span class="o">-></span><span class="n">close</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>需要注意的是,在部署的时候,会通过mkfs生成monitor map,存储在monitor的后端存储中,以后进程如果重启,是不需要读取配置的,避免出现未知错误。
使用过程中,经常有同事将配置写错,比如public 和 cluster 地址写反,修改配置后重启进程,这样是达不到修改效果的。
另外,由于monitor map特别重要,在monitor启动的过程中,如果需要sync数据,会先备份一份再修改,防止sync中途出错,再次重启后如果没有monitor map,monitor就没法启动。</p>
<h1 id="monitor-class">Monitor Class</h1>
<p>从主线程中的流程看,初始化过程主要集中在类Monitor的preinit和init函数中,先看看Monitor类的构造函数然后分析这两个函数:</p>
<h3 id="constructor">Constructor</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Monitor</span><span class="o">::</span><span class="n">Monitor</span><span class="p">(</span><span class="n">CephContext</span><span class="o">*</span> <span class="n">cct_</span><span class="p">,</span> <span class="n">string</span> <span class="n">nm</span><span class="p">,</span> <span class="n">MonitorDBStore</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">m</span><span class="p">,</span> <span class="n">MonMap</span> <span class="o">*</span><span class="n">map</span><span class="p">)</span> <span class="o">:</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// paxos算法实现
</span> <span class="n">paxos</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Paxos</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="s">"paxos"</span><span class="p">);</span>
<span class="c1">// 借助于paxos实现的不同服务
</span> <span class="n">paxos_service</span><span class="p">[</span><span class="n">PAXOS_MDSMAP</span><span class="p">]</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MDSMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">,</span> <span class="s">"mdsmap"</span><span class="p">);</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">PAXOS_MONMAP</span><span class="p">]</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MonmapMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">,</span> <span class="s">"monmap"</span><span class="p">);</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">PAXOS_OSDMAP</span><span class="p">]</span> <span class="o">=</span> <span class="k">new</span> <span class="n">OSDMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">,</span> <span class="s">"osdmap"</span><span class="p">);</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">PAXOS_PGMAP</span><span class="p">]</span> <span class="o">=</span> <span class="k">new</span> <span class="n">PGMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">,</span> <span class="s">"pgmap"</span><span class="p">);</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">PAXOS_LOG</span><span class="p">]</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LogMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">,</span> <span class="s">"logm"</span><span class="p">);</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">PAXOS_AUTH</span><span class="p">]</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AuthMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">,</span> <span class="s">"auth"</span><span class="p">);</span>
<span class="n">health_monitor</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HealthMonitor</span><span class="p">(</span><span class="k">this</span><span class="p">);</span> <span class="c1">// 主要监控monitor的数据存储空间变化情况,查看磁盘是否满了
</span> <span class="n">config_key_service</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ConfigKeyService</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">paxos</span><span class="p">);</span> <span class="c1">// 存储一些用户自定义的k/v数据
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="preinit">Preinit</h3>
<p>接着是preinit函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">preinit</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">.......</span>
<span class="n">init_paxos</span><span class="p">();</span>
<span class="n">health_monitor</span><span class="o">-></span><span class="n">init</span><span class="p">();</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">init_paxos</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">paxos</span><span class="o">-></span><span class="n">init</span><span class="p">();</span> <span class="c1">// 读取paxos算法的关键信息
</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">PAXOS_NUM</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">init</span><span class="p">();</span> <span class="c1">// 只有LogMonitor重载了init函数,其他服务没什么需要初始化的
</span> <span class="p">}</span>
<span class="n">refresh_from_paxos</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span> <span class="c1">// 更新各服务信息
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">Paxos</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">last_pn</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"last_pn"</span><span class="p">);</span> <span class="c1">// 上次提议的编号
</span> <span class="n">accepted_pn</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"accepted_pn"</span><span class="p">);</span> <span class="c1">// 已经接受的最大编号
</span> <span class="n">last_committed</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"last_committed"</span><span class="p">);</span> <span class="c1">// 最后一次commit的版本
</span> <span class="n">first_committed</span> <span class="o">=</span> <span class="n">get_store</span><span class="p">()</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_name</span><span class="p">(),</span> <span class="s">"first_committed"</span><span class="p">);</span> <span class="c1">// 第一次commit的版本
</span><span class="p">}</span>
</code></pre></div></div>
<p>commit版本决定了是否需要向其他monitor sync数据。last_pn和accepted_pn主要用于paxos解决多个monitor数据一致性。</p>
<p>针对每个服务,PaxosService基类提供模板方法refresh和post_refresh,进行服务的更新,对于每个派生类(paxos服务),具体需要更新的信息派生类自己实现抽象函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">refresh_from_paxos</span><span class="p">(</span><span class="kt">bool</span> <span class="o">*</span><span class="n">need_bootstrap</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">PAXOS_NUM</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">refresh</span><span class="p">(</span><span class="n">need_bootstrap</span><span class="p">);</span> <span class="c1">// 调用模板方法更新
</span> <span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">PAXOS_NUM</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">paxos_service</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">post_refresh</span><span class="p">();</span> <span class="c1">// 调用模板方法,更新后的处理
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">PaxosService</span><span class="o">::</span><span class="n">refresh</span><span class="p">(</span><span class="kt">bool</span> <span class="o">*</span><span class="n">need_bootstrap</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 版本比较关键,决定是否需要更新
</span> <span class="n">cached_first_committed</span> <span class="o">=</span> <span class="n">mon</span><span class="o">-></span><span class="n">store</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_service_name</span><span class="p">(),</span> <span class="n">first_committed_name</span><span class="p">);</span>
<span class="n">cached_last_committed</span> <span class="o">=</span> <span class="n">mon</span><span class="o">-></span><span class="n">store</span><span class="o">-></span><span class="n">get</span><span class="p">(</span><span class="n">get_service_name</span><span class="p">(),</span> <span class="n">last_committed_name</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">update_from_paxos</span><span class="p">(</span><span class="n">need_bootstrap</span><span class="p">);</span> <span class="c1">// 各服务实现自己的需求
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">PaxosService</span><span class="o">::</span><span class="n">post_refresh</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">post_paxos_update</span><span class="p">();</span> <span class="c1">// 各服务实现自己的需求
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mon</span><span class="o">-></span><span class="n">is_peon</span><span class="p">()</span> <span class="o">&&</span> <span class="o">!</span><span class="n">waiting_for_finished_proposal</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">finish_contexts</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">waiting_for_finished_proposal</span><span class="p">,</span> <span class="o">-</span><span class="n">EAGAIN</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>preinit主要是将Paxos和PaxosService等服务进行了初始化,读取了上次记录在store中的数据,为此monitor和其他monitor互动做好准备。</p>
<h3 id="init">Init</h3>
<p>接下来main thread初化messenger,准备消息的收发,然后调用init函数,和其他monitor进行互动:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kt">int</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">timer</span><span class="p">.</span><span class="n">init</span><span class="p">();</span> <span class="c1">// 初始化timer线程
</span> <span class="n">new_tick</span><span class="p">();</span> <span class="c1">// 加入time事件
</span>
<span class="c1">// i'm ready!
</span> <span class="n">messenger</span><span class="o">-></span><span class="n">add_dispatcher_tail</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="n">bootstrap</span><span class="p">();</span> <span class="c1">// 启动
</span>
<span class="p">......</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>bootstrap从名字上看,就可以知道是引导monitor正确启动的入口,在monitor进程运行的过程中,如果出现一些信息不对称或不全的情况,就会调用此函数重新启动,
因为重启过程中会sync数据:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">bootstrap</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">wait_for_paxos_write</span><span class="p">();</span>
<span class="n">sync_reset_requester</span><span class="p">();</span>
<span class="n">unregister_cluster_logger</span><span class="p">();</span>
<span class="n">cancel_probe_timeout</span><span class="p">();</span>
<span class="c1">// 设置状态
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">STATE_PROBING</span><span class="p">;</span>
<span class="n">_reset</span><span class="p">();</span> <span class="c1">// 重置paxos及其服务
</span>
<span class="c1">// 只有一个monitor,没必要联系其他monitor进行leader选举
</span> <span class="k">if</span> <span class="p">(</span><span class="n">monmap</span><span class="o">-></span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span> <span class="o">&&</span> <span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">win_standalone_election</span><span class="p">();</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="c1">// 发送消息,收集信息
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">monmap</span><span class="o">-></span><span class="n">size</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">i</span> <span class="o">!=</span> <span class="n">rank</span><span class="p">)</span>
<span class="n">messenger</span><span class="o">-></span><span class="n">send_message</span><span class="p">(</span><span class="k">new</span> <span class="n">MMonProbe</span><span class="p">(</span><span class="n">monmap</span><span class="o">-></span><span class="n">fsid</span><span class="p">,</span> <span class="n">MMonProbe</span><span class="o">::</span><span class="n">OP_PROBE</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">has_ever_joined</span><span class="p">),</span>
<span class="n">monmap</span><span class="o">-></span><span class="n">get_inst</span><span class="p">(</span><span class="n">i</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接下来的流程就是根据收到的消息进行状态转换:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Monitor</span><span class="o">::</span><span class="n">handle_probe_reply</span><span class="p">(</span><span class="n">MMonProbe</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 同步数据
</span> <span class="k">if</span> <span class="p">(</span><span class="n">paxos</span><span class="o">-></span><span class="n">get_version</span><span class="p">()</span> <span class="o">+</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">paxos_max_join_drift</span> <span class="o"><</span> <span class="n">m</span><span class="o">-></span><span class="n">paxos_last_version</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cancel_probe_timeout</span><span class="p">();</span>
<span class="n">sync_start</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>
<span class="n">m</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 满足条件,开始选举
</span> <span class="kt">unsigned</span> <span class="n">need</span> <span class="o">=</span> <span class="n">monmap</span><span class="o">-></span><span class="n">size</span><span class="p">()</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">outside_quorum</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">>=</span> <span class="n">need</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">outside_quorum</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">name</span><span class="p">))</span> <span class="p">{</span>
<span class="n">start_election</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>如果commit版本有差异,就会同步数据,同步完成后会再一次bootstrap,然后probing,接着就会发起选举的消息,获胜变为leader,失败变为peon。
选举流程可以参考<a href="http://blog.wjin.org/posts/ceph-monitor-leader-elect.html">ceph leader elect</a>,
选举完成后paxos算法需要做数据恢复,参考<a href="http://blog.wjin.org/posts/ceph-monitor-paxos.html">ceph paxos</a>。</p>
<h1 id="dispatch">Dispatch</h1>
<p>待数据恢复完成后,说明数据已经完全一致,monitor就进入工作状态,准备收发消息了,从启动流程看,monitor不像osd进程那样,会启动很多的线程池来工作。
main thread发送probe相关消息后就开始wait,等待进程退出了。后期的主要工作就是根据收到的消息,进行相应的处理,根据ceph网络层面的实现,
后期所有的处理几乎都是在dispatch线程内完成的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="nf">ms_dispatch</span><span class="p">(</span><span class="n">Message</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">_ms_dispatch</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在分发消息的时候,首先还是获取了monitor内部的锁,初始化的时候只创建了一个messenger,那么只会有一个dispatch线程(这里只考虑SimpleMessenger,
AsyncMessenger会有多个worker线程进行分发),怎么还需要这把锁?其实这里主要是防止和time线程竞争。</p>
<p>monitor单独有个time线程处理所有的超时,分布式系统,发送消息出去后,相应的消息未必会及时收到,可能网络出问题,也可能对端已经挂了,也可能被对端忽略了,
直接无视,所以设置timeout后怎么处理是非常关键的。在Elector和Paxos算法实现中的timeout处理均是通过Monitor内部的time线程完成。</p>
<p>后续Monitor, Elector以及Paxos的状态演进,要么是通过dispatch线程收到消息后的处理,要么是超时后time线程的处理,通过Monitor内部的锁进行互斥,
所以在Elector和Paxos及其服务的实现中,没有任何锁的机制。dispatch线程收到不同消息后,属于Monitor类自己的消息,自己处理,如果不行,
再转发给Elector, Paxos或PaxosService等进行处理。发生timeout事件的时候,time线程多数情况下需要做一下清理工作,然后可能重新bootstrap。</p>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>获取monitor map,为后续monitor进程间通信做好准备</p>
</li>
<li>
<p>初始化paxos,加载一些paxos变量,主要是commit过的版本号</p>
</li>
<li>
<p>bootstrap,然后进入probing阶段,此时可能根据commit的版本号进行数据同步</p>
</li>
<li>
<p>leader选举</p>
</li>
<li>
<p>选举完成后,leader/peon分别初始化paxos算法</p>
</li>
<li>
<p>paxos的leader通过collect阶段做数据恢复</p>
</li>
<li>
<p>leader/peon变为active</p>
</li>
</ul>
Ceph Scrub Mechanism
2016-01-19T00:00:00+00:00
http://blog.wjin.org/posts/ceph-scrub-mechanism
<h1 id="introduction">Introduction</h1>
<p>通常情况下,ceph 采用三副本同步写的策略,维护数据的强一致性。同时,ceph也提供一种机制去检查各个副本之间的数据是否一致,
如果发现不一致就必须repair,这种机制就是scrub。</p>
<p>scrub分为两种,scrub 和 deep scrub。前者只比较object元数据信息,后者会真正读取对象文件内容进行比较,
会造成很大的IO流量。更为严重的是,在scrub的时候,会获取pg的锁,这样会hang住前台IO请求,目前社区正在对这一块进行改进。</p>
<p>ceph osd进程中,由周期性timer线程检查pg是否需要做scrub,另外,也可以通过命令行(ceph pg scrub pgid)触发scrub,
实现的时候主要是设置一个must_scrub标志位完成,不难看出,scrub的粒度是以pg为单位进行的。</p>
<p>处理的流程主要是在timer线程内调度,在disk_tp线程池内对pg进行处理。</p>
<h2 id="timer-线程">timer 线程</h2>
<p><img src="/assets/img/post/ceph_scrub_timer.png" alt="image" /></p>
<ol>
<li>
<p>以一定概率调度scrub处理函数 (考虑因素包括: pg是primary,scrub配置的时间段,系统当前负载等)</p>
</li>
<li>
<p>primary pg预留slot,并且发送消息让从pg也预留slot</p>
</li>
<li>
<p>主从pg均预留slot成功,将pg放入队列scrub_wq</p>
</li>
</ol>
<h2 id="disk_tp线程池">disk_tp线程池</h2>
<p><img src="/assets/img/post/ceph_scrub_process.png" alt="image" /></p>
<ol>
<li>
<p>从scrub_wq取出pg,调用PG::scrub进行处理</p>
</li>
<li>
<p>随机睡眠一小段时间,进行参数判断,然后进入chunky_scrub</p>
</li>
<li>
<p>chunky_scrub采用简单的状态机处理 (获取从pg的scrub map,比较map是否一致)</p>
</li>
</ol>
<p>源码中对状态机的注释比较清楚:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code> * +------------------+
* _________v__________ |
* | | |
* | INACTIVE | |
* |____________________| |
* | |
* | +----------+ |
* _________v___v______ | |
* | | | |
* | NEW_CHUNK | | |
* |____________________| | |
* | | |
* _________v__________ | |
* | | | |
* | WAIT_PUSHES | | |
* |____________________| | |
* | | |
* _________v__________ | |
* | | | |
* | WAIT_LAST_UPDATE | | |
* |____________________| | |
* | | |
* _________v__________ | |
* | | | |
* | BUILD_MAP | | |
* |____________________| | |
* | | |
* _________v__________ | |
* | | | |
* | WAIT_REPLICAS | | |
* |____________________| | |
* | | |
* _________v__________ | |
* | | | |
* | COMPARE_MAPS | | |
* |____________________| | |
* | | |
* | | |
* _________v__________ | |
* | | | |
* |WAIT_DIGEST_UPDATES | | |
* |____________________| | |
* | | | |
* | +----------+ |
* _________v__________ |
* | | |
* | FINISH | |
* |____________________| |
* | |
* +------------------+
</code></pre></div></div>
<p>需要注意的是,正常情况下(不考虑recovery和pg状态发生变化), PG::scrub也会多次调用才能完成pg的scrub操作。</p>
<p>首先,primary pg发送消息给从pg获取scrub map,然后会到WAIT_REPLICAS状态,此时由于等待从pg的消息,scrub会暂时执行完毕,
scrub状态记录在scrubber中。当收到从pg发来的map后,pg会重新被放入scrub_wq队列,等待线程池重新从队列获取并执行,
此时scrubber的状态是WAIT_REPLICAS,判断成功,继续比较scrub map。</p>
<p>其次,当scrub map 比较完成后,会对对象end进行判断,如果还有新的对象需要做scrub,会将pg重新加入scrub_wq队列,
而chunky_scrub会退出循环,完成本次执行,避免scrub执行太长时间,导致pg的其他IO hang住。在实现的时候,进入PG::scrub,
如果配置的sleep时间大于0,还会随机睡眠一下,也是为了缓解scrub占用pg锁的时间太过频繁。</p>
<h1 id="code-analysis">Code Analysis</h1>
<h2 id="scrub-schedule">Scrub Schedule</h2>
<p>scrub是一个周期性事件,osd进程会定期调度scrub,如果满足条件就会将pg送入srcub_wq队列,等待disk_tp线程执行。
和很多周期性事件一样,它的检查是在OSD进程的timer线程内:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// time事件callback入口,会周期性的调用
</span><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_active</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// osd 状态必须是active
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">scrub_random_backoff</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 以一定概率调度
</span> <span class="n">sched_scrub</span><span class="p">();</span> <span class="c1">// 调度scrub
</span> <span class="p">}</span>
<span class="n">check_replay_queue</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">sched_scrub</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">bool</span> <span class="n">load_is_low</span> <span class="o">=</span> <span class="n">scrub_should_schedule</span><span class="p">();</span> <span class="c1">// 负载低,在规定时间限制内,会返回true
</span>
<span class="n">pair</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span> <span class="n">spg_t</span><span class="o">></span> <span class="n">pos</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">service</span><span class="p">.</span><span class="n">first_scrub_stamp</span><span class="p">(</span><span class="o">&</span><span class="n">pos</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 获取一个需要做scrub的pg
</span> <span class="k">do</span> <span class="p">{</span>
<span class="n">utime_t</span> <span class="n">t</span> <span class="o">=</span> <span class="n">pos</span><span class="p">.</span><span class="n">first</span><span class="p">;</span>
<span class="n">spg_t</span> <span class="n">pgid</span> <span class="o">=</span> <span class="n">pos</span><span class="p">.</span><span class="n">second</span><span class="p">;</span>
<span class="n">utime_t</span> <span class="n">diff</span> <span class="o">=</span> <span class="n">now</span> <span class="o">-</span> <span class="n">t</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">double</span><span class="p">)</span><span class="n">diff</span> <span class="o"><</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_scrub_min_interval</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 没有超过下限,不做scrub
</span> <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">double</span><span class="p">)</span><span class="n">diff</span> <span class="o"><</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_scrub_max_interval</span> <span class="o">&&</span> <span class="o">!</span><span class="n">load_is_low</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 超过下限,但是还未到上限,load_is_low不满足也不做scrub
</span> <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 满足scrub条件
</span> <span class="n">PG</span> <span class="o">*</span><span class="n">pg</span> <span class="o">=</span> <span class="n">_lookup_lock_pg</span><span class="p">(</span><span class="n">pgid</span><span class="p">);</span> <span class="c1">// 获取pg并且lock pg
</span> <span class="k">if</span> <span class="p">(</span><span class="n">pg</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pg</span><span class="o">-></span><span class="n">get_pgbackend</span><span class="p">()</span><span class="o">-></span><span class="n">scrub_supported</span><span class="p">()</span> <span class="o">&&</span> <span class="n">pg</span><span class="o">-></span><span class="n">is_active</span><span class="p">()</span> <span class="o">&&</span> <span class="c1">// pg也必须是active状态
</span> <span class="p">(</span><span class="n">load_is_low</span> <span class="o">||</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">diff</span> <span class="o">>=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_scrub_max_interval</span> <span class="o">||</span>
<span class="n">pg</span><span class="o">-></span><span class="n">scrubber</span><span class="p">.</span><span class="n">must_scrub</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pg</span><span class="o">-></span><span class="n">sched_scrub</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 调度pg 的scrub
</span> <span class="n">pg</span><span class="o">-></span><span class="n">unlock</span><span class="p">();</span> <span class="c1">// 成功将pg送入scrub_wq后,释放锁,当disk_tp线程从队列获取pg,开始做scrub的时候,会继续拿锁
</span> <span class="k">break</span><span class="p">;</span> <span class="c1">// break说明一次只成功调度一个pg做,下一次time继续调度
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="n">pg</span><span class="o">-></span><span class="n">unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">service</span><span class="p">.</span><span class="n">next_scrub_stamp</span><span class="p">(</span><span class="n">pos</span><span class="p">,</span> <span class="o">&</span><span class="n">pos</span><span class="p">));</span> <span class="c1">// 排在前面的pg可能状态不是active,所以这里继续循环寻找下一个pg
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>_lookup_lock_pg 函数获取一个需要做scrub的pg并上锁,而需要做scrub的pg以时间顺序保存在集合中:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// OSD.h中定义的pg集合
</span><span class="n">set</span><span class="o"><</span> <span class="n">pair</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span><span class="n">spg_t</span><span class="o">></span> <span class="o">></span> <span class="n">last_scrub_pg</span><span class="p">;</span>
<span class="c1">// 从集合中取一个最久的pg
</span><span class="kt">bool</span> <span class="nf">first_scrub_stamp</span><span class="p">(</span><span class="n">pair</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span> <span class="n">spg_t</span><span class="o">></span> <span class="o">*</span><span class="n">out</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">sched_scrub_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">last_scrub_pg</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">set</span><span class="o"><</span> <span class="n">pair</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span> <span class="n">spg_t</span><span class="o">></span> <span class="o">>::</span><span class="n">iterator</span> <span class="n">iter</span> <span class="o">=</span> <span class="n">last_scrub_pg</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="c1">// 时间越小,排在越前面
</span> <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="o">*</span><span class="n">iter</span><span class="p">;</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>timer线程中的调度机制只是将需要做scrub的pg放入scrub_wq队列,等待disk_tp线程执行,怎么判断一个pg需要scrub,
也就是什么时候将pg放入集合last_scrub_pg,这个策略是由pg自己决定的,符合计算机科学的策略与机制分离。
在pg初始化(PG::init),在osd进程启动时加载pg(load_pg),以及pg分裂(add_newly_split_pg)等情况下,如果pg是primary,
就会被放入集合。</p>
<p>继续跟踪OSD的sched_scrub函数,最终会调用PG的sched_scrub,这个函数会申请scrub slot,只有当所有副本均申请成功后,
才会排队进行scrub, 所以对于同一个pg,至少会两次tick周期调度到这里,才会真正进行queue_scrub,当然如果向副本发送消息,
并且收到副本的结果在第二个if执行之前,理论上一次也行。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">PG</span><span class="o">::</span><span class="n">sched_scrub</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 注意条件,如果pg不是primary,直接返回了,说明只有primary pg才可以发起scrub
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">is_primary</span><span class="p">()</span> <span class="o">&&</span> <span class="n">is_active</span><span class="p">()</span> <span class="o">&&</span> <span class="n">is_clean</span><span class="p">()</span> <span class="o">&&</span> <span class="o">!</span><span class="n">is_scrubbing</span><span class="p">()))</span> <span class="p">{</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="kt">bool</span> <span class="n">ret</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="c1">// 第一个if将pg本身加入peers,并且向副本发送scrub消息,让副本预留slot
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">scrubber</span><span class="p">.</span><span class="n">reserved</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 第一次调用,reserve为false
</span> <span class="n">assert</span><span class="p">(</span><span class="n">scrubber</span><span class="p">.</span><span class="n">reserved_peers</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span> <span class="c1">// peers也为空集合
</span> <span class="k">if</span> <span class="p">(</span><span class="n">osd</span><span class="o">-></span><span class="n">inc_scrubs_pending</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 为自己预留slot
</span> <span class="n">scrubber</span><span class="p">.</span><span class="n">reserved</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">scrubber</span><span class="p">.</span><span class="n">reserved_peers</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">pg_whoami</span><span class="p">);</span> <span class="c1">// 将pg本身加入peers集合,一起计数
</span> <span class="n">scrub_reserve_replicas</span><span class="p">();</span> <span class="c1">// 发送消息让副本也预留slot
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 第二个if检查slot信息
</span> <span class="k">if</span> <span class="p">(</span><span class="n">scrubber</span><span class="p">.</span><span class="n">reserved</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">scrubber</span><span class="p">.</span><span class="n">reserve_failed</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 还没收到副本的消息,或者副本预留成功,reserve_failed为false
</span> <span class="n">clear_scrub_reserved</span><span class="p">();</span>
<span class="n">scrub_unreserve_replicas</span><span class="p">();</span>
<span class="n">ret</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">scrubber</span><span class="p">.</span><span class="n">reserved_peers</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">acting</span><span class="p">.</span><span class="n">size</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 判断是否所有副本都预留成功了
</span> <span class="k">if</span> <span class="p">(</span><span class="n">time_for_deep</span><span class="p">)</span> <span class="p">{</span>
<span class="n">state_set</span><span class="p">(</span><span class="n">PG_STATE_DEEP_SCRUB</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">queue_scrub</span><span class="p">();</span> <span class="c1">// 如果是,调度scrub
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// 等待副本的消息
</span> <span class="n">dout</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"sched_scrub: reserved "</span> <span class="o"><<</span> <span class="n">scrubber</span><span class="p">.</span><span class="n">reserved_peers</span> <span class="o"><<</span> <span class="s">", waiting for replicas"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>正常情况下,如果副本也预留slot成功,就会执行函数queue_scrub,它的操作就是将pg放入scrub_wq队列:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">PG</span><span class="o">::</span><span class="n">queue_scrub</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_scrubbing</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 已经在做scrub,返回
</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">scrubber</span><span class="p">.</span><span class="n">must_scrub</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">state_set</span><span class="p">(</span><span class="n">PG_STATE_SCRUBBING</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">scrubber</span><span class="p">.</span><span class="n">must_deep_scrub</span><span class="p">)</span> <span class="p">{</span>
<span class="n">state_set</span><span class="p">(</span><span class="n">PG_STATE_DEEP_SCRUB</span><span class="p">);</span>
<span class="n">scrubber</span><span class="p">.</span><span class="n">must_deep_scrub</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">scrubber</span><span class="p">.</span><span class="n">must_repair</span><span class="p">)</span> <span class="p">{</span>
<span class="n">state_set</span><span class="p">(</span><span class="n">PG_STATE_REPAIR</span><span class="p">);</span>
<span class="n">scrubber</span><span class="p">.</span><span class="n">must_repair</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">osd</span><span class="o">-></span><span class="n">queue_for_scrub</span><span class="p">(</span><span class="k">this</span><span class="p">);</span> <span class="c1">// 放入队列
</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">queue_for_scrub</span><span class="p">(</span><span class="n">PG</span> <span class="o">*</span><span class="n">pg</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">scrub_wq</span><span class="p">.</span><span class="n">queue</span><span class="p">(</span><span class="n">pg</span><span class="p">);</span> <span class="c1">// 放入队列
</span><span class="p">}</span>
</code></pre></div></div>
<h2 id="do-scrub">Do Scrub</h2>
<p>schedule成功调度后,pg进入队列scrub_wq,它会被线程池disk_tp处理,线程池的入口是Worker函数,它的工作就是从队列取出元素,然后调用其process函数进行处理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_process</span><span class="p">(</span>
<span class="n">PG</span> <span class="o">*</span><span class="n">pg</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">&</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">pg</span><span class="o">-></span><span class="n">scrub</span><span class="p">(</span><span class="n">handle</span><span class="p">);</span> <span class="c1">// 还是让pg执行scrub
</span> <span class="n">pg</span><span class="o">-></span><span class="n">put</span><span class="p">(</span><span class="s">"ScrubWQ"</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">PG</span><span class="o">::</span><span class="n">scrub</span><span class="p">(</span><span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">&</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">lock</span><span class="p">();</span>
<span class="p">......</span>
<span class="n">chunky_scrub</span><span class="p">(</span><span class="n">handle</span><span class="p">);</span>
<span class="n">unlock</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>chunky scrub是对classic scrub的改进,只针对某一范围内的object做scrub,而不是pg的所有object,
当一个范围内的scrub完成后,会重新进入队列,再次调度。而再次调度执行scrub这个时间段内,pg的锁是可以被前台进程获取,
其次,对一小部分对象做scrub,理论上也比对所有对象做scrub完成的要快(读取的数据更少),因此chunky scrub对前台IO更友好,
现在默认都使用chunky scrub。实现的源码看看代码中的注释很容易明白,就是简单的状态机。</p>
<p>当scrub发现有不一致的pg的时候,会上报给monitor,这样就可以通过monitor获取到信息。</p>
<h1 id="tuning">Tuning</h1>
<p>因为scrub影响前台IO,可能会造成slow request,在实际部署运营过程中,以下一些参数可能需要调整,特别是
osd_scrub_load_threshold 和 osd_scrub_sleep,前者默认值0.5实在太小,几乎不会满足,后者为0也不会随机睡眠。
同时,deep_scrub可能也应该关闭,避免造成大的IO请求。</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>osd_max_scrub
osd_scrub_min_interval & osd_scrub_max_interval
osd_scrub_begin_hour & osd_scrub_end_hour
osd_scrub_load_threshold
osd_scrub_sleep
osd_scrub_chunk_min & osd_scrub_chunk_max
</code></pre></div></div>
Ceph Rbd Metadata
2016-01-11T00:00:00+00:00
http://blog.wjin.org/posts/ceph-rbd-metadata
<h1 id="introduction">Introduction</h1>
<p>ceph RBD块存储可以通过librbd库供qemu虚拟机使用,也可以通过krbd内核模块直接给linux块设备驱动使用。
不论哪种情况,rbd对外提供的都是一块虚拟的磁盘,那么,这个虚拟磁盘内部是怎么存储的?有些什么元数据?</p>
<p>这篇文章简单总结下内部数据的组织,测试利用vstart方式启动一个测试集群,
可以参考<a href="http://docs.ceph.com/docs/master/dev/#development-mode-cluster">官方文档</a>。</p>
<h1 id="create-image">Create Image</h1>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls</span> <span class="c"># 列出所有对象</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd create foo <span class="nt">--size</span> 1024 <span class="nt">--image-format</span> 2 <span class="c"># 创建镜像foo</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls
</span>rbd_header.101c190e05d5
rbd_directory
rbd_id.foo
</code></pre></div></div>
<p>当向一个空的pool新建一个image后,发现会新建三个对象,其中rbd_directory保存当前pool的所有image名字与id的双向映射,方便列出当前pool所有的image:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd listomapvals rbd_directory <span class="c"># 获取directory对象omap信息 </span>
id_101c190e05d5
value <span class="o">(</span>7 bytes<span class="o">)</span> :
0000 : 03 00 00 00 66 6f 6f : ....foo
name_foo
value <span class="o">(</span>16 bytes<span class="o">)</span> :
0000 : 0c 00 00 00 31 30 31 63 31 39 30 65 30 35 64 35 : ....101c190e05d5
</code></pre></div></div>
<p>对象rbd_id.foo保存image的ID:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd get rbd_id.foo foo <span class="c"># 获取对象内容</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>hexdump <span class="nt">-C</span> foo <span class="c"># dump对象内容</span>
00000000 0c 00 00 00 31 30 31 63 31 39 30 65 30 35 64 35 |....101c190e05d5|
</code></pre></div></div>
<p>对象rbd_header.101c190e05d5 保存image的元数据信息(可以用命令rbd info image_name查看):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd listomapvals rbd_header.101c190e05d5 <span class="c"># 获取header对象omap信息</span>
features
value <span class="o">(</span>8 bytes<span class="o">)</span> :
0000 : 01 00 00 00 00 00 00 00 : ........
object_prefix
value <span class="o">(</span>25 bytes<span class="o">)</span> :
0000 : 15 00 00 00 72 62 64 5f 64 61 74 61 2e 31 30 31 : ....rbd_data.101
0010 : 63 31 39 30 65 30 35 64 35 : c190e05d5
order
value <span class="o">(</span>1 bytes<span class="o">)</span> :
0000 : 16 : <span class="nb">.</span>
size
value <span class="o">(</span>8 bytes<span class="o">)</span> :
0000 : 00 00 00 40 00 00 00 00 : ...@....
snap_seq
value <span class="o">(</span>8 bytes<span class="o">)</span> :
0000 : 00 00 00 00 00 00 00 00 : ........
</code></pre></div></div>
<p>以后每新加一个image,就会出现两个对象rbd_header.xx 和 rbd_id.xx,并且更新rbd_directory中的信息。
同时需要注意,header 和 directory两个对象都是记录的kv信息,所以实际size是0,查看的时候使用的是listomapvals,
而rbd_id对象存放的是image的id,内容存放在对象文件内部,查看的时候,先通过get命令获取对象内容,然后再dump出来。</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd stat rbd_header.101c190e05d5 <span class="c"># 获取header对象文件的信息</span>
rbd/rbd_header.101c190e05d5 mtime 2016-01-16 15:17:19.000000, size 0
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd stat rbd_directory <span class="c"># 获取directory对象文件的信息</span>
rbd/rbd_directory mtime 0.000000, size 0
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd stat rbd_id.foo <span class="c"># 获取rbd_id对象文件的信息</span>
rbd/rbd_id.foo mtime 2016-01-12 10:33:51.000000, size 16
</code></pre></div></div>
<h1 id="create-snapshot">Create Snapshot</h1>
<h2 id="create">Create</h2>
<p>紧接着前面的实验,对镜像foo创建一个快照:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd snap create rbd/foo@snap <span class="c"># 创建快照, 快照的名称为snap</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls</span> <span class="c"># 列出所有对象</span>
rbd_header.101c190e05d5
rbd_directory
rbd_id.foo
</code></pre></div></div>
<p>发现没有增加任何对象,那快照的信息保存在什么地方?答案在header对象中:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd listomapvals rbd_header.101c190e05d5 <span class="c"># 获取header对象omap信息</span>
features
value <span class="o">(</span>8 bytes<span class="o">)</span> :
0000 : 01 00 00 00 00 00 00 00 : ........
object_prefix
value <span class="o">(</span>25 bytes<span class="o">)</span> :
0000 : 15 00 00 00 72 62 64 5f 64 61 74 61 2e 31 30 31 : ....rbd_data.101
0010 : 63 31 39 30 65 30 35 64 35 : c190e05d5
order
value <span class="o">(</span>1 bytes<span class="o">)</span> :
0000 : 16 : <span class="nb">.</span>
size
value <span class="o">(</span>8 bytes<span class="o">)</span> :
0000 : 00 00 00 40 00 00 00 00 : ...@....
snap_seq
value <span class="o">(</span>8 bytes<span class="o">)</span> :
0000 : 02 00 00 00 00 00 00 00 : ........
snapshot_0000000000000002 <span class="c"># 快照ID</span>
value <span class="o">(</span>81 bytes<span class="o">)</span> :
0000 : 04 01 4b 00 00 00 02 00 00 00 00 00 00 00 04 00 : ..K.............
0010 : 00 00 73 6e 61 70 00 00 00 40 00 00 00 00 01 00 : ..snap...@......
0020 : 00 00 00 00 00 00 01 01 1c 00 00 00 ff ff ff ff : ................
0030 : ff ff ff ff 00 00 00 00 fe ff ff ff ff ff ff ff : ................
0040 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 : ................
0050 : 00 : <span class="nb">.</span>
</code></pre></div></div>
<p>snap_seq为image当前最新的snapshot的snapid。</p>
<h2 id="copy-on-write">Copy On Write</h2>
<p>ceph创建快照是秒极的,然后通过COW机制进行维护,可以对镜像进行读写看看发生的变化。先借助于krbd,
将刚才的镜像foo map到/dev下面:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span><span class="nb">sudo</span> ./rbd map rbd/foo <span class="c"># map镜像foo</span>
/dev/rbd0
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span><span class="nb">sudo</span> ./rbd showmapped <span class="c"># 列出所有map的信息</span>
id pool image snap device
0 rbd foo - /dev/rbd0
</code></pre></div></div>
<p>然后对镜像foo写入4M的数据:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span><span class="nb">sudo </span>dd <span class="k">if</span><span class="o">=</span>/dev/random <span class="nv">of</span><span class="o">=</span>/dev/rbd0 <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>4 <span class="c"># 写入数据</span>
dd: warning: partial <span class="nb">read</span> <span class="o">(</span>78 bytes<span class="o">)</span><span class="p">;</span> suggest <span class="nv">iflag</span><span class="o">=</span>fullblock
0+4 records <span class="k">in
</span>0+4 records out
312 bytes <span class="o">(</span>312 B<span class="o">)</span> copied, 0.0270896 s, 11.5 kB/s
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls</span> <span class="c"># 查看所有对象</span>
rbd_data.101c190e05d5.0000000000000000 <span class="c"># 新增加的对象文件</span>
rbd_header.101c190e05d5
rbd_directory
rbd_id.foo
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span><span class="nb">sudo</span> ./rados <span class="nt">-p</span> rbd stat rbd_data.101c190e05d5.0000000000000000
rbd/rbd_data.101c190e05d5.0000000000000000 mtime 2016-01-16 15:53:37.000000, size 4096
</code></pre></div></div>
<p>因为之前已经创建了快照,然后才写入数据,按理说这时候应该有快照的数据存在,但是实际情况却不是这样的:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>find ./dev <span class="nt">-name</span> <span class="s2">"*101c190e05d5.*"</span> <span class="c"># 三个一样的对象,因为是三副本</span>
./dev/osd0/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
./dev/osd1/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
./dev/osd2/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
</code></pre></div></div>
<p>没有快照的对象存在,说明快照的数据没有保存,这肯定不是bug,原因是在最开始创建快照的时候,镜像是空数据或全零数据,
然后写的时候,触发了COW,COW在复制数据的时候,发现是全零数据,就不会创建对象去存储了,浪费存储空间。</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd snap create rbd/foo@snap2 <span class="c"># 再次创建快照,名称为snap2</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls
</span>rbd_data.101c190e05d5.0000000000000000
rbd_header.101c190e05d5
rbd_directory
rbd_id.foo
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span><span class="nb">sudo </span>dd <span class="k">if</span><span class="o">=</span>/dev/random <span class="nv">of</span><span class="o">=</span>/dev/rbd0 <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>4 <span class="c"># 再次写入第一个object</span>
dd: warning: partial <span class="nb">read</span> <span class="o">(</span>78 bytes<span class="o">)</span><span class="p">;</span> suggest <span class="nv">iflag</span><span class="o">=</span>fullblock
0+4 records <span class="k">in
</span>0+4 records out
305 bytes <span class="o">(</span>305 B<span class="o">)</span> copied, 0.0329425 s, 9.3 kB/s
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>find ./dev <span class="nt">-name</span> <span class="s2">"*101c190e05d5.*"</span>
./dev/osd0/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
./dev/osd0/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__3_5A1CE208__0 <span class="c"># COW发生了, 3是快照的ID</span>
./dev/osd1/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
./dev/osd1/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__3_5A1CE208__0
./dev/osd2/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
./dev/osd2/current/0.0_head/rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__3_5A1CE208__0
</code></pre></div></div>
<p>find查找比较粗暴,也可以直接查找:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./ceph osd map rbd rbd_data.101c190e05d5.0000000000000000 <span class="c"># 找到对象文件对应的pg</span>
osdmap e22 pool <span class="s1">'rbd'</span> <span class="o">(</span>0<span class="o">)</span> object <span class="s1">'rbd_data.101c190e05d5.0000000000000000'</span> -> pg 0.5a1ce208 <span class="o">(</span>0.0<span class="o">)</span> -> up <span class="o">([</span>0,2,1], p0<span class="o">)</span> acting <span class="o">([</span>0,2,1], p0<span class="o">)</span> <span class="c"># 0.0 是 pg ID</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span><span class="nb">ls</span> ./dev/osd0/current/0.0_head/ <span class="c"># pg 0.0对应的目录</span>
__head_00000000__0 rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__3_5A1CE208__0 rbd<span class="se">\u</span>data.101c190e05d5.0000000000000000__head_5A1CE208__0
</code></pre></div></div>
<h1 id="clone-image">Clone Image</h1>
<p>还是利用vstart脚本启动一个新的集群,然后创建镜像:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd create foo <span class="nt">--size</span> 1024 <span class="nt">--image-format</span> 2 <span class="c">#创建镜像</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls
</span>rbd_directory
rbd_id.foo
rbd_header.10146b8b4567
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd snap create rbd/foo@snap <span class="c">#创建快照</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls
</span>rbd_directory
rbd_id.foo
rbd_header.10146b8b4567
</code></pre></div></div>
<p>紧接着将快照保护起来,然后clone出一个新的镜像:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd snap protect rbd/foo@snap <span class="c">#保护快照</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd clone rbd/foo@snap rbd/bar <span class="c">#克隆镜像</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd <span class="nb">ls
</span>bar
foo
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd <span class="nb">ls
</span>rbd_id.bar
rbd_children <span class="c">#存放父子关系</span>
rbd_directory
rbd_id.foo
rbd_header.10146b8b4567
rbd_header.101e6b8b4567
</code></pre></div></div>
<p>可以看到,clone的时候除了新的image自身的两个对象,还多了一个对象rbd_children,这个对象用来保存clone的时候image的父子关系:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd listomapvals rbd_children
key <span class="o">(</span>32 bytes<span class="o">)</span>:
00000000 00 00 00 00 00 00 00 00 0c 00 00 00 31 30 31 34 |............1014|
00000010 36 62 38 62 34 35 36 37 04 00 00 00 00 00 00 00 |6b8b4567........|
00000020
value <span class="o">(</span>20 bytes<span class="o">)</span> :
00000000 01 00 00 00 0c 00 00 00 31 30 31 65 36 62 38 62 |........101e6b8b|
00000010 34 35 36 37 |4567|
00000014
</code></pre></div></div>
<p>如果对克隆的镜像做flatten,即解除父子关系,则rbd_children对象将会删除父子的信息:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rbd flatten bar <span class="c">#flatten 镜像</span>
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>./rados <span class="nt">-p</span> rbd listomapvals rbd_children
<span class="o">[</span>jw@localhost src]<span class="nv">$ </span>
</code></pre></div></div>
<p>大部分元数据就是以上的对象,另外需要注意的是,如果新的特性object map打开(jewel版本以后默认会开启),还会有存放object map元数据的对象。</p>
Ceph Async Messenger
2015-12-28T00:00:00+00:00
http://blog.wjin.org/posts/ceph-async-messenger
<h1 id="overview">Overview</h1>
<p>ceph源码中有三种网络通信的实现方式,SimpleMessenger实现比较早,每一对通信的peer之间创建四个线程维护连接状态(每一端两个线程,分别负责读和写),
这样当集群规模上去后,会导致大量的线程被创建。随着linux中epoll的实现,高并发的网络io都是借助于epoll这样的系统调用,
比如libevent库。ceph源码中也基于epoll实现了AsyncMessenger,这有助于减少集群中网络通信所需要的线程数,
目前实现虽然还不太稳定,并不是默认的通信组件,但是未来一定会取代SimpleMessenger。</p>
<h1 id="server">Server</h1>
<p>服务端需要监听端口,等待连接请求到来,然后接受请求建立连接进行通信。</p>
<h3 id="initialization">Initialization</h3>
<p>以osd进程为例,在进程启动的过程中,会创建Messenger对象,用于管理网络连接,监听端口,接收请求,源码在文件src/ceph_osd.cc:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// public用于客户端通信
</span> <span class="n">Messenger</span> <span class="o">*</span><span class="n">ms_public</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span>
<span class="n">entity_name_t</span><span class="o">::</span><span class="n">OSD</span><span class="p">(</span><span class="n">whoami</span><span class="p">),</span> <span class="s">"client"</span><span class="p">,</span>
<span class="n">getpid</span><span class="p">());</span>
<span class="c1">// cluster用于集群内部通信
</span> <span class="n">Messenger</span> <span class="o">*</span><span class="n">ms_cluster</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">,</span> <span class="n">g_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span>
<span class="n">entity_name_t</span><span class="o">::</span><span class="n">OSD</span><span class="p">(</span><span class="n">whoami</span><span class="p">),</span> <span class="s">"cluster"</span><span class="p">,</span>
<span class="n">getpid</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="c1">// src/msg/Messenger.cc
</span><span class="n">Messenger</span> <span class="o">*</span><span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">type</span><span class="p">,</span>
<span class="n">entity_name_t</span> <span class="n">name</span><span class="p">,</span> <span class="n">string</span> <span class="n">lname</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">nonce</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 在src/common/config_opts.h文件中,目前需要配置async相关选项才会生效
</span> <span class="c1">// OPTION(enable_experimental_unrecoverable_data_corrupting_features, OPT_STR, "ms-type-async")
</span> <span class="c1">// OPTION(ms_type, OPT_STR, "async")
</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">r</span> <span class="o">==</span> <span class="mi">1</span> <span class="o">||</span> <span class="n">type</span> <span class="o">==</span> <span class="s">"async"</span><span class="p">)</span> <span class="o">&&</span>
<span class="n">cct</span><span class="o">-></span><span class="n">check_experimental_feature_enabled</span><span class="p">(</span><span class="s">"ms-type-async"</span><span class="p">))</span>
<span class="k">return</span> <span class="k">new</span> <span class="n">AsyncMessenger</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">lname</span><span class="p">,</span> <span class="n">nonce</span><span class="p">);</span>
<span class="p">......</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>类AsyncMessenger的构造函数需要注意,虽然在osd进程的启动过程中,会创建6个messenger,但是他们全部共享一个WorkerPool,
函数lookup_or_create_singleton_object<WorkerPool>保证只会创建一个pool,因为传入的名称WokerPool::name是一样的:</WorkerPool></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">AsyncMessenger</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="n">entity_name_t</span> <span class="n">name</span><span class="p">,</span>
<span class="n">string</span> <span class="n">mname</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">_nonce</span><span class="p">)</span>
<span class="o">:</span> <span class="n">SimplePolicyMessenger</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span><span class="n">mname</span><span class="p">,</span> <span class="n">_nonce</span><span class="p">),</span>
<span class="n">processor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">cct</span><span class="p">,</span> <span class="n">_nonce</span><span class="p">),</span>
<span class="n">lock</span><span class="p">(</span><span class="s">"AsyncMessenger::lock"</span><span class="p">),</span>
<span class="n">nonce</span><span class="p">(</span><span class="n">_nonce</span><span class="p">),</span> <span class="n">need_addr</span><span class="p">(</span><span class="nb">true</span><span class="p">),</span> <span class="n">did_bind</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span>
<span class="n">global_seq</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">deleted_lock</span><span class="p">(</span><span class="s">"AsyncMessenger::deleted_lock"</span><span class="p">),</span>
<span class="n">cluster_protocol</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">stopped</span><span class="p">(</span><span class="nb">true</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ceph_spin_init</span><span class="p">(</span><span class="o">&</span><span class="n">global_seq_lock</span><span class="p">);</span>
<span class="n">cct</span><span class="o">-></span><span class="n">lookup_or_create_singleton_object</span><span class="o"><</span><span class="n">WorkerPool</span><span class="o">></span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">WorkerPool</span><span class="o">::</span><span class="n">name</span><span class="p">);</span> <span class="c1">// 创建pool对象, 注意第二个参数是WorkerPool中的静态常量
</span> <span class="c1">// 创建一个本地连接对象用于向自己发送消息
</span> <span class="n">local_connection</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AsyncConnection</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="k">this</span><span class="p">,</span> <span class="o">&</span><span class="n">pool</span><span class="o">-></span><span class="n">get_worker</span><span class="p">()</span><span class="o">-></span><span class="n">center</span><span class="p">);</span>
<span class="n">init_local_connection</span><span class="p">();</span> <span class="c1">// 初始化本地对象
</span><span class="p">}</span>
<span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="kt">void</span> <span class="n">lookup_or_create_singleton_object</span><span class="p">(</span><span class="n">T</span><span class="o">*&</span> <span class="n">p</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">&</span><span class="n">name</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ceph_spin_lock</span><span class="p">(</span><span class="o">&</span><span class="n">_associated_objs_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">_associated_objs</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">name</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// name决定了一个进程只会有一个pool
</span> <span class="n">p</span> <span class="o">=</span> <span class="k">new</span> <span class="n">T</span><span class="p">(</span><span class="k">this</span><span class="p">);</span> <span class="c1">// new一个对象,这里是WorkerPool
</span> <span class="n">_associated_objs</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="n">AssociatedSingletonObject</span><span class="o">*></span><span class="p">(</span><span class="n">p</span><span class="p">);</span> <span class="c1">// 加入map
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="n">T</span><span class="o">*></span><span class="p">(</span><span class="n">_associated_objs</span><span class="p">[</span><span class="n">name</span><span class="p">]);</span>
<span class="p">}</span>
<span class="n">ceph_spin_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">_associated_objs_lock</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>另外需要注意,这个进程唯一的pool是在messenger的构造函数分配的,messenger的析构函数并不负责释放内存,因为多个messenger共享,
一个messenger销毁了并不代表其他messenger也一定会销毁,这个pool的指针存放在CephContext成员变量_associated_objs中,
因为daemon进程有一个全局唯一的CephContext,当CephContext析构的时候,会释放pool指针的内存。</p>
<p>一个osd进程只会有一个WorkerPool,那这个pool在初始化的时候干什么事情了?顾名思义,Worker的Pool,肯定是用来管理Worker的,
构造函数中恰恰就是新建了Worker类的对象,而Worker类继承于线程类,肯定就是单独干活的线程,源码在文件
src/msg/async/AsyncMessenger.[h|c]中:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WorkerPool</span><span class="o">::</span><span class="n">WorkerPool</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span><span class="o">:</span> <span class="n">cct</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="n">seq</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">started</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span>
<span class="n">barrier_lock</span><span class="p">(</span><span class="s">"WorkerPool::WorkerPool::barrier_lock"</span><span class="p">),</span>
<span class="n">barrier_count</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">ms_async_op_threads</span> <span class="o">></span> <span class="mi">0</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">ms_async_op_threads</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Worker</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Worker</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="k">this</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="c1">// 新建Worker类对象
</span> <span class="n">workers</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">w</span><span class="p">);</span> <span class="c1">// 保存在vector容器中, 用于跟踪所有的worker
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="k">class</span> <span class="nc">Worker</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Thread</span> <span class="p">{</span> <span class="c1">// 继承线程类,说明Worker类单独包含线程
</span> <span class="k">static</span> <span class="k">const</span> <span class="kt">uint64_t</span> <span class="n">InitEventNumber</span> <span class="o">=</span> <span class="mi">5000</span><span class="p">;</span> <span class="c1">// 事件个数
</span> <span class="k">static</span> <span class="k">const</span> <span class="kt">uint64_t</span> <span class="n">EventMaxWaitUs</span> <span class="o">=</span> <span class="mi">30000000</span><span class="p">;</span> <span class="c1">// 事件最大的等待时间, 30秒
</span> <span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">;</span>
<span class="n">WorkerPool</span> <span class="o">*</span><span class="n">pool</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">done</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">id</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">EventCenter</span> <span class="n">center</span><span class="p">;</span> <span class="c1">// 事件中心
</span> <span class="n">Worker</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="n">WorkerPool</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="o">:</span> <span class="n">cct</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="n">pool</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">done</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span> <span class="n">id</span><span class="p">(</span><span class="n">i</span><span class="p">),</span> <span class="n">center</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
<span class="n">center</span><span class="p">.</span><span class="n">init</span><span class="p">(</span><span class="n">InitEventNumber</span><span class="p">);</span> <span class="c1">// 初始化事件驱动, 实际上就是初始化了epoll相关的结构
</span> <span class="p">}</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">entry</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">stop</span><span class="p">();</span>
<span class="p">};</span>
</code></pre></div></div>
<p>为了代码通用,这里单独抽象了一层出来,即EventCenter,用来管理各种事件的驱动,比如epoll, kqueue, select等。
源码在src/msg/async/Event.[h]c]:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">EventCenter</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">FileEvent</span> <span class="o">*</span><span class="n">file_events</span><span class="p">;</span> <span class="c1">// 所有io事件
</span> <span class="n">EventDriver</span> <span class="o">*</span><span class="n">driver</span><span class="p">;</span> <span class="c1">// 具体的驱动
</span> <span class="n">map</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span> <span class="n">list</span><span class="o"><</span><span class="n">TimeEvent</span><span class="o">></span> <span class="o">></span> <span class="n">time_events</span><span class="p">;</span> <span class="c1">// 所有时间事件
</span> <span class="p">......</span>
<span class="p">};</span>
<span class="c1">// EventDriver接口
// epoll的驱动继承此接口,接口的实现就是对epoll三个系统调用epoll_create, epoll_ctl,epoll_wait的封装
</span><span class="k">class</span> <span class="nc">EventDriver</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="o">~</span><span class="n">EventDriver</span><span class="p">()</span> <span class="p">{}</span> <span class="c1">// we want a virtual destructor!!!
</span> <span class="k">virtual</span> <span class="kt">int</span> <span class="n">init</span><span class="p">(</span><span class="kt">int</span> <span class="n">nevent</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">int</span> <span class="n">add_event</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">cur_mask</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mask</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">del_event</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">cur_mask</span><span class="p">,</span> <span class="kt">int</span> <span class="n">del_mask</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">int</span> <span class="n">event_wait</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">FiredFileEvent</span><span class="o">></span> <span class="o">&</span><span class="n">fired_events</span><span class="p">,</span> <span class="k">struct</span> <span class="n">timeval</span> <span class="o">*</span><span class="n">tp</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">int</span> <span class="n">resize_events</span><span class="p">(</span><span class="kt">int</span> <span class="n">newsize</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">EpollDriver</span> <span class="o">:</span> <span class="k">public</span> <span class="n">EventDriver</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">epfd</span><span class="p">;</span> <span class="c1">// epoll fd
</span> <span class="k">struct</span> <span class="n">epoll_event</span> <span class="o">*</span><span class="n">events</span><span class="p">;</span> <span class="c1">// 等待事件的结构体指针,可以查看epoll相关资料
</span> <span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">size</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Worker构造函数中,调用了center的init函数,看看center.init干了些什么事情?</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">EventCenter</span><span class="o">::</span><span class="n">init</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">driver</span> <span class="o">=</span> <span class="k">new</span> <span class="n">EpollDriver</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span> <span class="c1">// 新建一个驱动对象
</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">driver</span><span class="o">-></span><span class="n">init</span><span class="p">(</span><span class="n">n</span><span class="p">);</span> <span class="c1">// 初始化具体的驱动
</span>
<span class="kt">int</span> <span class="n">fds</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span> <span class="c1">// pipe用来唤醒worker线程,后文会分析到
</span> <span class="k">if</span> <span class="p">(</span><span class="n">pipe</span><span class="p">(</span><span class="n">fds</span><span class="p">)</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" can't create notify pipe"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">notify_receive_fd</span> <span class="o">=</span> <span class="n">fds</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">notify_send_fd</span> <span class="o">=</span> <span class="n">fds</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="p">......</span>
<span class="n">create_file_event</span><span class="p">(</span><span class="n">notify_receive_fd</span><span class="p">,</span> <span class="n">EVENT_READABLE</span><span class="p">,</span> <span class="n">EventCallbackRef</span><span class="p">(</span><span class="k">new</span> <span class="n">C_handle_notify</span><span class="p">()));</span> <span class="c1">// 监听pipe的可读事件
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 初始化epoll
</span><span class="kt">int</span> <span class="n">EpollDriver</span><span class="o">::</span><span class="n">init</span><span class="p">(</span><span class="kt">int</span> <span class="n">nevent</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">events</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">epoll_event</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">epoll_event</span><span class="p">)</span><span class="o">*</span><span class="n">nevent</span><span class="p">);</span> <span class="c1">// nevent就是Worker类中的InitEventNumber
</span> <span class="n">memset</span><span class="p">(</span><span class="n">events</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">epoll_event</span><span class="p">)</span><span class="o">*</span><span class="n">nevent</span><span class="p">);</span>
<span class="n">epfd</span> <span class="o">=</span> <span class="n">epoll_create</span><span class="p">(</span><span class="mi">1024</span><span class="p">);</span> <span class="c1">// 获取一个epoll fd
</span> <span class="n">size</span> <span class="o">=</span> <span class="n">nevent</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>从osd进程,到AsyncMessenger类,接着到所有messenger共享的WorkerPool,然后初始化进程唯一pool的每个Worker,然后worker中借助于EventCenter统一管理所有事件,
并且初始化了具体的事件处理机制,如epoll,似乎所有工作已经就绪?
其实不然,首先,worker的线程并没有启动,其次,osd进程的messenger也并没有绑定到特定端口进行监听,所以osd启动的过程中,还得有其他步骤。</p>
<h3 id="bind-and-listen">Bind and Listen</h3>
<p>在messenger创建以后,会设置策略以及限流的参数,接下来就会绑定地址,对网络层套接字的处理,比如socket/bind/listen/accept等,主要是通过类Processor来管理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 继续ceph_osd.cc代码
</span><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 设置协议
</span> <span class="n">ms_cluster</span><span class="o">-></span><span class="n">set_cluster_protocol</span><span class="p">(</span><span class="n">CEPH_OSD_PROTOCOL</span><span class="p">);</span>
<span class="p">......</span>
<span class="c1">// 设置策略以及限流
</span> <span class="n">ms_public</span><span class="o">-></span><span class="n">set_default_policy</span><span class="p">(</span><span class="n">Messenger</span><span class="o">::</span><span class="n">Policy</span><span class="o">::</span><span class="n">stateless_server</span><span class="p">(</span><span class="n">supported</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>
<span class="n">ms_public</span><span class="o">-></span><span class="n">set_policy_throttlers</span><span class="p">(</span><span class="n">entity_name_t</span><span class="o">::</span><span class="n">TYPE_CLIENT</span><span class="p">,</span>
<span class="n">client_byte_throttler</span><span class="p">.</span><span class="n">get</span><span class="p">(),</span>
<span class="n">client_msg_throttler</span><span class="p">.</span><span class="n">get</span><span class="p">());</span>
<span class="p">......</span>
<span class="c1">// 绑定地址
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ms_public</span><span class="o">-></span><span class="n">bind</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">public_addr</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ms_cluster</span><span class="o">-></span><span class="n">bind</span><span class="p">(</span><span class="n">g_conf</span><span class="o">-></span><span class="n">cluster_addr</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">ms_public</span><span class="o">-></span><span class="n">start</span><span class="p">();</span> <span class="c1">// 启动线程
</span>
<span class="p">......</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">osd</span><span class="o">-></span><span class="n">init</span><span class="p">();</span> <span class="c1">// 这里很关键, 后文分析
</span>
<span class="p">......</span>
<span class="n">ms_public</span><span class="o">-></span><span class="n">wait</span><span class="p">();</span> <span class="c1">// 等待线程结束
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">bind_addr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// bind to a socket
</span> <span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">avoid_ports</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">processor</span><span class="p">.</span><span class="n">bind</span><span class="p">(</span><span class="n">bind_addr</span><span class="p">,</span> <span class="n">avoid_ports</span><span class="p">);</span> <span class="c1">// 调用processor对象进行处理
</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="c1">// processor的处理就是对socket API的封装:socket, bind, listen
// 创建套接字,绑定到特定端口,进行监听
</span><span class="kt">int</span> <span class="n">Processor</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span> <span class="o">&</span><span class="n">bind_addr</span><span class="p">,</span> <span class="k">const</span> <span class="n">set</span><span class="o"><</span><span class="kt">int</span><span class="o">>&</span> <span class="n">avoid_ports</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">listen_sd</span> <span class="o">=</span> <span class="o">::</span><span class="n">socket</span><span class="p">(</span><span class="n">family</span><span class="p">,</span> <span class="n">SOCK_STREAM</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">rc</span> <span class="o">=</span> <span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="n">listen_sd</span><span class="p">,</span> <span class="p">(</span><span class="k">struct</span> <span class="n">sockaddr</span> <span class="o">*</span><span class="p">)</span> <span class="o">&</span><span class="n">listen_addr</span><span class="p">.</span><span class="n">ss_addr</span><span class="p">(),</span> <span class="n">listen_addr</span><span class="p">.</span><span class="n">addr_size</span><span class="p">());</span>
<span class="p">......</span>
<span class="n">rc</span> <span class="o">=</span> <span class="o">::</span><span class="n">listen</span><span class="p">(</span><span class="n">listen_sd</span><span class="p">,</span> <span class="mi">128</span><span class="p">);</span>
<span class="p">......</span>
<span class="n">msgr</span><span class="o">-></span><span class="n">init_local_connection</span><span class="p">();</span> <span class="c1">// 更新地址,但是因为还没有dispatch对象,不会处理连接
</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">init_local_connection</span><span class="p">()</span> <span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">_init_local_connection</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">_init_local_connection</span><span class="p">()</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">local_connection</span><span class="o">-></span><span class="n">peer_addr</span> <span class="o">=</span> <span class="n">my_inst</span><span class="p">.</span><span class="n">addr</span><span class="p">;</span>
<span class="n">local_connection</span><span class="o">-></span><span class="n">peer_type</span> <span class="o">=</span> <span class="n">my_inst</span><span class="p">.</span><span class="n">name</span><span class="p">.</span><span class="n">type</span><span class="p">();</span>
<span class="n">ms_deliver_handle_fast_connect</span><span class="p">(</span><span class="n">local_connection</span><span class="p">.</span><span class="n">get</span><span class="p">());</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">ms_deliver_handle_fast_connect</span><span class="p">(</span><span class="n">Connection</span> <span class="o">*</span><span class="n">con</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="n">list</span><span class="o"><</span><span class="n">Dispatcher</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">fast_dispatchers</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="c1">// fast_dispatchers 目前为空
</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">fast_dispatchers</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">p</span><span class="p">)</span>
<span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">)</span><span class="o">-></span><span class="n">ms_handle_fast_connect</span><span class="p">(</span><span class="n">con</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="deal-with-event">Deal with Event</h3>
<p>在绑定地址进行端口监听以后,就会等着连接到来,要处理连接请求,肯定得创建Worker线程来处理吧?</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// ceph_osd.cc 会继续调用messenger->start(), 参见前面代码
</span><span class="kt">int</span> <span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">start</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="p">......</span>
<span class="n">pool</span><span class="o">-></span><span class="n">start</span><span class="p">();</span> <span class="c1">// 启动所有线程
</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">WorkerPool</span><span class="o">::</span><span class="n">start</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">started</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">workers</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">workers</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">create</span><span class="p">();</span> <span class="c1">// 创建线程
</span> <span class="p">}</span>
<span class="n">started</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 线程入口函数
</span><span class="kt">void</span> <span class="o">*</span><span class="n">Worker</span><span class="o">::</span><span class="n">entry</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">center</span><span class="p">.</span><span class="n">set_owner</span><span class="p">(</span><span class="n">pthread_self</span><span class="p">());</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">done</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 线程一直循环处理事件
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">center</span><span class="p">.</span><span class="n">process_events</span><span class="p">(</span><span class="n">EventMaxWaitUs</span><span class="p">);</span> <span class="c1">// 借助于事件中心处理事件, 注意最大的等待时间是30秒
</span> <span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 通过epoll_wait返回所有就绪的fd,然后一次调用其callback
</span><span class="kt">int</span> <span class="n">EventCenter</span><span class="o">::</span><span class="n">process_events</span><span class="p">(</span><span class="kt">int</span> <span class="n">timeout_microseconds</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">vector</span><span class="o"><</span><span class="n">FiredFileEvent</span><span class="o">></span> <span class="n">fired_events</span><span class="p">;</span>
<span class="n">next_time</span> <span class="o">=</span> <span class="n">shortest</span><span class="p">;</span>
<span class="n">numevents</span> <span class="o">=</span> <span class="n">driver</span><span class="o">-></span><span class="n">event_wait</span><span class="p">(</span><span class="n">fired_events</span><span class="p">,</span> <span class="o">&</span><span class="n">tv</span><span class="p">);</span> <span class="c1">// 获取当前的io事件
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">numevents</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">rfired</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">FileEvent</span> <span class="o">*</span><span class="n">event</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">file_lock</span><span class="p">);</span>
<span class="n">event</span> <span class="o">=</span> <span class="n">_get_file_event</span><span class="p">(</span><span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">event</span><span class="o">-></span><span class="n">mask</span> <span class="o">&</span> <span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">mask</span> <span class="o">&</span> <span class="n">EVENT_READABLE</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rfired</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">event</span><span class="o">-></span><span class="n">read_cb</span><span class="o">-></span><span class="n">do_request</span><span class="p">(</span><span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span><span class="p">);</span> <span class="c1">// 处理可读事件
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">event</span><span class="o">-></span><span class="n">mask</span> <span class="o">&</span> <span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">mask</span> <span class="o">&</span> <span class="n">EVENT_WRITABLE</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">rfired</span> <span class="o">||</span> <span class="n">event</span><span class="o">-></span><span class="n">read_cb</span> <span class="o">!=</span> <span class="n">event</span><span class="o">-></span><span class="n">write_cb</span><span class="p">)</span>
<span class="n">event</span><span class="o">-></span><span class="n">write_cb</span><span class="o">-></span><span class="n">do_request</span><span class="p">(</span><span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span><span class="p">);</span> <span class="c1">// 处理可写事件
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">EpollDriver</span><span class="o">::</span><span class="n">event_wait</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">FiredFileEvent</span><span class="o">></span> <span class="o">&</span><span class="n">fired_events</span><span class="p">,</span> <span class="k">struct</span> <span class="n">timeval</span> <span class="o">*</span><span class="n">tvp</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">retval</span><span class="p">,</span> <span class="n">numevents</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">retval</span> <span class="o">=</span> <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">events</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
<span class="n">tvp</span> <span class="o">?</span> <span class="p">(</span><span class="n">tvp</span><span class="o">-></span><span class="n">tv_sec</span><span class="o">*</span><span class="mi">1000</span> <span class="o">+</span> <span class="n">tvp</span><span class="o">-></span><span class="n">tv_usec</span><span class="o">/</span><span class="mi">1000</span><span class="p">)</span> <span class="o">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// epoll_wait系统调用,等待就绪事件或超时返回
</span> <span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">numevents</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">epoll_event</span> <span class="o">*</span><span class="n">e</span> <span class="o">=</span> <span class="n">events</span> <span class="o">+</span> <span class="n">j</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLIN</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">EVENT_READABLE</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLOUT</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">EVENT_WRITABLE</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLERR</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">EVENT_WRITABLE</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLHUP</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">EVENT_WRITABLE</span><span class="p">;</span>
<span class="c1">// 记录下已经发生的事件
</span> <span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span> <span class="o">=</span> <span class="n">e</span><span class="o">-></span><span class="n">data</span><span class="p">.</span><span class="n">fd</span><span class="p">;</span>
<span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">mask</span> <span class="o">=</span> <span class="n">mask</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">numevents</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>process_events函数中,需要注意的是,这里处理三种事件,与fd相关的读写事件,与时间相关的time事件,还有添加的外部事件,
在处理fd的时候,如果没有fd就绪就会一直wait等待超时(最大超时时间不超过下次时间事件的值)。但是,在这个过程中,
有两种情况需要被唤醒,一是添加了一个更小的时间事件(最近发生),二是添加了外部事件。</p>
<h3 id="add-listen-fd">Add Listen Fd</h3>
<p>Worker线程循环不停的处理事件,其实就是调用epoll_wait,返回就绪事件的fd,然后调用fd对应的回调read_cb或write_cb,很明显,epoll_wait能够返回就绪的fd,
这个fd必然是之前添加进去的,什么时候添加的呢?还记得在第二步Bind的时候,Processor类中创建了listen_fd,要想监听来自这个fd的请求,必然要将其添加到epoll进行管理。</p>
<p>但是从osd代码运行到这里,似乎都没有添加的动作?在osd调用messenger->start()后,紧接着就是:</p>
<blockquote>
<p>err = osd->init();</p>
</blockquote>
<p>诀窍就在这里:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// i'm ready!
</span> <span class="n">client_messenger</span><span class="o">-></span><span class="n">add_dispatcher_head</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="n">cluster_messenger</span><span class="o">-></span><span class="n">add_dispatcher_head</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">add_dispatcher_head</span><span class="p">(</span><span class="n">Dispatcher</span> <span class="o">*</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">bool</span> <span class="n">first</span> <span class="o">=</span> <span class="n">dispatchers</span><span class="p">.</span><span class="n">empty</span><span class="p">();</span> <span class="c1">// 刚开始当然为空, first为true
</span> <span class="n">dispatchers</span><span class="p">.</span><span class="n">push_front</span><span class="p">(</span><span class="n">d</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">ms_can_fast_dispatch_any</span><span class="p">())</span>
<span class="n">fast_dispatchers</span><span class="p">.</span><span class="n">push_front</span><span class="p">(</span><span class="n">d</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">first</span><span class="p">)</span>
<span class="n">ready</span><span class="p">();</span> <span class="c1">// 准备添加fd到epoll
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">ready</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">get_myaddr</span><span class="p">()</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">Worker</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="n">pool</span><span class="o">-></span><span class="n">get_worker</span><span class="p">();</span> <span class="c1">// 获取一个worker干活
</span> <span class="n">processor</span><span class="p">.</span><span class="n">start</span><span class="p">(</span><span class="n">w</span><span class="p">);</span> <span class="c1">// listen_sd在Processor中
</span><span class="p">}</span>
<span class="kt">int</span> <span class="n">Processor</span><span class="o">::</span><span class="n">start</span><span class="p">(</span><span class="n">Worker</span> <span class="o">*</span><span class="n">w</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">msgr</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="c1">// start thread
</span> <span class="k">if</span> <span class="p">(</span><span class="n">listen_sd</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">worker</span> <span class="o">=</span> <span class="n">w</span><span class="p">;</span>
<span class="c1">// 创建可读事件, 最终会调用epoll_ctl将listen_sd加进epoll进行管理
</span> <span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">create_file_event</span><span class="p">(</span><span class="n">listen_sd</span><span class="p">,</span> <span class="n">EVENT_READABLE</span><span class="p">,</span>
<span class="n">EventCallbackRef</span><span class="p">(</span><span class="k">new</span> <span class="n">C_processor_accept</span><span class="p">(</span><span class="k">this</span><span class="p">)));</span> <span class="c1">// 注意事件的callback
</span> <span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="accept-connection">Accept Connection</h3>
<p>listen fd添加进去以后,初始化过程就算全部完成了。当新的连接请求到来,如前所述,worker线程会调用process_event函数,回调就会被执行:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// listen fd 的回调
</span><span class="k">class</span> <span class="nc">C_processor_accept</span> <span class="o">:</span> <span class="k">public</span> <span class="n">EventCallback</span> <span class="p">{</span>
<span class="n">Processor</span> <span class="o">*</span><span class="n">pro</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">C_processor_accept</span><span class="p">(</span><span class="n">Processor</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span><span class="o">:</span> <span class="n">pro</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">do_request</span><span class="p">(</span><span class="kt">int</span> <span class="n">id</span><span class="p">)</span> <span class="p">{</span>
<span class="n">pro</span><span class="o">-></span><span class="n">accept</span><span class="p">();</span> <span class="c1">// 回调
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">Processor</span><span class="o">::</span><span class="n">accept</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">errors</span> <span class="o"><</span> <span class="mi">4</span><span class="p">)</span> <span class="p">{</span>
<span class="n">entity_addr_t</span> <span class="n">addr</span><span class="p">;</span>
<span class="n">socklen_t</span> <span class="n">slen</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">addr</span><span class="p">.</span><span class="n">ss_addr</span><span class="p">());</span>
<span class="kt">int</span> <span class="n">sd</span> <span class="o">=</span> <span class="o">::</span><span class="n">accept</span><span class="p">(</span><span class="n">listen_sd</span><span class="p">,</span> <span class="p">(</span><span class="n">sockaddr</span><span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">addr</span><span class="p">.</span><span class="n">ss_addr</span><span class="p">(),</span> <span class="o">&</span><span class="n">slen</span><span class="p">);</span> <span class="c1">// 接受连接请求
</span> <span class="k">if</span> <span class="p">(</span><span class="n">sd</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">msgr</span><span class="o">-></span><span class="n">add_accept</span><span class="p">(</span><span class="n">sd</span><span class="p">);</span> <span class="c1">// 通过messenger处理接收套接字sd
</span> <span class="k">continue</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">AsyncConnectionRef</span> <span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">add_accept</span><span class="p">(</span><span class="kt">int</span> <span class="n">sd</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">Worker</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="n">pool</span><span class="o">-></span><span class="n">get_worker</span><span class="p">();</span>
<span class="n">AsyncConnectionRef</span> <span class="n">conn</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AsyncConnection</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="k">this</span><span class="p">,</span> <span class="o">&</span><span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">);</span> <span class="c1">// 创建连接
</span> <span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">.</span><span class="n">dispatch_event_external</span><span class="p">(</span><span class="n">EventCallbackRef</span><span class="p">(</span><span class="k">new</span> <span class="n">C_conn_accept</span><span class="p">(</span><span class="n">conn</span><span class="p">,</span> <span class="n">sd</span><span class="p">)));</span> <span class="c1">// 分发事件, 外部新的连接,所以叫external
</span> <span class="n">accepting_conns</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">conn</span><span class="p">);</span> <span class="c1">// 记录下即将生效的连接, 最终完成后会从此集合删除
</span> <span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">return</span> <span class="n">conn</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">EventCenter</span><span class="o">::</span><span class="n">dispatch_event_external</span><span class="p">(</span><span class="n">EventCallbackRef</span> <span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">external_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">external_events</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> <span class="c1">// 将事件的callback函数放入事件中心的队列中等待执行
</span> <span class="n">external_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="n">wakeup</span><span class="p">();</span> <span class="c1">// 唤醒worker线程
</span><span class="p">}</span>
</code></pre></div></div>
<p>不是很明白为什么需要放入队列,等待worker下一次的process_event调用,是否可以直接执行完毕?</p>
<p>不管怎么样,放入队列后,需要执行队列中的callback,什么时候会执行呢?很明显是在worker线程中的process_event函数,
但是worker线程可能睡眠在epoll_wait(epoll管理的所有fd都没就绪,只能等待超时),如果有新连接到来,需要立即接收连接请求,
所以要唤醒睡眠的worker线程,后面的wakeup函数就是达到此目的,这个函数向pipe的一端写入数据(pipe是在函数EventCenter::init()中创建的),
使得另一端可读,即notify_receive_fd就绪,epoll_wait会返回其可读事件,然后执行其回调(回调就是简单读pipe),使得worker线程得以继续处理,
然后执行刚才放入队列中的回调。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">EventCenter</span><span class="o">::</span><span class="n">wakeup</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'c'</span><span class="p">;</span>
<span class="c1">// wake up "event_wait"
</span> <span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">write</span><span class="p">(</span><span class="n">notify_send_fd</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// 唤醒worker线程
</span> <span class="c1">// FIXME ?
</span> <span class="n">assert</span><span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">EventCenter</span><span class="o">::</span><span class="n">process_events</span><span class="p">(</span><span class="kt">int</span> <span class="n">timeout_microseconds</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">numevents</span> <span class="o">=</span> <span class="n">driver</span><span class="o">-></span><span class="n">event_wait</span><span class="p">(</span><span class="n">fired_events</span><span class="p">,</span> <span class="o">&</span><span class="n">tv</span><span class="p">);</span> <span class="c1">// 本来worker线程可能睡眠在这里,会被wakeup唤醒
</span>
<span class="c1">// 这时候至少有一个fd就绪,即notify_receive_fd
</span> <span class="c1">// 执行所有fd的callback, 对于notify_receive_fd,可以看其callback,就是简单读一下,什么也没干
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">numevents</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">event</span><span class="o">-></span><span class="n">read_cb</span><span class="o">-></span><span class="n">do_request</span><span class="p">(</span><span class="n">fired_events</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span><span class="p">);</span>
<span class="p">.....</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="c1">// 紧接着处理刚才的队列, 这正是唤醒worker的目的
</span> <span class="p">{</span>
<span class="n">external_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">external_events</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">EventCallbackRef</span> <span class="n">e</span> <span class="o">=</span> <span class="n">external_events</span><span class="p">.</span><span class="n">front</span><span class="p">();</span>
<span class="n">external_events</span><span class="p">.</span><span class="n">pop_front</span><span class="p">();</span>
<span class="n">external_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="p">)</span>
<span class="n">e</span><span class="o">-></span><span class="n">do_request</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span> <span class="c1">// 连接请求的callback
</span> <span class="n">external_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">external_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="add-accept-fd">Add Accept Fd</h3>
<p>从分析看,连接请求的callback会很快被执行。前面已经有了accept接收请求的fd,现在需要将那个fd加入epoll结构,管理起来,然后就可以进行通信,
callback最终就是做这些事情:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 队列中的回调类型
</span><span class="k">class</span> <span class="nc">C_conn_accept</span> <span class="o">:</span> <span class="k">public</span> <span class="n">EventCallback</span> <span class="p">{</span>
<span class="n">AsyncConnectionRef</span> <span class="n">conn</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">C_conn_accept</span><span class="p">(</span><span class="n">AsyncConnectionRef</span> <span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="n">s</span><span class="p">)</span><span class="o">:</span> <span class="n">conn</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="n">fd</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">do_request</span><span class="p">(</span><span class="kt">int</span> <span class="n">id</span><span class="p">)</span> <span class="p">{</span>
<span class="n">conn</span><span class="o">-></span><span class="n">accept</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">AsyncConnection</span><span class="o">::</span><span class="n">accept</span><span class="p">(</span><span class="kt">int</span> <span class="n">incoming</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">async_msgr</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" sd="</span> <span class="o"><<</span> <span class="n">incoming</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">assert</span><span class="p">(</span><span class="n">sd</span> <span class="o"><</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">sd</span> <span class="o">=</span> <span class="n">incoming</span><span class="p">;</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">STATE_ACCEPTING</span><span class="p">;</span>
<span class="n">center</span><span class="o">-></span><span class="n">create_file_event</span><span class="p">(</span><span class="n">sd</span><span class="p">,</span> <span class="n">EVENT_READABLE</span><span class="p">,</span> <span class="n">read_handler</span><span class="p">);</span> <span class="c1">// sd就是连接成功的fd,加进epoll管理
</span> <span class="n">process</span><span class="p">();</span> <span class="c1">// 服务器端的状态机开始执行,会先向客户端发送BANNER消息
</span><span class="p">}</span>
</code></pre></div></div>
<h3 id="communication">Communication</h3>
<p>注意服务端AsyncConnection状态机的初始状态是STATE_ACCEPTING,服务器端的状态机会先向客户端发送BANNER消息。
以后收到消息,worker线程就会调用read_handler处理,然后调用process,状态机不停的转换状态:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 注册的回调类
</span><span class="k">class</span> <span class="nc">C_handle_read</span> <span class="o">:</span> <span class="k">public</span> <span class="n">EventCallback</span> <span class="p">{</span>
<span class="n">AsyncConnectionRef</span> <span class="n">conn</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">C_handle_read</span><span class="p">(</span><span class="n">AsyncConnectionRef</span> <span class="n">c</span><span class="p">)</span><span class="o">:</span> <span class="n">conn</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">do_request</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd_or_id</span><span class="p">)</span> <span class="p">{</span>
<span class="n">conn</span><span class="o">-></span><span class="n">process</span><span class="p">();</span> <span class="c1">// 调用connection处理
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">AsyncConnection</span><span class="o">::</span><span class="n">process</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">prev_state</span> <span class="o">=</span> <span class="n">state</span><span class="p">;</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">prev_state</span> <span class="o">=</span> <span class="n">state</span><span class="p">;</span>
<span class="c1">// connection状态机
</span> <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">STATE_OPEN</span><span class="p">:</span>
<span class="p">......</span>
<span class="k">default</span><span class="o">:</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">_process_connection</span><span class="p">()</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">fail</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">fail</span><span class="o">:</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="c1">// 单独处理连接信息
</span><span class="kt">int</span> <span class="n">AsyncConnection</span><span class="o">::</span><span class="n">_process_connection</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">switch</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">STATE_WAIT_SEND</span><span class="p">:</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>AsyncConnection就是负责通信的类,要理解这个状态机的原理,必须理解ceph的应用层通信协议,
可以参看<a href="http://docs.ceph.com/docs/master/dev/network-protocol/">官方文档</a>的解释。</p>
<p>AsyncMessenger的框架就算介绍完成了,当有新的连接请求到来,就会重复执行以下这几步:</p>
<ul>
<li>
<p>accept connection</p>
</li>
<li>
<p>add accept fd</p>
</li>
<li>
<p>communication</p>
</li>
</ul>
<p>由此可以看出,线程数不是随连接数线性增加的,只由最开始初始化的时候启动了多少个worker决定。</p>
<h1 id="client">Client</h1>
<p>客户端的操作主要是发起connect操作,建立连接进行通信。所有的客户端都是基于librados库,然后通过RadosClient连接集群的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">Rados</span><span class="o">::</span><span class="n">connect</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">client</span><span class="o">-></span><span class="n">connect</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">RadosClient</span><span class="o">::</span><span class="n">connect</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 创建messenger
</span> <span class="n">messenger</span> <span class="o">=</span> <span class="n">Messenger</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">ms_type</span><span class="p">,</span> <span class="n">entity_name_t</span><span class="o">::</span><span class="n">CLIENT</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span>
<span class="s">"radosclient"</span><span class="p">,</span> <span class="n">nonce</span><span class="p">);</span>
<span class="p">......</span>
<span class="c1">// 创建objecter
</span> <span class="c1">// 发送消息的时候,比如librbd代码,都是通过objecter处理
</span> <span class="c1">// objecter需要借助于messenger发送,所以需要将创建的messenger传给objecter类
</span> <span class="n">objecter</span> <span class="o">=</span> <span class="k">new</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">nothrow</span><span class="p">)</span> <span class="n">Objecter</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">messenger</span><span class="p">,</span> <span class="o">&</span><span class="n">monclient</span><span class="p">,</span>
<span class="o">&</span><span class="n">finisher</span><span class="p">,</span>
<span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rados_mon_op_timeout</span><span class="p">,</span>
<span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rados_osd_op_timeout</span><span class="p">);</span>
<span class="c1">// 同理,连接monitor也需要处理消息的收发
</span> <span class="n">monclient</span><span class="p">.</span><span class="n">set_messenger</span><span class="p">(</span><span class="n">messenger</span><span class="p">);</span>
<span class="n">objecter</span><span class="o">-></span><span class="n">init</span><span class="p">();</span>
<span class="n">messenger</span><span class="o">-></span><span class="n">add_dispatcher_tail</span><span class="p">(</span><span class="n">objecter</span><span class="p">);</span>
<span class="n">messenger</span><span class="o">-></span><span class="n">add_dispatcher_tail</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="n">messenger</span><span class="o">-></span><span class="n">start</span><span class="p">();</span>
<span class="p">......</span>
<span class="n">messenger</span><span class="o">-></span><span class="n">set_myname</span><span class="p">(</span><span class="n">entity_name_t</span><span class="o">::</span><span class="n">CLIENT</span><span class="p">(</span><span class="n">monclient</span><span class="p">.</span><span class="n">get_global_id</span><span class="p">()));</span> <span class="c1">// ID全局唯一,所以需要向monitor获取
</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>connect操作只是初始化了messenger对象,真正需要通信的时候,才会去建立连接,以objecter.cc中的op_submit为例:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ceph_tid_t</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_op_submit</span><span class="p">(</span><span class="n">Op</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">&</span> <span class="n">lc</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_get_session</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">osd</span><span class="p">,</span> <span class="o">&</span><span class="n">s</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_get_session</span><span class="p">(</span><span class="kt">int</span> <span class="n">osd</span><span class="p">,</span> <span class="n">OSDSession</span> <span class="o">**</span><span class="n">session</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">&</span> <span class="n">lc</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// session 不存在,会创建新的session,
</span> <span class="n">s</span><span class="o">-></span><span class="n">con</span> <span class="o">=</span> <span class="n">messenger</span><span class="o">-></span><span class="n">get_connection</span><span class="p">(</span><span class="n">osdmap</span><span class="o">-></span><span class="n">get_inst</span><span class="p">(</span><span class="n">osd</span><span class="p">));</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">ConnectionRef</span> <span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">get_connection</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_inst_t</span><span class="o">&</span> <span class="n">dest</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">create_connect</span><span class="p">(</span><span class="n">dest</span><span class="p">.</span><span class="n">addr</span><span class="p">,</span> <span class="n">dest</span><span class="p">.</span><span class="n">name</span><span class="p">.</span><span class="n">type</span><span class="p">());</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="n">AsyncConnectionRef</span> <span class="n">AsyncMessenger</span><span class="o">::</span><span class="n">create_connect</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span><span class="o">&</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">type</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// create connection
</span> <span class="n">Worker</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="n">pool</span><span class="o">-></span><span class="n">get_worker</span><span class="p">();</span>
<span class="n">AsyncConnectionRef</span> <span class="n">conn</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AsyncConnection</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="k">this</span><span class="p">,</span> <span class="o">&</span><span class="n">w</span><span class="o">-></span><span class="n">center</span><span class="p">);</span> <span class="c1">// 创建connection
</span> <span class="n">conn</span><span class="o">-></span><span class="n">connect</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">type</span><span class="p">);</span> <span class="c1">// 连接
</span> <span class="n">assert</span><span class="p">(</span><span class="o">!</span><span class="n">conns</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">addr</span><span class="p">));</span>
<span class="n">conns</span><span class="p">[</span><span class="n">addr</span><span class="p">]</span> <span class="o">=</span> <span class="n">conn</span><span class="p">;</span>
<span class="k">return</span> <span class="n">conn</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">connect</span><span class="p">(</span><span class="k">const</span> <span class="n">entity_addr_t</span><span class="o">&</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">type</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">set_peer_type</span><span class="p">(</span><span class="n">type</span><span class="p">);</span>
<span class="n">set_peer_addr</span><span class="p">(</span><span class="n">addr</span><span class="p">);</span>
<span class="n">policy</span> <span class="o">=</span> <span class="n">msgr</span><span class="o">-></span><span class="n">get_policy</span><span class="p">(</span><span class="n">type</span><span class="p">);</span>
<span class="n">_connect</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">AsyncConnection</span><span class="o">::</span><span class="n">_connect</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">STATE_CONNECTING</span><span class="p">;</span> <span class="c1">// 这个初始化状态很关键,是客户端状态机的起始状态
</span> <span class="n">stopping</span><span class="p">.</span><span class="n">set</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="n">center</span><span class="o">-></span><span class="n">dispatch_event_external</span><span class="p">(</span><span class="n">read_handler</span><span class="p">);</span> <span class="c1">// 放入队列等待worker处理
</span><span class="p">}</span>
</code></pre></div></div>
<p>这里和前面一样,worker会处理这个外部事件,read_handler就会调用process函数,紧接着就过度到_process_connection:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">AsyncConnection</span><span class="o">::</span><span class="n">_process_connection</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">switch</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">STATE_CONNECTING</span><span class="p">:</span> <span class="c1">// 初始状态
</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">sd</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">get_peer_addr</span><span class="p">());</span> <span class="c1">// 通过net类的功能,实际上就是调用connect系统调用,建立socket通信
</span>
<span class="c1">// 连接成功后,将socket fd加入epoll进行管理
</span> <span class="n">center</span><span class="o">-></span><span class="n">create_file_event</span><span class="p">(</span><span class="n">sd</span><span class="p">,</span> <span class="n">EVENT_READABLE</span><span class="p">,</span> <span class="n">read_handler</span><span class="p">);</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">STATE_CONNECTING_WAIT_BANNER</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接下来就是客户端和服务端的通信,都是通过AsyncConnection的状态机完成。同理,客户端即使创建多个messenger,
他们仍然共享一个workerpool,线程数由这个pool初始化的时候决定,不会随着连接的增加而线性增加。</p>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>进程中所有的AsyncMessenger共享一个workerpool管理所有worker</p>
</li>
<li>
<p>Worker线程通过EventCenter负责具体的事件处理</p>
</li>
<li>
<p>应用层的网络通信由AsyncConnection的状态机处理</p>
</li>
</ul>
Ceph Throttle Summary
2015-12-18T00:00:00+00:00
http://blog.wjin.org/posts/ceph-throttle-summary
<h1 id="overview">Overview</h1>
<p>ceph的io栈比较长,就像流水线生产一样,io操作会经过很多队列,由特定线程或线程池取出进行操作,最终将对象数据存放在磁盘上。
中间很多关键步骤的参数,比如队列深度,线程池个数等都可以通过设置参数控制,ceph也提供一种机制,对每个组件进行限流,
防止所有操作拥塞在流水线的一个地方加剧竞争。</p>
<p>ceph源码中,有四种基本的限流实现,下面一一分析。</p>
<h1 id="simplethrottle">SimpleThrottle</h1>
<p>这是最简单的一种,设置一个max值以及当前计数,需要资源时,计数+1,如果操作达到上限max,就阻塞等待,使用完后返回资源,计数-1。
具体实现源码在文件src/common/Throttle.h 和 src/common/Throttle.c,参见类SimpleThrottle:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 类头文件
</span><span class="k">class</span> <span class="nc">SimpleThrottle</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">SimpleThrottle</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">max</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">ignore_enoent</span><span class="p">);</span>
<span class="o">~</span><span class="n">SimpleThrottle</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">start_op</span><span class="p">();</span> <span class="c1">// 使用计数+1,超过最大限制会阻塞
</span> <span class="kt">void</span> <span class="n">end_op</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">);</span> <span class="c1">// 使用计数-1
</span> <span class="kt">bool</span> <span class="n">pending_error</span><span class="p">()</span> <span class="k">const</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">wait_for_ret</span><span class="p">();</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">mutable</span> <span class="n">Mutex</span> <span class="n">m_lock</span><span class="p">;</span>
<span class="n">Cond</span> <span class="n">m_cond</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">m_max</span><span class="p">;</span> <span class="c1">// 并发最大限制数
</span> <span class="kt">uint64_t</span> <span class="n">m_current</span><span class="p">;</span> <span class="c1">// 当前的并发数
</span> <span class="kt">int</span> <span class="n">m_ret</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">m_ignore_enoent</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// 简单的callback类,自动加减计数
</span><span class="k">class</span> <span class="nc">C_SimpleThrottle</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Context</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">C_SimpleThrottle</span><span class="p">(</span><span class="n">SimpleThrottle</span> <span class="o">*</span><span class="n">throttle</span><span class="p">)</span> <span class="o">:</span> <span class="n">m_throttle</span><span class="p">(</span><span class="n">throttle</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_throttle</span><span class="o">-></span><span class="n">start_op</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">finish</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_throttle</span><span class="o">-></span><span class="n">end_op</span><span class="p">(</span><span class="n">r</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">SimpleThrottle</span> <span class="o">*</span><span class="n">m_throttle</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// 成员函数实现
// 获取资源
</span><span class="kt">void</span> <span class="n">SimpleThrottle</span><span class="o">::</span><span class="n">start_op</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_lock</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">m_max</span> <span class="o">==</span> <span class="n">m_current</span><span class="p">)</span> <span class="c1">// 阻塞等待
</span> <span class="n">m_cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">m_lock</span><span class="p">);</span>
<span class="o">++</span><span class="n">m_current</span><span class="p">;</span> <span class="c1">// 增加计数
</span><span class="p">}</span>
<span class="c1">// 释放资源
</span><span class="kt">void</span> <span class="n">SimpleThrottle</span><span class="o">::</span><span class="n">end_op</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_lock</span><span class="p">);</span>
<span class="o">--</span><span class="n">m_current</span><span class="p">;</span> <span class="c1">// 减少计数
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">&&</span> <span class="o">!</span><span class="n">m_ret</span> <span class="o">&&</span> <span class="o">!</span><span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="n">ENOENT</span> <span class="o">&&</span> <span class="n">m_ignore_enoent</span><span class="p">))</span>
<span class="n">m_ret</span> <span class="o">=</span> <span class="n">r</span><span class="p">;</span>
<span class="n">m_cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这个简单的限流主要用在librbd中,比如对镜像回滚、导出、导入等操作。</p>
<h1 id="throttle">Throttle</h1>
<p>不同于SimpleThrottle,Throttle这个类实现的限流一次可以请求多个资源,而不是每次只能请求一个资源,
并且在资源不足的情况下,按照fifo顺序对请求进行排序,具体实现源码也是在文件src/common/Throttle.h 和 src/common/Throttle.c,
参见类Throttle:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Throttle</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">ceph</span><span class="o">::</span><span class="n">atomic_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">max</span><span class="p">;</span> <span class="c1">// 当前并发数和最大值
</span> <span class="n">Mutex</span> <span class="n">lock</span><span class="p">;</span>
<span class="n">list</span><span class="o"><</span><span class="n">Cond</span><span class="o">*></span> <span class="n">cond</span><span class="p">;</span> <span class="c1">// FIFO 等待队列
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Throttle</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">n</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">_use_perf</span> <span class="o">=</span> <span class="nb">true</span><span class="p">);</span>
<span class="o">~</span><span class="n">Throttle</span><span class="p">();</span>
<span class="k">private</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">_reset_max</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">m</span><span class="p">);</span>
<span class="kt">bool</span> <span class="nf">_should_wait</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 是否超过并发的限制
</span> <span class="kt">int64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="n">max</span><span class="p">.</span><span class="n">read</span><span class="p">();</span>
<span class="kt">int64_t</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">count</span><span class="p">.</span><span class="n">read</span><span class="p">();</span>
<span class="k">return</span>
<span class="n">m</span> <span class="o">&&</span>
<span class="p">((</span><span class="n">c</span> <span class="o"><=</span> <span class="n">m</span> <span class="o">&&</span> <span class="n">cur</span> <span class="o">+</span> <span class="n">c</span> <span class="o">></span> <span class="n">m</span><span class="p">)</span> <span class="o">||</span> <span class="c1">// normally stay under max
</span> <span class="p">(</span><span class="n">c</span> <span class="o">>=</span> <span class="n">m</span> <span class="o">&&</span> <span class="n">cur</span> <span class="o">></span> <span class="n">m</span><span class="p">));</span> <span class="c1">// except for large c
</span> <span class="p">}</span>
<span class="c1">// 核心实现函数,如果需要阻塞,会新建一个条件变量,然后放入链表中,等待自己的条件变量被唤醒
</span> <span class="c1">// 如果多次调用,会形成一个阻塞链表,按照FIFO顺序被唤醒
</span> <span class="kt">bool</span> <span class="n">_wait</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span><span class="p">);</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">bool</span> <span class="n">wait</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 将并发参数max设置为m,必要时会等待之前已经在等待的操作完成,阻塞
</span>
<span class="p">......</span>
<span class="c1">// 将并发参数max设置为m,必要时会等待之前已经在等待的操作完成
</span> <span class="c1">// 等待可以进行c个并发,阻塞
</span> <span class="c1">// 接着将当前并发数增加为c
</span> <span class="kt">bool</span> <span class="n">get</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 获取资源,增加计数 ,阻塞
</span>
<span class="kt">int64_t</span> <span class="n">put</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// 释放资源,减少计数,非阻塞
</span><span class="p">};</span>
</code></pre></div></div>
<p>核心函数就这一个内部函数_wait,以及另外两个常见的API get/put:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">Throttle</span><span class="o">::</span><span class="n">_wait</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">utime_t</span> <span class="n">start</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">waited</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">_should_wait</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">||</span> <span class="o">!</span><span class="n">cond</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 需要等待
</span> <span class="n">Cond</span> <span class="o">*</span><span class="n">cv</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Cond</span><span class="p">;</span> <span class="c1">// 新建自己的条件变量
</span> <span class="n">cond</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">cv</span><span class="p">);</span> <span class="c1">// 插入fifo队列
</span> <span class="k">do</span> <span class="p">{</span>
<span class="n">waited</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">cv</span><span class="o">-></span><span class="n">Wait</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span> <span class="c1">// 睡眠
</span> <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">_should_wait</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">||</span> <span class="n">cv</span> <span class="o">!=</span> <span class="n">cond</span><span class="p">.</span><span class="n">front</span><span class="p">());</span> <span class="c1">// 唤醒后,需要继续检查并发数是否够用,所以用do while循环
</span>
<span class="k">delete</span> <span class="n">cv</span><span class="p">;</span>
<span class="n">cond</span><span class="p">.</span><span class="n">pop_front</span><span class="p">();</span>
<span class="c1">// 唤醒后面新加入的waiters
</span> <span class="c1">// 执行到这里,说明计数已经满足,可以执行,继续唤醒后面的waiters,而不是等待自己释放了计数才去唤醒
</span> <span class="c1">// 这是因为前面一次可能释放了很多计数,后面几个新加的waiters都可以得到满足
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cond</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span>
<span class="n">cond</span><span class="p">.</span><span class="n">front</span><span class="p">()</span><span class="o">-></span><span class="n">SignalOne</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">waited</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">Throttle</span><span class="o">::</span><span class="n">get</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="mi">0</span> <span class="o">==</span> <span class="n">max</span><span class="p">.</span><span class="n">read</span><span class="p">())</span> <span class="p">{</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">waited</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 重置并发数max
</span> <span class="n">assert</span><span class="p">(</span><span class="n">m</span> <span class="o">></span> <span class="mi">0</span><span class="p">);</span>
<span class="n">_reset_max</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">waited</span> <span class="o">=</span> <span class="n">_wait</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// 获取资源
</span> <span class="n">count</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// 增加计数
</span> <span class="p">}</span>
<span class="k">return</span> <span class="n">waited</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int64_t</span> <span class="n">Throttle</span><span class="o">::</span><span class="n">put</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="mi">0</span> <span class="o">==</span> <span class="n">max</span><span class="p">.</span><span class="n">read</span><span class="p">())</span> <span class="p">{</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cond</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="c1">// 唤醒下一个
</span> <span class="n">cond</span><span class="p">.</span><span class="n">front</span><span class="p">()</span><span class="o">-></span><span class="n">SignalOne</span><span class="p">();</span>
<span class="n">count</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// 减少计数
</span> <span class="p">}</span>
<span class="k">return</span> <span class="n">count</span><span class="p">.</span><span class="n">read</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这个throttle实现比较通用,除了后面讲的特定场所的throttle,其他都是使用类Throttle来达到限流目的。</p>
<h1 id="wbthrottle">WBThrottle</h1>
<p>WB代表write back,这个限流是专为FileStore设计的,主要是防止FileStore写太快,后端存储设备速度跟不上。对于ssd块设备,可以disable这个功能。
WBThrottle类自带线程,限流从三个粒度(io个数,字节数,对象个数)一起控制:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">WBThrottle</span> <span class="o">:</span> <span class="n">Thread</span><span class="p">,</span> <span class="k">public</span> <span class="n">md_config_obs_t</span> <span class="p">{</span> <span class="c1">// 继承线程类
</span> <span class="n">ghobject_t</span> <span class="n">clearing</span><span class="p">;</span>
<span class="c1">// <soft, hard>
</span> <span class="c1">// 当三个粒度的其中一个超过soft值,就开始回刷fd
</span> <span class="c1">// 当三个粒度的其中一个超过hard值,throttle就开始起作用,_do_op会阻塞
</span> <span class="n">pair</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="kt">uint64_t</span><span class="o">></span> <span class="n">size_limits</span><span class="p">;</span> <span class="c1">// 未刷新字节数
</span> <span class="n">pair</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="kt">uint64_t</span><span class="o">></span> <span class="n">io_limits</span><span class="p">;</span> <span class="c1">// 未刷新io个数
</span> <span class="n">pair</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="kt">uint64_t</span><span class="o">></span> <span class="n">fd_limits</span><span class="p">;</span> <span class="c1">// 未刷新fd个数,也即对象个数
</span>
<span class="kt">uint64_t</span> <span class="n">cur_ios</span><span class="p">;</span> <span class="c1">// 当前未刷新io个数
</span> <span class="kt">uint64_t</span> <span class="n">cur_size</span><span class="p">;</span> <span class="c1">// 当前未刷新字节数
</span>
<span class="c1">// 跟踪对象的一些信息
</span> <span class="k">class</span> <span class="nc">PendingWB</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">bool</span> <span class="n">nocache</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">size</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">ios</span><span class="p">;</span>
<span class="n">PendingWB</span><span class="p">()</span> <span class="o">:</span> <span class="n">nocache</span><span class="p">(</span><span class="nb">true</span><span class="p">),</span> <span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">ios</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">add</span><span class="p">(</span><span class="kt">bool</span> <span class="n">_nocache</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">_size</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">_ios</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">_nocache</span><span class="p">)</span>
<span class="n">nocache</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="c1">// only nocache if all writes are nocache
</span> <span class="n">size</span> <span class="o">+=</span> <span class="n">_size</span><span class="p">;</span>
<span class="n">ios</span> <span class="o">+=</span> <span class="n">_ios</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// lru实现
</span> <span class="n">list</span><span class="o"><</span><span class="n">ghobject_t</span><span class="o">></span> <span class="n">lru</span><span class="p">;</span>
<span class="n">ceph</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">list</span><span class="o"><</span><span class="n">ghobject_t</span><span class="o">>::</span><span class="n">iterator</span><span class="o">></span> <span class="n">rev_lru</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">remove_object</span><span class="p">(</span><span class="k">const</span> <span class="n">ghobject_t</span> <span class="o">&</span><span class="n">oid</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">ceph</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">list</span><span class="o"><</span><span class="n">ghobject_t</span><span class="o">>::</span><span class="n">iterator</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">iter</span> <span class="o">=</span>
<span class="n">rev_lru</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">iter</span> <span class="o">==</span> <span class="n">rev_lru</span><span class="p">.</span><span class="n">end</span><span class="p">())</span>
<span class="k">return</span><span class="p">;</span>
<span class="n">lru</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">iter</span><span class="o">-></span><span class="n">second</span><span class="p">);</span>
<span class="n">rev_lru</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">iter</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">ghobject_t</span> <span class="nf">pop_object</span><span class="p">()</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="o">!</span><span class="n">lru</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span>
<span class="n">ghobject_t</span> <span class="n">oid</span><span class="p">(</span><span class="n">lru</span><span class="p">.</span><span class="n">front</span><span class="p">());</span>
<span class="n">lru</span><span class="p">.</span><span class="n">pop_front</span><span class="p">();</span>
<span class="n">rev_lru</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span>
<span class="k">return</span> <span class="n">oid</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">insert_object</span><span class="p">(</span><span class="k">const</span> <span class="n">ghobject_t</span> <span class="o">&</span><span class="n">oid</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">rev_lru</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">oid</span><span class="p">)</span> <span class="o">==</span> <span class="n">rev_lru</span><span class="p">.</span><span class="n">end</span><span class="p">());</span>
<span class="n">lru</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span>
<span class="n">rev_lru</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="o">--</span><span class="n">lru</span><span class="p">.</span><span class="n">end</span><span class="p">()));</span>
<span class="p">}</span>
<span class="n">ceph</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">pair</span><span class="o"><</span><span class="n">PendingWB</span><span class="p">,</span> <span class="n">FDRef</span><span class="o">></span> <span class="o">></span> <span class="n">pending_wbs</span><span class="p">;</span> <span class="c1">// 等待刷新的对象集合
</span>
<span class="c1">// 如果还没达到soft值,就睡眠
</span> <span class="c1">// 达到soft值,就从lru中取出一个,然后刷新此对象上的操作
</span> <span class="kt">bool</span> <span class="n">get_next_should_flush</span><span class="p">(</span>
<span class="n">boost</span><span class="o">::</span><span class="n">tuple</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">FDRef</span><span class="p">,</span> <span class="n">PendingWB</span><span class="o">></span> <span class="o">*</span><span class="n">next</span> <span class="c1">///< [out] next to flush
</span> <span class="p">);</span> <span class="c1">///< @return false if we are shutting down
</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">enum</span> <span class="n">FS</span> <span class="p">{</span>
<span class="n">BTRFS</span><span class="p">,</span>
<span class="n">XFS</span>
<span class="p">};</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">FS</span> <span class="n">fs</span><span class="p">;</span>
<span class="kt">void</span> <span class="n">set_from_conf</span><span class="p">();</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">WBThrottle</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">);</span>
<span class="o">~</span><span class="n">WBThrottle</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">start</span><span class="p">();</span> <span class="c1">// 创建线程
</span> <span class="kt">void</span> <span class="n">stop</span><span class="p">();</span> <span class="c1">// 销毁线程
</span>
<span class="c1">/// Set fs as XFS or BTRFS
</span> <span class="kt">void</span> <span class="nf">set_fs</span><span class="p">(</span><span class="n">FS</span> <span class="n">new_fs</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">fs</span> <span class="o">=</span> <span class="n">new_fs</span><span class="p">;</span>
<span class="n">set_from_conf</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// 将对象的操作请求,插入到 pending_wbs 以及 lru中,等待刷新
</span> <span class="kt">void</span> <span class="n">queue_wb</span><span class="p">(</span>
<span class="n">FDRef</span> <span class="n">fd</span><span class="p">,</span> <span class="c1">///< [in] FDRef to oid
</span> <span class="k">const</span> <span class="n">ghobject_t</span> <span class="o">&</span><span class="n">oid</span><span class="p">,</span> <span class="c1">///< [in] object
</span> <span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="c1">///< [in] offset written
</span> <span class="kt">uint64_t</span> <span class="n">len</span><span class="p">,</span> <span class="c1">///< [in] length written
</span> <span class="kt">bool</span> <span class="n">nocache</span> <span class="c1">///< [in] try to clear out of cache after write
</span> <span class="p">);</span>
<span class="c1">// 如果三个指标中的一个超过了hard值,就睡眠
</span> <span class="kt">void</span> <span class="n">throttle</span><span class="p">();</span>
<span class="c1">/// md_config_obs_t
</span> <span class="k">const</span> <span class="kt">char</span><span class="o">**</span> <span class="n">get_tracked_conf_keys</span><span class="p">()</span> <span class="k">const</span><span class="p">;</span>
<span class="kt">void</span> <span class="n">handle_conf_change</span><span class="p">(</span><span class="k">const</span> <span class="n">md_config_t</span> <span class="o">*</span><span class="n">conf</span><span class="p">,</span>
<span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">set</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">></span> <span class="o">&</span><span class="n">changed</span><span class="p">);</span>
<span class="c1">// 线程入口,调用 get_next_should_flush 获取一个flush的FD
</span> <span class="c1">// 然后调用 fdatasync 或 fsync 刷新此对象上的数据
</span> <span class="kt">void</span> <span class="o">*</span><span class="n">entry</span><span class="p">();</span>
<span class="p">};</span>
</code></pre></div></div>
<p>这个类唯一使用的地方就是FileStore,FileStore包含成员WBThrottle,在mount的时候,会调用其start()函数启动线程,umount的时候调用其stop()函数结束线程。
限流的地方发生在函数FileStore::_do_op,这里会调用WBThrottle::throttle()进行限流,然后经过一系列调用,
会到FileStore::_write()函数,此函数会调用文件系统接口执行写操作到page cache,然后将写操作通过WBThrottle::queue_wb()加入pending_wbs等待线程刷新,
即执行fsync/fdatasync系统调用。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 三种粒度任一一项超过hard值,就会阻塞
</span><span class="kt">void</span> <span class="n">WBThrottle</span><span class="o">::</span><span class="n">throttle</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">stopping</span> <span class="o">&&</span> <span class="o">!</span><span class="p">(</span>
<span class="n">cur_ios</span> <span class="o"><</span> <span class="n">io_limits</span><span class="p">.</span><span class="n">second</span> <span class="o">&&</span>
<span class="n">pending_wbs</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o"><</span> <span class="n">fd_limits</span><span class="p">.</span><span class="n">second</span> <span class="o">&&</span>
<span class="n">cur_size</span> <span class="o"><</span> <span class="n">size_limits</span><span class="p">.</span><span class="n">second</span><span class="p">))</span> <span class="p">{</span>
<span class="n">cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_do_op</span><span class="p">(</span><span class="n">OpSequencer</span> <span class="o">*</span><span class="n">osr</span><span class="p">,</span> <span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">&</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">wbthrottle</span><span class="p">.</span><span class="n">throttle</span><span class="p">();</span> <span class="c1">// 限流
</span>
<span class="p">....</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_do_transactions</span><span class="p">(</span><span class="n">o</span><span class="o">-></span><span class="n">tls</span><span class="p">,</span> <span class="n">o</span><span class="o">-></span><span class="n">op</span><span class="p">,</span> <span class="o">&</span><span class="n">handle</span><span class="p">);</span> <span class="c1">// 执行写操作
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_do_transactions</span><span class="p">(</span>
<span class="n">list</span><span class="o"><</span><span class="n">Transaction</span><span class="o">*></span> <span class="o">&</span><span class="n">tls</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">op_seq</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">for</span> <span class="p">(</span><span class="n">list</span><span class="o"><</span><span class="n">Transaction</span><span class="o">*>::</span><span class="n">iterator</span> <span class="n">p</span> <span class="o">=</span> <span class="n">tls</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">p</span> <span class="o">!=</span> <span class="n">tls</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">p</span><span class="p">,</span> <span class="n">trans_num</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_do_transaction</span><span class="p">(</span><span class="o">**</span><span class="n">p</span><span class="p">,</span> <span class="n">op_seq</span><span class="p">,</span> <span class="n">trans_num</span><span class="p">,</span> <span class="n">handle</span><span class="p">);</span> <span class="c1">// 依次执行每个事务
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">unsigned</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_do_transaction</span><span class="p">(</span>
<span class="n">Transaction</span><span class="o">&</span> <span class="n">t</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">op_seq</span><span class="p">,</span> <span class="kt">int</span> <span class="n">trans_num</span><span class="p">,</span>
<span class="n">ThreadPool</span><span class="o">::</span><span class="n">TPHandle</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Transaction</span><span class="o">::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">SequencerPosition</span> <span class="n">spos</span><span class="p">(</span><span class="n">op_seq</span><span class="p">,</span> <span class="n">trans_num</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span><span class="p">.</span><span class="n">have_op</span><span class="p">())</span> <span class="p">{</span>
<span class="p">......</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">op</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">Transaction</span><span class="p">:</span><span class="o">:</span><span class="n">OP_NOP</span><span class="o">:</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">Transaction</span><span class="p">:</span><span class="o">:</span><span class="n">OP_WRITE</span><span class="o">:</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_write</span><span class="p">(</span><span class="n">cid</span><span class="p">,</span> <span class="n">oid</span><span class="p">,</span> <span class="n">off</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">bl</span><span class="p">,</span> <span class="n">fadvise_flags</span><span class="p">);</span> <span class="c1">// 写操作
</span> <span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">FileStore</span><span class="o">::</span><span class="n">_write</span><span class="p">(</span><span class="n">coll_t</span> <span class="n">cid</span><span class="p">,</span> <span class="k">const</span> <span class="n">ghobject_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span>
<span class="k">const</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">fadvise_flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">bl</span><span class="p">.</span><span class="n">write_fd</span><span class="p">(</span><span class="o">**</span><span class="n">fd</span><span class="p">);</span> <span class="c1">// 写文件内容
</span>
<span class="c1">// flush?
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">replaying</span> <span class="o">&&</span>
<span class="n">g_conf</span><span class="o">-></span><span class="n">filestore_wbthrottle_enable</span><span class="p">)</span>
<span class="n">wbthrottle</span><span class="p">.</span><span class="n">queue_wb</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">oid</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="c1">// 记录需要刷新的信息
</span> <span class="n">fadvise_flags</span> <span class="o">&</span> <span class="n">CEPH_OSD_OP_FLAG_FADVISE_DONTNEED</span><span class="p">);</span>
<span class="n">lfn_close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">WBThrottle</span><span class="o">::</span><span class="n">queue_wb</span><span class="p">(</span>
<span class="n">FDRef</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="n">ghobject_t</span> <span class="o">&</span><span class="n">hoid</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">offset</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">len</span><span class="p">,</span>
<span class="kt">bool</span> <span class="n">nocache</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">ceph</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">pair</span><span class="o"><</span><span class="n">PendingWB</span><span class="p">,</span> <span class="n">FDRef</span><span class="o">></span> <span class="o">>::</span><span class="n">iterator</span> <span class="n">wbiter</span> <span class="o">=</span>
<span class="n">pending_wbs</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">hoid</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">wbiter</span> <span class="o">==</span> <span class="n">pending_wbs</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 还没记录过
</span> <span class="n">wbiter</span> <span class="o">=</span> <span class="n">pending_wbs</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span> <span class="c1">// 新建一个item
</span> <span class="n">make_pair</span><span class="p">(</span><span class="n">hoid</span><span class="p">,</span>
<span class="n">make_pair</span><span class="p">(</span>
<span class="n">PendingWB</span><span class="p">(),</span>
<span class="n">fd</span><span class="p">))).</span><span class="n">first</span><span class="p">;</span>
<span class="n">logger</span><span class="o">-></span><span class="n">inc</span><span class="p">(</span><span class="n">l_wbthrottle_inodes_dirtied</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// 已经记录了
</span> <span class="n">remove_object</span><span class="p">(</span><span class="n">hoid</span><span class="p">);</span> <span class="c1">// 从lru中删除旧的对象, 后面会加入新的对象到lru
</span> <span class="p">}</span>
<span class="n">cur_ios</span><span class="o">++</span><span class="p">;</span> <span class="c1">// 更新io操作数
</span> <span class="n">cur_size</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span> <span class="c1">// 更新字节数
</span>
<span class="n">wbiter</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">first</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">nocache</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">insert_object</span><span class="p">(</span><span class="n">hoid</span><span class="p">);</span> <span class="c1">// 将对象插入到lru
</span>
<span class="n">cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span> <span class="c1">// 唤醒wb自带的线程执行刷新,以及在_do_op中阻塞的线程执行写操作
</span><span class="p">}</span>
</code></pre></div></div>
<p>WBThrottle自带线程的入口函数为entry:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">WBThrottle</span><span class="o">::</span><span class="n">entry</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">boost</span><span class="o">::</span><span class="n">tuple</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">FDRef</span><span class="p">,</span> <span class="n">PendingWB</span><span class="o">></span> <span class="n">wb</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">get_next_should_flush</span><span class="p">(</span><span class="o">&</span><span class="n">wb</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 获取一个新的item
</span> <span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="c1">// 执行sync操作
</span><span class="cp">#ifdef HAVE_FDATASYNC
</span> <span class="o">::</span><span class="n">fdatasync</span><span class="p">(</span><span class="o">**</span><span class="n">wb</span><span class="p">.</span><span class="n">get</span><span class="o"><</span><span class="mi">1</span><span class="o">></span><span class="p">());</span>
<span class="cp">#else
</span> <span class="o">::</span><span class="n">fsync</span><span class="p">(</span><span class="o">**</span><span class="n">wb</span><span class="p">.</span><span class="n">get</span><span class="o"><</span><span class="mi">1</span><span class="o">></span><span class="p">());</span>
<span class="cp">#endif
</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">cur_ios</span> <span class="o">-=</span> <span class="n">wb</span><span class="p">.</span><span class="n">get</span><span class="o"><</span><span class="mi">2</span><span class="o">></span><span class="p">().</span><span class="n">ios</span><span class="p">;</span> <span class="c1">// 更新io操作计数
</span> <span class="n">cur_size</span> <span class="o">-=</span> <span class="n">wb</span><span class="p">.</span><span class="n">get</span><span class="o"><</span><span class="mi">2</span><span class="o">></span><span class="p">().</span><span class="n">size</span><span class="p">;</span> <span class="c1">// 更新size计数
</span> <span class="n">cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span> <span class="c1">// 唤醒之前在_do_op中的wait操作
</span> <span class="n">wb</span> <span class="o">=</span> <span class="n">boost</span><span class="o">::</span><span class="n">tuple</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">FDRef</span><span class="p">,</span> <span class="n">PendingWB</span><span class="o">></span><span class="p">();</span> <span class="c1">// 重置wb
</span> <span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">WBThrottle</span><span class="o">::</span><span class="n">get_next_should_flush</span><span class="p">(</span>
<span class="n">boost</span><span class="o">::</span><span class="n">tuple</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">FDRef</span><span class="p">,</span> <span class="n">PendingWB</span><span class="o">></span> <span class="o">*</span><span class="n">next</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">next</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">stopping</span> <span class="o">&&</span>
<span class="n">cur_ios</span> <span class="o"><</span> <span class="n">io_limits</span><span class="p">.</span><span class="n">first</span> <span class="o">&&</span>
<span class="n">pending_wbs</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o"><</span> <span class="n">fd_limits</span><span class="p">.</span><span class="n">first</span> <span class="o">&&</span>
<span class="n">cur_size</span> <span class="o"><</span> <span class="n">size_limits</span><span class="p">.</span><span class="n">first</span><span class="p">)</span>
<span class="n">cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span> <span class="c1">// 三个条件都小于soft值,睡眠
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">stopping</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">assert</span><span class="p">(</span><span class="o">!</span><span class="n">pending_wbs</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span>
<span class="n">ghobject_t</span> <span class="n">obj</span><span class="p">(</span><span class="n">pop_object</span><span class="p">());</span> <span class="c1">// 从lru中获取需要刷新的对象,并从lru中删除
</span>
<span class="c1">// 获取本次刷新的item
</span> <span class="n">ceph</span><span class="o">::</span><span class="n">unordered_map</span><span class="o"><</span><span class="n">ghobject_t</span><span class="p">,</span> <span class="n">pair</span><span class="o"><</span><span class="n">PendingWB</span><span class="p">,</span> <span class="n">FDRef</span><span class="o">></span> <span class="o">>::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span>
<span class="n">pending_wbs</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">obj</span><span class="p">);</span>
<span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">boost</span><span class="o">::</span><span class="n">make_tuple</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">second</span><span class="p">,</span> <span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">first</span><span class="p">);</span>
<span class="n">pending_wbs</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">i</span><span class="p">);</span> <span class="c1">// 删除item
</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>WBThrottle类中的condition变量cond,_do_op中的操作可能会因为throttle中三个值的任意一个值超过hard上限而wait在这个条件变量,
throttle内部的刷新线程,也可能因为throttle中的三个值全部没有达到soft下限而wait在这个条件变量。是否考虑用两个条件变量分别等待?
避免一些不必要的唤醒。</p>
<p>另外需要注意,FileStore中也自带一个sync线程,会定时刷新osd存放数据的整个目录current,而WBThrottle每次只是针对一个对象的回刷。</p>
<h1 id="asyncobjectthrottle">AsyncObjectThrottle</h1>
<p>最后一种throttle是在librbd/AsyncObjectThrottle.h 文件中,主要是用来限制一些对image的管理操作,比如flatten/snapshot/resize等。当管理操作执行的时候,
限制能够同时并发处理多少个object,原理可以参考分析flatten时的<a href="http://blog.wjin.org/posts/ceph-librbd-flatten.html">文章</a>。</p>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>WBThrottle 针对于FileStore回刷机制,并且自带线程</p>
</li>
<li>
<p>AsyncObjectThrottle用在管理image时,限制同时操作的对象个数</p>
</li>
<li>
<p>大部分情况下都会用Throttle来达到限流目的,按fifo队列排序,比较公平</p>
</li>
</ul>
Ceph Track Op Process
2015-11-28T00:00:00+00:00
http://blog.wjin.org/posts/ceph-track-op-process
<h1 id="overview">Overview</h1>
<p>在ceph存储系统中,后端rados集群的osd进程,会对收到的消息进行解析,如果是op类型的操作,就会新建一个OpRequest的对象,
这个对象会贯穿整个操作的执行过程,直至完成销毁,ceph提供了一种机制来跟踪这个对象,记录一些事件,帮助我们进行性能分析与调优。</p>
<h1 id="implementation">Implementation</h1>
<h2 id="initialize-optracker">Initialize OpTracker</h2>
<p>在类OSD的实现中,定义了一个OpTracker类的对象op_tracker,这个成员具有管理TrackedOp类对象的功能:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">OSD</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Dispatcher</span><span class="p">,</span>
<span class="k">public</span> <span class="n">md_config_obs_t</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">OpTracker</span> <span class="n">op_tracker</span><span class="p">;</span> <span class="c1">// 管理TrackedOp对象的辅助类
</span>
<span class="p">......</span>
<span class="p">};</span>
<span class="c1">// OSD构造函数
</span><span class="n">OSD</span><span class="o">::</span><span class="n">OSD</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct_</span><span class="p">,</span> <span class="n">ObjectStore</span> <span class="o">*</span><span class="n">store_</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">id</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">internal_messenger</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">external_messenger</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">hb_clientm</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">hb_front_serverm</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">hb_back_serverm</span><span class="p">,</span>
<span class="n">Messenger</span> <span class="o">*</span><span class="n">osdc_messenger</span><span class="p">,</span>
<span class="n">MonClient</span> <span class="o">*</span><span class="n">mc</span><span class="p">,</span>
<span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">&</span><span class="n">dev</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="o">&</span><span class="n">jdev</span><span class="p">)</span> <span class="o">:</span>
<span class="n">Dispatcher</span><span class="p">(</span><span class="n">cct_</span><span class="p">),</span>
<span class="p">......</span>
<span class="n">op_tracker</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_enable_op_tracker</span><span class="p">,</span> <span class="c1">// 第一个参数是CephContext,第二个是配置文件是否激活tracker,如果没激活,就不会跟踪op
</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_num_op_tracker_shard</span><span class="p">),</span> <span class="c1">// 第三个参数是分片大小,内部实现的时候做了shard,避免所有op对象记录在同一个链表中,导致性能瓶颈
</span> <span class="p">......</span>
<span class="p">{</span>
<span class="n">monc</span><span class="o">-></span><span class="n">set_messenger</span><span class="p">(</span><span class="n">client_messenger</span><span class="p">);</span>
<span class="n">op_tracker</span><span class="p">.</span><span class="n">set_complaint_and_threshold</span><span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_complaint_time</span><span class="p">,</span> <span class="c1">// 容忍处理op的最长时间,超过后,会打印警告日志
</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_log_threshold</span><span class="p">);</span> <span class="c1">// 警告日志个数上限
</span> <span class="n">op_tracker</span><span class="p">.</span><span class="n">set_history_size_and_duration</span><span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_history_size</span><span class="p">,</span> <span class="c1">// 设置OpHistory的大小
</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_op_history_duration</span><span class="p">);</span> <span class="c1">// 设置OpHistory中op最长停留时间
</span><span class="p">}</span>
</code></pre></div></div>
<p>OpTracker是用来管理TrackedOp对象的,它不仅管理正在进行中的op(inflight op),也跟踪记录已经完成的op(通过OpHistory类对象),
需要的时候,可以向osd进程发送命令dump过去已经完成的op。</p>
<p>很明显,OpRequest类需要继承TrackedOp类,以便被OpTracker类管理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">OpRequest</span> <span class="o">:</span> <span class="k">public</span> <span class="n">TrackedOp</span> <span class="p">{</span>
<span class="k">friend</span> <span class="k">class</span> <span class="nc">OpTracker</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<h2 id="ophistory">OpHistory</h2>
<p>先看OpHistory类是怎么记录已经完成的op的,在文件src/common/TrackedOp.h和src/common/TrackedOp.cc文件中:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 头文件
</span><span class="k">class</span> <span class="nc">OpHistory</span> <span class="p">{</span>
<span class="n">set</span><span class="o"><</span><span class="n">pair</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span> <span class="n">TrackedOpRef</span><span class="o">></span> <span class="o">></span> <span class="n">arrived</span><span class="p">;</span> <span class="c1">// 按到达时间排序
</span> <span class="n">set</span><span class="o"><</span><span class="n">pair</span><span class="o"><</span><span class="kt">double</span><span class="p">,</span> <span class="n">TrackedOpRef</span><span class="o">></span> <span class="o">></span> <span class="n">duration</span><span class="p">;</span> <span class="c1">// 按停留时间排序
</span>
<span class="n">Mutex</span> <span class="n">ops_history_lock</span><span class="p">;</span> <span class="c1">// 锁保护上面两个集合
</span>
<span class="kt">void</span> <span class="n">cleanup</span><span class="p">(</span><span class="n">utime_t</span> <span class="n">now</span><span class="p">);</span> <span class="c1">// 删除不满足条件的op
</span> <span class="kt">bool</span> <span class="n">shutdown</span><span class="p">;</span>
<span class="c1">// 不能是无限制的全部记录op,需要按照如下两个条件清理
</span> <span class="kt">uint32_t</span> <span class="n">history_size</span><span class="p">;</span> <span class="c1">// op总个数
</span> <span class="kt">uint32_t</span> <span class="n">history_duration</span><span class="p">;</span> <span class="c1">// op最长停留时间
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">OpHistory</span><span class="p">()</span> <span class="o">:</span> <span class="n">ops_history_lock</span><span class="p">(</span><span class="s">"OpHistory::Lock"</span><span class="p">),</span> <span class="n">shutdown</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span>
<span class="n">history_size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">history_duration</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
<span class="o">~</span><span class="n">OpHistory</span><span class="p">()</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">arrived</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">duration</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">insert</span><span class="p">(</span><span class="n">utime_t</span> <span class="n">now</span><span class="p">,</span> <span class="n">TrackedOpRef</span> <span class="n">op</span><span class="p">);</span> <span class="c1">// 插入新的op,会调用cleanup清理多余的
</span> <span class="kt">void</span> <span class="n">dump_ops</span><span class="p">(</span><span class="n">utime_t</span> <span class="n">now</span><span class="p">,</span> <span class="n">Formatter</span> <span class="o">*</span><span class="n">f</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">on_shutdown</span><span class="p">();</span>
<span class="kt">void</span> <span class="nf">set_size_and_duration</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">new_size</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">new_duration</span><span class="p">)</span> <span class="p">{</span>
<span class="n">history_size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
<span class="n">history_duration</span> <span class="o">=</span> <span class="n">new_duration</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// cpp文件
</span><span class="kt">void</span> <span class="n">OpHistory</span><span class="o">::</span><span class="n">insert</span><span class="p">(</span><span class="n">utime_t</span> <span class="n">now</span><span class="p">,</span> <span class="n">TrackedOpRef</span> <span class="n">op</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">shutdown</span><span class="p">)</span>
<span class="k">return</span><span class="p">;</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">history_lock</span><span class="p">(</span><span class="n">ops_history_lock</span><span class="p">);</span>
<span class="n">duration</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">get_duration</span><span class="p">(),</span> <span class="n">op</span><span class="p">));</span> <span class="c1">// 插入集合duration
</span> <span class="n">arrived</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">get_initiated</span><span class="p">(),</span> <span class="n">op</span><span class="p">));</span> <span class="c1">// 插入集合arrived
</span> <span class="n">cleanup</span><span class="p">(</span><span class="n">now</span><span class="p">);</span> <span class="c1">// 删除多余的
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">OpHistory</span><span class="o">::</span><span class="n">cleanup</span><span class="p">(</span><span class="n">utime_t</span> <span class="n">now</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">arrived</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">&&</span>
<span class="p">(</span><span class="n">now</span> <span class="o">-</span> <span class="n">arrived</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">first</span> <span class="o">></span>
<span class="p">(</span><span class="kt">double</span><span class="p">)(</span><span class="n">history_duration</span><span class="p">)))</span> <span class="p">{</span> <span class="c1">// 时间超时
</span> <span class="n">duration</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span>
<span class="n">arrived</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">second</span><span class="o">-></span><span class="n">get_duration</span><span class="p">(),</span>
<span class="n">arrived</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">second</span><span class="p">));</span>
<span class="n">arrived</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">arrived</span><span class="p">.</span><span class="n">begin</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">while</span> <span class="p">(</span><span class="n">duration</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">></span> <span class="n">history_size</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 个数超额
</span> <span class="n">arrived</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span>
<span class="n">duration</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">second</span><span class="o">-></span><span class="n">get_initiated</span><span class="p">(),</span>
<span class="n">duration</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span><span class="o">-></span><span class="n">second</span><span class="p">));</span>
<span class="n">duration</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">duration</span><span class="p">.</span><span class="n">begin</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>很简单的set集合管理TrackedOp对象,因为这些都是已经执行完成后的对象,插入和删除set集合的元素不需要做shard,不影响性能。</p>
<h2 id="optracker">OpTracker</h2>
<p>接着看看真正的管理类OpTracker:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">OpTracker</span> <span class="p">{</span>
<span class="c1">// 这个类用来给智能指针设置删除器对象
</span> <span class="k">class</span> <span class="nc">RemoveOnDelete</span> <span class="p">{</span>
<span class="n">OpTracker</span> <span class="o">*</span><span class="n">tracker</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">RemoveOnDelete</span><span class="p">(</span><span class="n">OpTracker</span> <span class="o">*</span><span class="n">tracker</span><span class="p">)</span> <span class="o">:</span> <span class="n">tracker</span><span class="p">(</span><span class="n">tracker</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="k">operator</span><span class="p">()(</span><span class="n">TrackedOp</span> <span class="o">*</span><span class="n">op</span><span class="p">);</span>
<span class="p">};</span>
<span class="k">friend</span> <span class="k">class</span> <span class="nc">RemoveOnDelete</span><span class="p">;</span>
<span class="k">friend</span> <span class="k">class</span> <span class="nc">OpHistory</span><span class="p">;</span>
<span class="n">atomic64_t</span> <span class="n">seq</span><span class="p">;</span> <span class="c1">// 递增的原子计数器,每个op唯一,用此值做shard
</span>
<span class="c1">// 对inflight op对象做了shard,避免竞争,hammer版本才新加的,性能瓶颈
</span> <span class="c1">// 每一个分片的信息
</span> <span class="k">struct</span> <span class="n">ShardedTrackingData</span> <span class="p">{</span>
<span class="n">Mutex</span> <span class="n">ops_in_flight_lock_sharded</span><span class="p">;</span> <span class="c1">// 保护xlist
</span> <span class="n">xlist</span><span class="o"><</span><span class="n">TrackedOp</span> <span class="o">*></span> <span class="n">ops_in_flight_sharded</span><span class="p">;</span> <span class="c1">// 记录op的链表
</span> <span class="n">ShardedTrackingData</span><span class="p">(</span><span class="n">string</span> <span class="n">lock_name</span><span class="p">)</span><span class="o">:</span>
<span class="n">ops_in_flight_lock_sharded</span><span class="p">(</span><span class="n">lock_name</span><span class="p">.</span><span class="n">c_str</span><span class="p">())</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="n">vector</span><span class="o"><</span><span class="n">ShardedTrackingData</span><span class="o">*></span> <span class="n">sharded_in_flight_list</span><span class="p">;</span> <span class="c1">// 分片的数组
</span> <span class="kt">uint32_t</span> <span class="n">num_optracker_shards</span><span class="p">;</span> <span class="c1">// 分片大小
</span>
<span class="n">OpHistory</span> <span class="n">history</span><span class="p">;</span> <span class="c1">// 借助类OpHistory管理已经执行完成的op
</span>
<span class="kt">float</span> <span class="n">complaint_time</span><span class="p">;</span> <span class="c1">// op警告日志时间上限
</span> <span class="kt">int</span> <span class="n">log_threshold</span><span class="p">;</span> <span class="c1">// op警告日志条数上限
</span>
<span class="kt">void</span> <span class="n">_mark_event</span><span class="p">(</span><span class="n">TrackedOp</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">evt</span><span class="p">,</span> <span class="n">utime_t</span> <span class="n">now</span><span class="p">);</span> <span class="c1">// 记录事件
</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">bool</span> <span class="n">tracking_enabled</span><span class="p">;</span> <span class="c1">// 标志是否enable track
</span> <span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<p>先看构造/析构函数,主要就是对分片进行初始化和销毁:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">OpTracker</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct_</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">tracking</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">num_shards</span><span class="p">)</span> <span class="o">:</span> <span class="n">seq</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
<span class="n">num_optracker_shards</span><span class="p">(</span><span class="n">num_shards</span><span class="p">),</span>
<span class="n">complaint_time</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">log_threshold</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
<span class="n">tracking_enabled</span><span class="p">(</span><span class="n">tracking</span><span class="p">),</span> <span class="n">cct</span><span class="p">(</span><span class="n">cct_</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">num_optracker_shards</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">char</span> <span class="n">lock_name</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="n">snprintf</span><span class="p">(</span><span class="n">lock_name</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">lock_name</span><span class="p">),</span> <span class="s">"%s:%d"</span><span class="p">,</span> <span class="s">"OpTracker::ShardedLock"</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
<span class="n">ShardedTrackingData</span><span class="o">*</span> <span class="n">one_shard</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ShardedTrackingData</span><span class="p">(</span><span class="n">lock_name</span><span class="p">);</span> <span class="c1">// 初始化一个分片信息
</span> <span class="n">sharded_in_flight_list</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">one_shard</span><span class="p">);</span> <span class="c1">// 放在数组中
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="o">~</span><span class="n">OpTracker</span><span class="p">()</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">sharded_in_flight_list</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">((</span><span class="n">sharded_in_flight_list</span><span class="p">.</span><span class="n">back</span><span class="p">())</span><span class="o">-></span><span class="n">ops_in_flight_sharded</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span>
<span class="k">delete</span> <span class="n">sharded_in_flight_list</span><span class="p">.</span><span class="n">back</span><span class="p">();</span> <span class="c1">// 销毁
</span> <span class="n">sharded_in_flight_list</span><span class="p">.</span><span class="n">pop_back</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="track-op">Track OP</h2>
<p>紧接着看看怎么跟踪op的,在osd收到消息后,会进行分发(两种分发途径),如果是op的请求,就会通过工厂方法create_request创建OpRequest的对象:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 快速分发路径
</span><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">ms_fast_dispatch</span><span class="p">(</span><span class="n">Message</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">OpRequestRef</span> <span class="n">op</span> <span class="o">=</span> <span class="n">op_tracker</span><span class="p">.</span><span class="n">create_request</span><span class="o"><</span><span class="n">OpRequest</span><span class="o">></span><span class="p">(</span><span class="n">m</span><span class="p">);</span> <span class="c1">// 创建OpRequest
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="c1">// 通常一般分发路径
</span><span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">_dispatch</span><span class="p">(</span><span class="n">Message</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">case</span> <span class="n">MSG_OSD_REP_SCRUB</span><span class="p">:</span>
<span class="n">handle_rep_scrub</span><span class="p">(</span><span class="k">static_cast</span><span class="o"><</span><span class="n">MOSDRepScrub</span><span class="o">*></span><span class="p">(</span><span class="n">m</span><span class="p">));</span>
<span class="k">break</span><span class="p">;</span>
<span class="c1">// -- need OSDMap --
</span>
<span class="k">default</span><span class="o">:</span>
<span class="p">{</span>
<span class="n">OpRequestRef</span> <span class="n">op</span> <span class="o">=</span> <span class="n">op_tracker</span><span class="p">.</span><span class="n">create_request</span><span class="o"><</span><span class="n">OpRequest</span><span class="p">,</span> <span class="n">Message</span><span class="o">*></span><span class="p">(</span><span class="n">m</span><span class="p">);</span> <span class="c1">// 创建OpRequest
</span> <span class="n">op</span><span class="o">-></span><span class="n">mark_event</span><span class="p">(</span><span class="s">"waiting_for_osdmap"</span><span class="p">);</span>
<span class="c1">// no map? starting up?
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">osdmap</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">7</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"no OSDMap, not booted"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">waiting_for_osdmap</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">op</span><span class="p">);</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// need OSDMap
</span> <span class="n">dispatch_op</span><span class="p">(</span><span class="n">op</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>注意到OpRequest继承自TrackedOp对象,在基类的构造函数中,会将OpRequest对象添加到OpTracker的链表中进行管理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TrackedOp</span><span class="p">(</span><span class="n">OpTracker</span> <span class="o">*</span><span class="n">_tracker</span><span class="p">,</span> <span class="k">const</span> <span class="n">utime_t</span><span class="o">&</span> <span class="n">initiated</span><span class="p">)</span> <span class="o">:</span>
<span class="n">xitem</span><span class="p">(</span><span class="k">this</span><span class="p">),</span>
<span class="n">tracker</span><span class="p">(</span><span class="n">_tracker</span><span class="p">),</span>
<span class="n">initiated_at</span><span class="p">(</span><span class="n">initiated</span><span class="p">),</span>
<span class="n">lock</span><span class="p">(</span><span class="s">"TrackedOp::lock"</span><span class="p">),</span>
<span class="n">seq</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
<span class="n">warn_interval_multiplier</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">tracker</span><span class="o">-></span><span class="n">register_inflight_op</span><span class="p">(</span><span class="o">&</span><span class="n">xitem</span><span class="p">);</span> <span class="c1">// 注册自己
</span> <span class="n">events</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span><span class="n">initiated_at</span><span class="p">,</span> <span class="s">"initiated"</span><span class="p">));</span> <span class="c1">// 注册初始化的事件
</span><span class="p">}</span>
<span class="c1">// 注册函数
</span><span class="kt">void</span> <span class="n">OpTracker</span><span class="o">::</span><span class="n">register_inflight_op</span><span class="p">(</span><span class="n">xlist</span><span class="o"><</span><span class="n">TrackedOp</span><span class="o">*>::</span><span class="n">item</span> <span class="o">*</span><span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">tracking_enabled</span><span class="p">)</span> <span class="c1">// 没有激活跟踪,什么也不干
</span> <span class="k">return</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">current_seq</span> <span class="o">=</span> <span class="n">seq</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span> <span class="c1">// 原子计数器自增
</span> <span class="kt">uint32_t</span> <span class="n">shard_index</span> <span class="o">=</span> <span class="n">current_seq</span> <span class="o">%</span> <span class="n">num_optracker_shards</span><span class="p">;</span> <span class="c1">// 根据seq取模,获取分片索引
</span> <span class="n">ShardedTrackingData</span><span class="o">*</span> <span class="n">sdata</span> <span class="o">=</span> <span class="n">sharded_in_flight_list</span><span class="p">[</span><span class="n">shard_index</span><span class="p">];</span> <span class="c1">// 获取分片数据结构
</span>
<span class="n">assert</span><span class="p">(</span><span class="nb">NULL</span> <span class="o">!=</span> <span class="n">sdata</span><span class="p">);</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">locker</span><span class="p">(</span><span class="n">sdata</span><span class="o">-></span><span class="n">ops_in_flight_lock_sharded</span><span class="p">);</span>
<span class="n">sdata</span><span class="o">-></span><span class="n">ops_in_flight_sharded</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">i</span><span class="p">);</span> <span class="c1">// 插入分片
</span> <span class="n">sdata</span><span class="o">-></span><span class="n">ops_in_flight_sharded</span><span class="p">.</span><span class="n">back</span><span class="p">()</span><span class="o">-></span><span class="n">seq</span> <span class="o">=</span> <span class="n">current_seq</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在创建OpRequest对象的时候,会放入智能指针进行管理,但是需要注意智能指针的删除器,删除器中会自动从inflight op的链表中将自己删除,
然后加入OpHistory中继续记录,使得dump命令可以打印历史op:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 创建的时候,T为OpRequest, U为Message*
</span><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="p">,</span> <span class="k">typename</span> <span class="n">U</span><span class="o">></span>
<span class="k">typename</span> <span class="n">T</span><span class="o">::</span><span class="n">Ref</span> <span class="n">create_request</span><span class="p">(</span><span class="n">U</span> <span class="n">params</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// typename T::Ref 实际上定义为 typedef ceph::shared_ptr<OpRequest> Ref;
</span> <span class="c1">// 就是shared_ptr类型
</span> <span class="k">typename</span> <span class="n">T</span><span class="o">::</span><span class="n">Ref</span> <span class="n">retval</span><span class="p">(</span><span class="k">new</span> <span class="n">T</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="k">this</span><span class="p">),</span> <span class="c1">// 第一个参数是new OpRequest(Message*, OpTracker*)
</span> <span class="n">RemoveOnDelete</span><span class="p">(</span><span class="k">this</span><span class="p">));</span> <span class="c1">// 第二个参数给智能指针设置一个删除器对象,销毁时调用这个对象的operator()
</span> <span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 智能指针销毁的函数
</span><span class="kt">void</span> <span class="n">OpTracker</span><span class="o">::</span><span class="n">RemoveOnDelete</span><span class="o">::</span><span class="k">operator</span><span class="p">()(</span><span class="n">TrackedOp</span> <span class="o">*</span><span class="n">op</span><span class="p">)</span> <span class="p">{</span>
<span class="n">op</span><span class="o">-></span><span class="n">mark_event</span><span class="p">(</span><span class="s">"done"</span><span class="p">);</span> <span class="c1">// 标记op完成事件
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">tracker</span><span class="o">-></span><span class="n">tracking_enabled</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 没有激活跟踪
</span> <span class="n">op</span><span class="o">-></span><span class="n">_unregistered</span><span class="p">();</span> <span class="c1">// TrackedOp留下的hook,目前为空函数
</span> <span class="k">delete</span> <span class="n">op</span><span class="p">;</span> <span class="c1">// 删除指针
</span> <span class="k">return</span><span class="p">;</span> <span class="c1">// 返回
</span> <span class="p">}</span>
<span class="n">tracker</span><span class="o">-></span><span class="n">unregister_inflight_op</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 激活跟踪,单独处理
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">OpTracker</span><span class="o">::</span><span class="n">unregister_inflight_op</span><span class="p">(</span><span class="n">TrackedOp</span> <span class="o">*</span><span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// caller checks;
</span> <span class="n">assert</span><span class="p">(</span><span class="n">tracking_enabled</span><span class="p">);</span>
<span class="c1">// 获取分片信息,将自己从分片链表中删除
</span> <span class="kt">uint32_t</span> <span class="n">shard_index</span> <span class="o">=</span> <span class="n">i</span><span class="o">-></span><span class="n">seq</span> <span class="o">%</span> <span class="n">num_optracker_shards</span><span class="p">;</span>
<span class="n">ShardedTrackingData</span><span class="o">*</span> <span class="n">sdata</span> <span class="o">=</span> <span class="n">sharded_in_flight_list</span><span class="p">[</span><span class="n">shard_index</span><span class="p">];</span>
<span class="n">assert</span><span class="p">(</span><span class="nb">NULL</span> <span class="o">!=</span> <span class="n">sdata</span><span class="p">);</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">locker</span><span class="p">(</span><span class="n">sdata</span><span class="o">-></span><span class="n">ops_in_flight_lock_sharded</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">i</span><span class="o">-></span><span class="n">xitem</span><span class="p">.</span><span class="n">get_list</span><span class="p">()</span> <span class="o">==</span> <span class="o">&</span><span class="n">sdata</span><span class="o">-></span><span class="n">ops_in_flight_sharded</span><span class="p">);</span>
<span class="n">i</span><span class="o">-></span><span class="n">xitem</span><span class="p">.</span><span class="n">remove_myself</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">i</span><span class="o">-></span><span class="n">_unregistered</span><span class="p">();</span> <span class="c1">// hook
</span>
<span class="n">utime_t</span> <span class="n">now</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span>
<span class="c1">// 插入OpHistory进行管理,并没有析构指针
</span> <span class="c1">// 这里是不需要删除的,继续将最开始动态分配的指针放入shared_ptr中,但是这次没有设置deleter对象,以后引用计数为0会自动调用delete
</span> <span class="n">history</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">now</span><span class="p">,</span> <span class="n">TrackedOpRef</span><span class="p">(</span><span class="n">i</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="op-warning-info">Op Warning Info</h2>
<p>当一个op处理的时间超过complaint_time的时候,就会打印warning信息,这是因为在osd进程启动的时候,设置了time事件,会定期check:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">tick_timer</span><span class="p">.</span><span class="n">add_event_after</span><span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">osd_heartbeat_interval</span><span class="p">,</span> <span class="k">new</span> <span class="n">C_Tick</span><span class="p">(</span><span class="k">this</span><span class="p">));</span> <span class="c1">// 添加time事件
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="c1">// time事件的callback
</span><span class="k">class</span> <span class="nc">C_Tick</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Context</span> <span class="p">{</span>
<span class="n">OSD</span> <span class="o">*</span><span class="n">osd</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">C_Tick</span><span class="p">(</span><span class="n">OSD</span> <span class="o">*</span><span class="n">o</span><span class="p">)</span> <span class="o">:</span> <span class="n">osd</span><span class="p">(</span><span class="n">o</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">finish</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">osd</span><span class="o">-></span><span class="n">tick</span><span class="p">();</span> <span class="c1">// 调用OSD::tick
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">tick</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">check_ops_in_flight</span><span class="p">();</span> <span class="c1">// 检查
</span>
<span class="n">tick_timer</span><span class="p">.</span><span class="n">add_event_after</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="k">new</span> <span class="n">C_Tick</span><span class="p">(</span><span class="k">this</span><span class="p">));</span> <span class="c1">// 新加另外一个事件,所以会持续检查
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">check_ops_in_flight</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">warnings</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">op_tracker</span><span class="p">.</span><span class="n">check_ops_in_flight</span><span class="p">(</span><span class="n">warnings</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 有警告信息
</span> <span class="k">for</span> <span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span> <span class="n">warnings</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">i</span> <span class="o">!=</span> <span class="n">warnings</span><span class="p">.</span><span class="n">end</span><span class="p">();</span>
<span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">clog</span><span class="o">-></span><span class="n">warn</span><span class="p">()</span> <span class="o"><<</span> <span class="o">*</span><span class="n">i</span><span class="p">;</span> <span class="c1">// 打印
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">OpTracker</span><span class="o">::</span><span class="n">check_ops_in_flight</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="o">&</span><span class="n">warning_vector</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 检查每个分片,看看最老的op有没超过compliant_time
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="op-event">Op Event</h2>
<p>除了能够自动检查一些耗时特别长的op以外,OpTracker类还提供了一种事件机制,可以在某个时间点标记某种事件的发生,以便以后分析:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TrackedOp</span> <span class="p">{</span>
<span class="k">protected</span><span class="o">:</span>
<span class="n">OpTracker</span> <span class="o">*</span><span class="n">tracker</span><span class="p">;</span> <span class="c1">/// the tracker we are associated with
</span>
<span class="n">list</span><span class="o"><</span><span class="n">pair</span><span class="o"><</span><span class="n">utime_t</span><span class="p">,</span> <span class="n">string</span><span class="o">></span> <span class="o">></span> <span class="n">events</span><span class="p">;</span> <span class="c1">// <事件发生的时间,描述信息>
</span> <span class="k">mutable</span> <span class="n">Mutex</span> <span class="n">lock</span><span class="p">;</span> <span class="c1">// 保护链表
</span>
<span class="kt">void</span> <span class="n">mark_event</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">event</span><span class="p">);</span> <span class="c1">// 标记事件发生
</span>
<span class="p">......</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">TrackedOp</span><span class="o">::</span><span class="n">mark_event</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">event</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">tracker</span><span class="o">-></span><span class="n">tracking_enabled</span><span class="p">)</span> <span class="c1">// 没有enable跟踪,直接返回
</span> <span class="k">return</span><span class="p">;</span>
<span class="n">utime_t</span> <span class="n">now</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">g_ceph_context</span><span class="p">);</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">events</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">make_pair</span><span class="p">(</span><span class="n">now</span><span class="p">,</span> <span class="n">event</span><span class="p">));</span> <span class="c1">// 放入链表
</span> <span class="p">}</span>
<span class="n">tracker</span><span class="o">-></span><span class="n">mark_event</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">event</span><span class="p">);</span> <span class="c1">// 继续让OpTracker处理
</span> <span class="n">_event_marked</span><span class="p">();</span> <span class="c1">// hook
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">OpTracker</span><span class="o">::</span><span class="n">mark_event</span><span class="p">(</span><span class="n">TrackedOp</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">dest</span><span class="p">,</span> <span class="n">utime_t</span> <span class="n">time</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">tracking_enabled</span><span class="p">)</span>
<span class="k">return</span><span class="p">;</span>
<span class="k">return</span> <span class="n">_mark_event</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">time</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OpTracker</span><span class="o">::</span><span class="n">_mark_event</span><span class="p">(</span><span class="n">TrackedOp</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">evt</span><span class="p">,</span>
<span class="n">utime_t</span> <span class="n">time</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 打印事件发生信息,日志级别为5,需要调整日志级别
</span> <span class="n">dout</span><span class="p">(</span><span class="mi">5</span><span class="p">);</span>
<span class="o">*</span><span class="n">_dout</span> <span class="o"><<</span> <span class="s">"seq: "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">seq</span>
<span class="o"><<</span> <span class="s">", time: "</span> <span class="o"><<</span> <span class="n">time</span> <span class="o"><<</span> <span class="s">", event: "</span> <span class="o"><<</span> <span class="n">evt</span>
<span class="o"><<</span> <span class="s">", op: "</span><span class="p">;</span>
<span class="n">op</span><span class="o">-></span><span class="n">_dump_op_descriptor_unlocked</span><span class="p">(</span><span class="o">*</span><span class="n">_dout</span><span class="p">);</span>
<span class="o">*</span><span class="n">_dout</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>如果我们需要分析处理op的主要耗时在什么地方,我们可以在进入某些关键函数之前,调用mark_event标记事件开始,
然后在结束后再次调用mark_event标记事件完成,当op完成后,在析构的时候,分析op的所有事件,就可以知道事件耗时多少。</p>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>ceph通过OpTracker类来跟踪op的执行过程,必要时打印一些警告信息</p>
</li>
<li>
<p>可以向osd发送dump命令,查看op的信息</p>
</li>
<li>
<p>通过标记事件,可以分析op的处理流程</p>
</li>
</ul>
Ceph Dynamic Log/Option Mechanism
2015-11-22T00:00:00+00:00
http://blog.wjin.org/posts/ceph-dynamic-log-option-mechanism
<h1 id="overview">Overview</h1>
<p>像ceph这样的大型系统,增加日志功能以及参数配置,并且能够在运行时动态的修改日志级别和参数值,对于分析问题,性能调优都是非常有帮助的。</p>
<h1 id="implementation">Implementation</h1>
<h2 id="log-class">Log Class</h2>
<p>在ceph中,无论是后端osd daemon进程,还是客户端进程,或者其他命令行工具,启动的时候都会创建一个CephContext这样的对象
(global_init -> global_pre_init -> common_init),这个对象的初始化过程中,
就会初始化log相关的信息,包括启动打印log的线程。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CephContext</span><span class="o">::</span><span class="n">CephContext</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">module_type_</span><span class="p">)</span>
<span class="o">:</span> <span class="n">nref</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
<span class="n">_conf</span><span class="p">(</span><span class="k">new</span> <span class="n">md_config_t</span><span class="p">()),</span> <span class="c1">// 初始化配置对象
</span> <span class="n">_log</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_module_type</span><span class="p">(</span><span class="n">module_type_</span><span class="p">),</span>
<span class="n">_crypto_inited</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span>
<span class="n">_service_thread</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_log_obs</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_admin_socket</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_perf_counters_collection</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_perf_counters_conf_obs</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_heartbeat_map</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_crypto_none</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_crypto_aes</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_lockdep_obs</span><span class="p">(</span><span class="nb">NULL</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">_log</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ceph</span><span class="o">::</span><span class="n">log</span><span class="o">::</span><span class="n">Log</span><span class="p">(</span><span class="o">&</span><span class="n">_conf</span><span class="o">-></span><span class="n">subsys</span><span class="p">);</span> <span class="c1">// 创建日志管理类对象
</span> <span class="n">_log</span><span class="o">-></span><span class="n">start</span><span class="p">();</span> <span class="c1">// 启动线程
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>log子系统主要的实现在目录src/log,有一个很重要的类Log,并且它继承线程类,说明Log类自带线程处理日志,
因为打印日志,会影响系统的性能,特别是c++的流,对性能影响更明显。ceph这里采取了一些优化:</p>
<ul>
<li>
<p>ceph 对每个子系统的日志都预先定义了日志级别,并且可动态修改</p>
</li>
<li>
<p>每条log都带有日志级别,低于预先定义的级别才会被打印</p>
</li>
<li>
<p>log 信息只需要提交到Log类的线程即可,这边单独的线程接管后台打印日志的任务,对提交log的线程影响较小</p>
</li>
</ul>
<p>Log类的实现比较简单,维护两个队列,m_new用于提交新日志,flush的时候获取m_new的entry用来刷新,m_recent用来存放最近的日志,
比如用户通过admin socket发送dump log的命令时,就会将m_recent的日志dump到文件:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">class</span> <span class="nc">Log</span> <span class="o">:</span> <span class="k">private</span> <span class="n">Thread</span> <span class="c1">// 继承线程类
</span><span class="p">{</span>
<span class="n">Log</span> <span class="o">**</span><span class="n">m_indirect_this</span><span class="p">;</span>
<span class="n">SubsystemMap</span> <span class="o">*</span><span class="n">m_subs</span><span class="p">;</span> <span class="c1">// 每个子系统的日志级别的map
</span>
<span class="n">pthread_mutex_t</span> <span class="n">m_queue_mutex</span><span class="p">;</span> <span class="c1">// 这个锁专门用来提交日志
</span> <span class="n">pthread_mutex_t</span> <span class="n">m_flush_mutex</span><span class="p">;</span> <span class="c1">// 这个锁用来打印提交的日志
</span> <span class="n">pthread_cond_t</span> <span class="n">m_cond_loggers</span><span class="p">;</span>
<span class="n">pthread_cond_t</span> <span class="n">m_cond_flusher</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">m_queue_mutex_holder</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">m_flush_mutex_holder</span><span class="p">;</span>
<span class="n">EntryQueue</span> <span class="n">m_new</span><span class="p">;</span> <span class="c1">// 提交日志的队列
</span> <span class="n">EntryQueue</span> <span class="n">m_recent</span><span class="p">;</span> <span class="c1">// 存放最近的日志
</span>
<span class="p">......</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">entry</span><span class="p">();</span> <span class="c1">// 线程入口
</span>
<span class="kt">void</span> <span class="n">_flush</span><span class="p">(</span><span class="n">EntryQueue</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="n">EntryQueue</span> <span class="o">*</span><span class="n">requeue</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">crash</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">_log_message</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">crash</span><span class="p">);</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Log</span><span class="p">(</span><span class="n">SubsystemMap</span> <span class="o">*</span><span class="n">s</span><span class="p">);</span>
<span class="k">virtual</span> <span class="o">~</span><span class="n">Log</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">set_flush_on_exit</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">set_max_new</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">set_max_recent</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">set_log_file</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">fn</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">reopen_log_file</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">flush</span><span class="p">();</span> <span class="c1">// 刷新
</span>
<span class="kt">void</span> <span class="n">dump_recent</span><span class="p">();</span> <span class="c1">// 打印最近的日志,一般用来响应admin socket的请求
</span>
<span class="n">Entry</span> <span class="o">*</span><span class="n">create_entry</span><span class="p">(</span><span class="kt">int</span> <span class="n">level</span><span class="p">,</span> <span class="kt">int</span> <span class="n">subsys</span><span class="p">);</span> <span class="c1">// 创建日志
</span> <span class="kt">void</span> <span class="n">submit_entry</span><span class="p">(</span><span class="n">Entry</span> <span class="o">*</span><span class="n">e</span><span class="p">);</span> <span class="c1">// 提交日志
</span>
<span class="kt">void</span> <span class="n">start</span><span class="p">();</span> <span class="c1">// 启动刷新日志的线程
</span> <span class="kt">void</span> <span class="n">stop</span><span class="p">();</span> <span class="c1">// 终止线程
</span>
<span class="c1">/// true if the log lock is held by our thread
</span> <span class="kt">bool</span> <span class="n">is_inside_log_lock</span><span class="p">();</span>
<span class="c1">/// induce a segv on the next log event
</span> <span class="kt">void</span> <span class="n">inject_segv</span><span class="p">();</span>
<span class="p">};</span>
</code></pre></div></div>
<h2 id="log-level--const-option">Log Level & Const Option</h2>
<p>这里需要特别注意Log的构造函数,需要提供一张系统的map,注意到这个map是在CephContext的构造函数初始化列表中new出来的(在new Log类之前),
这个map是来干什么的?</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 描述单个子系统的日志信息
</span><span class="k">struct</span> <span class="n">Subsystem</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">log_level</span><span class="p">;</span> <span class="c1">// 日志级别
</span> <span class="kt">int</span> <span class="n">gather_level</span><span class="p">;</span> <span class="c1">// gather级别,这个一般用的少
</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span><span class="p">;</span> <span class="c1">// 子系统名称
</span>
<span class="n">Subsystem</span><span class="p">()</span> <span class="o">:</span> <span class="n">log_level</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">gather_level</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">SubsystemMap</span> <span class="p">{</span>
<span class="c1">// 每个子系统会占据vector一项
</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Subsystem</span><span class="o">></span> <span class="n">m_subsys</span><span class="p">;</span> <span class="c1">// 日志信息
</span> <span class="kt">unsigned</span> <span class="n">m_max_name_len</span><span class="p">;</span>
<span class="p">......</span>
<span class="c1">// 添加一个子系统信息
</span> <span class="kt">void</span> <span class="n">SubsystemMap</span><span class="o">::</span><span class="n">add</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">subsys</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">name</span><span class="p">,</span> <span class="kt">int</span> <span class="n">log</span><span class="p">,</span> <span class="kt">int</span> <span class="n">gather</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">subsys</span> <span class="o">>=</span> <span class="n">m_subsys</span><span class="p">.</span><span class="n">size</span><span class="p">())</span>
<span class="n">m_subsys</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">subsys</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">m_subsys</span><span class="p">[</span><span class="n">subsys</span><span class="p">].</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
<span class="n">m_subsys</span><span class="p">[</span><span class="n">subsys</span><span class="p">].</span><span class="n">log_level</span> <span class="o">=</span> <span class="n">log</span><span class="p">;</span>
<span class="n">m_subsys</span><span class="p">[</span><span class="n">subsys</span><span class="p">].</span><span class="n">gather_level</span> <span class="o">=</span> <span class="n">gather</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">name</span><span class="p">.</span><span class="n">length</span><span class="p">()</span> <span class="o">></span> <span class="n">m_max_name_len</span><span class="p">)</span>
<span class="n">m_max_name_len</span> <span class="o">=</span> <span class="n">name</span><span class="p">.</span><span class="n">length</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// 是否需要打印日志
</span> <span class="c1">// sub 就是vector的下标,用来寻址当前是哪个子系统的日志信息
</span> <span class="c1">// level为当前这条日志信息的级别
</span> <span class="c1">// 比如对某个子系统设置的级别是5,如果当前这个子系统的这条日志的级别是3,那么就会被打印
</span> <span class="kt">bool</span> <span class="n">should_gather</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">sub</span><span class="p">,</span> <span class="kt">int</span> <span class="n">level</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">sub</span> <span class="o"><</span> <span class="n">m_subsys</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="k">return</span> <span class="n">level</span> <span class="o"><=</span> <span class="n">m_subsys</span><span class="p">[</span><span class="n">sub</span><span class="p">].</span><span class="n">gather_level</span> <span class="o">||</span>
<span class="n">level</span> <span class="o"><=</span> <span class="n">m_subsys</span><span class="p">[</span><span class="n">sub</span><span class="p">].</span><span class="n">log_level</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>m_subsys 怎么是个vector?按理说该是个map<string, int>,string为子系统名称,int为日志级别,这样可以根据名字查找一个子系统的日志级别。
这里的关键还是在日志级别的配置文件中,c++这边直接include了配置文件,所以在配置文件中的位置,决定了这里的下标,所以用了vector而不是map。
同时,在include的时候,也定义了一些选项的常量,并在构造函数中进行初始化。</p>
<p>具体的配置文件是src/common/Config_opts.h, 对它的处理在src/common/Config.h和src/common/Config.cc文件:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// src/common/Config_opts.h
// 配置文件里,包含三个部分:OPTION, DEFAULT_SUBSYS, SUBSYS
// include的时候,会根据不同情况选择不同部分,未被选择的项就会为空
</span>
<span class="p">......</span>
<span class="n">OPTION</span><span class="p">(</span><span class="n">xio_portal_threads</span><span class="p">,</span> <span class="n">OPT_INT</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="c1">// (选项名称,选项类型,选项值)
</span>
<span class="n">DEFAULT_SUBSYS</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">SUBSYS</span><span class="p">(</span><span class="n">lockdep</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">SUBSYS</span><span class="p">(</span><span class="n">context</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="c1">// (子系统名称,log级别,gather级别)
</span><span class="p">......</span>
</code></pre></div></div>
<p>看看怎么include这个文件进来,src/common/Config.h的例子:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">md_config_t</span> <span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 这里需要取得option的值,并定义常量
</span> <span class="c1">// 定义特殊类型的OPTION
</span> <span class="cp">#define OPTION_OPT_INT(name) const int name;
</span> <span class="cp">#define OPTION_OPT_LONGLONG(name) const long long name;
</span> <span class="cp">#define OPTION_OPT_STR(name) const std::string name;
</span> <span class="cp">#define OPTION_OPT_DOUBLE(name) const double name;
</span> <span class="cp">#define OPTION_OPT_FLOAT(name) const float name;
</span> <span class="cp">#define OPTION_OPT_BOOL(name) const bool name;
</span> <span class="cp">#define OPTION_OPT_ADDR(name) const entity_addr_t name;
</span> <span class="cp">#define OPTION_OPT_U32(name) const uint32_t name;
</span> <span class="cp">#define OPTION_OPT_U64(name) const uint64_t name;
</span> <span class="cp">#define OPTION_OPT_UUID(name) const uuid_d name;
</span>
<span class="c1">// 定义OPTION
</span> <span class="cp">#define OPTION(name, ty, init) OPTION_##ty(name)
</span>
<span class="c1">// 定义其他不需要的两项 SUBSYS/DEFAULT_SUBSYS 为空
</span> <span class="cp">#define SUBSYS(name, log, gather)
</span> <span class="cp">#define DEFAULT_SUBSYS(log, gather)
</span>
<span class="c1">// 把config文件include进来, 实际上获取了所有的OPTION
</span> <span class="cp">#include "common/config_opts.h"
</span>
<span class="c1">// 取消以前所有的宏定义
</span> <span class="cp">#undef OPTION_OPT_INT
</span> <span class="cp">#undef OPTION_OPT_LONGLONG
</span> <span class="cp">#undef OPTION_OPT_STR
</span> <span class="cp">#undef OPTION_OPT_DOUBLE
</span> <span class="cp">#undef OPTION_OPT_FLOAT
</span> <span class="cp">#undef OPTION_OPT_BOOL
</span> <span class="cp">#undef OPTION_OPT_ADDR
</span> <span class="cp">#undef OPTION_OPT_U32
</span> <span class="cp">#undef OPTION_OPT_U64
</span> <span class="cp">#undef OPTION_OPT_UUID
</span> <span class="cp">#undef OPTION
</span> <span class="cp">#undef SUBSYS
</span> <span class="cp">#undef DEFAULT_SUBSYS
</span>
<span class="p">......</span>
<span class="p">};</span>
<span class="c1">// 定义一个枚举,实际上就是生成了每个子系统的下标,这里借助了枚举类型的自增
// 所以在文件中的位置决定了子系统在vector中的下标
</span><span class="k">enum</span> <span class="n">config_subsys_id</span> <span class="p">{</span>
<span class="n">ceph_subsys_</span><span class="p">,</span> <span class="c1">// default
</span>
<span class="c1">// 因为这里不需要OPTION字段,先定义为空
</span> <span class="cp">#define OPTION(a,b,c)
</span>
<span class="c1">// 生成枚举item的宏
</span> <span class="cp">#define SUBSYS(name, log, gather) \
ceph_subsys_##name,
</span>
<span class="c1">// DEFAULT_SUBSYS这里也不需要,定义为空
</span> <span class="cp">#define DEFAULT_SUBSYS(log, gather)
</span>
<span class="c1">// 把config文件include进来, 实际上获取了所有的SUBSYS
</span> <span class="cp">#include "common/config_opts.h"
</span>
<span class="c1">// 取消以前所有的宏定义
</span> <span class="cp">#undef SUBSYS
</span> <span class="cp">#undef OPTION
</span> <span class="cp">#undef DEFAULT_SUBSYS
</span>
<span class="n">ceph_subsys_max</span>
<span class="p">};</span>
</code></pre></div></div>
<p>前面是头文件,在看看构造函数的实现src/common/Config.cc:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">md_config_t</span><span class="o">::</span><span class="n">md_config_t</span><span class="p">()</span>
<span class="o">:</span> <span class="n">cluster</span><span class="p">(</span><span class="s">"ceph"</span><span class="p">),</span>
<span class="c1">// 头文件定义了name 常量,这里对常量进行初始化
</span><span class="cp">#define OPTION_OPT_INT(name, def_val) name(def_val),
#define OPTION_OPT_LONGLONG(name, def_val) name((1LL) * def_val),
#define OPTION_OPT_STR(name, def_val) name(def_val),
#define OPTION_OPT_DOUBLE(name, def_val) name(def_val),
#define OPTION_OPT_FLOAT(name, def_val) name(def_val),
#define OPTION_OPT_BOOL(name, def_val) name(def_val),
#define OPTION_OPT_ADDR(name, def_val) name(def_val),
#define OPTION_OPT_U32(name, def_val) name(def_val),
#define OPTION_OPT_U64(name, def_val) name(((uint64_t)1) * def_val),
#define OPTION_OPT_UUID(name, def_val) name(def_val),
</span>
<span class="cp">#define OPTION(name, type, def_val) OPTION_##type(name, def_val)
</span>
<span class="c1">// 以下两项不需要
</span><span class="cp">#define SUBSYS(name, log, gather)
#define DEFAULT_SUBSYS(log, gather)
</span>
<span class="c1">// include文件进来
</span><span class="cp">#include "common/config_opts.h"
</span>
<span class="c1">// 取消所有的宏
</span><span class="cp">#undef OPTION_OPT_INT
#undef OPTION_OPT_LONGLONG
#undef OPTION_OPT_STR
#undef OPTION_OPT_DOUBLE
#undef OPTION_OPT_FLOAT
#undef OPTION_OPT_BOOL
#undef OPTION_OPT_ADDR
#undef OPTION_OPT_U32
#undef OPTION_OPT_U64
#undef OPTION_OPT_UUID
#undef OPTION
#undef SUBSYS
#undef DEFAULT_SUBSYS
</span>
<span class="n">lock</span><span class="p">(</span><span class="s">"md_config_t"</span><span class="p">,</span> <span class="nb">true</span><span class="p">,</span> <span class="nb">false</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">init_subsys</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">md_config_t</span><span class="o">::</span><span class="n">init_subsys</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// 将子系统加入vector,第一个参数就是头文件中定义的宏,就是下标
</span><span class="cp">#define SUBSYS(name, log, gather) \
subsys.add(ceph_subsys_##name, STRINGIFY(name), log, gather);
#define DEFAULT_SUBSYS(log, gather) \
subsys.add(ceph_subsys_, "none", log, gather);
</span>
<span class="c1">// 过滤选项
</span><span class="cp">#define OPTION(a, b, c)
</span>
<span class="c1">// include 文件
</span><span class="cp">#include "common/config_opts.h"
</span>
<span class="c1">// 取消所有的宏
</span><span class="cp">#undef OPTION
#undef SUBSYS
#undef DEFAULT_SUBSYS
</span><span class="p">}</span>
</code></pre></div></div>
<h2 id="log-thread">Log Thread</h2>
<p>Log创建的时候,记录了整个系统(进程)的日志级别即systemmap,接下来就看看这个线程怎么干活的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 线程入口函数
</span><span class="kt">void</span> <span class="o">*</span><span class="n">Log</span><span class="o">::</span><span class="n">entry</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="n">m_queue_mutex_holder</span> <span class="o">=</span> <span class="n">pthread_self</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">m_stop</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">m_new</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 有新日志
</span> <span class="n">m_queue_mutex_holder</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="n">flush</span><span class="p">();</span> <span class="c1">// 刷新日志
</span> <span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="n">m_queue_mutex_holder</span> <span class="o">=</span> <span class="n">pthread_self</span><span class="p">();</span>
<span class="k">continue</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">pthread_cond_wait</span><span class="p">(</span><span class="o">&</span><span class="n">m_cond_flusher</span><span class="p">,</span> <span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">m_queue_mutex_holder</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="n">flush</span><span class="p">();</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Log</span><span class="o">::</span><span class="n">flush</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">m_flush_mutex</span><span class="p">);</span>
<span class="n">m_flush_mutex_holder</span> <span class="o">=</span> <span class="n">pthread_self</span><span class="p">();</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="n">m_queue_mutex_holder</span> <span class="o">=</span> <span class="n">pthread_self</span><span class="p">();</span>
<span class="n">EntryQueue</span> <span class="n">t</span><span class="p">;</span> <span class="c1">// 临时队列
</span> <span class="n">t</span><span class="p">.</span><span class="n">swap</span><span class="p">(</span><span class="n">m_new</span><span class="p">);</span> <span class="c1">// O(1)的交换,这样m_new又可以接收新日志的提交了
</span> <span class="n">pthread_cond_broadcast</span><span class="p">(</span><span class="o">&</span><span class="n">m_cond_loggers</span><span class="p">);</span>
<span class="n">m_queue_mutex_holder</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span> <span class="c1">// 提前释放锁,以便其他线程继续提交
</span> <span class="n">_flush</span><span class="p">(</span><span class="o">&</span><span class="n">t</span><span class="p">,</span> <span class="o">&</span><span class="n">m_recent</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span> <span class="c1">// 真正打印临时队列的信息, 并且会记录到m_recent
</span>
<span class="c1">// trim
</span> <span class="k">while</span> <span class="p">(</span><span class="n">m_recent</span><span class="p">.</span><span class="n">m_len</span> <span class="o">></span> <span class="n">m_max_recent</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// m_recent有大小限制,超出部分删除
</span> <span class="k">delete</span> <span class="n">m_recent</span><span class="p">.</span><span class="n">dequeue</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">m_flush_mutex_holder</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">m_flush_mutex</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Log</span><span class="o">::</span><span class="n">_flush</span><span class="p">(</span><span class="n">EntryQueue</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="n">EntryQueue</span> <span class="o">*</span><span class="n">requeue</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">crash</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Entry</span> <span class="o">*</span><span class="n">e</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">80</span><span class="p">];</span>
<span class="k">while</span> <span class="p">((</span><span class="n">e</span> <span class="o">=</span> <span class="n">t</span><span class="o">-></span><span class="n">dequeue</span><span class="p">())</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">sub</span> <span class="o">=</span> <span class="n">e</span><span class="o">-></span><span class="n">m_subsys</span><span class="p">;</span> <span class="c1">// vector下标,判断本条日志属于哪个子系统
</span>
<span class="kt">bool</span> <span class="n">should_log</span> <span class="o">=</span> <span class="n">crash</span> <span class="o">||</span> <span class="n">m_subs</span><span class="o">-></span><span class="n">get_log_level</span><span class="p">(</span><span class="n">sub</span><span class="p">)</span> <span class="o">>=</span> <span class="n">e</span><span class="o">-></span><span class="n">m_prio</span><span class="p">;</span> <span class="c1">// 日志级别小于等于子系统设置的级别才会被打印
</span>
<span class="kt">bool</span> <span class="n">do_fd</span> <span class="o">=</span> <span class="n">m_fd</span> <span class="o">>=</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">should_log</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">do_syslog</span> <span class="o">=</span> <span class="n">m_syslog_crash</span> <span class="o">>=</span> <span class="n">e</span><span class="o">-></span><span class="n">m_prio</span> <span class="o">&&</span> <span class="n">should_log</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">do_stderr</span> <span class="o">=</span> <span class="n">m_stderr_crash</span> <span class="o">>=</span> <span class="n">e</span><span class="o">-></span><span class="n">m_prio</span> <span class="o">&&</span> <span class="n">should_log</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">do_fd</span> <span class="o">||</span> <span class="n">do_syslog</span> <span class="o">||</span> <span class="n">do_stderr</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">do_fd</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">safe_write</span><span class="p">(</span><span class="n">m_fd</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">buflen</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">do_syslog</span><span class="p">)</span> <span class="p">{</span>
<span class="n">syslog</span><span class="p">(</span><span class="n">LOG_USER</span><span class="p">,</span> <span class="s">"%s%s"</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">do_stderr</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cerr</span> <span class="o"><<</span> <span class="n">buf</span> <span class="o"><<</span> <span class="n">s</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">requeue</span><span class="o">-></span><span class="n">enqueue</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> <span class="c1">// 将日志存放到m_recent队列
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="submit-log-entry">Submit Log Entry</h2>
<p>线程不停的从m_new队列里获取需要刷新的entry,并且交换到临时队列里去打印,然后立即释放锁继续让队列m_new接收新的请求,
虽然所有线程都可以提交日志,但是这把锁占用时间还是很短的,尽可能避免对性能的影响。</p>
<p>然后看看其他线程是怎么提交打印日志请求的呢?以librbd里的一个简单例子,打印image 信息:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 每个子系统在文件开头都会有这几个宏
</span><span class="cp">#define dout_subsys ceph_subsys_rbd
#undef dout_prefix
#define dout_prefix *_dout << "librbd: "
</span>
<span class="p">......</span>
<span class="c1">// 和通常的cout没什么区别,只是多了两个参数,一个是CephContext,另一个就是日志级别
// cct这个参数只是客户端打印日志需要,后端daemon进程这些不需要这个参数,可以直接用另外一个宏dout(20)
// 因为daemon进程有个唯一的全局CephContext,客户端这边可能会打开多个CephContext,所以需要加上
</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"info "</span> <span class="o"><<</span> <span class="n">ictx</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span> <span class="c1">// dendl 必须配合ldout/dout之类的使用
</span>
</code></pre></div></div>
<p>宏的主要定义是在文件src/common/Dout.h:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define dout_impl(cct, sub, v) \
do { \
if (cct->_conf->subsys.should_gather(sub, v)) { \
if (0) { \
char __array[((v >= -1) && (v <= 200)) ? 0 : -1] __attribute__((unused)); \
} \
ceph::log::Entry *_dout_e = cct->_log->create_entry(v, sub); \
ostream _dout_os(&_dout_e->m_streambuf); \
CephContext *_dout_cct = cct; \
std::ostream* _dout = &_dout_os;
</span>
<span class="cp">#define ldout(cct, v) dout_impl(cct, dout_subsys, v) dout_prefix
</span>
<span class="cp">#define dendl std::flush; \
_ASSERT_H->_log->submit_entry(_dout_e); \
} \
} while (0)
</span>
<span class="c1">// 这个宏在src/include/Assert.h文件里
</span><span class="cp">#define _ASSERT_H _dout_cct
</span></code></pre></div></div>
<p>将<code class="highlighter-rouge">ldout(ictx->cct, 20) << "info " << ictx << dendl</code>展开:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">do</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">subsys</span><span class="p">.</span><span class="n">should_gather</span><span class="p">(</span><span class="n">sub</span><span class="p">,</span> <span class="n">v</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// 日志级别不满足的话,什么也不做,对性能几乎没影响
</span>
<span class="n">ceph</span><span class="o">::</span><span class="n">log</span><span class="o">::</span><span class="n">Entry</span> <span class="o">*</span><span class="n">_dout_e</span> <span class="o">=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_log</span><span class="o">-></span><span class="n">create_entry</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">sub</span><span class="p">);</span> <span class="c1">// 创建entry
</span>
<span class="c1">// 定义一个输出流对象_dout_os,并用entry里的buffer来初始化,以后输出到流的内容,就输出到entry的buffer了
</span> <span class="n">ostream</span> <span class="n">_dout_os</span><span class="p">(</span><span class="o">&</span><span class="n">_dout_e</span><span class="o">-></span><span class="n">m_streambuf</span><span class="p">);</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">_dout_cct</span> <span class="o">=</span> <span class="n">cct</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">ostream</span><span class="o">*</span> <span class="n">_dout</span> <span class="o">=</span> <span class="o">&</span><span class="n">_dout_os</span><span class="p">;</span> <span class="c1">// 流对象地址赋值给_dout变量,
</span>
<span class="c1">// 接下来是 dout_prefix宏的展开
</span> <span class="o">*</span><span class="n">_dout</span> <span class="o"><<</span> <span class="s">"librbd: "</span>
<span class="c1">// 每次打印的自己的信息
</span> <span class="o"><<</span> <span class="s">"info "</span> <span class="o"><<</span> <span class="n">ictx</span> <span class="o"><<</span>
<span class="c1">// 接下来是 dendl的展开
</span> <span class="n">std</span><span class="o">::</span><span class="n">flush</span><span class="p">;</span>
<span class="n">_dout_cct</span><span class="o">-></span><span class="n">_log</span><span class="o">-></span><span class="n">submit_entry</span><span class="p">(</span><span class="n">_dout_e</span><span class="p">);</span> <span class="c1">// 提交entry
</span> <span class="p">}</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p>上面的creat_entry和submit_entry就很简单了,都是Log类的简单函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Entry</span> <span class="o">*</span><span class="n">Log</span><span class="o">::</span><span class="n">create_entry</span><span class="p">(</span><span class="kt">int</span> <span class="n">level</span><span class="p">,</span> <span class="kt">int</span> <span class="n">subsys</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="k">new</span> <span class="n">Entry</span><span class="p">(</span><span class="n">ceph_clock_now</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span> <span class="c1">// new一个entry对象
</span> <span class="n">pthread_self</span><span class="p">(),</span>
<span class="n">level</span><span class="p">,</span> <span class="n">subsys</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// kludge for perf testing
</span> <span class="n">Entry</span> <span class="o">*</span><span class="n">e</span> <span class="o">=</span> <span class="n">m_recent</span><span class="p">.</span><span class="n">dequeue</span><span class="p">();</span>
<span class="n">e</span><span class="o">-></span><span class="n">m_stamp</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">e</span><span class="o">-></span><span class="n">m_thread</span> <span class="o">=</span> <span class="n">pthread_self</span><span class="p">();</span>
<span class="n">e</span><span class="o">-></span><span class="n">m_prio</span> <span class="o">=</span> <span class="n">level</span><span class="p">;</span>
<span class="n">e</span><span class="o">-></span><span class="n">m_subsys</span> <span class="o">=</span> <span class="n">subsys</span><span class="p">;</span>
<span class="k">return</span> <span class="n">e</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Log</span><span class="o">::</span><span class="n">submit_entry</span><span class="p">(</span><span class="n">Entry</span> <span class="o">*</span><span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span> <span class="c1">// 前面提到的这把锁是用来提交entry的
</span>
<span class="c1">// wait for flush to catch up
</span> <span class="k">while</span> <span class="p">(</span><span class="n">m_new</span><span class="p">.</span><span class="n">m_len</span> <span class="o">></span> <span class="n">m_max_new</span><span class="p">)</span>
<span class="n">pthread_cond_wait</span><span class="p">(</span><span class="o">&</span><span class="n">m_cond_loggers</span><span class="p">,</span> <span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="n">m_new</span><span class="p">.</span><span class="n">enqueue</span><span class="p">(</span><span class="n">e</span><span class="p">);</span> <span class="c1">// 进队列
</span> <span class="n">pthread_cond_signal</span><span class="p">(</span><span class="o">&</span><span class="n">m_cond_flusher</span><span class="p">);</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">m_queue_mutex</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>另外需要注意一下,entry的实现里有个小技巧,对长度为很小的固定日志,做了一个小优化:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Entry</span> <span class="p">{</span>
<span class="n">utime_t</span> <span class="n">m_stamp</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">m_thread</span><span class="p">;</span>
<span class="kt">short</span> <span class="n">m_prio</span><span class="p">,</span> <span class="n">m_subsys</span><span class="p">;</span>
<span class="n">Entry</span> <span class="o">*</span><span class="n">m_next</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">m_static_buf</span><span class="p">[</span><span class="n">CEPH_LOG_ENTRY_PREALLOC</span><span class="p">];</span> <span class="c1">// 宏定义为80
</span> <span class="n">PrebufferedStreambuf</span> <span class="n">m_streambuf</span><span class="p">;</span>
<span class="n">Entry</span><span class="p">()</span>
<span class="o">:</span> <span class="n">m_thread</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">m_prio</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">m_subsys</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
<span class="n">m_next</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">m_streambuf</span><span class="p">(</span><span class="n">m_static_buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">m_static_buf</span><span class="p">))</span> <span class="c1">// 初始化m_streambuf为一个80字节的静态buffer,以后不够用了才会动态扩展为string
</span> <span class="p">{}</span>
<span class="p">......</span>
<span class="p">};</span>
</code></pre></div></div>
<h2 id="dynamic-change-config">Dynamic Change Config</h2>
<h3 id="single-daemon">Single Daemon</h3>
<p>可以修改配置文件,然后重启进程,使得配置生效,虽然ceph本身能够容忍这样的操作,但是ceph也提供了另外一种机制来动态修改参数(包括日志级别和参数配置)。
这是基于unix domain socket实现的,在ceph中叫AdminSocket,这个类的实现先不管,就是监听unix domain socket,然后接受到命令请求的时候解析执行,
比如如下的命令查看osd 0 的配置:</p>
<blockquote>
<p>ceph daemon path_to_admin_socket/osd.0.asok config show</p>
</blockquote>
<p>或</p>
<blockquote>
<p>ceph daemon osd.0 config show</p>
</blockquote>
<p>这是怎么实现的呢?和Log类一样,AdminSocket也是在CephContext中初始化的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CephContext</span><span class="o">::</span><span class="n">CephContext</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">module_type_</span><span class="p">)</span>
<span class="o">:</span> <span class="n">nref</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
<span class="n">_conf</span><span class="p">(</span><span class="k">new</span> <span class="n">md_config_t</span><span class="p">()),</span>
<span class="n">_log</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_module_type</span><span class="p">(</span><span class="n">module_type_</span><span class="p">),</span>
<span class="n">_service_thread</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_log_obs</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_admin_socket</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_perf_counters_collection</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_perf_counters_conf_obs</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_heartbeat_map</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_crypto_none</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">_crypto_aes</span><span class="p">(</span><span class="nb">NULL</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_log</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ceph</span><span class="o">::</span><span class="n">log</span><span class="o">::</span><span class="n">Log</span><span class="p">(</span><span class="o">&</span><span class="n">_conf</span><span class="o">-></span><span class="n">subsys</span><span class="p">);</span>
<span class="n">_log</span><span class="o">-></span><span class="n">start</span><span class="p">();</span>
<span class="n">_log_obs</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LogObs</span><span class="p">(</span><span class="n">_log</span><span class="p">);</span>
<span class="n">_conf</span><span class="o">-></span><span class="n">add_observer</span><span class="p">(</span><span class="n">_log_obs</span><span class="p">);</span>
<span class="n">_admin_socket</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AdminSocket</span><span class="p">(</span><span class="k">this</span><span class="p">);</span> <span class="c1">// new AdminSocket 对象
</span>
<span class="n">_admin_hook</span> <span class="o">=</span> <span class="k">new</span> <span class="n">CephContextHook</span><span class="p">(</span><span class="k">this</span><span class="p">);</span> <span class="c1">// 执行命令的hook
</span> <span class="n">_admin_socket</span><span class="o">-></span><span class="n">register_command</span><span class="p">(</span><span class="s">"perfcounters_dump"</span><span class="p">,</span> <span class="s">"perfcounters_dump"</span><span class="p">,</span> <span class="n">_admin_hook</span><span class="p">,</span> <span class="s">""</span><span class="p">);</span> <span class="c1">// 注册能够识别的命令
</span> <span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>初始化后就会立即注册自己能够识别的命令及其对应的操作(AdminSocketHook),然后启动AdminSocket类的线程(AdminSocket也继承自Thread类,和Log类一样),
开始接收新请求的到来,一旦有请求到来,就会调用AdminSocket::do_accept(),如果有命令匹配(以前注册过命令),就会调用注册时的hook, 即AdminSocketHook::call():</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CephContextHook</span> <span class="o">:</span> <span class="k">public</span> <span class="n">AdminSocketHook</span> <span class="p">{</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">m_cct</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">CephContextHook</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">)</span> <span class="o">:</span> <span class="n">m_cct</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">bool</span> <span class="n">call</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">command</span><span class="p">,</span> <span class="n">cmdmap_t</span><span class="o">&</span> <span class="n">cmdmap</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">format</span><span class="p">,</span> <span class="c1">// 执行命令
</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">out</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_cct</span><span class="o">-></span><span class="n">do_command</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">cmdmap</span><span class="p">,</span> <span class="n">format</span><span class="p">,</span> <span class="o">&</span><span class="n">out</span><span class="p">);</span> <span class="c1">// 调用CephContext的do_command
</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">CephContext</span><span class="o">::</span><span class="n">do_command</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">command</span><span class="p">,</span> <span class="n">cmdmap_t</span><span class="o">&</span> <span class="n">cmdmap</span><span class="p">,</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">format</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">command</span> <span class="o">==</span> <span class="s">"perfcounters_dump"</span> <span class="o">||</span> <span class="n">command</span> <span class="o">==</span> <span class="s">"1"</span> <span class="o">||</span>
<span class="n">command</span> <span class="o">==</span> <span class="s">"perf dump"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_perf_counters_collection</span><span class="o">-></span><span class="n">dump_formatted</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="k">else</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">command</span> <span class="o">==</span> <span class="s">"config show"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_conf</span><span class="o">-></span><span class="n">show_config</span><span class="p">(</span><span class="n">f</span><span class="p">);</span> <span class="c1">// 打印所有config
</span> <span class="p">}</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">command</span> <span class="o">==</span> <span class="s">"config set"</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 修改参数的命令
</span> <span class="p">......</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_conf</span><span class="o">-></span><span class="n">set_val</span><span class="p">(</span><span class="n">var</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">valstr</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span> <span class="c1">// 调用md_config_t::set_val()改变配置
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>后面逻辑就比较简单,日志级别可以动态修改,以前定义的类中的常量也可以被修改(没错,就是这样):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 对于选项,set_val会调用下面这个函数
</span><span class="kt">int</span> <span class="n">md_config_t</span><span class="o">::</span><span class="n">set_val_impl</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">val</span><span class="p">,</span> <span class="k">const</span> <span class="n">config_option</span> <span class="o">*</span><span class="n">opt</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">set_val_raw</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="n">opt</span><span class="p">);</span> <span class="c1">// 改变值
</span> <span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="n">changed</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">opt</span><span class="o">-></span><span class="n">name</span><span class="p">);</span> <span class="c1">// 记录下改变的选项,需要通知观察者,这里用了设计模式中的观察者模式
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">md_config_t</span><span class="o">::</span><span class="n">set_val_raw</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">val</span><span class="p">,</span> <span class="k">const</span> <span class="n">config_option</span> <span class="o">*</span><span class="n">opt</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">opt</span><span class="o">-></span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">OPT_INT</span><span class="p">:</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">err</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">f</span> <span class="o">=</span> <span class="n">strict_sistrtoll</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="o">&</span><span class="n">err</span><span class="p">);</span> <span class="c1">// 转换为整数
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">err</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="o">*</span><span class="p">(</span><span class="kt">int</span><span class="o">*</span><span class="p">)</span><span class="n">opt</span><span class="o">-></span><span class="n">conf_ptr</span><span class="p">(</span><span class="k">this</span><span class="p">)</span> <span class="o">=</span> <span class="n">f</span><span class="p">;</span> <span class="c1">// 根据常量在对象中的地址,设置地址的值,修改了常量
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">config_option</span><span class="o">::</span><span class="n">conf_ptr</span><span class="p">(</span><span class="n">md_config_t</span> <span class="o">*</span><span class="n">conf</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)(((</span><span class="kt">char</span><span class="o">*</span><span class="p">)</span><span class="n">conf</span><span class="p">)</span> <span class="o">+</span> <span class="n">md_conf_off</span><span class="p">);</span> <span class="c1">// 获取偏移地址
</span> <span class="k">return</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 定义数组,记录每个option的一个三元组(名字,类型,在md_config_t中的地址偏移)
</span><span class="k">struct</span> <span class="n">config_option</span> <span class="n">config_optionsp</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
<span class="cp">#define OPTION(name, type, def_val) \
{ STRINGIFY(name), type, offsetof(struct md_config_t, name) }, // 第三个参数记录地址偏移
</span>
<span class="cp">#define SUBSYS(name, log, gather)
#define DEFAULT_SUBSYS(log, gather)
</span>
<span class="cp">#include "common/config_opts.h"
</span>
<span class="cp">#undef OPTION
#undef SUBSYS
#undef DEFAULT_SUBSYS
</span><span class="p">};</span>
<span class="cp">#ifndef offsetof
#define offsetof(STRUCTURE,FIELD) ((int)((char*)&((STRUCTURE*)0)->FIELD)) // 将0转换为结构体指针然后获取偏移,在linux kernel中的链表的实现就见过这么用
#endif
</span></code></pre></div></div>
<p>对于日志级别,直接修改了以前的systemmap,即vector中的值,在打印日志的时候,会获取到新的值进行判断,对于常量选项,某个子系统如果对某些选项感兴趣,
应注册观察者,当值发生了变化,就会收到通知。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">md_config_t</span><span class="o">::</span><span class="n">_apply_changes</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">ostream</span> <span class="o">*</span><span class="n">oss</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="c1">// 通知以前注册的观察者,即daemon进程(osd,mon,mds等)
</span> <span class="k">for</span> <span class="p">(</span><span class="n">rev_obs_map_t</span><span class="o">::</span><span class="n">const_iterator</span> <span class="n">r</span> <span class="o">=</span> <span class="n">robs</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">r</span> <span class="o">!=</span> <span class="n">robs</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">md_config_obs_t</span> <span class="o">*</span><span class="n">obs</span> <span class="o">=</span> <span class="n">r</span><span class="o">-></span><span class="n">first</span><span class="p">;</span>
<span class="n">obs</span><span class="o">-></span><span class="n">handle_conf_change</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">r</span><span class="o">-></span><span class="n">second</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">changed</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="multi-daemon">Multi Daemon</h3>
<p>注意到前面这些方式,都是向某个daemon进程的admin socket发送命令或者修改配置重启某个daemon进程,只针对单个进程,
如果我想修改所有的osd进程的某项配置,可以修改配置,然后分发到所有机器,然后重启所有机器的osd进程;当然也可以用ansible之类的工具,
远程执行脚本,脚本自动找出机器所有daemon进程的admin socket并发送命令。</p>
<p>ceph还提供了另外一种方,可以同时修改所有进程的配置,比如可以通过如下命令对所有osd进程进行修改:</p>
<blockquote>
<p>ceph tell osd.* injectargs ‘–key value’</p>
</blockquote>
<p>这是怎么做到的?
这条命令的执行和以前admin socket的机制截然不同,这条命令不会通过admin socket去直接向某个osd进程发送命令,而是通过librados API接口,
向rados集群的所有osd进程发送命令请求的消息,当osd进程收到消息后,解析出是命令请求的操作,会将操作送入command队列(类似读写操作的op队列),
然后由command线程池的线程获取队列元素执行,最终线程会调用下面这个函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_process</span><span class="p">(</span><span class="n">Command</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">osd</span><span class="o">-></span><span class="n">osd_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">osd</span><span class="o">-></span><span class="n">is_stopping</span><span class="p">())</span> <span class="p">{</span>
<span class="n">osd</span><span class="o">-></span><span class="n">osd_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">delete</span> <span class="n">c</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">osd</span><span class="o">-></span><span class="n">do_command</span><span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">con</span><span class="p">.</span><span class="n">get</span><span class="p">(),</span> <span class="n">c</span><span class="o">-></span><span class="n">tid</span><span class="p">,</span> <span class="n">c</span><span class="o">-></span><span class="n">cmd</span><span class="p">,</span> <span class="n">c</span><span class="o">-></span><span class="n">indata</span><span class="p">);</span> <span class="c1">// 执行命令请求
</span> <span class="n">osd</span><span class="o">-></span><span class="n">osd_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="k">delete</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">OSD</span><span class="o">::</span><span class="n">do_command</span><span class="p">(</span><span class="n">Connection</span> <span class="o">*</span><span class="n">con</span><span class="p">,</span> <span class="n">ceph_tid_t</span> <span class="n">tid</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">>&</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">prefix</span> <span class="o">==</span> <span class="s">"injectargs"</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// injectargs 的命令请求
</span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">argsvec</span><span class="p">;</span>
<span class="n">cmd_getval</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="n">cmdmap</span><span class="p">,</span> <span class="s">"injected_args"</span><span class="p">,</span> <span class="n">argsvec</span><span class="p">);</span> <span class="c1">// 解析参数到argsvec
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">argsvec</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="n">ss</span> <span class="o"><<</span> <span class="s">"ignoring empty injectargs"</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">string</span> <span class="n">args</span> <span class="o">=</span> <span class="n">argsvec</span><span class="p">.</span><span class="n">front</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">a</span> <span class="o">=</span> <span class="o">++</span><span class="n">argsvec</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">a</span> <span class="o">!=</span> <span class="n">argsvec</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">a</span><span class="p">)</span>
<span class="n">args</span> <span class="o">+=</span> <span class="s">" "</span> <span class="o">+</span> <span class="o">*</span><span class="n">a</span><span class="p">;</span>
<span class="n">osd_lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">injectargs</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="o">&</span><span class="n">ss</span><span class="p">);</span> <span class="c1">// 修改参数,最终还是通过md_config_t类完成
</span> <span class="n">osd_lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">md_config_t</span><span class="o">::</span><span class="n">injectargs</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&</span> <span class="n">s</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">ostream</span> <span class="o">*</span><span class="n">oss</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">parse_injectargs</span><span class="p">(</span><span class="n">nargs</span><span class="p">,</span> <span class="n">oss</span><span class="p">);</span> <span class="c1">// 设置参数值
</span>
<span class="p">......</span>
<span class="n">_apply_changes</span><span class="p">(</span><span class="n">oss</span><span class="p">);</span> <span class="c1">// 和以前一样,回调通知自己,告诉参数有变
</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>另外需要注意下,命令ceph tell …,最开始是c++实现的,现在改为python实现的,在src/ceph.in文件,最终安装后默认会放到/usr/bin/ceph,
然后通过python库以及提供的集群配置文件(默认/etc/ceph/ceph.conf),就会调用c++ librados接口,向集群发送消息。
所以其实我们并不需要登录到rados集群的机器,在管理机就可以执行,只要网络联通,安装了ceph以及设置好了配置文件,这对我们进行性能分析和参数调优非常方便。</p>
<h1 id="summary">Summary</h1>
<ul>
<li>
<p>ceph对每个子系统定义了日志级别,启动的时候从配置文件构造了一张包含日志级别的map,并可动态修改</p>
</li>
<li>
<p>采用单独线程处理所有提交的日志</p>
</li>
<li>
<p>日志级别不满足的时候,几乎无任何额外开销(do{}while(0))</p>
</li>
<li>
<p>快速提交日志,锁很短的时间</p>
</li>
<li>
<p>日志流的优化,预先定义静态buffer</p>
</li>
<li>
<p>动态修改系统参数配置,通过观察者模式通知相关者</p>
</li>
<li>
<p>在管理机动态调整集群所有osd参数,方便进行性能调优</p>
</li>
</ul>
Ceph Class Plugin
2015-11-11T00:00:00+00:00
http://blog.wjin.org/posts/ceph-class-plugin
<h1 id="overview">Overview</h1>
<p>ceph提供了librados API来访问后端rados存储集群,但是这些API也仅仅是对对象的简单操作(增删改查等),
增加新的API会导致librados变得越来越复杂,而且也根本不可能应对特定场景下的特殊需求。
所以ceph提供了可动态加载插件(ceph中叫class,不是c++中类的概念,实际上就是动态链接库)的方式,不同应用可以根据自己的需求定制特定的插件,
rados后端存储集群会根据用户请求,让OSD进程主动加载插件(dlopen),执行用户的特殊需求。</p>
<h1 id="implementation">Implementation</h1>
<h2 id="manage-plugin">manage plugin</h2>
<p>插件就是动态链接库,它的管理,对于c/c++,肯定都是利用dlopen相关函数,为了统一管理加载的插件,OSD进程借助于类src/osd/ClassHandler来管理,
其中有两个类中类(ClassData/ClassMethod),前者描述一个插件,后者描述插件包含的方法(函数)。</p>
<p>ClassHandler中有一个map来记录osd加载的所有插件,即 <插件名,描述插件的句柄(ClassData)>,
ClassData中有一个map记录此插件提供的所有方法,即 <方法名,描述方法的句柄(ClassMethod)>, 实现的头文件比较简单:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ClassHandler</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">ClassData</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">ClassMethod</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">;</span>
<span class="n">string</span> <span class="n">name</span><span class="p">;</span> <span class="c1">// 名称
</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">;</span>
<span class="n">cls_method_call_t</span> <span class="n">func</span><span class="p">;</span> <span class="c1">// c函数
</span> <span class="n">cls_method_cxx_call_t</span> <span class="n">cxx_func</span><span class="p">;</span> <span class="c1">// c++版本函数
</span>
<span class="kt">int</span> <span class="n">exec</span><span class="p">(</span><span class="n">cls_method_context_t</span> <span class="n">ctx</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">indata</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">outdata</span><span class="p">);</span> <span class="c1">// 发起函数调用,实际上就是调用前面的func或cxx_func
</span> <span class="kt">void</span> <span class="n">unregister</span><span class="p">();</span>
<span class="kt">int</span> <span class="nf">get_flags</span><span class="p">()</span> <span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">cls</span><span class="o">-></span><span class="n">handler</span><span class="o">-></span><span class="n">mutex</span><span class="p">);</span>
<span class="k">return</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ClassMethod</span><span class="p">()</span> <span class="o">:</span> <span class="n">cls</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">flags</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">func</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">cxx_func</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">ClassData</span> <span class="p">{</span>
<span class="k">enum</span> <span class="n">Status</span> <span class="p">{</span> <span class="c1">// 依赖插件的状态
</span> <span class="n">CLASS_UNKNOWN</span><span class="p">,</span>
<span class="n">CLASS_MISSING</span><span class="p">,</span> <span class="c1">// missing
</span> <span class="n">CLASS_MISSING_DEPS</span><span class="p">,</span> <span class="c1">// missing dependencies
</span> <span class="n">CLASS_INITIALIZING</span><span class="p">,</span> <span class="c1">// calling init() right now
</span> <span class="n">CLASS_OPEN</span><span class="p">,</span> <span class="c1">// initialized, usable
</span> <span class="p">}</span> <span class="n">status</span><span class="p">;</span>
<span class="n">string</span> <span class="n">name</span><span class="p">;</span> <span class="c1">// 插件名称
</span> <span class="n">ClassHandler</span> <span class="o">*</span><span class="n">handler</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">handle</span><span class="p">;</span>
<span class="n">map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">ClassMethod</span><span class="o">></span> <span class="n">methods_map</span><span class="p">;</span> <span class="c1">// <插件的函数名,描述插件函数的句柄>, 插件注册的方法都在这个map里
</span>
<span class="n">set</span><span class="o"><</span><span class="n">ClassData</span> <span class="o">*></span> <span class="n">dependencies</span><span class="p">;</span> <span class="c1">// 插件依赖的其他插件(so)
</span> <span class="n">set</span><span class="o"><</span><span class="n">ClassData</span> <span class="o">*></span> <span class="n">missing_dependencies</span><span class="p">;</span>
<span class="n">ClassData</span><span class="p">()</span> <span class="o">:</span> <span class="n">status</span><span class="p">(</span><span class="n">CLASS_UNKNOWN</span><span class="p">),</span>
<span class="n">handler</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span>
<span class="n">handle</span><span class="p">(</span><span class="nb">NULL</span><span class="p">)</span> <span class="p">{}</span>
<span class="o">~</span><span class="n">ClassData</span><span class="p">()</span> <span class="p">{</span> <span class="p">}</span>
<span class="n">ClassMethod</span> <span class="o">*</span><span class="n">register_method</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">mname</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span> <span class="n">cls_method_call_t</span> <span class="n">func</span><span class="p">);</span> <span class="c1">// 注册一个插件的方法
</span> <span class="n">ClassMethod</span> <span class="o">*</span><span class="n">register_cxx_method</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">mname</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span> <span class="n">cls_method_cxx_call_t</span> <span class="n">func</span><span class="p">);</span> <span class="c1">// cxx表示c++版本的函数
</span> <span class="kt">void</span> <span class="n">unregister_method</span><span class="p">(</span><span class="n">ClassMethod</span> <span class="o">*</span><span class="n">method</span><span class="p">);</span> <span class="c1">// 清除方法
</span> <span class="p">};</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Mutex</span> <span class="n">mutex</span><span class="p">;</span>
<span class="n">map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">ClassData</span><span class="o">></span> <span class="n">classes</span><span class="p">;</span> <span class="c1">// <插件的名字,描述插件的句柄>, 所有的插件都会记录在这个map里
</span>
<span class="n">ClassData</span> <span class="o">*</span><span class="n">_get_class</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">cname</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">_load_class</span><span class="p">(</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">);</span> <span class="c1">// 加载so
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">ClassHandler</span><span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="n">cct_</span><span class="p">)</span> <span class="o">:</span> <span class="n">cct</span><span class="p">(</span><span class="n">cct_</span><span class="p">),</span> <span class="n">mutex</span><span class="p">(</span><span class="s">"ClassHandler"</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">int</span> <span class="n">open_all_classes</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">open_class</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">cname</span><span class="p">,</span> <span class="n">ClassData</span> <span class="o">**</span><span class="n">pcls</span><span class="p">);</span> <span class="c1">// 调用_load_class, 然后dlopen加载so
</span>
<span class="n">ClassData</span> <span class="o">*</span><span class="n">register_class</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cname</span><span class="p">);</span> <span class="c1">// 注册插件
</span> <span class="kt">void</span> <span class="n">unregister_class</span><span class="p">(</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">);</span> <span class="c1">// 清除插件
</span>
<span class="kt">void</span> <span class="n">shutdown</span><span class="p">();</span>
<span class="p">};</span>
</code></pre></div></div>
<p>register函数只是注册一个插件,占用map一项,实际并没有去dlopen so,unregister就更简单了,什么都没干:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">register_class</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cname</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">mutex</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span> <span class="o">=</span> <span class="n">_get_class</span><span class="p">(</span><span class="n">cname</span><span class="p">);</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"register_class "</span> <span class="o"><<</span> <span class="n">cname</span> <span class="o"><<</span> <span class="s">" status "</span> <span class="o"><<</span> <span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o">!=</span> <span class="n">ClassData</span><span class="o">::</span><span class="n">CLASS_INITIALIZING</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"class "</span> <span class="o"><<</span> <span class="n">cname</span> <span class="o"><<</span> <span class="s">" isn't loaded; is the class registering under the wrong name?"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">cls</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">_get_class</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">cname</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">;</span>
<span class="n">map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">ClassData</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">iter</span> <span class="o">=</span> <span class="n">classes</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">cname</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">iter</span> <span class="o">!=</span> <span class="n">classes</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="n">cls</span> <span class="o">=</span> <span class="o">&</span><span class="n">iter</span><span class="o">-></span><span class="n">second</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">cls</span> <span class="o">=</span> <span class="o">&</span><span class="n">classes</span><span class="p">[</span><span class="n">cname</span><span class="p">];</span> <span class="c1">// 记录注册的插件
</span> <span class="n">dout</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"_get_class adding new class name "</span> <span class="o"><<</span> <span class="n">cname</span> <span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">cls</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">cls</span><span class="o">-></span><span class="n">name</span> <span class="o">=</span> <span class="n">cname</span><span class="p">;</span>
<span class="n">cls</span><span class="o">-></span><span class="n">handler</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">cls</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">ClassHandler</span><span class="o">::</span><span class="n">unregister_class</span><span class="p">(</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* FIXME: do we really need this one? */</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在以后需要调用插件的时候,可以通过函数ClassHandler::open_class去dlopen插件so及其所依赖的其他插件,并且调用插件的入口函数__cls_init对插件进行初始化工作:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">ClassHandler</span><span class="o">::</span><span class="n">open_class</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">cname</span><span class="p">,</span> <span class="n">ClassData</span> <span class="o">**</span><span class="n">pcls</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">lock</span><span class="p">(</span><span class="n">mutex</span><span class="p">);</span>
<span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span> <span class="o">=</span> <span class="n">_get_class</span><span class="p">(</span><span class="n">cname</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o">!=</span> <span class="n">ClassData</span><span class="o">::</span><span class="n">CLASS_OPEN</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_load_class</span><span class="p">(</span><span class="n">cls</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="o">*</span><span class="n">pcls</span> <span class="o">=</span> <span class="n">cls</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">ClassHandler</span><span class="o">::</span><span class="n">_load_class</span><span class="p">(</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// already open
</span> <span class="k">if</span> <span class="p">(</span><span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o">==</span> <span class="n">ClassData</span><span class="o">::</span><span class="n">CLASS_OPEN</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o">==</span> <span class="n">ClassData</span><span class="o">::</span><span class="n">CLASS_UNKNOWN</span> <span class="o">||</span>
<span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o">==</span> <span class="n">ClassData</span><span class="o">::</span><span class="n">CLASS_MISSING</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="n">cls</span><span class="o">-></span><span class="n">handle</span> <span class="o">=</span> <span class="n">dlopen</span><span class="p">(</span><span class="n">fname</span><span class="p">,</span> <span class="n">RTLD_NOW</span><span class="p">);</span> <span class="c1">// dlopen 插件
</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="c1">// 递归调用_load_class解决依赖库
</span> <span class="p">......</span>
<span class="c1">// initialize
</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">cls_init</span><span class="p">)()</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="p">)())</span><span class="n">dlsym</span><span class="p">(</span><span class="n">cls</span><span class="o">-></span><span class="n">handle</span><span class="p">,</span> <span class="s">"__cls_init"</span><span class="p">);</span> <span class="c1">// 插件的入口,很关键,就是分析symbol,找到__cls_init,每个插件都有这个函数
</span> <span class="k">if</span> <span class="p">(</span><span class="n">cls_init</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cls</span><span class="o">-></span><span class="n">status</span> <span class="o">=</span> <span class="n">ClassData</span><span class="o">::</span><span class="n">CLASS_INITIALIZING</span><span class="p">;</span>
<span class="n">cls_init</span><span class="p">();</span> <span class="c1">// 初始化插件,调用插件入口函数__cls_init
</span> <span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在osd进程启动的过程中,会new一个ClassHandler这样的对象,为以后插件的动态加载做好准备:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">OSD</span><span class="o">::</span><span class="n">init</span><span class="p">()</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="n">class_handler</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ClassHandler</span><span class="p">(</span><span class="n">cct</span><span class="p">);</span>
<span class="n">cls_initialize</span><span class="p">(</span><span class="n">class_handler</span><span class="p">);</span> <span class="c1">// 这里很关键
</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>osd自己new了一个ClassHandler,宣称自己有了管理插件的能力,但是这个能力得暴露出来吧?
其实这是通过cls_initialize函数实现的,源码在src/objclass/Class_api.cc:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">ClassHandler</span> <span class="o">*</span><span class="n">ch</span><span class="p">;</span> <span class="c1">// 静态全局变量
</span>
<span class="kt">void</span> <span class="nf">cls_initialize</span><span class="p">(</span><span class="n">ClassHandler</span> <span class="o">*</span><span class="n">h</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ch</span> <span class="o">=</span> <span class="n">h</span><span class="p">;</span> <span class="c1">// 全局变量记录ClassHandler 对象的地址,以后要用就找它了
</span><span class="p">}</span>
</code></pre></div></div>
<h2 id="write-plugin">write plugin</h2>
<p>到此,似乎明白了osd进程通过ClassHandler拥有了加载插件的功能,但是插件怎么写?什么时候使用插件?</p>
<p>ceph封装了一些简单的接口,在src/objclass/目录下。一部分是插件版本管理以及初始化的函数接口,
另外一部分就是访问对象的一些API的包装,编写插件时可以方便的调用:</p>
<p>###</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1">// 插件的版本控制
</span><span class="cp">#define CLS_VER(maj,min) \
int __cls_ver__## maj ## _ ##min = 0; \
int __cls_ver_maj = maj; \
int __cls_ver_min = min;
</span>
<span class="c1">// 插件的名称
</span><span class="cp">#define CLS_NAME(name) \
int __cls_name__## name = 0; \
const char *__cls_name = #name;
</span>
<span class="c1">// 方法特征,读/写
</span><span class="cp">#define CLS_METHOD_RD 0x1
#define CLS_METHOD_WR 0x2
#define CLS_METHOD_PUBLIC 0x4
</span>
<span class="kt">void</span> <span class="n">__cls_init</span><span class="p">();</span> <span class="c1">// 入口函数,写插件时必须实现的函数,并且在此函数中注册以后需要使用的方法
</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">cls_handle_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">cls_method_handle_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">cls_method_context_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">cls_method_call_t</span><span class="p">)(</span><span class="n">cls_method_context_t</span> <span class="n">ctx</span><span class="p">,</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">indata</span><span class="p">,</span> <span class="kt">int</span> <span class="n">datalen</span><span class="p">,</span>
<span class="kt">char</span> <span class="o">**</span><span class="n">outdata</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">outdatalen</span><span class="p">);</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ver</span><span class="p">;</span>
<span class="p">}</span> <span class="n">cls_deps_t</span><span class="p">;</span>
</code></pre></div></div>
<p>编写插件需要的基本函数,都是通过静态全局变量ch调用了osd的ClassHandler:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 注册插件自己,名字全局唯一
</span><span class="kt">int</span> <span class="nf">cls_register</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="n">cls_handle_t</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span> <span class="o">=</span> <span class="n">ch</span><span class="o">-></span><span class="n">register_class</span><span class="p">(</span><span class="n">name</span><span class="p">);</span> <span class="c1">// 占用ClassHandler的map的一项
</span> <span class="o">*</span><span class="n">handle</span> <span class="o">=</span> <span class="p">(</span><span class="n">cls_handle_t</span><span class="p">)</span><span class="n">cls</span><span class="p">;</span>
<span class="k">return</span> <span class="p">(</span><span class="n">cls</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">cls_unregister</span><span class="p">(</span><span class="n">cls_handle_t</span> <span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span> <span class="o">=</span> <span class="p">(</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="p">)</span><span class="n">handle</span><span class="p">;</span>
<span class="n">ch</span><span class="o">-></span><span class="n">unregister_class</span><span class="p">(</span><span class="n">cls</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 注册插件的方法, c函数
</span><span class="kt">int</span> <span class="nf">cls_register_method</span><span class="p">(</span><span class="n">cls_handle_t</span> <span class="n">hclass</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
<span class="n">cls_method_call_t</span> <span class="n">class_call</span><span class="p">,</span> <span class="n">cls_method_handle_t</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="p">(</span><span class="n">CLS_METHOD_RD</span> <span class="o">|</span> <span class="n">CLS_METHOD_WR</span><span class="p">)))</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span> <span class="o">=</span> <span class="p">(</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="p">)</span><span class="n">hclass</span><span class="p">;</span>
<span class="n">cls_method_handle_t</span> <span class="n">hmethod</span> <span class="o">=</span><span class="p">(</span><span class="n">cls_method_handle_t</span><span class="p">)</span><span class="n">cls</span><span class="o">-></span><span class="n">register_method</span><span class="p">(</span><span class="n">method</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">class_call</span><span class="p">);</span> <span class="c1">// 占用ClassData的map一项
</span> <span class="k">if</span> <span class="p">(</span><span class="n">handle</span><span class="p">)</span>
<span class="o">*</span><span class="n">handle</span> <span class="o">=</span> <span class="n">hmethod</span><span class="p">;</span>
<span class="k">return</span> <span class="p">(</span><span class="n">hmethod</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// 注册插件的方法,c++函数
</span><span class="kt">int</span> <span class="nf">cls_register_cxx_method</span><span class="p">(</span><span class="n">cls_handle_t</span> <span class="n">hclass</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
<span class="n">cls_method_cxx_call_t</span> <span class="n">class_call</span><span class="p">,</span> <span class="n">cls_method_handle_t</span> <span class="o">*</span><span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span> <span class="o">=</span> <span class="p">(</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="p">)</span><span class="n">hclass</span><span class="p">;</span>
<span class="n">cls_method_handle_t</span> <span class="n">hmethod</span> <span class="o">=</span> <span class="p">(</span><span class="n">cls_method_handle_t</span><span class="p">)</span><span class="n">cls</span><span class="o">-></span><span class="n">register_cxx_method</span><span class="p">(</span><span class="n">method</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">class_call</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">handle</span><span class="p">)</span>
<span class="o">*</span><span class="n">handle</span> <span class="o">=</span> <span class="n">hmethod</span><span class="p">;</span>
<span class="k">return</span> <span class="p">(</span><span class="n">hmethod</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">cls_unregister_method</span><span class="p">(</span><span class="n">cls_method_handle_t</span> <span class="n">handle</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassMethod</span> <span class="o">*</span><span class="n">method</span> <span class="o">=</span> <span class="p">(</span><span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassMethod</span> <span class="o">*</span><span class="p">)</span><span class="n">handle</span><span class="p">;</span>
<span class="n">method</span><span class="o">-></span><span class="n">unregister</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="plugin-demo">plugin demo</h2>
<p>主要的函数明白后,编写插件就是很easy的一件事情,看一个hello world的例子:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CLS_VER</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">)</span> <span class="c1">// 插件的版本
</span><span class="n">CLS_NAME</span><span class="p">(</span><span class="n">hello</span><span class="p">)</span> <span class="c1">// 插件的名称
</span>
<span class="n">cls_handle_t</span> <span class="n">h_class</span><span class="p">;</span> <span class="c1">// 插件的句柄
</span><span class="n">cls_method_handle_t</span> <span class="n">h_say_hello</span><span class="p">;</span> <span class="c1">// 插件方法的句柄
</span><span class="cm">/* 插件的其他方法 */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">say_hello</span><span class="p">(</span><span class="n">cls_method_context_t</span> <span class="n">hctx</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// 此函数的逻辑,这里可以非常复杂的处理自己的逻辑,这也是插件的威力
</span> <span class="c1">// src/objclass目录下面封装了很多访问对象及其属性的API,可供使用
</span> <span class="c1">// 这不同于客户端,这是在存储集群进程osd直接访问
</span> <span class="n">out</span><span class="o">-></span><span class="n">append</span><span class="p">(</span><span class="s">"Hello, world!"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 初始化,所有插件必须提供,open_class的时候初始化
</span><span class="kt">void</span> <span class="nf">__cls_init</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cls_register</span><span class="p">(</span><span class="s">"hello"</span><span class="p">,</span> <span class="o">&</span><span class="n">h_class</span><span class="p">);</span> <span class="c1">// 注册插件
</span>
<span class="n">cls_register_cxx_method</span><span class="p">(</span><span class="n">h_class</span><span class="p">,</span> <span class="s">"say_hello"</span><span class="p">,</span> <span class="c1">// 注册插件的方法
</span> <span class="n">CLS_METHOD_RD</span><span class="p">,</span>
<span class="n">say_hello</span><span class="p">,</span> <span class="o">&</span><span class="n">h_say_hello</span><span class="p">);</span>
<span class="cm">/* 注册插件的其他方法 */</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="access-plugin">access plugin</h2>
<p>写插件也很容易,我们可以添加自己的复杂逻辑,但是怎么访问呢?</p>
<p>一个简单的例子,这是src/cls/lock/Cls_lock_client.cc里的代码,这个是lock插件的客户端,顺便提一下,
当我们写插件的时候,一般会写插件的服务端代码,供osd加载调用,另外还会对应的写一个客户端代码,供我们的客户端使用。
客户端调用本地代码的时候,实际上还是会将此请求封装成一个op操作,然后发送到后端集群,后端集群通过解析,
发现是请求的特定插件的方法,就会加载插件并且调用指定的方法:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// lock一个对象的操作
</span><span class="kt">int</span> <span class="nf">lock</span><span class="p">(</span><span class="n">IoCtx</span> <span class="o">*</span><span class="n">ioctx</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">name</span><span class="p">,</span> <span class="n">ClsLockType</span> <span class="n">type</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">cookie</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">tag</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">description</span><span class="p">,</span> <span class="k">const</span> <span class="n">utime_t</span><span class="o">&</span> <span class="n">duration</span><span class="p">,</span>
<span class="kt">uint8_t</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ObjectWriteOperation</span> <span class="n">op</span><span class="p">;</span> <span class="c1">// 对rados层来说,这是一个写操作
</span>
<span class="n">lock</span><span class="p">(</span><span class="o">&</span><span class="n">op</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">cookie</span><span class="p">,</span> <span class="n">tag</span><span class="p">,</span> <span class="n">description</span><span class="p">,</span> <span class="n">duration</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span> <span class="c1">// 调用下面这个函数初始化插件自定义的操作
</span>
<span class="k">return</span> <span class="n">ioctx</span><span class="o">-></span><span class="n">operate</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="o">&</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 通过objecter.cc,将这个写操作发送出去
</span><span class="p">}</span>
<span class="kt">void</span> <span class="nf">lock</span><span class="p">(</span><span class="n">ObjectWriteOperation</span> <span class="o">*</span><span class="n">rados_op</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">name</span><span class="p">,</span> <span class="n">ClsLockType</span> <span class="n">type</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">cookie</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">tag</span><span class="p">,</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">description</span><span class="p">,</span>
<span class="k">const</span> <span class="n">utime_t</span><span class="o">&</span> <span class="n">duration</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">cls_lock_lock_op</span> <span class="n">op</span><span class="p">;</span> <span class="c1">// 注意这个op,是插件自己的op,发送到后端集群后,会自己解析
</span>
<span class="c1">// 初始化插件自己定义的op
</span> <span class="n">op</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
<span class="n">op</span><span class="p">.</span><span class="n">type</span> <span class="o">=</span> <span class="n">type</span><span class="p">;</span>
<span class="n">op</span><span class="p">.</span><span class="n">cookie</span> <span class="o">=</span> <span class="n">cookie</span><span class="p">;</span>
<span class="n">op</span><span class="p">.</span><span class="n">tag</span> <span class="o">=</span> <span class="n">tag</span><span class="p">;</span>
<span class="n">op</span><span class="p">.</span><span class="n">description</span> <span class="o">=</span> <span class="n">description</span><span class="p">;</span>
<span class="n">op</span><span class="p">.</span><span class="n">duration</span> <span class="o">=</span> <span class="n">duration</span><span class="p">;</span>
<span class="n">op</span><span class="p">.</span><span class="n">flags</span> <span class="o">=</span> <span class="n">flags</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">in</span><span class="p">;</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span> <span class="c1">// 将插件的op封装到bufferlist,作为输入参数
</span>
<span class="n">rados_op</span><span class="o">-></span><span class="n">exec</span><span class="p">(</span><span class="s">"lock"</span><span class="p">,</span> <span class="s">"lock"</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span> <span class="c1">// 第一个参数是插件名,第二个参数是插件对应的方法,第三个参数是方法的输入参数
</span><span class="p">}</span>
</code></pre></div></div>
<p>插件自己的操作参数初始化完后,接着就初始化本次写操作的op内容,以便将插件的请求发送到后端集群,这个和以前类似,只是插件的op的类型比较特殊:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">librados</span><span class="o">::</span><span class="n">ObjectOperation</span><span class="o">::</span><span class="n">exec</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cls</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">inbl</span><span class="p">)</span>
<span class="p">{</span>
<span class="o">::</span><span class="n">ObjectOperation</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="p">(</span><span class="o">::</span><span class="n">ObjectOperation</span> <span class="o">*</span><span class="p">)</span><span class="n">impl</span><span class="p">;</span>
<span class="n">o</span><span class="o">-></span><span class="n">call</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">method</span><span class="p">,</span> <span class="n">inbl</span><span class="p">);</span> <span class="c1">// 调用objecter.cc的ObjectOperation
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">call</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cname</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">&</span><span class="n">indata</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">add_call</span><span class="p">(</span><span class="n">CEPH_OSD_OP_CALL</span><span class="p">,</span> <span class="n">cname</span><span class="p">,</span> <span class="n">method</span><span class="p">,</span> <span class="n">indata</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span> <span class="c1">// CEPH_OSD_OP_CALL,后端收到op消息后,通过这个值判断是插件的请求
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">add_call</span><span class="p">(</span><span class="kt">int</span> <span class="n">op</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cname</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">&</span><span class="n">indata</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">*</span><span class="n">outbl</span><span class="p">,</span> <span class="n">Context</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">prval</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">OSDOp</span><span class="o">&</span> <span class="n">osd_op</span> <span class="o">=</span> <span class="n">add_op</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 增加op
</span>
<span class="c1">// 初始化op各成员
</span> <span class="kt">unsigned</span> <span class="n">p</span> <span class="o">=</span> <span class="n">ops</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">out_handler</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">;</span>
<span class="n">out_bl</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">outbl</span><span class="p">;</span>
<span class="n">out_rval</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">prval</span><span class="p">;</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">op</span><span class="p">.</span><span class="n">op</span> <span class="o">=</span> <span class="n">op</span><span class="p">;</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">class_len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">cname</span><span class="p">);</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">method_len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">method</span><span class="p">);</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">indata_len</span> <span class="o">=</span> <span class="n">indata</span><span class="p">.</span><span class="n">length</span><span class="p">();</span>
<span class="c1">// 插件的所有输入参数,都是此次写操作的indata,包括插件名,方法名,输入参数
</span> <span class="n">osd_op</span><span class="p">.</span><span class="n">indata</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">cname</span><span class="p">,</span> <span class="n">osd_op</span><span class="p">.</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">class_len</span><span class="p">);</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">indata</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">method</span><span class="p">,</span> <span class="n">osd_op</span><span class="p">.</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">method_len</span><span class="p">);</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">indata</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">indata</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>通过前面的ioctx->operate调用,此次操作就算发送出去了,看看后端集群对此次操作的响应:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">ReplicatedPG</span><span class="o">::</span><span class="n">do_osd_ops</span><span class="p">(</span><span class="n">OpContext</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">OSDOp</span><span class="o">>&</span> <span class="n">ops</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">op</span><span class="p">)</span> <span class="p">{</span>
<span class="p">......</span>
<span class="k">case</span> <span class="n">CEPH_OSD_OP_CALL</span><span class="p">:</span> <span class="c1">// op 类型
</span> <span class="p">{</span>
<span class="c1">// 解析输入参数
</span> <span class="n">string</span> <span class="n">cname</span><span class="p">,</span> <span class="n">mname</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">indata</span><span class="p">;</span>
<span class="k">try</span> <span class="p">{</span>
<span class="n">bp</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">class_len</span><span class="p">,</span> <span class="n">cname</span><span class="p">);</span>
<span class="n">bp</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">method_len</span><span class="p">,</span> <span class="n">mname</span><span class="p">);</span>
<span class="n">bp</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">cls</span><span class="p">.</span><span class="n">indata_len</span><span class="p">,</span> <span class="n">indata</span><span class="p">);</span>
<span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="n">buffer</span><span class="o">::</span><span class="n">error</span><span class="o">&</span> <span class="n">e</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"call unable to decode class + method + indata"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">30</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"in dump: "</span><span class="p">;</span>
<span class="n">osd_op</span><span class="p">.</span><span class="n">indata</span><span class="p">.</span><span class="n">hexdump</span><span class="p">(</span><span class="o">*</span><span class="n">_dout</span><span class="p">);</span>
<span class="o">*</span><span class="n">_dout</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">osd</span><span class="p">,</span> <span class="n">do_osd_op_pre_call</span><span class="p">,</span> <span class="n">soid</span><span class="p">.</span><span class="n">oid</span><span class="p">.</span><span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">soid</span><span class="p">.</span><span class="n">snap</span><span class="p">.</span><span class="n">val</span><span class="p">,</span> <span class="s">"???"</span><span class="p">,</span> <span class="s">"???"</span><span class="p">);</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassData</span> <span class="o">*</span><span class="n">cls</span><span class="p">;</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">osd</span><span class="o">-></span><span class="n">class_handler</span><span class="o">-></span><span class="n">open_class</span><span class="p">(</span><span class="n">cname</span><span class="p">,</span> <span class="o">&</span><span class="n">cls</span><span class="p">);</span> <span class="c1">// dlopen加载插件并初始化
</span> <span class="n">assert</span><span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// init_op_flags() already verified this works.
</span>
<span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassMethod</span> <span class="o">*</span><span class="n">method</span> <span class="o">=</span> <span class="n">cls</span><span class="o">-></span><span class="n">get_method</span><span class="p">(</span><span class="n">mname</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span> <span class="c1">// 初始化的时候,注册了方法句柄
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">method</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"call method "</span> <span class="o"><<</span> <span class="n">cname</span> <span class="o"><<</span> <span class="s">"."</span> <span class="o"><<</span> <span class="n">mname</span> <span class="o"><<</span> <span class="s">" does not exist"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-</span><span class="n">EOPNOTSUPP</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">method</span><span class="o">-></span><span class="n">get_flags</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">CLS_METHOD_WR</span><span class="p">)</span>
<span class="n">ctx</span><span class="o">-></span><span class="n">user_modify</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">outdata</span><span class="p">;</span>
<span class="n">dout</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"call method "</span> <span class="o"><<</span> <span class="n">cname</span> <span class="o"><<</span> <span class="s">"."</span> <span class="o"><<</span> <span class="n">mname</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">prev_rd</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-></span><span class="n">num_read</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">prev_wr</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-></span><span class="n">num_write</span><span class="p">;</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">method</span><span class="o">-></span><span class="n">exec</span><span class="p">((</span><span class="n">cls_method_context_t</span><span class="p">)</span><span class="o">&</span><span class="n">ctx</span><span class="p">,</span> <span class="n">indata</span><span class="p">,</span> <span class="n">outdata</span><span class="p">);</span> <span class="c1">// 执行方法
</span> <span class="p">......</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">......</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这里为什么通过句柄来描述方法,为什么不直接调用?原因是这里有c和c++版本的函数,c版本的函数怎么能处理bufferlist输入参数呢?
所以需要对输入参数和结果参数转化一下:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">ClassHandler</span><span class="o">::</span><span class="n">ClassMethod</span><span class="o">::</span><span class="n">exec</span><span class="p">(</span><span class="n">cls_method_context_t</span> <span class="n">ctx</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">indata</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">outdata</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cxx_func</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// C++ call version
</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">cxx_func</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&</span><span class="n">indata</span><span class="p">,</span> <span class="o">&</span><span class="n">outdata</span><span class="p">);</span> <span class="c1">// c++直接调用
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// C version
</span> <span class="kt">char</span> <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">olen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">func</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">indata</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">indata</span><span class="p">.</span><span class="n">length</span><span class="p">(),</span> <span class="o">&</span><span class="n">out</span><span class="p">,</span> <span class="o">&</span><span class="n">olen</span><span class="p">);</span> <span class="c1">// 参数转化后调用
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">out</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 处理结果参数,转化回bufferlist
</span> <span class="c1">// assume *out was allocated via cls_alloc (which calls malloc!)
</span> <span class="n">buffer</span><span class="o">::</span><span class="n">ptr</span> <span class="n">bp</span> <span class="o">=</span> <span class="n">buffer</span><span class="o">::</span><span class="n">claim_malloc</span><span class="p">(</span><span class="n">olen</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span>
<span class="n">outdata</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">bp</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这样插件的请求就算成功执行了,实际上就是一个remote call :(</p>
<h1 id="summary">Summary</h1>
<ol>
<li>
<p>ceph的可动态加载插件丰富了librados的API,可以根据客户端的特定需求编写特定的插件</p>
</li>
<li>
<p>osd进程通过ClassHandler类管理插件,启动的时候new一个对象,并且将对象地址赋值给全局变量ch</p>
</li>
<li>
<p>插件通过一些wrapper函数(对ch的调用),注册自己到osd的ClassHandler的map里</p>
</li>
<li>
<p>osd在需要的时候,会自动加载插件并初始化插件(__cls_init),然后执行用户请求</p>
</li>
<li>
<p>插件的执行,其实就是客户端这边发送一个op请求(仍然是通过librados的基本API完成),osd收到请求后,解析出op类型进行处理,一般就是对注册的插件方法进行调用</p>
</li>
</ol>
Ceph Watch/Notify Mechanism
2015-11-04T00:00:00+00:00
http://blog.wjin.org/posts/ceph-watchnotify-mechanism
<h1 id="overview">Overview</h1>
<p>ceph中,有一个比较重要的watch/notify机制(粒度是object),它用来在不同客户端之间进行通信,使得各客户端之间的状态保持一致,
并且新功能librbd exclusive lock也是基于它来实现的。</p>
<p>在块存储中,如果image的元数据发生变化(header object发生变化),需要通知当前所有的客户端,这是怎么做到的呢?</p>
<p>客户端打开image的时候,如果不是只读,就会通过ImageWatcher类接口,注册一个watch在header object上,以后header object修改的事件发生,
自己就会收到通知。</p>
<p>librbd API 并没有watch这样的接口,用户使用块存储API并不需要关心这个细节,当使用librados的API时,如果需要,可以注册一些watch。
这里以打开一个image为例子来说明:</p>
<h1 id="watch">Watch</h1>
<h2 id="librbd-level">librbd level</h2>
<p>客户端要读写某个image,首先得open:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="n">RBD</span><span class="o">::</span><span class="n">open</span><span class="p">(</span><span class="n">IoCtx</span><span class="o">&</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="n">Image</span><span class="o">&</span> <span class="n">image</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">snap_name</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ImageCtx</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">snap_name</span><span class="p">,</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span> <span class="c1">// 最后一个false表示不是read_only,这样才会注册watch
</span> <span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">open_image_enter</span><span class="p">,</span> <span class="n">ictx</span><span class="p">,</span> <span class="n">ictx</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">id</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">snap_name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">librbd</span><span class="o">::</span><span class="n">open_image</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span> <span class="c1">// 调用internal.cc
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">open_image_exit</span><span class="p">,</span> <span class="n">r</span><span class="p">);</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">image</span><span class="p">.</span><span class="n">ctx</span> <span class="o">=</span> <span class="p">(</span><span class="n">image_ctx_t</span><span class="p">)</span> <span class="n">ictx</span><span class="p">;</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">open_image_exit</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>仍然是调用internal.cc的代码:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="nf">open_image</span><span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">init</span><span class="p">();</span> <span class="c1">// 读取metadata初始化image信息
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">err_close</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 非只读
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">register_watch</span><span class="p">();</span> <span class="c1">// 将客户端自己注册为watcher
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error registering a watch: "</span> <span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">err_close</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">owner_locker</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">);</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ictx_refresh</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span> <span class="c1">// 刷新
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">err_close</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="n">r</span> <span class="o">=</span> <span class="n">_snap_set</span><span class="p">(</span><span class="n">ictx</span><span class="p">,</span> <span class="n">ictx</span><span class="o">-></span><span class="n">snap_name</span><span class="p">.</span><span class="n">c_str</span><span class="p">()))</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">err_close</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="nl">err_close:</span>
<span class="n">close_image</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>注册的过程,是通过ImageWatcher类调用librados API 完成:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 只会注册一次,watch一直存在
</span><span class="kt">int</span> <span class="n">ImageCtx</span><span class="o">::</span><span class="n">register_watch</span><span class="p">()</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">image_watcher</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">image_watcher</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ImageWatcher</span><span class="p">(</span><span class="o">*</span><span class="k">this</span><span class="p">);</span>
<span class="k">return</span> <span class="n">image_watcher</span><span class="o">-></span><span class="n">register_watch</span><span class="p">();</span> <span class="c1">// 通过ImageWatcher管理类注册
</span><span class="p">}</span>
<span class="kt">int</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">register_watch</span><span class="p">()</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_watch_lock</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_watch_state</span> <span class="o">==</span> <span class="n">WATCH_STATE_UNREGISTERED</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">md_ctx</span><span class="p">.</span><span class="n">watch2</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">header_oid</span><span class="p">,</span> <span class="c1">// 注意是header_oid
</span> <span class="o">&</span><span class="n">m_watch_handle</span><span class="p">,</span> <span class="c1">// handle会被初始化为linger op的地址,以后收到notify的时候用以区分是哪个op,进而区分是哪个watch(可能会注册多个watch)
</span> <span class="o">&</span><span class="n">m_watch_ctx</span><span class="p">);</span> <span class="c1">// 以后收到notify的时候调用这个callback,对notify消息进行处理
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">m_watch_state</span> <span class="o">=</span> <span class="n">WATCH_STATE_REGISTERED</span><span class="p">;</span> <span class="c1">// 修改状态
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="librados-level">librados level</h2>
<p>注册操作,对后端rados集群来说,是一个OP操作,所以也会通过objecter封装,然后以消息的方式发送到后端集群。
其次,它还是一个写操作,因为要将watch的客户端信息存在磁盘上,这样OSD重启后仍然知道有哪些watch存在。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtx</span><span class="o">::</span><span class="n">watch2</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">cookie</span><span class="p">,</span>
<span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx2</span> <span class="o">*</span><span class="n">ctx2</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">object_t</span> <span class="n">obj</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span>
<span class="k">return</span> <span class="n">io_ctx_impl</span><span class="o">-></span><span class="n">watch</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">cookie</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">ctx2</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtxImpl</span><span class="o">::</span><span class="n">watch</span><span class="p">(</span><span class="k">const</span> <span class="n">object_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="o">*</span><span class="n">handle</span><span class="p">,</span>
<span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
<span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx2</span> <span class="o">*</span><span class="n">ctx2</span><span class="p">)</span>
<span class="p">{</span>
<span class="o">::</span><span class="n">ObjectOperation</span> <span class="n">wr</span><span class="p">;</span> <span class="c1">// 定义一个写的操作
</span> <span class="n">version_t</span> <span class="n">objver</span><span class="p">;</span>
<span class="n">C_SaferCond</span> <span class="n">onfinish</span><span class="p">;</span>
<span class="n">Objecter</span><span class="o">::</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">linger_op</span> <span class="o">=</span> <span class="n">objecter</span><span class="o">-></span><span class="n">linger_register</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="n">oloc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 会new一个LingerOp对象,并记录在objecter的map/set中
</span> <span class="o">*</span><span class="n">handle</span> <span class="o">=</span> <span class="n">linger_op</span><span class="o">-></span><span class="n">get_cookie</span><span class="p">();</span> <span class="c1">// cookie就是刚才new的LingerOp的地址转换为int64
</span> <span class="n">linger_op</span><span class="o">-></span><span class="n">watch_context</span> <span class="o">=</span> <span class="k">new</span> <span class="n">WatchInfo</span><span class="p">(</span><span class="k">this</span><span class="p">,</span>
<span class="n">oid</span><span class="p">,</span> <span class="n">ctx</span><span class="p">,</span> <span class="n">ctx2</span><span class="p">);</span> <span class="c1">// ctx为NULL,废弃的参数, ctx2为ImageWatcher类的成员: m_watch_ctx
</span>
<span class="n">prepare_assert_ops</span><span class="p">(</span><span class="o">&</span><span class="n">wr</span><span class="p">);</span>
<span class="n">wr</span><span class="p">.</span><span class="n">watch</span><span class="p">(</span><span class="o">*</span><span class="n">handle</span><span class="p">,</span> <span class="n">CEPH_OSD_WATCH_OP_WATCH</span><span class="p">);</span> <span class="c1">// 给wr增加一个watch操作,op的类型是:CEPH_OSD_WATCH_OP_WATCH
</span> <span class="n">bufferlist</span> <span class="n">bl</span><span class="p">;</span>
<span class="n">objecter</span><span class="o">-></span><span class="n">linger_watch</span><span class="p">(</span><span class="n">linger_op</span><span class="p">,</span> <span class="n">wr</span><span class="p">,</span> <span class="c1">// 请求objecter发送
</span> <span class="n">snapc</span><span class="p">,</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="nb">NULL</span><span class="p">),</span> <span class="n">bl</span><span class="p">,</span>
<span class="o">&</span><span class="n">onfinish</span><span class="p">,</span> <span class="c1">// 下面一行会wait在这个条件变量,等待操作完成
</span> <span class="o">&</span><span class="n">objver</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">onfinish</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span> <span class="c1">// 等待回调
</span>
<span class="n">set_sync_op_version</span><span class="p">(</span><span class="n">objver</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">objecter</span><span class="o">-></span><span class="n">linger_cancel</span><span class="p">(</span><span class="n">linger_op</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="osdc-level">osdc level</h2>
<p>接下来是通过objecter将操作发送出去,和其它通用的OP操作类似,只是在前面多了几步:<code class="highlighter-rouge">linger_watch -> _linger_submit -> _send_linger -> _op_submit</code></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ceph_tid_t</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">linger_watch</span><span class="p">(</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
<span class="n">ObjectOperation</span><span class="o">&</span> <span class="n">op</span><span class="p">,</span>
<span class="k">const</span> <span class="n">SnapContext</span><span class="o">&</span> <span class="n">snapc</span><span class="p">,</span> <span class="n">utime_t</span> <span class="n">mtime</span><span class="p">,</span>
<span class="n">bufferlist</span><span class="o">&</span> <span class="n">inbl</span><span class="p">,</span>
<span class="n">Context</span> <span class="o">*</span><span class="n">oncommit</span><span class="p">,</span>
<span class="n">version_t</span> <span class="o">*</span><span class="n">objver</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">info</span><span class="o">-></span><span class="n">is_watch</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="c1">// 这个true很关键,收到notify消息的时候,会带有cookie,即linger_op的地址,然后通过这个变量来判断本次消息对应的客户端的linger_op是watch还是notify
</span> <span class="n">info</span><span class="o">-></span><span class="n">snapc</span> <span class="o">=</span> <span class="n">snapc</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">mtime</span> <span class="o">=</span> <span class="n">mtime</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">CEPH_OSD_FLAG_WRITE</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">ops</span> <span class="o">=</span> <span class="n">op</span><span class="p">.</span><span class="n">ops</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">inbl</span> <span class="o">=</span> <span class="n">inbl</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">poutbl</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">pobjver</span> <span class="o">=</span> <span class="n">objver</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">on_reg_commit</span> <span class="o">=</span> <span class="n">oncommit</span><span class="p">;</span> <span class="c1">// 这个callback很关键,调用的时候会唤醒watch函数中的wait操作
</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">wl</span><span class="p">(</span><span class="n">rwlock</span><span class="p">);</span>
<span class="n">_linger_submit</span><span class="p">(</span><span class="n">info</span><span class="p">);</span> <span class="c1">// 提交linger op
</span> <span class="n">logger</span><span class="o">-></span><span class="n">inc</span><span class="p">(</span><span class="n">l_osdc_linger_active</span><span class="p">);</span>
<span class="k">return</span> <span class="n">info</span><span class="o">-></span><span class="n">linger_id</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_linger_submit</span><span class="p">(</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">rwlock</span><span class="p">.</span><span class="n">is_wlocked</span><span class="p">());</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span> <span class="n">lc</span><span class="p">(</span><span class="n">rwlock</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">::</span><span class="n">TakenForWrite</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">linger_id</span><span class="p">);</span>
<span class="c1">// Populate Op::target
</span> <span class="n">OSDSession</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">_calc_target</span><span class="p">(</span><span class="o">&</span><span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">,</span> <span class="o">&</span><span class="n">info</span><span class="o">-></span><span class="n">last_force_resend</span><span class="p">);</span>
<span class="c1">// Create LingerOp<->OSDSession relation
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_get_session</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">osd</span><span class="p">,</span> <span class="o">&</span><span class="n">s</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">s</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">get_write</span><span class="p">();</span>
<span class="n">_session_linger_op_assign</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span> <span class="c1">// 关联session与linger op
</span> <span class="n">s</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>
<span class="n">put_session</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="n">_send_linger</span><span class="p">(</span><span class="n">info</span><span class="p">);</span> <span class="c1">// 发送
</span><span class="p">}</span>
<span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_send_linger</span><span class="p">(</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">rwlock</span><span class="p">.</span><span class="n">is_wlocked</span><span class="p">());</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span> <span class="n">lc</span><span class="p">(</span><span class="n">rwlock</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">::</span><span class="n">TakenForWrite</span><span class="p">);</span>
<span class="n">vector</span><span class="o"><</span><span class="n">OSDOp</span><span class="o">></span> <span class="n">opv</span><span class="p">;</span>
<span class="n">Context</span> <span class="o">*</span><span class="n">oncommit</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">watch_lock</span><span class="p">.</span><span class="n">get_read</span><span class="p">();</span> <span class="c1">// just to read registered status
</span> <span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">registered</span> <span class="o">&&</span> <span class="n">info</span><span class="o">-></span><span class="n">is_watch</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">15</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"send_linger "</span> <span class="o"><<</span> <span class="n">info</span><span class="o">-></span><span class="n">linger_id</span> <span class="o"><<</span> <span class="s">" reconnect"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">opv</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">OSDOp</span><span class="p">());</span>
<span class="n">opv</span><span class="p">.</span><span class="n">back</span><span class="p">().</span><span class="n">op</span><span class="p">.</span><span class="n">op</span> <span class="o">=</span> <span class="n">CEPH_OSD_OP_WATCH</span><span class="p">;</span>
<span class="n">opv</span><span class="p">.</span><span class="n">back</span><span class="p">().</span><span class="n">op</span><span class="p">.</span><span class="n">watch</span><span class="p">.</span><span class="n">cookie</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">get_cookie</span><span class="p">();</span>
<span class="n">opv</span><span class="p">.</span><span class="n">back</span><span class="p">().</span><span class="n">op</span><span class="p">.</span><span class="n">watch</span><span class="p">.</span><span class="n">op</span> <span class="o">=</span> <span class="n">CEPH_OSD_WATCH_OP_RECONNECT</span><span class="p">;</span>
<span class="n">opv</span><span class="p">.</span><span class="n">back</span><span class="p">().</span><span class="n">op</span><span class="p">.</span><span class="n">watch</span><span class="p">.</span><span class="n">gen</span> <span class="o">=</span> <span class="o">++</span><span class="n">info</span><span class="o">-></span><span class="n">register_gen</span><span class="p">;</span>
<span class="n">oncommit</span> <span class="o">=</span> <span class="k">new</span> <span class="n">C_Linger_Reconnect</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">15</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"send_linger "</span> <span class="o"><<</span> <span class="n">info</span><span class="o">-></span><span class="n">linger_id</span> <span class="o"><<</span> <span class="s">" register"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">opv</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">ops</span><span class="p">;</span>
<span class="n">oncommit</span> <span class="o">=</span> <span class="k">new</span> <span class="n">C_Linger_Commit</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span> <span class="c1">// 第一次会创建这个callback
</span> <span class="p">}</span>
<span class="n">info</span><span class="o">-></span><span class="n">watch_lock</span><span class="p">.</span><span class="n">put_read</span><span class="p">();</span>
<span class="n">Op</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Op</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">base_oid</span><span class="p">,</span> <span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">base_oloc</span><span class="p">,</span>
<span class="n">opv</span><span class="p">,</span> <span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">flags</span> <span class="o">|</span> <span class="n">CEPH_OSD_FLAG_READ</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span>
<span class="n">info</span><span class="o">-></span><span class="n">pobjver</span><span class="p">);</span>
<span class="n">o</span><span class="o">-></span><span class="n">oncommit_sync</span> <span class="o">=</span> <span class="n">oncommit</span><span class="p">;</span> <span class="c1">// 将新建的callback赋值op的成员oncommit_sync,这个成员是watch/notify专用的,handle_osd_op_reply会回调此callback
</span> <span class="n">o</span><span class="o">-></span><span class="n">snapid</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">snap</span><span class="p">;</span>
<span class="n">o</span><span class="o">-></span><span class="n">snapc</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">snapc</span><span class="p">;</span>
<span class="n">o</span><span class="o">-></span><span class="n">mtime</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">mtime</span><span class="p">;</span>
<span class="n">o</span><span class="o">-></span><span class="n">target</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">target</span><span class="p">;</span>
<span class="n">o</span><span class="o">-></span><span class="n">tid</span> <span class="o">=</span> <span class="n">last_tid</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span>
<span class="c1">// do not resend this; we will send a new op to reregister
</span> <span class="n">o</span><span class="o">-></span><span class="n">should_resend</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">register_tid</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// repeat send. cancel old registeration op, if any.
</span> <span class="n">info</span><span class="o">-></span><span class="n">session</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">get_write</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">session</span><span class="o">-></span><span class="n">ops</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">register_tid</span><span class="p">))</span> <span class="p">{</span>
<span class="n">Op</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="n">info</span><span class="o">-></span><span class="n">session</span><span class="o">-></span><span class="n">ops</span><span class="p">[</span><span class="n">info</span><span class="o">-></span><span class="n">register_tid</span><span class="p">];</span>
<span class="n">_op_cancel_map_check</span><span class="p">(</span><span class="n">o</span><span class="p">);</span>
<span class="n">_cancel_linger_op</span><span class="p">(</span><span class="n">o</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">info</span><span class="o">-></span><span class="n">session</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>
<span class="n">info</span><span class="o">-></span><span class="n">register_tid</span> <span class="o">=</span> <span class="n">_op_submit</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span> <span class="c1">// 发送
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// first send
</span> <span class="n">info</span><span class="o">-></span><span class="n">register_tid</span> <span class="o">=</span> <span class="n">_op_submit_with_budget</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span> <span class="c1">// 发送
</span> <span class="p">}</span>
<span class="n">logger</span><span class="o">-></span><span class="n">inc</span><span class="p">(</span><span class="n">l_osdc_linger_send</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>OP提交后,会通过消息发送到rados集群,集群会通过messge和op的类型分类,经过一系列操作后最终会将watch客户端的信息落盘,
然后会发送回处理完成的消息,这个消息和以前一样(<code class="highlighter-rouge">CEPH_MSG_OSD_OPREPLY</code>),会在ms_dispatch函数处理,最后调用:<code class="highlighter-rouge">handle_osd_op_reply</code></p>
<p>这个函数比较复杂,但最终会调用回调<code class="highlighter-rouge">op->oncommit_sync</code> :</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">handle_osd_op_reply</span><span class="p">(</span><span class="n">MOSDOpReply</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">......</span>
<span class="k">if</span> <span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">oncommit_sync</span><span class="p">)</span> <span class="p">{</span>
<span class="n">op</span><span class="o">-></span><span class="n">oncommit_sync</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">rc</span><span class="p">);</span> <span class="c1">// 调用回调
</span> <span class="n">op</span><span class="o">-></span><span class="n">oncommit_sync</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">num_uncommitted</span><span class="p">.</span><span class="n">dec</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>而oncommit_sync由上面分析可知是C_Linger_Commit类型:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">struct</span> <span class="n">C_Linger_Commit</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Context</span> <span class="p">{</span>
<span class="n">Objecter</span> <span class="o">*</span><span class="n">objecter</span><span class="p">;</span>
<span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">;</span>
<span class="n">C_Linger_Commit</span><span class="p">(</span><span class="n">Objecter</span> <span class="o">*</span><span class="n">o</span><span class="p">,</span> <span class="n">LingerOp</span> <span class="o">*</span><span class="n">l</span><span class="p">)</span> <span class="o">:</span> <span class="n">objecter</span><span class="p">(</span><span class="n">o</span><span class="p">),</span> <span class="n">info</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="p">{</span>
<span class="n">info</span><span class="o">-></span><span class="n">get</span><span class="p">();</span>
<span class="p">}</span>
<span class="o">~</span><span class="n">C_Linger_Commit</span><span class="p">()</span> <span class="p">{</span>
<span class="n">info</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">finish</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">objecter</span><span class="o">-></span><span class="n">_linger_commit</span><span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="n">r</span><span class="p">);</span> <span class="c1">// 继续回调
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_linger_commit</span><span class="p">(</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span> <span class="kt">int</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">wl</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">watch_lock</span><span class="p">);</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"_linger_commit "</span> <span class="o"><<</span> <span class="n">info</span><span class="o">-></span><span class="n">linger_id</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">on_reg_commit</span><span class="p">)</span> <span class="p">{</span>
<span class="n">info</span><span class="o">-></span><span class="n">on_reg_commit</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">r</span><span class="p">);</span> <span class="c1">// 继续回调, 最终唤醒watch函数中的wait操作
</span> <span class="n">info</span><span class="o">-></span><span class="n">on_reg_commit</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// only tell the user the first time we do this
</span> <span class="n">info</span><span class="o">-></span><span class="n">registered</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">info</span><span class="o">-></span><span class="n">pobjver</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这样watch操作就算完成了,接下来就会收到notify消息,当然客户端也可以主动发送notify消息,先看怎么发送,然后看怎么处理接收消息。</p>
<h1 id="notify">Notify</h1>
<h2 id="send">send</h2>
<p>当某个客户端的事件需要通知给其他客户端的时候,就会调用notify相关函数。这些通知事件,基本上都是由ImageWatcher类发起,
比如notify_header_update, notify_released_lock等。最终也会通过objecter创建一个OP发送出去,和watch操作不同的地方是,
这个OP是读类型,因为没有信息需要落盘执行写操作。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtx</span><span class="o">::</span><span class="n">notify2</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">timeout_ms</span><span class="p">,</span> <span class="n">bufferlist</span> <span class="o">*</span><span class="n">preplybl</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">object_t</span> <span class="n">obj</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span>
<span class="k">return</span> <span class="n">io_ctx_impl</span><span class="o">-></span><span class="n">notify</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">bl</span><span class="p">,</span> <span class="n">timeout_ms</span><span class="p">,</span> <span class="n">preplybl</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtxImpl</span><span class="o">::</span><span class="n">notify</span><span class="p">(</span><span class="k">const</span> <span class="n">object_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">timeout_ms</span><span class="p">,</span>
<span class="n">bufferlist</span> <span class="o">*</span><span class="n">preply_bl</span><span class="p">,</span>
<span class="kt">char</span> <span class="o">**</span><span class="n">preply_buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="o">*</span><span class="n">preply_buf_len</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">bufferlist</span> <span class="n">inbl</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">C_NotifyFinish</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Context</span> <span class="p">{</span>
<span class="n">Cond</span> <span class="n">cond</span><span class="p">;</span>
<span class="n">Mutex</span> <span class="n">lock</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">done</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">result</span><span class="p">;</span>
<span class="n">bufferlist</span> <span class="n">reply_bl</span><span class="p">;</span>
<span class="n">C_NotifyFinish</span><span class="p">()</span>
<span class="o">:</span> <span class="n">lock</span><span class="p">(</span><span class="s">"IoCtxImpl::notify::C_NotifyFinish::lock"</span><span class="p">),</span>
<span class="n">done</span><span class="p">(</span><span class="nb">false</span><span class="p">),</span>
<span class="n">result</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>
<span class="kt">void</span> <span class="n">finish</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">complete</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="n">done</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">r</span><span class="p">;</span>
<span class="n">cond</span><span class="p">.</span><span class="n">Signal</span><span class="p">();</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">wait</span><span class="p">()</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">done</span><span class="p">)</span>
<span class="n">cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">lock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span> <span class="n">notify_private</span><span class="p">;</span>
<span class="n">Objecter</span><span class="o">::</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">linger_op</span> <span class="o">=</span> <span class="n">objecter</span><span class="o">-></span><span class="n">linger_register</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="n">oloc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// new一个linger op
</span> <span class="n">linger_op</span><span class="o">-></span><span class="n">on_notify_finish</span> <span class="o">=</span> <span class="o">&</span><span class="n">notify_private</span><span class="p">;</span> <span class="c1">// 这个callback也很关键
</span> <span class="n">linger_op</span><span class="o">-></span><span class="n">notify_result_bl</span> <span class="o">=</span> <span class="o">&</span><span class="n">notify_private</span><span class="p">.</span><span class="n">reply_bl</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">prot_ver</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">timeout</span> <span class="o">=</span> <span class="n">notify_timeout</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">timeout_ms</span><span class="p">)</span>
<span class="n">timeout</span> <span class="o">=</span> <span class="n">timeout_ms</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">;</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="n">prot_ver</span><span class="p">,</span> <span class="n">inbl</span><span class="p">);</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="n">timeout</span><span class="p">,</span> <span class="n">inbl</span><span class="p">);</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="n">bl</span><span class="p">,</span> <span class="n">inbl</span><span class="p">);</span>
<span class="c1">// Construct RADOS op
</span> <span class="o">::</span><span class="n">ObjectOperation</span> <span class="n">rd</span><span class="p">;</span>
<span class="n">prepare_assert_ops</span><span class="p">(</span><span class="o">&</span><span class="n">rd</span><span class="p">);</span>
<span class="n">rd</span><span class="p">.</span><span class="n">notify</span><span class="p">(</span><span class="n">linger_op</span><span class="o">-></span><span class="n">get_cookie</span><span class="p">(),</span> <span class="n">inbl</span><span class="p">);</span>
<span class="c1">// Issue RADOS op
</span> <span class="n">C_SaferCond</span> <span class="n">onack</span><span class="p">;</span> <span class="c1">// 这个callback当操作成功并收到OP回包后会回调
</span> <span class="n">version_t</span> <span class="n">objver</span><span class="p">;</span>
<span class="n">objecter</span><span class="o">-></span><span class="n">linger_notify</span><span class="p">(</span><span class="n">linger_op</span><span class="p">,</span> <span class="c1">// 发送出去
</span> <span class="n">rd</span><span class="p">,</span> <span class="n">snap_seq</span><span class="p">,</span> <span class="n">inbl</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span>
<span class="o">&</span><span class="n">onack</span><span class="p">,</span> <span class="o">&</span><span class="n">objver</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r_issue</span> <span class="o">=</span> <span class="n">onack</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span> <span class="c1">// 等待OP成功执行
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r_issue</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">notify_private</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span> <span class="c1">// 等待回来的notify消息,不是OP执行成功的消息
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">client</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" failed to initiate notify, r = "</span>
<span class="o"><<</span> <span class="n">r_issue</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">......</span>
<span class="c1">// 这个和watch操作不一样,watch是注册一个linger_op,一直存在,以后用来接收notify通知,通过cookie区别
</span> <span class="c1">// 而notify在成功执行完成后,这个linger_op就没用了,需要取消
</span> <span class="n">objecter</span><span class="o">-></span><span class="n">linger_cancel</span><span class="p">(</span><span class="n">linger_op</span><span class="p">);</span>
<span class="p">......</span>
<span class="p">}</span>
</code></pre></div></div>
<p>注意这里有两个wait操作,第一个类似于watch操作的步骤:<code class="highlighter-rouge">linger_notify -> _linger_submit -> _send_linger -> _op_submit</code>
等待op处理完成的消息<code class="highlighter-rouge">CEPH_MSG_OSD_OPREPLY</code>,会在ms_dispatch函数处理,最后调用:<code class="highlighter-rouge">handle_osd_op_reply</code>,唤醒第一个wait。</p>
<p>第二个wait会在收到另外一个消息<code class="highlighter-rouge">CEPH_MSG_WATCH_NOTIFY</code>时,由ms_dispatch调用函数<code class="highlighter-rouge">handle_watch_notify</code>处理,这个函数最终会唤醒第二个wait。
这是因为后端rados集群收到notify消息后,会将消息发送给所有的watch,客户端自己本身也是watch,所以也会收到,不同的客户端会处理不同的分支。</p>
<h2 id="receive">receive</h2>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">handle_watch_notify</span><span class="p">(</span><span class="n">MWatchNotify</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">rwlock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">initialized</span><span class="p">.</span><span class="n">read</span><span class="p">())</span> <span class="p">{</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="n">LingerOp</span><span class="o">*></span><span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">cookie</span><span class="p">);</span> <span class="c1">// 如前面介绍一样,cookie就是op的地址
</span> <span class="k">if</span> <span class="p">(</span><span class="n">linger_ops_set</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">info</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" cookie "</span> <span class="o"><<</span> <span class="n">m</span><span class="o">-></span><span class="n">cookie</span> <span class="o"><<</span> <span class="s">" dne"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">wl</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">watch_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">opcode</span> <span class="o">==</span> <span class="n">CEPH_WATCH_EVENT_DISCONNECT</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">info</span><span class="o">-></span><span class="n">last_error</span><span class="p">)</span> <span class="p">{</span>
<span class="n">info</span><span class="o">-></span><span class="n">last_error</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOTCONN</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">watch_context</span><span class="p">)</span> <span class="p">{</span>
<span class="n">finisher</span><span class="o">-></span><span class="n">queue</span><span class="p">(</span><span class="k">new</span> <span class="n">C_DoWatchError</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="o">-</span><span class="n">ENOTCONN</span><span class="p">));</span>
<span class="n">_linger_callback_queue</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">info</span><span class="o">-></span><span class="n">is_watch</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 作为notify自己的时候,这个是false,发现是自己发出去的notify有了回应,所以调用callback回去唤醒以前的wait操作
</span> <span class="c1">// notify completion; we can do this inline since we know the only user
</span> <span class="c1">// (librados) is safe to call in fast-dispatch context
</span> <span class="n">assert</span><span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">on_notify_finish</span><span class="p">);</span>
<span class="n">info</span><span class="o">-></span><span class="n">notify_result_bl</span><span class="o">-></span><span class="n">claim</span><span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">get_data</span><span class="p">());</span>
<span class="n">info</span><span class="o">-></span><span class="n">on_notify_finish</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">return_code</span><span class="p">);</span> <span class="c1">// 唤醒自己的wait操作
</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// 其他watch客户端,也会收到这个消息,cookie是自己注册时的OP,注册watch时这个值为true
</span> <span class="n">finisher</span><span class="o">-></span><span class="n">queue</span><span class="p">(</span><span class="k">new</span> <span class="n">C_DoWatchNotify</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="n">m</span><span class="p">));</span> <span class="c1">// 这里就会执行注册时的ImageWatcher::m_watch_ctx
</span> <span class="n">_linger_callback_queue</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>看看watch时设置的callback是怎么被调用的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">C_DoWatchNotify</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Context</span> <span class="p">{</span>
<span class="n">Objecter</span> <span class="o">*</span><span class="n">objecter</span><span class="p">;</span>
<span class="n">Objecter</span><span class="o">::</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">;</span>
<span class="n">MWatchNotify</span> <span class="o">*</span><span class="n">msg</span><span class="p">;</span>
<span class="n">C_DoWatchNotify</span><span class="p">(</span><span class="n">Objecter</span> <span class="o">*</span><span class="n">o</span><span class="p">,</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">i</span><span class="p">,</span> <span class="n">MWatchNotify</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="o">:</span> <span class="n">objecter</span><span class="p">(</span><span class="n">o</span><span class="p">),</span> <span class="n">info</span><span class="p">(</span><span class="n">i</span><span class="p">),</span> <span class="n">msg</span><span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span>
<span class="n">info</span><span class="o">-></span><span class="n">get</span><span class="p">();</span>
<span class="n">info</span><span class="o">-></span><span class="n">_queued_async</span><span class="p">();</span>
<span class="n">msg</span><span class="o">-></span><span class="n">get</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">finish</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">objecter</span><span class="o">-></span><span class="n">_do_watch_notify</span><span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="n">msg</span><span class="p">);</span> <span class="c1">// 回调
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_do_watch_notify</span><span class="p">(</span><span class="n">LingerOp</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span> <span class="n">MWatchNotify</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">rwlock</span><span class="p">.</span><span class="n">get_read</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">initialized</span><span class="p">.</span><span class="n">read</span><span class="p">());</span>
<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-></span><span class="n">canceled</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rwlock</span><span class="p">.</span><span class="n">put_read</span><span class="p">();</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">rwlock</span><span class="p">.</span><span class="n">put_read</span><span class="p">();</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">opcode</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">CEPH_WATCH_EVENT_NOTIFY</span><span class="p">:</span>
<span class="c1">// 注册的时候有这么一句:linger_op->watch_context = new WatchInfo(this, oid, ctx, ctx2);
</span> <span class="n">info</span><span class="o">-></span><span class="n">watch_context</span><span class="o">-></span><span class="n">handle_notify</span><span class="p">(</span><span class="n">m</span><span class="o">-></span><span class="n">notify_id</span><span class="p">,</span> <span class="n">m</span><span class="o">-></span><span class="n">cookie</span><span class="p">,</span>
<span class="n">m</span><span class="o">-></span><span class="n">notifier_gid</span><span class="p">,</span> <span class="n">m</span><span class="o">-></span><span class="n">bl</span><span class="p">);</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">out</span><span class="o">:</span>
<span class="n">info</span><span class="o">-></span><span class="n">finished_async</span><span class="p">();</span>
<span class="n">info</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="n">m</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="n">_linger_callback_finish</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// 继续WatchInfo
</span><span class="k">struct</span> <span class="n">WatchInfo</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">WatchContext</span> <span class="p">{</span>
<span class="n">librados</span><span class="o">::</span><span class="n">IoCtxImpl</span> <span class="o">*</span><span class="n">ioctx</span><span class="p">;</span>
<span class="n">object_t</span> <span class="n">oid</span><span class="p">;</span>
<span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">;</span>
<span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx2</span> <span class="o">*</span><span class="n">ctx2</span><span class="p">;</span>
<span class="n">WatchInfo</span><span class="p">(</span><span class="n">librados</span><span class="o">::</span><span class="n">IoCtxImpl</span> <span class="o">*</span><span class="n">io</span><span class="p">,</span> <span class="n">object_t</span> <span class="n">o</span><span class="p">,</span>
<span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="n">librados</span><span class="o">::</span><span class="n">WatchCtx2</span> <span class="o">*</span><span class="n">c2</span><span class="p">)</span>
<span class="o">:</span> <span class="n">ioctx</span><span class="p">(</span><span class="n">io</span><span class="p">),</span> <span class="n">oid</span><span class="p">(</span><span class="n">o</span><span class="p">),</span> <span class="n">ctx</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="n">ctx2</span><span class="p">(</span><span class="n">c2</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ioctx</span><span class="o">-></span><span class="n">get</span><span class="p">();</span>
<span class="p">}</span>
<span class="o">~</span><span class="n">WatchInfo</span><span class="p">()</span> <span class="p">{</span>
<span class="n">ioctx</span><span class="o">-></span><span class="n">put</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">handle_notify</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">notify_id</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">cookie</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">notifier_id</span><span class="p">,</span>
<span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ctx2</span><span class="p">)</span>
<span class="n">ctx2</span><span class="o">-></span><span class="n">handle_notify</span><span class="p">(</span><span class="n">notify_id</span><span class="p">,</span> <span class="n">cookie</span><span class="p">,</span> <span class="n">notifier_id</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 回调ctx2, 就是ImageWatcher::m_watch_ctx
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">WatchCtx</span><span class="o">::</span><span class="n">handle_notify</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">notify_id</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">handle</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">notifier_id</span><span class="p">,</span>
<span class="n">bufferlist</span><span class="o">&</span> <span class="n">bl</span><span class="p">)</span> <span class="p">{</span>
<span class="n">image_watcher</span><span class="p">.</span><span class="n">handle_notify</span><span class="p">(</span><span class="n">notify_id</span><span class="p">,</span> <span class="n">handle</span><span class="p">,</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 继续回调
</span><span class="p">}</span>
<span class="c1">// 最终对notify消息进行解析, 不出所料,仍然在ImageWatcher类中
</span><span class="kt">void</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">handle_notify</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">notify_id</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">handle</span><span class="p">,</span>
<span class="n">bufferlist</span> <span class="o">&</span><span class="n">bl</span><span class="p">)</span> <span class="p">{</span>
<span class="n">NotifyMessage</span> <span class="n">notify_message</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">bl</span><span class="p">.</span><span class="n">length</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// legacy notification for header updates
</span> <span class="n">notify_message</span> <span class="o">=</span> <span class="n">NotifyMessage</span><span class="p">(</span><span class="n">HeaderUpdatePayload</span><span class="p">());</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">try</span> <span class="p">{</span>
<span class="n">bufferlist</span><span class="o">::</span><span class="n">iterator</span> <span class="n">iter</span> <span class="o">=</span> <span class="n">bl</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="o">::</span><span class="n">decode</span><span class="p">(</span><span class="n">notify_message</span><span class="p">,</span> <span class="n">iter</span><span class="p">);</span>
<span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="k">const</span> <span class="n">buffer</span><span class="o">::</span><span class="n">error</span> <span class="o">&</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="s">" error decoding image notification: "</span>
<span class="o"><<</span> <span class="n">err</span><span class="p">.</span><span class="n">what</span><span class="p">()</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">apply_visitor</span><span class="p">(</span><span class="n">HandlePayloadVisitor</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="n">notify_id</span><span class="p">,</span> <span class="n">handle</span><span class="p">),</span> <span class="c1">// 对notify消息进行处理
</span> <span class="n">notify_message</span><span class="p">.</span><span class="n">payload</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>对不同消息类型(payload),调用不同的重载函数处理,然后发送回确认消息:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">struct</span> <span class="n">HandlePayloadVisitor</span> <span class="o">:</span> <span class="k">public</span> <span class="n">boost</span><span class="o">::</span><span class="n">static_visitor</span><span class="o"><</span><span class="kt">void</span><span class="o">></span> <span class="p">{</span> <span class="c1">// 继承boost::static_visitor,以后用apply_visitor解析
</span> <span class="n">ImageWatcher</span> <span class="o">*</span><span class="n">image_watcher</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">notify_id</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">handle</span><span class="p">;</span>
<span class="n">HandlePayloadVisitor</span><span class="p">(</span><span class="n">ImageWatcher</span> <span class="o">*</span><span class="n">image_watcher_</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">notify_id_</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">handle_</span><span class="p">)</span>
<span class="o">:</span> <span class="n">image_watcher</span><span class="p">(</span><span class="n">image_watcher_</span><span class="p">),</span> <span class="n">notify_id</span><span class="p">(</span><span class="n">notify_id_</span><span class="p">),</span> <span class="n">handle</span><span class="p">(</span><span class="n">handle_</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="kr">inline</span> <span class="kt">void</span> <span class="k">operator</span><span class="p">()(</span><span class="k">const</span> <span class="n">WatchNotify</span><span class="o">::</span><span class="n">HeaderUpdatePayload</span> <span class="o">&</span><span class="n">payload</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span>
<span class="n">bufferlist</span> <span class="n">out</span><span class="p">;</span>
<span class="n">image_watcher</span><span class="o">-></span><span class="n">handle_payload</span><span class="p">(</span><span class="n">payload</span><span class="p">,</span> <span class="o">&</span><span class="n">out</span><span class="p">);</span>
<span class="n">image_watcher</span><span class="o">-></span><span class="n">acknowledge_notify</span><span class="p">(</span><span class="n">notify_id</span><span class="p">,</span> <span class="n">handle</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">Payload</span><span class="o">></span>
<span class="kr">inline</span> <span class="kt">void</span> <span class="k">operator</span><span class="p">()(</span><span class="k">const</span> <span class="n">Payload</span> <span class="o">&</span><span class="n">payload</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span> <span class="c1">// apply_visitor会调用这里
</span> <span class="n">bufferlist</span> <span class="n">out</span><span class="p">;</span>
<span class="n">image_watcher</span><span class="o">-></span><span class="n">handle_payload</span><span class="p">(</span><span class="n">payload</span><span class="p">,</span> <span class="o">&</span><span class="n">out</span><span class="p">);</span> <span class="c1">// 根据payload类型,调用不同的重载函数处理
</span> <span class="n">image_watcher</span><span class="o">-></span><span class="n">acknowledge_notify</span><span class="p">(</span><span class="n">notify_id</span><span class="p">,</span> <span class="n">handle</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span> <span class="c1">// 确认notify消息,会将确认消息发送给其他客户端
</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<ol>
<li>
<p>一个客户端,当不是以只读打开image的时候,会注册watch接收消息</p>
</li>
<li>
<p>客户端可能主动发送notify消息到其他客户端,也同时接受来自其他客户端的消息</p>
</li>
<li>
<p>客户端自己也会收到自己发出去的notify消息,通过cookie确定linger op(这里可能会存在两个或以上的linger op,一个是自己注册watch的时候创建的,一个是notify消息创建的),通过linger op的is_watch可以决定是不是自己发出消息对应的回应消息</p>
</li>
</ol>
Ceph Librbd Flatten
2015-11-02T00:00:00+00:00
http://blog.wjin.org/posts/ceph-librbd-flatten
<h1 id="overview">Overview</h1>
<p>在ceph中,创建一个image(块设备)后,可以对image生成快照snapshot,然后通过快照clone出新的image。
由于clone采取的是copy-on-write机制,会非常快。但是,当clone链很深的时候,可能会通过几次回溯才能找到parent image的object,
这样会比较慢,虽然ceph也提供copy-on-read机制来解决这个问题,但是要想完全避免这个问题,可以对image进行flatten操作,
这样会让父子image 分离,最新feature还支持deep-flatten,即对快照进行flatten。</p>
<h1 id="code-analysis">Code Analysis</h1>
<h2 id="invoke-flatten">invoke flatten</h2>
<p>先从librbd Image 类提供的API入手:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="n">Image</span><span class="o">::</span><span class="n">flatten</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span> <span class="o">=</span> <span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="p">)</span><span class="n">ctx</span><span class="p">;</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">flatten_enter</span><span class="p">,</span> <span class="n">ictx</span><span class="p">,</span> <span class="n">ictx</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">id</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span>
<span class="n">librbd</span><span class="o">::</span><span class="n">NoOpProgressContext</span> <span class="n">prog_ctx</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">librbd</span><span class="o">::</span><span class="n">flatten</span><span class="p">(</span><span class="n">ictx</span><span class="p">,</span> <span class="n">prog_ctx</span><span class="p">);</span> <span class="c1">// 调用internal.cc
</span> <span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">flatten_exit</span><span class="p">,</span> <span class="n">r</span><span class="p">);</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接着进入internal.cc文件:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="nf">flatten</span><span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span><span class="p">,</span> <span class="n">ProgressContext</span> <span class="o">&</span><span class="n">prog_ctx</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">cct</span><span class="p">;</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"flatten"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ictx_check</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span> <span class="c1">// 检测image
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 只读打开的时候,是不能做flatten的
</span> <span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">parent_locker</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">parent_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">parent_md</span><span class="p">.</span><span class="n">spec</span><span class="p">.</span><span class="n">pool_id</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 没有parent, 不存在flatten语义
</span> <span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"image has no parent"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint64_t</span> <span class="n">request_id</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">async_request_seq</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span> <span class="c1">// 原子操作,获取此次异步请求的id,通过<client_id, request_id>就可以决定整个集群的通知事件
</span> <span class="c1">// 第一个bind操作对应的是librbd.cc/async_flatten --> local_request
</span> <span class="c1">// 第二个bind操作对应的是ImageWatcher的成员函数notify_flatten --> remote_request
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">invoke_async_request</span><span class="p">(</span><span class="n">ictx</span><span class="p">,</span> <span class="s">"flatten"</span><span class="p">,</span> <span class="nb">false</span><span class="p">,</span>
<span class="n">boost</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="o">&</span><span class="n">async_flatten</span><span class="p">,</span> <span class="n">ictx</span><span class="p">,</span> <span class="n">_1</span><span class="p">,</span>
<span class="n">boost</span><span class="o">::</span><span class="n">ref</span><span class="p">(</span><span class="n">prog_ctx</span><span class="p">)),</span>
<span class="n">boost</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="o">&</span><span class="n">ImageWatcher</span><span class="o">::</span><span class="n">notify_flatten</span><span class="p">,</span>
<span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="p">,</span> <span class="n">request_id</span><span class="p">,</span>
<span class="n">boost</span><span class="o">::</span><span class="n">ref</span><span class="p">(</span><span class="n">prog_ctx</span><span class="p">)));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">r</span> <span class="o">!=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">notify_change</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">md_ctx</span><span class="p">,</span> <span class="n">ictx</span><span class="o">-></span><span class="n">header_oid</span><span class="p">,</span> <span class="n">ictx</span><span class="p">);</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"flatten finished"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>很多管理类操作,比如flatten, resize, snap_create等都是通过函数invoke_async_request发起异步操作去执行的。
这个函数的逻辑比较复杂,这里通过ImageWatcher类来统一负责Image 的flatten, resize, snap_create等操作,
同时也通过它实现了分布式锁(exclusive lock),主要是通过ceph的<a href="http://blog.wjin.org/posts/ceph-watchnotify-mechanism.html#sthash.fU8N80yn.dpuf">watch/notify机制</a>来实现的。</p>
<p>客户端打开一个image,会注册一个watch,由ImageWatcher类管理,同时客户端在必要的时候会持有image的owner_lock,
注意区分这个锁和exclusive lock。ImageWatcher类记录了分布式锁(exclusive lock)的状态,m_lock_owner_state 表示当前客户端是否持有exclusive lock,
m_owner_client_id表示持有exclusive lock的客户端ID,owner单词容易引起误导。</p>
<p>如果客户端拥有exclusive lock,则flatten等操作可以被它自己执行,会调用local_request。相反,如果锁被其他客户端占用,
则需要将操作转发给占有锁的客户端,即执行remote_request操作。</p>
<p>辅助函数is_lock_supported,它的作用是判断是否支持获取exclusive lock:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">is_lock_supported</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">snap_lock</span><span class="p">);</span>
<span class="k">return</span> <span class="n">is_lock_supported</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">snap_lock</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">is_lock_supported</span><span class="p">(</span><span class="k">const</span> <span class="n">RWLock</span> <span class="o">&</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">snap_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="c1">// 为真的条件是:1)支持exclusive lock 2) 非只读 3)不是快照
</span> <span class="k">return</span> <span class="p">((</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">features</span> <span class="o">&</span> <span class="n">RBD_FEATURE_EXCLUSIVE_LOCK</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">read_only</span> <span class="o">&&</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">snap_id</span> <span class="o">==</span> <span class="n">CEPH_NOSNAP</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接着看发起异步操作的函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">invoke_async_request</span><span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&</span> <span class="n">request_type</span><span class="p">,</span>
<span class="kt">bool</span> <span class="n">permit_snapshot</span><span class="p">,</span>
<span class="k">const</span> <span class="n">boost</span><span class="o">::</span><span class="n">function</span><span class="o"><</span><span class="kt">int</span><span class="p">(</span><span class="n">Context</span><span class="o">*</span><span class="p">)</span><span class="o">>&</span> <span class="n">local_request</span><span class="p">,</span>
<span class="k">const</span> <span class="n">boost</span><span class="o">::</span><span class="n">function</span><span class="o"><</span><span class="kt">int</span><span class="p">()</span><span class="o">>&</span> <span class="n">remote_request</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">C_SaferCond</span> <span class="n">ctx</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">);</span> <span class="c1">// 以读的方式获取owner_lock
</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">snap_locker</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">snap_lock</span><span class="p">);</span> <span class="c1">// 以读的方式获取snap_lock
</span> <span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span> <span class="o">||</span>
<span class="p">(</span><span class="o">!</span><span class="n">permit_snapshot</span> <span class="o">&&</span> <span class="n">ictx</span><span class="o">-></span><span class="n">snap_id</span> <span class="o">!=</span> <span class="n">CEPH_NOSNAP</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// permit_snapshot主要是针对快照的deep-flatten
</span> <span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">while</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_supported</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 支持exclusive lock
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">prepare_image_update</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span> <span class="c1">// 尝试获取exclusive_lock
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_owner</span><span class="p">())</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span> <span class="c1">// 如果自己持有独占锁,会跳出循环,执行local_request
</span> <span class="p">}</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">remote_request</span><span class="p">();</span> <span class="c1">// 锁被其他客户端占用,则将请求转发给持有锁的客户端进行此操作
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ETIMEDOUT</span> <span class="o">&&</span> <span class="n">r</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ERESTART</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 转发失败了会继续while循环
</span> <span class="k">return</span> <span class="n">r</span><span class="p">;</span> <span class="c1">// 请求成功发送出去,返回
</span> <span class="p">}</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> <span class="o"><<</span> <span class="n">request_type</span> <span class="o"><<</span> <span class="s">" timed out notifying lock owner"</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 跳出循环,说明自己拥有独占锁,自己执行
</span> <span class="n">r</span> <span class="o">=</span> <span class="n">local_request</span><span class="p">(</span><span class="o">&</span><span class="n">ctx</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span> <span class="c1">// 等待local_request的callback执行,即signal信号
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="n">ERESTART</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> <span class="o"><<</span> <span class="n">request_type</span> <span class="o"><<</span> <span class="s">" interrupted: restarting"</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="n">ERESTART</span><span class="p">);</span> <span class="c1">// 不成功,继续循环
</span> <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">prepare_image_update</span><span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">()</span> <span class="o">&&</span> <span class="o">!</span><span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_wlocked</span><span class="p">());</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 没有watcher, 这种情况发生在只读打开
</span> <span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_supported</span><span class="p">()</span> <span class="o">||</span> <span class="c1">// 第一个条件不支持exclusive lock
</span> <span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_owner</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// 第二个条件是支持的条件下,自己已经拥有锁
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">//
</span> <span class="c1">// need to upgrade to a write lock
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">acquired_lock</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">.</span><span class="n">put_read</span><span class="p">();</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_owner</span><span class="p">())</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">try_lock</span><span class="p">();</span> <span class="c1">// 尝试获取exlcusive lock
</span> <span class="n">acquired_lock</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_owner</span><span class="p">();</span> <span class="c1">// 判断是否获取成功
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">acquired_lock</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// finish any AIO that was previously waiting on acquiring the
</span> <span class="c1">// exclusive lock
</span> <span class="n">ictx</span><span class="o">-></span><span class="n">flush_async_operations</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">.</span><span class="n">get_read</span><span class="p">();</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="proxy-flatten">proxy flatten</h2>
<p>接下来就是remote_request或local_request的两条不同路径的调用。实际上,当执行remote_request这条路径的时候,
会将请求通过watch/notify机制,发送给其他持有exclusive lock的客户端,当远端收到通知后,最终也会调用local_request对应的函数:async_flatten。
执行成功后通过watch/notify机制将消息发送给请求的客户端。</p>
<p>看看remote_request是怎样将通知发送出去的,它对应的操作是:<code class="highlighter-rouge">ictx->image_watcher->notify_flatten(request_id, prog_ctx)</code></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">notify_flatten</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">request_id</span><span class="p">,</span> <span class="n">ProgressContext</span> <span class="o">&</span><span class="n">prog_ctx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span> <span class="c1">// 客户端打开一个image,自己可以持有image的owner_lock,不要和exclusive lock混淆
</span> <span class="n">assert</span><span class="p">(</span><span class="o">!</span><span class="n">is_lock_owner</span><span class="p">());</span> <span class="c1">// 不持有exclusive lock,这和以前描述一致,notify操作是要通知其他客户端去帮忙执行
</span>
<span class="n">AsyncRequestId</span> <span class="n">async_request_id</span><span class="p">(</span><span class="n">get_client_id</span><span class="p">(),</span> <span class="n">request_id</span><span class="p">);</span> <span class="c1">// 全局唯一ID
</span>
<span class="n">bufferlist</span> <span class="n">bl</span><span class="p">;</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="n">NotifyMessage</span><span class="p">(</span><span class="n">FlattenPayload</span><span class="p">(</span><span class="n">async_request_id</span><span class="p">)),</span> <span class="n">bl</span><span class="p">);</span> <span class="c1">// 封装flatten payload消息
</span>
<span class="k">return</span> <span class="n">notify_async_request</span><span class="p">(</span><span class="n">async_request_id</span><span class="p">,</span> <span class="n">bl</span><span class="p">,</span> <span class="n">prog_ctx</span><span class="p">);</span> <span class="c1">// 通知
</span><span class="p">}</span>
<span class="kt">int</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">notify_async_request</span><span class="p">(</span><span class="k">const</span> <span class="n">AsyncRequestId</span> <span class="o">&</span><span class="n">async_request_id</span><span class="p">,</span>
<span class="n">bufferlist</span> <span class="o">&</span><span class="n">in</span><span class="p">,</span>
<span class="n">ProgressContext</span><span class="o">&</span> <span class="n">prog_ctx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="s">" async request: "</span> <span class="o"><<</span> <span class="n">async_request_id</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">C_SaferCond</span> <span class="n">ctx</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_async_request_lock</span><span class="p">);</span>
<span class="n">m_async_requests</span><span class="p">[</span><span class="n">async_request_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">AsyncRequest</span><span class="p">(</span><span class="o">&</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&</span><span class="n">prog_ctx</span><span class="p">);</span> <span class="c1">// 记录通知请求
</span> <span class="p">}</span>
<span class="n">BOOST_SCOPE_EXIT</span><span class="p">(</span> <span class="p">(</span><span class="o">&</span><span class="n">ctx</span><span class="p">)(</span><span class="n">async_request_id</span><span class="p">)(</span><span class="o">&</span><span class="n">m_task_finisher</span><span class="p">)</span> <span class="c1">// 退出函数时,清理掉记录以及timeout事件
</span> <span class="p">(</span><span class="o">&</span><span class="n">m_async_requests</span><span class="p">)(</span><span class="o">&</span><span class="n">m_async_request_lock</span><span class="p">)</span> <span class="p">)</span> <span class="p">{</span>
<span class="n">m_task_finisher</span><span class="o">-></span><span class="n">cancel</span><span class="p">(</span><span class="n">Task</span><span class="p">(</span><span class="n">TASK_CODE_ASYNC_REQUEST</span><span class="p">,</span> <span class="n">async_request_id</span><span class="p">));</span> <span class="c1">// 清理timeout事件
</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_async_request_lock</span><span class="p">);</span>
<span class="n">m_async_requests</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">async_request_id</span><span class="p">);</span> <span class="c1">// 清理请求记录
</span> <span class="p">}</span> <span class="n">BOOST_SCOPE_EXIT_END</span>
<span class="n">schedule_async_request_timed_out</span><span class="p">(</span><span class="n">async_request_id</span><span class="p">);</span> <span class="c1">// 设置一个timeout的event事件
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">notify_lock_owner</span><span class="p">(</span><span class="n">in</span><span class="p">);</span> <span class="c1">// 通知持有exclusive lock的客户端帮忙执行
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 出错了就直接返回
</span> <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 1) 等待一段时间,如果超时了,上面注册的time事件就会唤醒这个wait,然后返回-ERESTART
</span> <span class="c1">// 2) 如果在超时时间范围内,收到notify complete的通知,表示操作完成,那么唤醒的callback会在处理notify消息的时候提前调用,返回处理的结果
</span> <span class="c1">// 处理消息最终会过度到这个函数:void ImageWatcher::handle_payload(const AsyncCompletePayload &payload, bufferlist *out)
</span> <span class="c1">// 3) 无论什么情况,当离开函数作用域后,boost宏范围内的代码,在离开函数作用域时,自动清理掉time事件和请求记录
</span> <span class="k">return</span> <span class="n">ctx</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>先看timeout这条路径,当timeout发生时,会调用async_request_timed_out函数,这样会返回-ERESTART,invoke会重新发起调用:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">schedule_async_request_timed_out</span><span class="p">(</span><span class="k">const</span> <span class="n">AsyncRequestId</span> <span class="o">&</span><span class="n">id</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Context</span> <span class="o">*</span><span class="n">ctx</span> <span class="o">=</span> <span class="k">new</span> <span class="n">FunctionContext</span><span class="p">(</span><span class="n">boost</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span>
<span class="o">&</span><span class="n">ImageWatcher</span><span class="o">::</span><span class="n">async_request_timed_out</span><span class="p">,</span> <span class="k">this</span><span class="p">,</span> <span class="n">id</span><span class="p">));</span> <span class="c1">// timeout发生时调用的函数是:async_request_timed_out
</span>
<span class="n">Task</span> <span class="n">task</span><span class="p">(</span><span class="n">TASK_CODE_ASYNC_REQUEST</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
<span class="n">m_task_finisher</span><span class="o">-></span><span class="n">cancel</span><span class="p">(</span><span class="n">task</span><span class="p">);</span> <span class="c1">// 取消之前的同样的任务
</span>
<span class="n">md_config_t</span> <span class="o">*</span><span class="n">conf</span> <span class="o">=</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="p">;</span>
<span class="n">m_task_finisher</span><span class="o">-></span><span class="n">add_event_after</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">conf</span><span class="o">-></span><span class="n">rbd_request_timed_out_seconds</span><span class="p">,</span> <span class="c1">// 增加一个timeout的任务
</span> <span class="n">ctx</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">async_request_timed_out</span><span class="p">(</span><span class="k">const</span> <span class="n">AsyncRequestId</span> <span class="o">&</span><span class="n">id</span><span class="p">)</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_async_request_lock</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">AsyncRequestId</span><span class="p">,</span> <span class="n">AsyncRequest</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">it</span> <span class="o">=</span>
<span class="n">m_async_requests</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">id</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">it</span> <span class="o">!=</span> <span class="n">m_async_requests</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="s">" request timed-out: "</span> <span class="o"><<</span> <span class="n">id</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">it</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">first</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="o">-</span><span class="n">ERESTART</span><span class="p">);</span> <span class="c1">// 调用callback, 实际上是唤醒notify_async_request函数中的ctx.wait()
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>另外一条正常工作的路径,持有exclusive lock的客户端在处理完后,会发送async complete的消息,发出请求的客户端会接收这个消息,
然后调用ImageWatcher::handle_notify函数,可以参考另一篇文章watch/notify的分析,然后通过payload类型,调用:</p>
<blockquote>
<p>ImageWatcher::handle_payload(const AsyncCompletePayload &payload, bufferlist *out)</p>
</blockquote>
<p>这个函数就会和上面timeout一样,唤醒wait操作,只是返回值不同:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">handle_payload</span><span class="p">(</span><span class="k">const</span> <span class="n">AsyncCompletePayload</span> <span class="o">&</span><span class="n">payload</span><span class="p">,</span>
<span class="n">bufferlist</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_async_request_lock</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">AsyncRequestId</span><span class="p">,</span> <span class="n">AsyncRequest</span><span class="o">>::</span><span class="n">iterator</span> <span class="n">req_it</span> <span class="o">=</span>
<span class="n">m_async_requests</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">req_it</span> <span class="o">!=</span> <span class="n">m_async_requests</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="s">" request finished: "</span>
<span class="o"><<</span> <span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span> <span class="o"><<</span> <span class="s">"="</span>
<span class="o"><<</span> <span class="n">payload</span><span class="p">.</span><span class="n">result</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">req_it</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">first</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">payload</span><span class="p">.</span><span class="n">result</span><span class="p">);</span> <span class="c1">// 唤醒操作
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>唤醒操作分析完后,回到前面的发送请求的函数notify_async_request,调用了notify_lock_owner 函数发送请求:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">notify_lock_owner</span><span class="p">(</span><span class="n">bufferlist</span> <span class="o">&</span><span class="n">bl</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="c1">// since we need to ack our own notifications, release the owner lock just in
</span> <span class="c1">// case another notification occurs before this one and it requires the lock
</span> <span class="n">bufferlist</span> <span class="n">response_bl</span><span class="p">;</span>
<span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">put_read</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">md_ctx</span><span class="p">.</span><span class="n">notify2</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">header_oid</span><span class="p">,</span> <span class="n">bl</span><span class="p">,</span> <span class="n">NOTIFY_TIMEOUT</span><span class="p">,</span> <span class="c1">// 将请求消息通知给持有锁的客户端,并且会等待执行结果
</span> <span class="o">&</span><span class="n">response_bl</span><span class="p">);</span>
<span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">get_read</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">r</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ETIMEDOUT</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="s">" lock owner notification failed: "</span>
<span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 对notify结果response_bl的处理
</span> <span class="k">typedef</span> <span class="n">std</span><span class="o">::</span><span class="n">map</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">pair</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="kt">uint64_t</span><span class="o">></span><span class="p">,</span> <span class="n">bufferlist</span><span class="o">></span> <span class="n">responses_t</span><span class="p">;</span>
<span class="n">responses_t</span> <span class="n">responses</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">response_bl</span><span class="p">.</span><span class="n">length</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">try</span> <span class="p">{</span>
<span class="n">bufferlist</span><span class="o">::</span><span class="n">iterator</span> <span class="n">iter</span> <span class="o">=</span> <span class="n">response_bl</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="o">::</span><span class="n">decode</span><span class="p">(</span><span class="n">responses</span><span class="p">,</span> <span class="n">iter</span><span class="p">);</span>
<span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="k">const</span> <span class="n">buffer</span><span class="o">::</span><span class="n">error</span> <span class="o">&</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">bufferlist</span> <span class="n">response</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">lock_owner_responded</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">responses_t</span><span class="o">::</span><span class="n">iterator</span> <span class="n">i</span> <span class="o">=</span> <span class="n">responses</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">i</span> <span class="o">!=</span> <span class="n">responses</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">.</span><span class="n">length</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">lock_owner_responded</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 只会有一个消息长度不为0,这个消息是当前持有exclusive lock的客户端的信息
</span> <span class="k">return</span> <span class="o">-</span><span class="n">EIO</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">lock_owner_responded</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">response</span><span class="p">.</span><span class="n">claim</span><span class="p">(</span><span class="n">i</span><span class="o">-></span><span class="n">second</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">lock_owner_responded</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">ETIMEDOUT</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">try</span> <span class="p">{</span>
<span class="n">bufferlist</span><span class="o">::</span><span class="n">iterator</span> <span class="n">iter</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="n">ResponseMessage</span> <span class="n">response_message</span><span class="p">;</span>
<span class="o">::</span><span class="n">decode</span><span class="p">(</span><span class="n">response_message</span><span class="p">,</span> <span class="n">iter</span><span class="p">);</span> <span class="c1">// 解析结果
</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">response_message</span><span class="p">.</span><span class="n">result</span><span class="p">;</span>
<span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="k">const</span> <span class="n">buffer</span><span class="o">::</span><span class="n">error</span> <span class="o">&</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>持有锁的客户端在收到flatten的消息后,最终会执行如下函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ImageWatcher</span><span class="o">::</span><span class="n">handle_payload</span><span class="p">(</span><span class="k">const</span> <span class="n">FlattenPayload</span> <span class="o">&</span><span class="n">payload</span><span class="p">,</span> <span class="c1">// flatten 的payload
</span> <span class="n">bufferlist</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m_lock_owner_state</span> <span class="o">==</span> <span class="n">LOCK_OWNER_STATE_LOCKED</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">new_request</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="c1">// 这种情况表示自己刚开始没有锁,把消息发送出去后,自己又拿到了exclusive lock,然后又收到了自己发出去的notify消息
</span> <span class="c1">// 注意watch/notify相关代码会在不同的客户端都会执行,时刻注意执行这段代码时客户端的身份
</span> <span class="k">if</span> <span class="p">(</span><span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">.</span><span class="n">client_id</span> <span class="o">==</span> <span class="n">get_client_id</span><span class="p">())</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="o">-</span><span class="n">ERESTART</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_async_request_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m_async_pending</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_async_pending</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">);</span>
<span class="n">new_request</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">new_request</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 新的请求
</span> <span class="n">RemoteProgressContext</span> <span class="o">*</span><span class="n">prog_ctx</span> <span class="o">=</span>
<span class="k">new</span> <span class="n">RemoteProgressContext</span><span class="p">(</span><span class="o">*</span><span class="k">this</span><span class="p">,</span> <span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">);</span>
<span class="n">RemoteContext</span> <span class="o">*</span><span class="n">ctx</span> <span class="o">=</span> <span class="k">new</span> <span class="n">RemoteContext</span><span class="p">(</span><span class="o">*</span><span class="k">this</span><span class="p">,</span> <span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">,</span> <span class="c1">// RemoteContext最终会发送notify complete的消息出来
</span> <span class="n">prog_ctx</span><span class="p">);</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="s">" remote flatten request: "</span>
<span class="o"><<</span> <span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">librbd</span><span class="o">::</span><span class="n">async_flatten</span><span class="p">(</span><span class="o">&</span><span class="n">m_image_ctx</span><span class="p">,</span> <span class="n">ctx</span><span class="p">,</span> <span class="o">*</span><span class="n">prog_ctx</span><span class="p">);</span> <span class="c1">// 调用async_flatten处理
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">delete</span> <span class="n">ctx</span><span class="p">;</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">WLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_async_request_lock</span><span class="p">);</span>
<span class="n">m_async_pending</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">payload</span><span class="p">.</span><span class="n">async_request_id</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="o">::</span><span class="n">encode</span><span class="p">(</span><span class="n">ResponseMessage</span><span class="p">(</span><span class="n">r</span><span class="p">),</span> <span class="o">*</span><span class="n">out</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="do-flatten">do flatten</h1>
<p>通过分析,无论local_request还是remote_request,都会调用下面这个函数进行实际的操作:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="nf">async_flatten</span><span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span><span class="p">,</span> <span class="n">Context</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">ProgressContext</span> <span class="o">&</span><span class="n">prog_ctx</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="o">!</span><span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_supported</span><span class="p">()</span> <span class="o">||</span> <span class="c1">// 不支持exclusive lock
</span> <span class="n">ictx</span><span class="o">-></span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_owner</span><span class="p">());</span> <span class="c1">// 自己已经获得exclusive lock,只有持有exclusive lock的客户端才能干活
</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">cct</span><span class="p">;</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"flatten"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span><span class="p">;</span>
<span class="c1">// ictx_check also updates parent data
</span> <span class="k">if</span> <span class="p">((</span><span class="n">r</span> <span class="o">=</span> <span class="n">ictx_check</span><span class="p">(</span><span class="n">ictx</span><span class="p">,</span> <span class="nb">true</span><span class="p">))</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"ictx_check failed"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">uint64_t</span> <span class="n">object_size</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">overlap</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">overlap_objects</span><span class="p">;</span>
<span class="o">::</span><span class="n">SnapContext</span> <span class="n">snapc</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">snap_lock</span><span class="p">);</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l2</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">parent_lock</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span> <span class="o">||</span> <span class="n">ictx</span><span class="o">-></span><span class="n">snap_id</span> <span class="o">!=</span> <span class="n">CEPH_NOSNAP</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// can't flatten a non-clone
</span> <span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">parent_md</span><span class="p">.</span><span class="n">spec</span><span class="p">.</span><span class="n">pool_id</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">snap_id</span> <span class="o">!=</span> <span class="n">CEPH_NOSNAP</span> <span class="o">||</span> <span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">snapc</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">snapc</span><span class="p">;</span>
<span class="n">assert</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">parent</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">get_parent_overlap</span><span class="p">(</span><span class="n">CEPH_NOSNAP</span><span class="p">,</span> <span class="o">&</span><span class="n">overlap</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">overlap</span> <span class="o"><=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">size</span><span class="p">);</span>
<span class="n">object_size</span> <span class="o">=</span> <span class="n">ictx</span><span class="o">-></span><span class="n">get_object_size</span><span class="p">();</span>
<span class="n">overlap_objects</span> <span class="o">=</span> <span class="n">Striper</span><span class="o">::</span><span class="n">get_num_objects</span><span class="p">(</span><span class="n">ictx</span><span class="o">-></span><span class="n">layout</span><span class="p">,</span> <span class="n">overlap</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">AsyncFlattenRequest</span> <span class="o">*</span><span class="n">req</span> <span class="o">=</span>
<span class="k">new</span> <span class="n">AsyncFlattenRequest</span><span class="p">(</span><span class="o">*</span><span class="n">ictx</span><span class="p">,</span> <span class="n">ctx</span><span class="p">,</span> <span class="n">object_size</span><span class="p">,</span> <span class="n">overlap_objects</span><span class="p">,</span> <span class="c1">// 这个类就是真正干活的类,负责flatten的具体工作
</span> <span class="n">snapc</span><span class="p">,</span> <span class="n">prog_ctx</span><span class="p">);</span>
<span class="n">req</span><span class="o">-></span><span class="n">send</span><span class="p">();</span> <span class="c1">// 执行请求
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接下来就是怎样执行这个请求的过程,其中涉及到AsyncRequest框架,很多maintenace操作(flatten, trim, resize等)继承此类,
还有个类AsyncObjectThrottle是限制操作并发数的,用来限流:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">AsyncFlattenRequest</span><span class="o">::</span><span class="n">send</span><span class="p">()</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">;</span>
<span class="n">m_state</span> <span class="o">=</span> <span class="n">STATE_FLATTEN_OBJECTS</span><span class="p">;</span>
<span class="n">AsyncObjectThrottle</span><span class="o">::</span><span class="n">ContextFactory</span> <span class="n">context_factory</span><span class="p">(</span> <span class="c1">// 定义一个factory函数对象,调用它的时候,实际上new了一个对象:AsyncFlattenObjectContext
</span> <span class="n">boost</span><span class="o">::</span><span class="n">lambda</span><span class="o">::</span><span class="n">bind</span><span class="p">(</span><span class="n">boost</span><span class="o">::</span><span class="n">lambda</span><span class="o">::</span><span class="n">new_ptr</span><span class="o"><</span><span class="n">AsyncFlattenObjectContext</span><span class="o">></span><span class="p">(),</span>
<span class="n">boost</span><span class="o">::</span><span class="n">lambda</span><span class="o">::</span><span class="n">_1</span><span class="p">,</span> <span class="o">&</span><span class="n">m_image_ctx</span><span class="p">,</span> <span class="n">m_object_size</span><span class="p">,</span> <span class="n">m_snapc</span><span class="p">,</span>
<span class="n">boost</span><span class="o">::</span><span class="n">lambda</span><span class="o">::</span><span class="n">_2</span><span class="p">));</span>
<span class="c1">// 利用AsyncObjectThrottle对对象的操作进行限流
</span> <span class="c1">// this 为本次请求的类型,这里是AsyncFlattenRequest派生类
</span> <span class="c1">// context_factory为生产context的工厂对象,会生产特定于flatten的对象出来: AsyncFlattenObjectContext
</span> <span class="c1">// create_callback_context() 这个参数也生成了一个函数对象,绑定到函数:AsyncRequest::complete,操作完成时的回调
</span> <span class="n">AsyncObjectThrottle</span> <span class="o">*</span><span class="n">throttle</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AsyncObjectThrottle</span><span class="p">(</span>
<span class="k">this</span><span class="p">,</span> <span class="n">m_image_ctx</span><span class="p">,</span> <span class="n">context_factory</span><span class="p">,</span> <span class="n">create_callback_context</span><span class="p">(),</span> <span class="n">m_prog_ctx</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span> <span class="n">m_overlap_objects</span><span class="p">);</span>
<span class="n">throttle</span><span class="o">-></span><span class="n">start_ops</span><span class="p">(</span><span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rbd_concurrent_management_ops</span><span class="p">);</span> <span class="c1">// 开始操作, 参数为操作的对象的并发数
</span><span class="p">}</span>
</code></pre></div></div>
<p>操作是以object为粒度的,所以可以并发执行,并发数可以配置。</p>
<p>总结一下大致流程:start_ops一开始,并发执行了n个op操作,当一个op完成后,会回调到finish_op函数,
然后这个函数继续启动下一个op执行(如果还有op没执行), m_current_ops初始化为零,start_next_op成功发送一个op后计数+1,
当发送的op执行完后,回调finish_op,计数会-1,当计数再一次为零时就表示所有op都已经执行完毕(发送出去,执行,并且回调成功),这时候会继续回调上一层的callback。
相当于同时开了n条流水线在执行,某条流水线完工后,继续拿新的op执行,直到所有op(end object限定)都执行完毕。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">AsyncObjectThrottle</span><span class="o">::</span><span class="n">start_ops</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">max_concurrent</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="kt">bool</span> <span class="n">complete</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_lock</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">max_concurrent</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 并发数
</span> <span class="n">start_next_op</span><span class="p">();</span> <span class="c1">// 执行一个op
</span> <span class="k">if</span> <span class="p">(</span><span class="n">m_ret</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">m_current_ops</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">complete</span> <span class="o">=</span> <span class="p">(</span><span class="n">m_current_ops</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">complete</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_ctx</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">m_ret</span><span class="p">);</span>
<span class="k">delete</span> <span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">AsyncObjectThrottle</span><span class="o">::</span><span class="n">start_next_op</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">bool</span> <span class="n">done</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">done</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m_async_request</span><span class="o">-></span><span class="n">is_canceled</span><span class="p">()</span> <span class="o">&&</span> <span class="n">m_ret</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// allow in-flight ops to complete, but don't start new ops
</span> <span class="n">m_ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ERESTART</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">m_ret</span> <span class="o">!=</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">m_object_no</span> <span class="o">>=</span> <span class="n">m_end_object_no</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// end object为结束条件
</span> <span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">uint64_t</span> <span class="n">ono</span> <span class="o">=</span> <span class="n">m_object_no</span><span class="o">++</span><span class="p">;</span> <span class="c1">// 当前执行的是哪个对象
</span>
<span class="c1">// 这里很关键,通过context工厂,生产一个对应于特定请求的context去执行
</span> <span class="c1">// 刚才传入的工厂对象生产的是:AsyncFlattenObjectContext,继承于类C_AsyncObjectThrottle
</span> <span class="c1">// 所以这里是基类指针,这个框架可以被trim之类的操作复用,只需要传入不同的工厂对象给AsyncObjectThrottle即可
</span> <span class="n">C_AsyncObjectThrottle</span> <span class="o">*</span><span class="n">ctx</span> <span class="o">=</span> <span class="n">m_context_factory</span><span class="p">(</span><span class="o">*</span><span class="k">this</span><span class="p">,</span> <span class="n">ono</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-></span><span class="n">send</span><span class="p">();</span> <span class="c1">// 真正执行context
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_ret</span> <span class="o">=</span> <span class="n">r</span><span class="p">;</span>
<span class="k">delete</span> <span class="n">ctx</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// op completed immediately
</span> <span class="k">delete</span> <span class="n">ctx</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// 成功发送了一个请求,等待请求完成后的回调
</span> <span class="o">++</span><span class="n">m_current_ops</span><span class="p">;</span> <span class="c1">// 等待计数加1
</span> <span class="n">done</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">m_prog_ctx</span><span class="p">.</span><span class="n">update_progress</span><span class="p">(</span><span class="n">ono</span><span class="p">,</span> <span class="n">m_end_object_no</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>看看ctx的send函数干了什么,对一个imaget执行flatten,实际上就是对object的每个对象,通过一个写操作去触发copy-on-write,
只不过写操作的buffer内容为空:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">AsyncFlattenObjectContext</span> <span class="o">:</span> <span class="k">public</span> <span class="n">C_AsyncObjectThrottle</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">AsyncFlattenObjectContext</span><span class="p">(</span><span class="n">AsyncObjectThrottle</span> <span class="o">&</span><span class="n">throttle</span><span class="p">,</span> <span class="n">ImageCtx</span> <span class="o">*</span><span class="n">image_ctx</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">object_size</span><span class="p">,</span> <span class="o">::</span><span class="n">SnapContext</span> <span class="n">snapc</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">object_no</span><span class="p">)</span>
<span class="o">:</span> <span class="n">C_AsyncObjectThrottle</span><span class="p">(</span><span class="n">throttle</span><span class="p">,</span> <span class="o">*</span><span class="n">image_ctx</span><span class="p">),</span> <span class="n">m_object_size</span><span class="p">(</span><span class="n">object_size</span><span class="p">),</span>
<span class="n">m_snapc</span><span class="p">(</span><span class="n">snapc</span><span class="p">),</span> <span class="n">m_object_no</span><span class="p">(</span><span class="n">object_no</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="k">virtual</span> <span class="kt">int</span> <span class="n">send</span><span class="p">()</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_supported</span><span class="p">()</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">image_watcher</span><span class="o">-></span><span class="n">is_lock_owner</span><span class="p">())</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"lost exclusive lock during flatten"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="n">ERESTART</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">bufferlist</span> <span class="n">bl</span><span class="p">;</span> <span class="c1">// 不包含任何数据
</span> <span class="n">string</span> <span class="n">oid</span> <span class="o">=</span> <span class="n">m_image_ctx</span><span class="p">.</span><span class="n">get_object_name</span><span class="p">(</span><span class="n">m_object_no</span><span class="p">);</span>
<span class="n">AioWrite</span> <span class="o">*</span><span class="n">req</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AioWrite</span><span class="p">(</span><span class="o">&</span><span class="n">m_image_ctx</span><span class="p">,</span> <span class="n">oid</span><span class="p">,</span> <span class="n">m_object_no</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">bl</span><span class="p">,</span> <span class="n">m_snapc</span><span class="p">,</span> <span class="c1">// 写IO请求
</span> <span class="k">this</span><span class="p">);</span> <span class="c1">// this指针也很关键,这个是一个callback,当io完成后会回调
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">req</span><span class="o">-></span><span class="n">has_parent</span><span class="p">())</span> <span class="p">{</span>
<span class="c1">// stop early if the parent went away - it just means
</span> <span class="c1">// another flatten finished first or the image was resized
</span> <span class="k">delete</span> <span class="n">req</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">req</span><span class="o">-></span><span class="n">send</span><span class="p">();</span> <span class="c1">// 发送IO请求
</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="kt">uint64_t</span> <span class="n">m_object_size</span><span class="p">;</span>
<span class="o">::</span><span class="n">SnapContext</span> <span class="n">m_snapc</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">m_object_no</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>IO请求发送完后,会回调callback,即this指针,注意继承关系,最后会过度下面这个函数:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">C_AsyncObjectThrottle</span><span class="o">::</span><span class="n">finish</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">);</span>
<span class="n">m_finisher</span><span class="p">.</span><span class="n">finish_op</span><span class="p">(</span><span class="n">r</span><span class="p">);</span> <span class="c1">// m_finisher就是AsyncObjectThrottle
</span><span class="p">}</span>
<span class="c1">// 所以最终回调到了这个函数
</span><span class="kt">void</span> <span class="n">AsyncObjectThrottle</span><span class="o">::</span><span class="n">finish_op</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">m_image_ctx</span><span class="p">.</span><span class="n">owner_lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="kt">bool</span> <span class="n">complete</span><span class="p">;</span>
<span class="p">{</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">m_lock</span><span class="p">);</span>
<span class="o">--</span><span class="n">m_current_ops</span><span class="p">;</span> <span class="c1">// 完成了一个,计数减一
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">r</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ENOENT</span> <span class="o">&&</span> <span class="n">m_ret</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_ret</span> <span class="o">=</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">start_next_op</span><span class="p">();</span> <span class="c1">// 继续下一个op
</span> <span class="n">complete</span> <span class="o">=</span> <span class="p">(</span><span class="n">m_current_ops</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 最终所有回调都完成了
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">complete</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m_ctx</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">m_ret</span><span class="p">);</span> <span class="c1">// 所有OP都完成了才会继续往上层回调,m_ctx就是AsyncRequest::complete
</span> <span class="k">delete</span> <span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>当所有op完成后,就会回调AsyncRequest::complete,这里就会判断此次请求是否完成了,逻辑在should_complete函数中控制,
当op执行完成后,需要更新header object等元数据,对于deep flatte情况,逻辑也需要在这里完成:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">void</span> <span class="nf">complete</span><span class="p">(</span><span class="kt">int</span> <span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">m_canceled</span> <span class="o">&&</span> <span class="n">safely_cancel</span><span class="p">(</span><span class="n">r</span><span class="p">))</span> <span class="p">{</span>
<span class="n">m_on_finish</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="o">-</span><span class="n">ERESTART</span><span class="p">);</span>
<span class="k">delete</span> <span class="k">this</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">should_complete</span><span class="p">(</span><span class="n">r</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// should_complete这里会更新一些元数据信息,简单的状态机
</span> <span class="n">m_on_finish</span><span class="o">-></span><span class="n">complete</span><span class="p">(</span><span class="n">filter_return_code</span><span class="p">(</span><span class="n">r</span><span class="p">));</span> <span class="c1">// 全部完成,回调async_flatten中的callback
</span> <span class="k">delete</span> <span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
Ceph Librbd Create Image
2015-11-01T00:00:00+00:00
http://blog.wjin.org/posts/ceph-librbd-create-image
<h1 id="overview">Overview</h1>
<h1 id="code-analysis">Code Analysis</h1>
<h2 id="librbd-level">librbd level</h2>
<p>首先从librbd.cc开始,简单wrapper,实际上调用了internal.cc的create:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">// io_ctx 参数用来连接librados,name就是image的名字,size为大小
</span> <span class="c1">// order为对象尺寸,比如,默认值为22, 即4mb (1 << 22)
</span> <span class="kt">int</span> <span class="n">RBD</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">IoCtx</span><span class="o">&</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">order</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">create_enter</span><span class="p">,</span> <span class="n">io_ctx</span><span class="p">.</span><span class="n">get_pool_name</span><span class="p">().</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">io_ctx</span><span class="p">.</span><span class="n">get_id</span><span class="p">(),</span> <span class="n">name</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="o">*</span><span class="n">order</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">librbd</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">order</span><span class="p">);</span> <span class="c1">// 调用internal.cc的实现
</span> <span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">create_exit</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="o">*</span><span class="n">order</span><span class="p">);</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>接下来重点是internal.cc的代码, 首先处理了image一些基本参数信息:包括format, order, stripe等的设置,
stripe的解释,可以参看官方解释<a href="http://docs.ceph.com/docs/v0.80.5/architecture/#data-striping">data striping</a>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="nf">create</span><span class="p">(</span><span class="n">librados</span><span class="o">::</span><span class="n">IoCtx</span><span class="o">&</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">imgname</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">size</span><span class="p">,</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">order</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="p">)</span><span class="n">io_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">();</span> <span class="c1">// cct是管理ceph集群配置的一个句柄,到处都会用到
</span> <span class="kt">bool</span> <span class="n">old_format</span> <span class="o">=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rbd_default_format</span> <span class="o">==</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// 一般现在用format2,所以不是old
</span> <span class="kt">uint64_t</span> <span class="n">features</span> <span class="o">=</span> <span class="n">old_format</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rbd_default_features</span><span class="p">;</span> <span class="c1">// 新的format会支持新加的feature
</span> <span class="k">return</span> <span class="n">create</span><span class="p">(</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">imgname</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">old_format</span><span class="p">,</span> <span class="n">features</span><span class="p">,</span> <span class="n">order</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// 真正实现的函数
</span> <span class="kt">int</span> <span class="nf">create</span><span class="p">(</span><span class="n">IoCtx</span><span class="o">&</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">imgname</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">size</span><span class="p">,</span>
<span class="kt">bool</span> <span class="n">old_format</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">features</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">order</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">stripe_unit</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">stripe_count</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">order</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="p">)</span><span class="n">io_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">features</span> <span class="o">&</span> <span class="o">~</span><span class="n">RBD_FEATURES_ALL</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"librbd does not support requested features."</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="n">ENOSYS</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// make sure it doesn't already exist, in either format
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">detect_format</span><span class="p">(</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">imgname</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span> <span class="c1">// 检测image的格式
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"Could not tell if "</span> <span class="o"><<</span> <span class="n">imgname</span> <span class="o"><<</span> <span class="s">" already exists"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"rbd image "</span> <span class="o"><<</span> <span class="n">imgname</span> <span class="o"><<</span> <span class="s">" already exists"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EEXIST</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!*</span><span class="n">order</span><span class="p">)</span>
<span class="o">*</span><span class="n">order</span> <span class="o">=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rbd_default_order</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!*</span><span class="n">order</span><span class="p">)</span>
<span class="o">*</span><span class="n">order</span> <span class="o">=</span> <span class="n">RBD_DEFAULT_OBJ_ORDER</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">order</span> <span class="o">&&</span> <span class="p">(</span><span class="o">*</span><span class="n">order</span> <span class="o">></span> <span class="mi">64</span> <span class="o">||</span> <span class="o">*</span><span class="n">order</span> <span class="o"><</span> <span class="mi">12</span><span class="p">))</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"order must be in the range [12, 64]"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EDOM</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">Rados</span> <span class="n">rados</span><span class="p">(</span><span class="n">io_ctx</span><span class="p">);</span> <span class="c1">// 用于连接rados, 即ceph集群
</span> <span class="c1">// 获取客户端id, 这个会由RadosClient类对象通过monitor分配得到
</span> <span class="c1">// Rados类包含RadosClient, connect的时候会被初始化
</span> <span class="kt">uint64_t</span> <span class="n">bid</span> <span class="o">=</span> <span class="n">rados</span><span class="p">.</span><span class="n">get_instance_id</span><span class="p">();</span>
<span class="c1">// if striping is enabled, use possibly custom defaults
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">old_format</span> <span class="o">&&</span> <span class="p">(</span><span class="n">features</span> <span class="o">&</span> <span class="n">RBD_FEATURE_STRIPINGV2</span><span class="p">)</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">stripe_unit</span> <span class="o">&&</span> <span class="o">!</span><span class="n">stripe_count</span><span class="p">)</span> <span class="p">{</span>
<span class="n">stripe_unit</span> <span class="o">=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rbd_default_stripe_unit</span><span class="p">;</span>
<span class="n">stripe_count</span> <span class="o">=</span> <span class="n">cct</span><span class="o">-></span><span class="n">_conf</span><span class="o">-></span><span class="n">rbd_default_stripe_count</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// normalize for default striping
</span> <span class="k">if</span> <span class="p">(</span><span class="n">stripe_unit</span> <span class="o">==</span> <span class="p">(</span><span class="mi">1ull</span> <span class="o"><<</span> <span class="o">*</span><span class="n">order</span><span class="p">)</span> <span class="o">&&</span> <span class="n">stripe_count</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">stripe_unit</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">stripe_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">((</span><span class="n">stripe_unit</span> <span class="o">||</span> <span class="n">stripe_count</span><span class="p">)</span> <span class="o">&&</span>
<span class="p">(</span><span class="n">features</span> <span class="o">&</span> <span class="n">RBD_FEATURE_STRIPINGV2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"STRIPINGV2 and format 2 or later required for non-default striping"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">((</span><span class="n">stripe_unit</span> <span class="o">&&</span> <span class="o">!</span><span class="n">stripe_count</span><span class="p">)</span> <span class="o">||</span>
<span class="p">(</span><span class="o">!</span><span class="n">stripe_unit</span> <span class="o">&&</span> <span class="n">stripe_count</span><span class="p">))</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">old_format</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// 旧的格式
</span> <span class="k">if</span> <span class="p">(</span><span class="n">stripe_unit</span> <span class="o">&&</span> <span class="n">stripe_unit</span> <span class="o">!=</span> <span class="p">(</span><span class="mi">1ull</span> <span class="o"><<</span> <span class="o">*</span><span class="n">order</span><span class="p">))</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">stripe_count</span> <span class="o">&&</span> <span class="n">stripe_count</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="k">return</span> <span class="n">create_v1</span><span class="p">(</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">imgname</span><span class="p">,</span> <span class="n">bid</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="o">*</span><span class="n">order</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// format2 会走这条路径
</span> <span class="k">return</span> <span class="n">create_v2</span><span class="p">(</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">imgname</span><span class="p">,</span> <span class="n">bid</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="o">*</span><span class="n">order</span><span class="p">,</span> <span class="n">features</span><span class="p">,</span>
<span class="n">stripe_unit</span><span class="p">,</span> <span class="n">stripe_count</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>具体到image的创建,实际上只会记录一些image的基本信息。
比如创建元数据对象<code class="highlighter-rouge">rbd_id.foo</code>和<code class="highlighter-rouge">rbd_header.foo</code>, 对于image的真正数据对象<code class="highlighter-rouge">rbd_data*</code>,
根本不会创建,这是因为ceph选择<code class="highlighter-rouge">thin-provisioning</code>这种方式,可以做到妙极创建块设备,后端也可以超额分配容量。</p>
<p>创建过程,很多步骤都是请求插件cls_client(<code class="highlighter-rouge">src/cls/rbd</code>)完成, 插件会向rados发送命令执行:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="nf">create_v2</span><span class="p">(</span><span class="n">IoCtx</span><span class="o">&</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">imgname</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">bid</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">size</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">order</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">features</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">stripe_unit</span><span class="p">,</span>
<span class="kt">uint64_t</span> <span class="n">stripe_count</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ostringstream</span> <span class="n">bid_ss</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">extra</span><span class="p">;</span>
<span class="n">string</span> <span class="n">id</span><span class="p">,</span> <span class="n">id_obj</span><span class="p">,</span> <span class="n">header_oid</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">remove_r</span><span class="p">;</span>
<span class="n">ostringstream</span> <span class="n">oss</span><span class="p">;</span>
<span class="n">CephContext</span> <span class="o">*</span><span class="n">cct</span> <span class="o">=</span> <span class="p">(</span><span class="n">CephContext</span> <span class="o">*</span><span class="p">)</span><span class="n">io_ctx</span><span class="p">.</span><span class="n">cct</span><span class="p">();</span>
<span class="n">ceph_file_layout</span> <span class="n">layout</span><span class="p">;</span>
<span class="n">id_obj</span> <span class="o">=</span> <span class="n">id_obj_name</span><span class="p">(</span><span class="n">imgname</span><span class="p">);</span>
<span class="c1">// 首先向librados发送请求,创建rbd_id.foo 这样的元数据对象
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">io_ctx</span><span class="p">.</span><span class="n">create</span><span class="p">(</span><span class="n">id_obj</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error creating rbd id object: "</span> <span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">extra</span> <span class="o">=</span> <span class="n">rand</span><span class="p">()</span> <span class="o">%</span> <span class="mh">0xFFFFFFFF</span><span class="p">;</span>
<span class="n">bid_ss</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">hex</span> <span class="o"><<</span> <span class="n">bid</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">hex</span> <span class="o"><<</span> <span class="n">extra</span><span class="p">;</span>
<span class="n">id</span> <span class="o">=</span> <span class="n">bid_ss</span><span class="p">.</span><span class="n">str</span><span class="p">();</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">cls_client</span><span class="o">::</span><span class="n">set_id</span><span class="p">(</span><span class="o">&</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">id_obj</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span> <span class="c1">// 通过插件设置image的id
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error setting image id: "</span> <span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">err_remove_id</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"adding rbd image to directory..."</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">cls_client</span><span class="o">::</span><span class="n">dir_add_image</span><span class="p">(</span><span class="o">&</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">RBD_DIRECTORY</span><span class="p">,</span> <span class="n">imgname</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span> <span class="c1">// 将image的<name,id>加到rbd目录,所有创建的image都在这里,方便ls命令列出pool所有的image
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error adding image to directory: "</span> <span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">err_remove_id</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">oss</span> <span class="o"><<</span> <span class="n">RBD_DATA_PREFIX</span> <span class="o"><<</span> <span class="n">id</span><span class="p">;</span>
<span class="n">header_oid</span> <span class="o">=</span> <span class="n">header_name</span><span class="p">(</span><span class="n">id</span><span class="p">);</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">cls_client</span><span class="o">::</span><span class="n">create_image</span><span class="p">(</span><span class="o">&</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">header_oid</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">order</span><span class="p">,</span> <span class="c1">// 创建image, 实际上会创建一个header对象:rbd_header.foo
</span> <span class="n">features</span><span class="p">,</span> <span class="n">oss</span><span class="p">.</span><span class="n">str</span><span class="p">());</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error writing header: "</span> <span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">err_remove_from_dir</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">((</span><span class="n">stripe_unit</span> <span class="o">||</span> <span class="n">stripe_count</span><span class="p">)</span> <span class="o">&&</span>
<span class="p">(</span><span class="n">stripe_count</span> <span class="o">!=</span> <span class="mi">1</span> <span class="o">||</span> <span class="n">stripe_unit</span> <span class="o">!=</span> <span class="p">(</span><span class="mi">1ull</span> <span class="o"><<</span> <span class="n">order</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">cls_client</span><span class="o">::</span><span class="n">set_stripe_unit_count</span><span class="p">(</span><span class="o">&</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">header_oid</span><span class="p">,</span> <span class="c1">// 设置元数据信息
</span> <span class="n">stripe_unit</span><span class="p">,</span> <span class="n">stripe_count</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error setting striping parameters: "</span>
<span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">err_remove_header</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">((</span><span class="n">features</span> <span class="o">&</span> <span class="n">RBD_FEATURE_OBJECT_MAP</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// object map 相关信息
</span> <span class="k">if</span> <span class="p">((</span><span class="n">features</span> <span class="o">&</span> <span class="n">RBD_FEATURE_EXCLUSIVE_LOCK</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"cannot use object map without exclusive lock"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">err_remove_header</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">memset</span><span class="p">(</span><span class="o">&</span><span class="n">layout</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">layout</span><span class="p">));</span>
<span class="n">layout</span><span class="p">.</span><span class="n">fl_object_size</span> <span class="o">=</span> <span class="mi">1ull</span> <span class="o"><<</span> <span class="n">order</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">stripe_unit</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">stripe_count</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">layout</span><span class="p">.</span><span class="n">fl_stripe_unit</span> <span class="o">=</span> <span class="n">layout</span><span class="p">.</span><span class="n">fl_object_size</span><span class="p">;</span>
<span class="n">layout</span><span class="p">.</span><span class="n">fl_stripe_count</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">layout</span><span class="p">.</span><span class="n">fl_stripe_unit</span> <span class="o">=</span> <span class="n">stripe_unit</span><span class="p">;</span>
<span class="n">layout</span><span class="p">.</span><span class="n">fl_stripe_count</span> <span class="o">=</span> <span class="n">stripe_count</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">librados</span><span class="o">::</span><span class="n">ObjectWriteOperation</span> <span class="n">op</span><span class="p">;</span>
<span class="n">cls_client</span><span class="o">::</span><span class="n">object_map_resize</span><span class="p">(</span><span class="o">&</span><span class="n">op</span><span class="p">,</span> <span class="n">Striper</span><span class="o">::</span><span class="n">get_num_objects</span><span class="p">(</span><span class="n">layout</span><span class="p">,</span> <span class="n">size</span><span class="p">),</span>
<span class="n">OBJECT_NONEXISTENT</span><span class="p">);</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">io_ctx</span><span class="p">.</span><span class="n">operate</span><span class="p">(</span><span class="n">ObjectMap</span><span class="o">::</span><span class="n">object_map_name</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">CEPH_NOSNAP</span><span class="p">),</span> <span class="o">&</span><span class="n">op</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">goto</span> <span class="n">err_remove_header</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"done."</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// 完成创建
</span>
<span class="c1">// 中途错误处理
</span> <span class="nl">err_remove_header:</span>
<span class="n">remove_r</span> <span class="o">=</span> <span class="n">io_ctx</span><span class="p">.</span><span class="n">remove</span><span class="p">(</span><span class="n">header_oid</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">remove_r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error cleaning up image header after creation failed: "</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="nl">err_remove_from_dir:</span>
<span class="n">remove_r</span> <span class="o">=</span> <span class="n">cls_client</span><span class="o">::</span><span class="n">dir_remove_image</span><span class="p">(</span><span class="o">&</span><span class="n">io_ctx</span><span class="p">,</span> <span class="n">RBD_DIRECTORY</span><span class="p">,</span>
<span class="n">imgname</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">remove_r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error cleaning up image from rbd_directory object "</span>
<span class="o"><<</span> <span class="s">"after creation failed: "</span> <span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">remove_r</span><span class="p">)</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="nl">err_remove_id:</span>
<span class="n">remove_r</span> <span class="o">=</span> <span class="n">io_ctx</span><span class="p">.</span><span class="n">remove</span><span class="p">(</span><span class="n">id_obj</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">remove_r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lderr</span><span class="p">(</span><span class="n">cct</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"error cleaning up id object after creation failed: "</span>
<span class="o"><<</span> <span class="n">cpp_strerror</span><span class="p">(</span><span class="n">remove_r</span><span class="p">)</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="librados-level">librados level</h2>
<p>无论是通过cls插件,还是直接调用librados的API,最终都是将操作封装成消息,发送到后端的OSD集群。</p>
<p>简单看一下创建对象的实现,其它操作类似。</p>
<p>在<code class="highlighter-rouge">create_v2</code>函数中,首先是通过librados层的类IoCtx的create函数,创建一个具体的对像: <code class="highlighter-rouge">rbd_id.foo</code></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtx</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">exclusive</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">object_t</span> <span class="n">obj</span><span class="p">(</span><span class="n">oid</span><span class="p">);</span>
<span class="k">return</span> <span class="n">io_ctx_impl</span><span class="o">-></span><span class="n">create</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">exclusive</span><span class="p">);</span> <span class="c1">// IoCtx 通过IoCtxImpl 实现
</span><span class="p">}</span>
<span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtxImpl</span><span class="o">::</span><span class="n">create</span><span class="p">(</span><span class="k">const</span> <span class="n">object_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">exclusive</span><span class="p">)</span>
<span class="p">{</span>
<span class="o">::</span><span class="n">ObjectOperation</span> <span class="n">op</span><span class="p">;</span> <span class="c1">// 这个是全局的ObjectOperation, 实际上是引用的src/osdc/Objecter.cc里面定义的类
</span> <span class="n">prepare_assert_ops</span><span class="p">(</span><span class="o">&</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 一些对object version的处理
</span> <span class="n">op</span><span class="p">.</span><span class="n">create</span><span class="p">(</span><span class="n">exclusive</span><span class="p">);</span> <span class="c1">// 将op的真正内容进行初始化
</span> <span class="k">return</span> <span class="n">operate</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="o">&</span><span class="n">op</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span> <span class="c1">// 借助于osdc/objecter.cc里的函数,真正将op通过消息发送到OSD
</span><span class="p">}</span>
</code></pre></div></div>
<p>接下来看看OP是怎么通过消息发送出去到OSD端的:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtxImpl</span><span class="o">::</span><span class="n">operate</span><span class="p">(</span><span class="k">const</span> <span class="n">object_t</span><span class="o">&</span> <span class="n">oid</span><span class="p">,</span> <span class="o">::</span><span class="n">ObjectOperation</span> <span class="o">*</span><span class="n">o</span><span class="p">,</span>
<span class="kt">time_t</span> <span class="o">*</span><span class="n">pmtime</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">utime_t</span> <span class="n">ut</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pmtime</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ut</span> <span class="o">=</span> <span class="n">utime_t</span><span class="p">(</span><span class="o">*</span><span class="n">pmtime</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">ut</span> <span class="o">=</span> <span class="n">ceph_clock_now</span><span class="p">(</span><span class="n">client</span><span class="o">-></span><span class="n">cct</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/* can't write to a snapshot */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">snap_seq</span> <span class="o">!=</span> <span class="n">CEPH_NOSNAP</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EROFS</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">o</span><span class="o">-></span><span class="n">size</span><span class="p">())</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">Mutex</span> <span class="n">mylock</span><span class="p">(</span><span class="s">"IoCtxImpl::operate::mylock"</span><span class="p">);</span>
<span class="n">Cond</span> <span class="n">cond</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">done</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span><span class="p">;</span>
<span class="n">version_t</span> <span class="n">ver</span><span class="p">;</span>
<span class="n">Context</span> <span class="o">*</span><span class="n">oncommit</span> <span class="o">=</span> <span class="k">new</span> <span class="n">C_SafeCond</span><span class="p">(</span><span class="o">&</span><span class="n">mylock</span><span class="p">,</span> <span class="o">&</span><span class="n">cond</span><span class="p">,</span> <span class="o">&</span><span class="n">done</span><span class="p">,</span> <span class="o">&</span><span class="n">r</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">op</span> <span class="o">=</span> <span class="n">o</span><span class="o">-></span><span class="n">ops</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">op</span><span class="p">.</span><span class="n">op</span><span class="p">;</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">client</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="n">ceph_osd_op_name</span><span class="p">(</span><span class="n">op</span><span class="p">)</span> <span class="o"><<</span> <span class="s">" oid="</span> <span class="o"><<</span> <span class="n">oid</span> <span class="o"><<</span> <span class="s">" nspace="</span> <span class="o"><<</span> <span class="n">oloc</span><span class="p">.</span><span class="n">nspace</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="c1">// 利用前面准备的ObjectOperation对象,初始化Objecter::Op对象,这里会new一个Objecter::Op出来
</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">Op</span> <span class="o">*</span><span class="n">objecter_op</span> <span class="o">=</span> <span class="n">objecter</span><span class="o">-></span><span class="n">prepare_mutate_op</span><span class="p">(</span><span class="n">oid</span><span class="p">,</span> <span class="n">oloc</span><span class="p">,</span>
<span class="o">*</span><span class="n">o</span><span class="p">,</span> <span class="n">snapc</span><span class="p">,</span> <span class="n">ut</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span> <span class="n">oncommit</span><span class="p">,</span> <span class="o">&</span><span class="n">ver</span><span class="p">);</span>
<span class="c1">// 将op提交
</span> <span class="n">objecter</span><span class="o">-></span><span class="n">op_submit</span><span class="p">(</span><span class="n">objecter_op</span><span class="p">);</span>
<span class="n">mylock</span><span class="p">.</span><span class="n">Lock</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">done</span><span class="p">)</span>
<span class="n">cond</span><span class="p">.</span><span class="n">Wait</span><span class="p">(</span><span class="n">mylock</span><span class="p">);</span>
<span class="n">mylock</span><span class="p">.</span><span class="n">Unlock</span><span class="p">();</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">client</span><span class="o">-></span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"Objecter returned from "</span>
<span class="o"><<</span> <span class="n">ceph_osd_op_name</span><span class="p">(</span><span class="n">op</span><span class="p">)</span> <span class="o"><<</span> <span class="s">" r="</span> <span class="o"><<</span> <span class="n">r</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">set_sync_op_version</span><span class="p">(</span><span class="n">ver</span><span class="p">);</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>发送之前,先简单的做一个限流操作:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ceph_tid_t</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">op_submit</span><span class="p">(</span><span class="n">Op</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">ctx_budget</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">RLocker</span> <span class="n">rl</span><span class="p">(</span><span class="n">rwlock</span><span class="p">);</span>
<span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span> <span class="n">lc</span><span class="p">(</span><span class="n">rwlock</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">::</span><span class="n">TakenForRead</span><span class="p">);</span>
<span class="k">return</span> <span class="n">_op_submit_with_budget</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">lc</span><span class="p">,</span> <span class="n">ctx_budget</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// 简单的throttle处理,以免发送太快
</span><span class="n">ceph_tid_t</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_op_submit_with_budget</span><span class="p">(</span><span class="n">Op</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">&</span> <span class="n">lc</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">ctx_budget</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">initialized</span><span class="p">.</span><span class="n">read</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">ops</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">op</span><span class="o">-></span><span class="n">out_bl</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">ops</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">op</span><span class="o">-></span><span class="n">out_rval</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">ops</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">==</span> <span class="n">op</span><span class="o">-></span><span class="n">out_handler</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="c1">// throttle. before we look at any state, because
</span> <span class="c1">// take_op_budget() may drop our lock while it blocks.
</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">op</span><span class="o">-></span><span class="n">ctx_budgeted</span> <span class="o">||</span> <span class="p">(</span><span class="n">ctx_budget</span> <span class="o">&&</span> <span class="p">(</span><span class="o">*</span><span class="n">ctx_budget</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)))</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">op_budget</span> <span class="o">=</span> <span class="n">_take_op_budget</span><span class="p">(</span><span class="n">op</span><span class="p">);</span>
<span class="c1">// take and pass out the budget for the first OP
</span> <span class="c1">// in the context session
</span> <span class="k">if</span> <span class="p">(</span><span class="n">ctx_budget</span> <span class="o">&&</span> <span class="p">(</span><span class="o">*</span><span class="n">ctx_budget</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
<span class="o">*</span><span class="n">ctx_budget</span> <span class="o">=</span> <span class="n">op_budget</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">C_CancelOp</span> <span class="o">*</span><span class="n">cb</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">osd_timeout</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cb</span> <span class="o">=</span> <span class="k">new</span> <span class="n">C_CancelOp</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="n">op</span><span class="o">-></span><span class="n">ontimeout</span> <span class="o">=</span> <span class="n">cb</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ceph_tid_t</span> <span class="n">tid</span> <span class="o">=</span> <span class="n">_op_submit</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span> <span class="c1">// 真正处理发送请求的地方
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cb</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cb</span><span class="o">-></span><span class="n">set_tid</span><span class="p">(</span><span class="n">tid</span><span class="p">);</span>
<span class="n">Mutex</span><span class="o">::</span><span class="n">Locker</span> <span class="n">l</span><span class="p">(</span><span class="n">timer_lock</span><span class="p">);</span>
<span class="n">timer</span><span class="p">.</span><span class="n">add_event_after</span><span class="p">(</span><span class="n">osd_timeout</span><span class="p">,</span> <span class="n">op</span><span class="o">-></span><span class="n">ontimeout</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">tid</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>在提交之前,会计算是否需要更新map,如果需要,会向monitor发送获取osdmap的消息, op会继续留在session中, 等待时机成熟时再发送。
这里有两个条件会重新发送session中滞留的op:</p>
<ul>
<li>
<p><code class="highlighter-rouge">schedule_tick</code>定时扫描发送, 它设置了一个定时器,到时间后会被执行,执行完后重新设置新的定时器,所以会一直周期性地扫描</p>
</li>
<li>
<p><code class="highlighter-rouge">handle_osd_map</code>调用<code class="highlighter-rouge">scan_request</code>再进行发送出去, <code class="highlighter-rouge">handle_osd_map</code>会在收到消息<code class="highlighter-rouge">CEPH_MSG_OSD_MAP</code>后,由dispatch函数通过消息分类机制调用</p>
</li>
</ul>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ceph_tid_t</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_op_submit</span><span class="p">(</span><span class="n">Op</span> <span class="o">*</span><span class="n">op</span><span class="p">,</span> <span class="n">RWLock</span><span class="o">::</span><span class="n">Context</span><span class="o">&</span> <span class="n">lc</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">rwlock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" op "</span> <span class="o"><<</span> <span class="n">op</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="c1">// pick target
</span> <span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">session</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">OSDSession</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="c1">// 计算是否需要更新map
</span> <span class="kt">bool</span> <span class="k">const</span> <span class="n">check_for_latest_map</span> <span class="o">=</span> <span class="n">_calc_target</span><span class="p">(</span><span class="o">&</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">,</span> <span class="o">&</span><span class="n">op</span><span class="o">-></span><span class="n">last_force_resend</span><span class="p">)</span> <span class="o">==</span> <span class="n">RECALC_OP_TARGET_POOL_DNE</span><span class="p">;</span>
<span class="c1">// Try to get a session, including a retry if we need to take write lock
</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_get_session</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">osd</span><span class="p">,</span> <span class="o">&</span><span class="n">s</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span> <span class="c1">// 获取session, session中包含connection与OSD的连接信息
</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="n">EAGAIN</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">s</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">lc</span><span class="p">.</span><span class="n">promote</span><span class="p">();</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_get_session</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">osd</span><span class="p">,</span> <span class="o">&</span><span class="n">s</span><span class="p">,</span> <span class="n">lc</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">assert</span><span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">s</span><span class="p">);</span> <span class="c1">// may be homeless
</span>
<span class="c1">// We may need to take wlock if we will need to _set_op_map_check later.
</span> <span class="k">if</span> <span class="p">(</span><span class="n">check_for_latest_map</span> <span class="o">&&</span> <span class="o">!</span><span class="n">lc</span><span class="p">.</span><span class="n">is_wlocked</span><span class="p">())</span> <span class="p">{</span>
<span class="n">lc</span><span class="p">.</span><span class="n">promote</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">_send_op_account</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 统计op信息
</span>
<span class="c1">// send?
</span> <span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"_op_submit oid "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">base_oid</span>
<span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">base_oloc</span> <span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">target_oloc</span>
<span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">ops</span> <span class="o"><<</span> <span class="s">" tid "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">tid</span>
<span class="o"><<</span> <span class="s">" osd."</span> <span class="o"><<</span> <span class="p">(</span><span class="o">!</span><span class="n">s</span><span class="o">-></span><span class="n">is_homeless</span><span class="p">()</span> <span class="o">?</span> <span class="n">s</span><span class="o">-></span><span class="n">osd</span> <span class="o">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">flags</span> <span class="o">&</span> <span class="p">(</span><span class="n">CEPH_OSD_FLAG_READ</span><span class="o">|</span><span class="n">CEPH_OSD_FLAG_WRITE</span><span class="p">));</span>
<span class="kt">bool</span> <span class="n">need_send</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="c1">// 判断是否需要发送
</span> <span class="k">if</span> <span class="p">((</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">flags</span> <span class="o">&</span> <span class="n">CEPH_OSD_FLAG_WRITE</span><span class="p">)</span> <span class="o">&&</span>
<span class="n">osdmap</span><span class="o">-></span><span class="n">test_flag</span><span class="p">(</span><span class="n">CEPH_OSDMAP_PAUSEWR</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">" paused modify "</span> <span class="o"><<</span> <span class="n">op</span> <span class="o"><<</span> <span class="s">" tid "</span> <span class="o"><<</span> <span class="n">last_tid</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">paused</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">_maybe_request_map</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">flags</span> <span class="o">&</span> <span class="n">CEPH_OSD_FLAG_READ</span><span class="p">)</span> <span class="o">&&</span>
<span class="n">osdmap</span><span class="o">-></span><span class="n">test_flag</span><span class="p">(</span><span class="n">CEPH_OSDMAP_PAUSERD</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o"><<</span> <span class="s">" paused read "</span> <span class="o"><<</span> <span class="n">op</span> <span class="o"><<</span> <span class="s">" tid "</span> <span class="o"><<</span> <span class="n">last_tid</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">paused</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">_maybe_request_map</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">flags</span> <span class="o">&</span> <span class="n">CEPH_OSD_FLAG_WRITE</span><span class="p">)</span> <span class="o">&&</span> <span class="n">_osdmap_full_flag</span><span class="p">())</span> <span class="p">{</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o"><<</span> <span class="s">" FULL, paused modify "</span> <span class="o"><<</span> <span class="n">op</span> <span class="o"><<</span> <span class="s">" tid "</span> <span class="o"><<</span> <span class="n">last_tid</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="n">op</span><span class="o">-></span><span class="n">target</span><span class="p">.</span><span class="n">paused</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">_maybe_request_map</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">s</span><span class="o">-></span><span class="n">is_homeless</span><span class="p">())</span> <span class="p">{</span>
<span class="n">need_send</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">_maybe_request_map</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">MOSDOp</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">need_send</span><span class="p">)</span> <span class="p">{</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">_prepare_osd_op</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 如果需要发送,则创建一条包含op的消息
</span> <span class="p">}</span>
<span class="n">s</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">get_write</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">tid</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">op</span><span class="o">-></span><span class="n">tid</span> <span class="o">=</span> <span class="n">last_tid</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span> <span class="c1">// 获取唯一的tid
</span> <span class="n">_session_op_assign</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">op</span><span class="p">);</span> <span class="c1">// 这里会将op的信息记录到session里面,firefly版本是直接记录在objecter的map中,显然和特定的session关联起来更合理
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">need_send</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_send_op</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span> <span class="c1">// 将消息发送出去
</span> <span class="p">}</span>
<span class="c1">// Last chance to touch Op here, after giving up session lock it can be
</span> <span class="c1">// freed at any time by response handler.
</span> <span class="n">ceph_tid_t</span> <span class="n">tid</span> <span class="o">=</span> <span class="n">op</span><span class="o">-></span><span class="n">tid</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">check_for_latest_map</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_send_op_map_check</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="c1">// 需要更新map,这里会向monitor发送获取osdmap的消息, op会留在session中, 以后会由schedule_tick或handle_osd_map调用scan_request再进行发送
</span> <span class="p">}</span>
<span class="n">op</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">s</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>
<span class="n">put_session</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> <span class="o"><<</span> <span class="n">num_unacked</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="o"><<</span> <span class="s">" unacked, "</span> <span class="o"><<</span> <span class="n">num_uncommitted</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="o"><<</span> <span class="s">" uncommitted"</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">tid</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>记录op到session的map里,等收到回包的时候好处理:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Objecter</span><span class="o">::</span><span class="n">_session_op_assign</span><span class="p">(</span><span class="n">OSDSession</span> <span class="o">*</span><span class="n">to</span><span class="p">,</span> <span class="n">Op</span> <span class="o">*</span><span class="n">op</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">to</span><span class="o">-></span><span class="n">lock</span><span class="p">.</span><span class="n">is_locked</span><span class="p">());</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">session</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">op</span><span class="o">-></span><span class="n">tid</span><span class="p">);</span>
<span class="n">get_session</span><span class="p">(</span><span class="n">to</span><span class="p">);</span>
<span class="n">op</span><span class="o">-></span><span class="n">session</span> <span class="o">=</span> <span class="n">to</span><span class="p">;</span>
<span class="n">to</span><span class="o">-></span><span class="n">ops</span><span class="p">[</span><span class="n">op</span><span class="o">-></span><span class="n">tid</span><span class="p">]</span> <span class="o">=</span> <span class="n">op</span><span class="p">;</span> <span class="c1">// 记录op信息,当OSD处理完后,会发送消息回来,回来的消息在handle_osd_op_reply函数中处理,这个tid就是找到对应op的关键
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">to</span><span class="o">-></span><span class="n">is_homeless</span><span class="p">())</span> <span class="p">{</span>
<span class="n">num_homeless_ops</span><span class="p">.</span><span class="n">inc</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">ldout</span><span class="p">(</span><span class="n">cct</span><span class="p">,</span> <span class="mi">15</span><span class="p">)</span> <span class="o"><<</span> <span class="n">__func__</span> <span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">to</span><span class="o">-></span><span class="n">osd</span> <span class="o"><<</span> <span class="s">" "</span> <span class="o"><<</span> <span class="n">op</span><span class="o">-></span><span class="n">tid</span> <span class="o"><<</span> <span class="n">dendl</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
Ceph Librbd Introduction
2015-10-25T00:00:00+00:00
http://blog.wjin.org/posts/ceph-librbd-introduction
<h1 id="introduction">Introduction</h1>
<p>librbd 是ceph对外提供的块存储接口的抽象,可以供qemu虚拟机, fio测试程序的rbd engine等程序使用。
它提供c和c++两种接口,对于c++, 最主要的两个类就是<code class="highlighter-rouge">RBD</code> 和 <code class="highlighter-rouge">Image</code>。
RBD 主要负责创建,删除,克隆镜像等操作, 而Image 类负责镜像的读写等操作。</p>
<h1 id="headfiles">HeadFiles</h1>
<p>作为一个通用的库,对c/c++来说,得提供一个头文件,即所谓的API。
这个头文件是 <strong>src/include/rbd/librbd.h</strong> 和 <strong>src/include/rbd/librbd.hpp</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">namespace</span> <span class="n">librbd</span> <span class="p">{</span> <span class="c1">// 库在librbd名字空间中
</span>
<span class="k">using</span> <span class="n">librados</span><span class="o">::</span><span class="n">IoCtx</span><span class="p">;</span> <span class="c1">// librados 库对外提供的接口
</span>
<span class="k">class</span> <span class="nc">Image</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">image_ctx_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">completion_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">callback_t</span><span class="p">)(</span><span class="n">completion_t</span> <span class="n">cb</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span> <span class="c1">// 异步操作回调接口
</span>
<span class="p">...</span>
<span class="k">class</span> <span class="nc">CEPH_RBD_API</span> <span class="n">RBD</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">RBD</span><span class="p">();</span>
<span class="o">~</span><span class="n">RBD</span><span class="p">();</span>
<span class="c1">// This must be dynamically allocated with new, and
</span> <span class="c1">// must be released with release().
</span> <span class="c1">// Do not use delete.
</span> <span class="k">struct</span> <span class="n">AioCompletion</span> <span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">pc</span><span class="p">;</span>
<span class="n">AioCompletion</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">cb_arg</span><span class="p">,</span> <span class="n">callback_t</span> <span class="n">complete_cb</span><span class="p">);</span>
<span class="kt">bool</span> <span class="n">is_complete</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">wait_for_complete</span><span class="p">();</span>
<span class="kt">ssize_t</span> <span class="n">get_return_value</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">release</span><span class="p">();</span>
<span class="p">};</span>
<span class="c1">// 接下来一些API: open/create/clone/remove/rename 等
</span>
<span class="k">private</span><span class="o">:</span>
<span class="cm">/* We don't allow assignment or copying */</span>
<span class="n">RBD</span><span class="p">(</span><span class="k">const</span> <span class="n">RBD</span><span class="o">&</span> <span class="n">rhs</span><span class="p">);</span>
<span class="k">const</span> <span class="n">RBD</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">RBD</span><span class="o">&</span> <span class="n">rhs</span><span class="p">);</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">CEPH_RBD_API</span> <span class="n">Image</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Image</span><span class="p">();</span>
<span class="o">~</span><span class="n">Image</span><span class="p">();</span>
<span class="c1">// 镜像的读写,flatten,trim等操作
</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">friend</span> <span class="k">class</span> <span class="nc">RBD</span><span class="p">;</span>
<span class="n">Image</span><span class="p">(</span><span class="k">const</span> <span class="n">Image</span><span class="o">&</span> <span class="n">rhs</span><span class="p">);</span>
<span class="k">const</span> <span class="n">Image</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">Image</span><span class="o">&</span> <span class="n">rhs</span><span class="p">);</span>
<span class="n">image_ctx_t</span> <span class="n">ctx</span><span class="p">;</span> <span class="c1">// viod*, 实际指向具体实现的类
</span><span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>
<p>ceph中几乎全是异步操作,异步操作在调用的时候会提供回调函数,librbd的异步接口是<code class="highlighter-rouge">librbd::RBD::AioCompletion</code>类实现的,
librbd的用户,如qemu里的驱动,在使用librbd时会先new一个<code class="highlighter-rouge">librbd::RBD::AioCompletion</code>对象出来,填入需要回调的callback和参数。
librbd::RBD::AioCompletion 的构造函数会创建一个<code class="highlighter-rouge">librbd::AioCompletion</code>(注意这里没有RBD作用域), 赋值给<code class="highlighter-rouge">void *pc</code>。
librbd::RBD::AioCompletion里的其它函数foo就是对pc->foo的调用。这样只暴露了头文件,对实现细节进行了隐藏。</p>
<p>同理, Image类的实现,也只有一个成员<code class="highlighter-rouge">void *ctx</code>,初始化后,会真正指向ImageCtx, 隐藏细节。对image的操作foo最后都是对
ctx->foo的操作。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RBD</span><span class="o">::</span><span class="n">AioCompletion</span><span class="o">::</span><span class="n">AioCompletion</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">cb_arg</span><span class="p">,</span> <span class="n">callback_t</span> <span class="n">complete_cb</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">librbd</span><span class="o">::</span><span class="n">AioCompletion</span> <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">librbd</span><span class="o">::</span><span class="n">aio_create_completion</span><span class="p">(</span><span class="n">cb_arg</span><span class="p">,</span>
<span class="n">complete_cb</span><span class="p">);</span>
<span class="n">pc</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">c</span><span class="p">;</span>
<span class="n">c</span><span class="o">-></span><span class="n">rbd_comp</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="implementation">Implementation</h1>
<p>库的实现代码在<strong>src/librbd/</strong>目录。对应头文件的cpp文件是<strong>librbd.cc</strong>。这个文件就是一些很简单的wrapper封装,
真正干活的是在<strong>internal.cc文件</strong>。</p>
<p>上文中的pc指针,对应到文件是<strong>AioCompletion.h/AioCompletion.cc</strong>。
而ctx指针,对应到<strong>ImageCtx.h/ImageCtx.cc</strong>文件。</p>
<p><code class="highlighter-rouge">ctx怎么初始化的?</code> Image类的构造函数并没有提供此参数,并且也没有set_ctx之类的public成员函数。这里只有一个<code class="highlighter-rouge">friend class RBD</code>,
很明显,只有RBD类可以初始化ctx成员。</p>
<p>实际上,在使用librbd的过程中,是先通过RBD类,去create一个块设备,以后使用的时候,也是通过RBD类去open此前创建的设备。
在open的过程中,就会初始化ctx。</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">int</span> <span class="n">RBD</span><span class="o">::</span><span class="n">open</span><span class="p">(</span><span class="n">IoCtx</span><span class="o">&</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="n">Image</span><span class="o">&</span> <span class="n">image</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">snap_name</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ImageCtx</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">snap_name</span><span class="p">,</span> <span class="n">io_ctx</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">open_image_enter</span><span class="p">,</span> <span class="n">ictx</span><span class="p">,</span> <span class="n">ictx</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">id</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">snap_name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">read_only</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">librbd</span><span class="o">::</span><span class="n">open_image</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">open_image_exit</span><span class="p">,</span> <span class="n">r</span><span class="p">);</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">image</span><span class="p">.</span><span class="n">ctx</span> <span class="o">=</span> <span class="p">(</span><span class="n">image_ctx_t</span><span class="p">)</span> <span class="n">ictx</span><span class="p">;</span> <span class="c1">// 初始化ctx
</span> <span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">open_image_exit</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 析构过程
</span> <span class="n">Image</span><span class="o">::~</span><span class="n">Image</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ctx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span> <span class="o">=</span> <span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="p">)</span><span class="n">ctx</span><span class="p">;</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">close_image_enter</span><span class="p">,</span> <span class="n">ictx</span><span class="p">,</span> <span class="n">ictx</span><span class="o">-></span><span class="n">name</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">ictx</span><span class="o">-></span><span class="n">id</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span>
<span class="n">close_image</span><span class="p">(</span><span class="n">ictx</span><span class="p">);</span>
<span class="n">tracepoint</span><span class="p">(</span><span class="n">librbd</span><span class="p">,</span> <span class="n">close_image_exit</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">close_image</span><span class="p">(</span><span class="n">ImageCtx</span> <span class="o">*</span><span class="n">ictx</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">...</span>
<span class="k">delete</span> <span class="n">ictx</span><span class="p">;</span> <span class="c1">// 释放ctx
</span> <span class="p">}</span>
</code></pre></div></div>
<p>对于具体的IO,则是对应到文件<strong>AioRequest.h/AioRequest.cc</strong>,最终会通过<code class="highlighter-rouge">send</code>函数发送出去,留着以后读写流程详细分析。</p>
<p>对于一些比较耗时的管理操作,比如flatten, trim, resize等,hammer版本的代码新加了一些异步实现,对应的文件<strong>Async…</strong>之类的。</p>
<p>另外,对异步操作,也增加了提交请求的workqueue,然后由线程专门去提交, 避免了在特定条件下被block住。</p>
<h1 id="python-demo">Python Demo</h1>
<p>一个简单的python例子, 来源<a href="http://docs.ceph.com/docs/v0.80.5/rbd/librbdpy/">ceph官方文档</a>。
主要就是四个对象:librados层Rados和IoCtx,librbd层就是RBD和Image。C++代码类似, 可以参考rbd命令的实现,文件是<strong>src/rbd.cc</strong>。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c"># 创建rados对象实例, 参数是集群的配置文件</span>
<span class="n">cluster</span> <span class="o">=</span> <span class="n">rados</span><span class="o">.</span><span class="n">Rados</span><span class="p">(</span><span class="n">conffile</span><span class="o">=</span><span class="s">'my_ceph.conf'</span><span class="p">)</span>
<span class="c"># 连接到rados集群</span>
<span class="n">cluster</span><span class="o">.</span><span class="n">connect</span><span class="p">()</span>
<span class="c"># 获取ioctx句柄用于操作某个pool</span>
<span class="n">ioctx</span> <span class="o">=</span> <span class="n">cluster</span><span class="o">.</span><span class="n">open_ioctx</span><span class="p">(</span><span class="s">'mypool'</span><span class="p">)</span>
<span class="c"># 创建 RBD 对象</span>
<span class="n">rbd_inst</span> <span class="o">=</span> <span class="n">rbd</span><span class="o">.</span><span class="n">RBD</span><span class="p">()</span>
<span class="c"># 创建一个块设备,注意和ioctx关联起来的,指明在哪个pool创建</span>
<span class="n">size</span> <span class="o">=</span> <span class="mi">4</span> <span class="o">*</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span> <span class="c"># 4 GiB</span>
<span class="n">rbd_inst</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">ioctx</span><span class="p">,</span> <span class="s">'myimage'</span><span class="p">,</span> <span class="n">size</span><span class="p">)</span>
<span class="c"># 创建Image的实例用于读写</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">rbd</span><span class="o">.</span><span class="n">Image</span><span class="p">(</span><span class="n">ioctx</span><span class="p">,</span> <span class="s">'myimage'</span><span class="p">)</span>
<span class="c"># 写数据</span>
<span class="n">data</span> <span class="o">=</span> <span class="s">'foo'</span> <span class="o">*</span> <span class="mi">200</span>
<span class="n">image</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># 关闭</span>
<span class="n">image</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">ioctx</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">cluster</span><span class="o">.</span><span class="n">shutdown</span><span class="p">()</span>
</code></pre></div></div>
Mutex of pthread_cond_wait
2014-10-17T00:00:00+00:00
http://blog.wjin.org/posts/mutex-of-pthread_cond_wait
<h1 id="introduction">Introduction</h1>
<p>Question: <code class="highlighter-rouge">Why does pthread_cond_wait need a lock?</code></p>
<p>We know that, when we use API <code class="highlighter-rouge">pthread_cond_wait</code>, we need to lock a mutex.</p>
<p>Like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">gx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_mutex_t</span> <span class="n">mtx</span> <span class="o">=</span> <span class="n">PTHREAD_MUTEX_INITIALIZER</span><span class="p">;</span>
<span class="n">pthread_cond_t</span> <span class="n">cond</span> <span class="o">=</span> <span class="n">PTHREAD_COND_INITIALIZER</span><span class="p">;</span>
<span class="c1">// thread A
</span><span class="p">...</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">gx</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">pthread_cond_wait</span><span class="p">(</span><span class="o">&</span><span class="n">cond</span><span class="p">,</span> <span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">...</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
</code></pre></div></div>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// thread B
</span><span class="p">...</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="n">gx</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="n">pthread_cond_signal</span><span class="p">(</span><span class="o">&</span><span class="n">cond</span><span class="p">);</span>
<span class="p">...</span>
</code></pre></div></div>
<p>Why we need such a lock? Can we eliminate it? The answer is <code class="highlighter-rouge">NO</code> because
pthread library itself hides the potential problem for you.</p>
<h1 id="analysis">Analysis</h1>
<p>According to man page:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">pthread_cond_timedwait</span><span class="p">(</span><span class="n">pthread_cond_t</span> <span class="o">*</span><span class="k">restrict</span> <span class="n">cond</span><span class="p">,</span>
<span class="n">pthread_mutex_t</span> <span class="o">*</span><span class="k">restrict</span> <span class="n">mutex</span><span class="p">,</span>
<span class="k">const</span> <span class="k">struct</span> <span class="n">timespec</span> <span class="o">*</span><span class="k">restrict</span> <span class="n">abstime</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">pthread_cond_wait</span><span class="p">(</span><span class="n">pthread_cond_t</span> <span class="o">*</span><span class="k">restrict</span> <span class="n">cond</span><span class="p">,</span>
<span class="n">pthread_mutex_t</span> <span class="o">*</span><span class="k">restrict</span> <span class="n">mutex</span><span class="p">);</span>
<span class="n">These</span> <span class="n">functions</span> <span class="n">atomically</span> <span class="n">release</span> <span class="n">mutex</span> <span class="n">and</span> <span class="n">cause</span> <span class="n">the</span> <span class="n">calling</span> <span class="kr">thread</span> <span class="n">to</span> <span class="n">block</span>
<span class="n">on</span> <span class="n">the</span> <span class="n">condition</span> <span class="n">variable</span> <span class="n">cond</span><span class="p">;</span> <span class="n">atomically</span> <span class="n">here</span> <span class="n">means</span> <span class="s">"atomically with respect</span><span class="err">
</span><span class="s">to access by another thread to the mutex and then the condition variable"</span><span class="p">.</span> <span class="n">That</span> <span class="n">is</span><span class="p">,</span>
<span class="k">if</span> <span class="n">another</span> <span class="kr">thread</span> <span class="n">is</span> <span class="n">able</span> <span class="n">to</span> <span class="n">acquire</span> <span class="n">the</span> <span class="n">mutex</span> <span class="n">after</span> <span class="n">the</span> <span class="n">about</span><span class="o">-</span><span class="n">to</span><span class="o">-</span><span class="n">block</span> <span class="kr">thread</span>
<span class="n">has</span> <span class="n">released</span> <span class="n">it</span><span class="p">,</span> <span class="n">then</span> <span class="n">a</span> <span class="n">subsequent</span> <span class="n">call</span> <span class="n">to</span> <span class="n">pthread_cond_broadcast</span><span class="p">()</span> <span class="n">or</span> <span class="n">pthread_cond_signal</span><span class="p">()</span>
<span class="n">in</span> <span class="n">that</span> <span class="kr">thread</span> <span class="n">shall</span> <span class="n">behave</span> <span class="n">as</span> <span class="k">if</span> <span class="n">it</span> <span class="n">were</span> <span class="n">issued</span> <span class="n">after</span> <span class="n">the</span> <span class="n">about</span><span class="o">-</span><span class="n">to</span><span class="o">-</span><span class="n">block</span> <span class="kr">thread</span> <span class="n">has</span> <span class="n">blocked</span><span class="p">.</span>
</code></pre></div></div>
<p>The key point is <code class="highlighter-rouge">atomically</code>. That means <code class="highlighter-rouge">unlock and wait</code> should not be interrupted.
If we can guarantee this atomic property, we can guarantee that <code class="highlighter-rouge">behave as if it were issued
after the about-to-block thread has blocked.</code> Otherwise, there might be <code class="highlighter-rouge">signal lost</code>.</p>
<p>If there is no mutex in <code class="highlighter-rouge">pthread_cond_wait</code> function, we may write code like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// thread A
</span><span class="p">...</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">gx</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span> <span class="c1">// step1
</span> <span class="n">pthread_cond_wait</span><span class="p">(</span><span class="o">&</span><span class="n">cond</span><span class="p">);</span> <span class="c1">// step2
</span> <span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">...</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="n">mtx</span><span class="p">);</span>
</code></pre></div></div>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// thread B
</span><span class="p">...</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="n">gx</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">mtx</span><span class="p">);</span>
<span class="n">pthread_cond_signal</span><span class="p">(</span><span class="o">&</span><span class="n">cond</span><span class="p">);</span> <span class="c1">// step3
</span><span class="p">...</span>
</code></pre></div></div>
<p>However, above code snippet has a problem as step1 and step2 are not <code class="highlighter-rouge">atomic</code> any more.
This could lead to a bad effect: <code class="highlighter-rouge">signal lost</code>.</p>
<p>For example, when finish executing step1, control switches back to thread B. And thread A won’t
get control until thread B executes step3. After thread A gets control again, the signal was lost.</p>
<p>So how does the C library avoid it?</p>
<h1 id="library-implementaion">Library Implementaion</h1>
<p>As I am familiar with Bionic C code in Android, I will use code from there directly.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">pthread_cond_wait</span><span class="p">(</span><span class="n">pthread_cond_t</span><span class="o">*</span> <span class="n">cond</span><span class="p">,</span> <span class="n">pthread_mutex_t</span><span class="o">*</span> <span class="n">mutex</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">__pthread_cond_timedwait</span><span class="p">(</span><span class="n">cond</span><span class="p">,</span> <span class="n">mutex</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">COND_GET_CLOCK</span><span class="p">(</span><span class="n">cond</span><span class="o">-></span><span class="n">value</span><span class="p">));</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">pthread_cond_timedwait</span><span class="p">(</span><span class="n">pthread_cond_t</span> <span class="o">*</span><span class="n">cond</span><span class="p">,</span> <span class="n">pthread_mutex_t</span> <span class="o">*</span> <span class="n">mutex</span><span class="p">,</span> <span class="k">const</span> <span class="n">timespec</span> <span class="o">*</span><span class="n">abstime</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">__pthread_cond_timedwait</span><span class="p">(</span><span class="n">cond</span><span class="p">,</span> <span class="n">mutex</span><span class="p">,</span> <span class="n">abstime</span><span class="p">,</span> <span class="n">COND_GET_CLOCK</span><span class="p">(</span><span class="n">cond</span><span class="o">-></span><span class="n">value</span><span class="p">));</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">__pthread_cond_timedwait</span><span class="p">(</span><span class="n">pthread_cond_t</span><span class="o">*</span> <span class="n">cond</span><span class="p">,</span> <span class="n">pthread_mutex_t</span><span class="o">*</span> <span class="n">mutex</span><span class="p">,</span> <span class="k">const</span> <span class="n">timespec</span><span class="o">*</span> <span class="n">abstime</span><span class="p">,</span> <span class="n">clockid_t</span> <span class="n">clock</span><span class="p">)</span> <span class="p">{</span>
<span class="n">timespec</span> <span class="n">ts</span><span class="p">;</span>
<span class="n">timespec</span><span class="o">*</span> <span class="n">tsp</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">abstime</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">__timespec_from_absolute</span><span class="p">(</span><span class="o">&</span><span class="n">ts</span><span class="p">,</span> <span class="n">abstime</span><span class="p">,</span> <span class="n">clock</span><span class="p">)</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">ETIMEDOUT</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">tsp</span> <span class="o">=</span> <span class="o">&</span><span class="n">ts</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">tsp</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">__pthread_cond_timedwait_relative</span><span class="p">(</span><span class="n">cond</span><span class="p">,</span> <span class="n">mutex</span><span class="p">,</span> <span class="n">tsp</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">__pthread_cond_timedwait_relative</span><span class="p">(</span><span class="n">pthread_cond_t</span><span class="o">*</span> <span class="n">cond</span><span class="p">,</span> <span class="n">pthread_mutex_t</span><span class="o">*</span> <span class="n">mutex</span><span class="p">,</span> <span class="k">const</span> <span class="n">timespec</span><span class="o">*</span> <span class="n">reltime</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">old_value</span> <span class="o">=</span> <span class="n">cond</span><span class="o">-></span><span class="n">value</span><span class="p">;</span> <span class="c1">// ***this is the key point***
</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="n">mutex</span><span class="p">);</span>
<span class="c1">// call system call with ***old value***
</span> <span class="kt">int</span> <span class="n">status</span> <span class="o">=</span> <span class="n">__futex_wait_ex</span><span class="p">(</span><span class="o">&</span><span class="n">cond</span><span class="o">-></span><span class="n">value</span><span class="p">,</span> <span class="n">COND_IS_SHARED</span><span class="p">(</span><span class="n">cond</span><span class="o">-></span><span class="n">value</span><span class="p">),</span> <span class="n">old_value</span><span class="p">,</span> <span class="n">reltime</span><span class="p">);</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="n">mutex</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">status</span> <span class="o">==</span> <span class="o">-</span><span class="n">ETIMEDOUT</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">ETIMEDOUT</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">__futex_wake_ex</span><span class="p">(</span><span class="k">volatile</span> <span class="kt">void</span><span class="o">*</span> <span class="n">ftx</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">shared</span><span class="p">,</span> <span class="kt">int</span> <span class="n">count</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">__futex</span><span class="p">(</span><span class="n">ftx</span><span class="p">,</span> <span class="n">shared</span> <span class="o">?</span> <span class="n">FUTEX_WAKE</span> <span class="o">:</span> <span class="n">FUTEX_WAKE_PRIVATE</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="n">__always_inline</span> <span class="kt">int</span> <span class="nf">__futex</span><span class="p">(</span><span class="k">volatile</span> <span class="kt">void</span><span class="o">*</span> <span class="n">ftx</span><span class="p">,</span> <span class="kt">int</span> <span class="n">op</span><span class="p">,</span> <span class="kt">int</span> <span class="n">value</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">timespec</span><span class="o">*</span> <span class="n">timeout</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Our generated syscall assembler sets errno, but our callers (pthread functions) don't want to.
</span> <span class="kt">int</span> <span class="n">saved_errno</span> <span class="o">=</span> <span class="n">errno</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">result</span> <span class="o">=</span> <span class="n">syscall</span><span class="p">(</span><span class="n">__NR_futex</span><span class="p">,</span> <span class="n">ftx</span><span class="p">,</span> <span class="n">op</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">timeout</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">__predict_false</span><span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
<span class="n">result</span> <span class="o">=</span> <span class="o">-</span><span class="n">errno</span><span class="p">;</span>
<span class="n">errno</span> <span class="o">=</span> <span class="n">saved_errno</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can see that, the key point in implementation is that: <code class="highlighter-rouge">before unlock, it saves a copy of old value of
condition variable, and then use this old value to call system call futex. Even if there is a switch after
unlock, it can still know that signal was happened before.</code></p>
<p>Here is futex system call explanation:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">futex</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">uaddr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">op</span><span class="p">,</span> <span class="kt">int</span> <span class="n">val</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">timespec</span> <span class="o">*</span><span class="n">timeout</span><span class="p">,</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">uaddr2</span><span class="p">,</span> <span class="kt">int</span> <span class="n">val3</span><span class="p">);</span>
<span class="n">FUTEX_WAIT</span>
<span class="n">This</span> <span class="n">operation</span> <span class="n">atomically</span> <span class="n">verifies</span> <span class="n">that</span> <span class="n">the</span> <span class="n">futex</span> <span class="n">address</span> <span class="n">uaddr</span> <span class="n">still</span> <span class="n">contains</span> <span class="n">the</span> <span class="n">value</span> <span class="n">val</span><span class="p">,</span> <span class="n">and</span> <span class="n">sleeps</span> <span class="n">awaiting</span> <span class="n">FUTEX_WAKE</span>
<span class="n">on</span> <span class="k">this</span> <span class="n">futex</span> <span class="n">address</span><span class="p">.</span> <span class="n">If</span> <span class="n">the</span> <span class="n">timeout</span> <span class="n">argument</span> <span class="n">is</span> <span class="n">non</span><span class="o">-</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">its</span> <span class="n">contents</span> <span class="n">describe</span> <span class="n">the</span> <span class="n">maximum</span> <span class="n">duration</span> <span class="n">of</span> <span class="n">the</span> <span class="n">wait</span><span class="p">,</span> <span class="n">which</span>
<span class="n">is</span> <span class="n">infinite</span> <span class="n">otherwise</span><span class="p">.</span> <span class="n">The</span> <span class="n">arguments</span> <span class="n">uaddr2</span> <span class="n">and</span> <span class="n">val3</span> <span class="n">are</span> <span class="n">ignored</span><span class="p">.</span>
</code></pre></div></div>
<h1 id="conclusion">Conclusion</h1>
<p>Library uses this mutex to guarantee that <code class="highlighter-rouge">unlock and wait</code> is atomic and not interrupted.
<code class="highlighter-rouge">Atomic</code> here is not like what we have already known before, it uses a trick way to avoid
<code class="highlighter-rouge">signal lost</code> bug. The library considers it carefully, hides the potential problem and provides easy API to use.</p>
Manacher
2014-10-12T00:00:00+00:00
http://blog.wjin.org/posts/manacher
<h1 id="introduction">Introduction</h1>
<p><strong>Question: Longest Palindromic Substring</strong></p>
<p>Given a string S, find the longest palindromic substring in S.</p>
<p>You may assume that the maximum length of S is 1000, and there exists</p>
<p>one unique longest palindromic substring.</p>
<h1 id="analysis">Analysis</h1>
<p>It is very easy to get a dp solution with time complexity O(n^2).</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// dp
// time complexity: O(n^2)
</span><span class="k">class</span> <span class="nc">Solution</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">string</span> <span class="n">longestPalindrome</span><span class="p">(</span><span class="n">string</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">size</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="s">""</span><span class="p">;</span>
<span class="n">vector</span><span class="o"><</span><span class="n">vector</span><span class="o"><</span><span class="kt">bool</span><span class="o">>></span> <span class="n">dp</span><span class="p">(</span><span class="n">len</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="kt">bool</span><span class="o">></span><span class="p">(</span><span class="n">len</span><span class="p">,</span> <span class="nb">false</span><span class="p">));</span>
<span class="kt">int</span> <span class="n">subStrLen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">start</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="c1">// dp[i][j] is true means s[i]...s[j] is palindrome
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">col</span> <span class="o"><</span> <span class="n">len</span><span class="p">;</span> <span class="n">col</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">row</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">row</span> <span class="o"><=</span> <span class="n">col</span><span class="p">;</span> <span class="n">row</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">row</span> <span class="o">==</span> <span class="n">col</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// single char is palindrome
</span> <span class="n">dp</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">col</span> <span class="o">-</span> <span class="n">row</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">></span> <span class="n">subStrLen</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// update the longest candidate
</span> <span class="n">subStrLen</span> <span class="o">=</span> <span class="n">col</span> <span class="o">-</span> <span class="n">row</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">row</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="n">row</span><span class="p">]</span> <span class="o">==</span> <span class="n">s</span><span class="p">[</span><span class="n">col</span><span class="p">])</span> <span class="p">{</span>
<span class="c1">// short circuit evaluation
</span> <span class="c1">// row + 1 == col means only have two letters, be careful
</span> <span class="k">if</span> <span class="p">(</span><span class="n">row</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">==</span> <span class="n">col</span> <span class="o">||</span> <span class="n">dp</span><span class="p">[</span><span class="n">row</span> <span class="o">+</span> <span class="mi">1</span><span class="p">][</span><span class="n">col</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="p">{</span>
<span class="n">dp</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">col</span> <span class="o">-</span> <span class="n">row</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">></span> <span class="n">subStrLen</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// update the longest candidate
</span> <span class="n">subStrLen</span> <span class="o">=</span> <span class="n">col</span> <span class="o">-</span> <span class="n">row</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">row</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">s</span><span class="p">.</span><span class="n">substr</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">subStrLen</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>As for palindrome, we can use its <code class="highlighter-rouge">symmetry property</code> to get a solution with only O(1) space complexity.</p>
<p>Traverse each charater in source string, and then expande it to two sides. We need to consider two situation
when expande, odd and even. For example: aba and abba.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// time complexity: O(n^2)
// space complexity: O(1)
</span><span class="k">class</span> <span class="nc">Solution2</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">string</span> <span class="n">expandAroundCenter</span><span class="p">(</span><span class="n">string</span> <span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c1</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c2</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">l</span> <span class="o">=</span> <span class="n">c1</span><span class="p">,</span> <span class="n">r</span> <span class="o">=</span> <span class="n">c2</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">length</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">l</span> <span class="o">>=</span> <span class="mi">0</span> <span class="o">&&</span> <span class="n">r</span> <span class="o"><=</span> <span class="n">n</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">&&</span> <span class="n">s</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="o">==</span> <span class="n">s</span><span class="p">[</span><span class="n">r</span><span class="p">])</span> <span class="p">{</span>
<span class="n">l</span><span class="o">--</span><span class="p">;</span>
<span class="n">r</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">s</span><span class="p">.</span><span class="n">substr</span><span class="p">(</span><span class="n">l</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">r</span> <span class="o">-</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">string</span> <span class="n">longestPalindrome</span><span class="p">(</span><span class="n">string</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">length</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="s">""</span><span class="p">;</span>
<span class="n">string</span> <span class="n">longest</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">substr</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// a single char itself is a palindrome
</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">string</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">expandAroundCenter</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="c1">// odd
</span> <span class="k">if</span> <span class="p">(</span><span class="n">p1</span><span class="p">.</span><span class="n">length</span><span class="p">()</span> <span class="o">></span> <span class="n">longest</span><span class="p">.</span><span class="n">length</span><span class="p">())</span>
<span class="n">longest</span> <span class="o">=</span> <span class="n">p1</span><span class="p">;</span>
<span class="n">string</span> <span class="n">p2</span> <span class="o">=</span> <span class="n">expandAroundCenter</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// even
</span> <span class="k">if</span> <span class="p">(</span><span class="n">p2</span><span class="p">.</span><span class="n">length</span><span class="p">()</span> <span class="o">></span> <span class="n">longest</span><span class="p">.</span><span class="n">length</span><span class="p">())</span>
<span class="n">longest</span> <span class="o">=</span> <span class="n">p2</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">longest</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The most excellent solution might be <code class="highlighter-rouge">Manacher's algorithm</code>. Both time and space complexities are O(n). You can refere to
<a href="http://leetcode.com/2011/11/longest-palindromic-substring-part-ii.html">explaintation</a>.</p>
<p>Of course, normally we can deal with string problems using <code class="highlighter-rouge">suffix tree</code> or <code class="highlighter-rouge">suffix array</code> to get O(n) time complexity.
So does this problem.</p>
<p>However, Manacher’s algorithm is simpler and easy to program compared with suffix tree or array.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// time complexity: O(n)
// space complexity: O(n)
</span><span class="k">class</span> <span class="nc">Solution3</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">string</span> <span class="n">preProcess</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">string</span> <span class="n">ret</span><span class="p">(</span><span class="s">"^"</span><span class="p">);</span> <span class="c1">// leading with special character
</span>
<span class="c1">// insert delimeter
</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">e</span> <span class="o">:</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">+=</span> <span class="sc">'#'</span><span class="p">;</span>
<span class="n">ret</span> <span class="o">+=</span> <span class="n">e</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ret</span> <span class="o">+=</span> <span class="s">"#$"</span><span class="p">;</span> <span class="c1">// ending with $
</span> <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">string</span> <span class="n">longestPalindrome</span><span class="p">(</span><span class="n">string</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">string</span> <span class="n">t</span><span class="p">(</span><span class="n">preProcess</span><span class="p">(</span><span class="n">s</span><span class="p">));</span>
<span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">size</span><span class="p">();</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">p</span><span class="p">(</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="c1">// calculate array p[i]
</span> <span class="c1">// p[i] means how long it can be extended to left or right to form a palindrome
</span> <span class="c1">// centered with i, character i itself is included in this length.
</span> <span class="c1">// i.e.: p[i] = 2, means p[i-1] p[i] p[i+1] is a palindrome
</span> <span class="c1">// p[i] - 1 is the palindroem length of original string
</span> <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">mx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">len</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">mx</span> <span class="o">></span> <span class="n">i</span><span class="p">)</span> <span class="o">?</span> <span class="n">min</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">id</span> <span class="o">-</span> <span class="n">i</span><span class="p">],</span> <span class="n">mx</span> <span class="o">-</span> <span class="n">i</span><span class="p">)</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">==</span> <span class="n">t</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]])</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">i</span> <span class="o">></span> <span class="n">mx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">id</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="n">mx</span> <span class="o">=</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">maxLenId</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="n">p</span><span class="p">[</span><span class="n">maxLenId</span><span class="p">])</span> <span class="p">{</span>
<span class="n">maxLenId</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// be careful with substr start point
</span> <span class="k">return</span> <span class="n">s</span><span class="p">.</span><span class="n">substr</span><span class="p">((</span><span class="n">maxLenId</span> <span class="o">-</span> <span class="n">p</span><span class="p">[</span><span class="n">maxLenId</span><span class="p">]</span> <span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">,</span> <span class="n">p</span><span class="p">[</span><span class="n">maxLenId</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
TCP/IP Review
2014-10-02T00:00:00+00:00
http://blog.wjin.org/posts/tcpip-review
<h1 id="network">Network</h1>
<h2 id="terms">Terms</h2>
<p><strong>OSI</strong>: Open Systems Interconnection</p>
<ul>
<li>
<p>Application layer</p>
</li>
<li>
<p>Presentation Layer</p>
</li>
<li>
<p>Session Layer</p>
</li>
<li>
<p>Transport layer</p>
</li>
<li>
<p>Network layer</p>
</li>
<li>
<p>Data Link layer</p>
</li>
<li>
<p>Physical layer</p>
</li>
</ul>
<p><strong>TCP/IP</strong>: Transmission Control Protocol and Internet Protocol</p>
<ul>
<li>
<p>Application layer</p>
</li>
<li>
<p>Transport layer</p>
</li>
<li>
<p>Internet layer</p>
</li>
<li>
<p>Link layer</p>
</li>
<li>
<p>Physical layer</p>
</li>
</ul>
<p><strong>Abbreviation</strong></p>
<p>RFC: Request For Comment</p>
<p>MTU: Maximum Transmission Unit (Link)</p>
<p>RTT: Round-Trip Time</p>
<p>TTL: Time-to-Live (IP)</p>
<p>MSS: Maximum Segment Size (TCP)</p>
<p>MSL: Maximum Segment Lifetime (TCP)</p>
<p>IP datagram</p>
<p>UDP datagram</p>
<p>TCP segment</p>
<h2 id="ip-address">IP Address</h2>
<p><img src="/assets/img/post/ip_class.png" alt="ip class" /></p>
<p><img src="/assets/img/post/ip_range.png" alt="ip range" /></p>
<p>The IPv4 address space was originally divided into <strong>five classes</strong>. Classes A, B, and C were used for
assigning addresses to interfaces on the Internet (<strong>unicast</strong> addresses) and for some other special-case uses.</p>
<p>The classes are defined by the first few bits in the address: 0 for class A, 10 for class B, 110 for class C, and so on.
Class D addresses are for <strong>multicast</strong> use, and class E addresses remain reserved.</p>
<p><strong>Host portion</strong> can be divided into a <code class="highlighter-rouge">subnetwork (subnet) number</code> and a <code class="highlighter-rouge">host number</code>.
And subnet mask is used to partition host (zero bit is host bit).</p>
<h2 id="ip-icmp-igmp-arp-rarp">IP (ICMP, IGMP, ARP, RARP)</h2>
<p><img src="/assets/img/post/ip_protocol.png" alt="ip" /></p>
<p>Big endian</p>
<p>Header is 20 bytes if no optional entry</p>
<p>TTL: Time-to-Live, sets an upper limit on the number of routers through which a datagram can pass (hop number).</p>
<p>IP fragmentation and reassembly:</p>
<p>IP Routing Order:</p>
<blockquote>
<p>host -> subnet -> net number -> default routing or fail</p>
</blockquote>
<p>IGP: RIP</p>
<p>EGP:</p>
<p>ARP: IP -> MAC</p>
<p>RARP: MAC -> IP</p>
<p>ICMP:</p>
<p><code class="highlighter-rouge">Ping</code>: send icmp type 0 packet; receive icmp type 8 packet</p>
<p><code class="highlighter-rouge">Traceroute</code>: using ip ttl, each time ttl + 1, distinguish result: Time Exceeded and Port Unreachable.</p>
<p>The traceroute tool is used to determine the routers used along a path from a sender to a destination.</p>
<p>The approach involves sending datagrams <em>first with an IPv4 TTL field set to 1</em> and allowing the expiring datagrams
to induce routers along the path to send ICMPv4 <strong>Time Exceeded (code 0) messages</strong>.</p>
<p>Each round, the sending TTL value is <strong>increased by 1</strong>, causing the routers that are one hop farther to
expire the datagrams and generate ICMP messages.</p>
<h2 id="udp">UDP</h2>
<p><img src="/assets/img/post/udp.png" alt="udp" /></p>
<ul>
<li>
<p>Simple, Unreliable, Fixed header (8 bytes)</p>
</li>
<li>
<p>Broadcast and multicast use UDP as it is one to many relationship</p>
</li>
<li>
<p>Path MTU Discovery with UDP</p>
</li>
</ul>
<h2 id="tcp">TCP</h2>
<p><img src="/assets/img/post/tcp.png" alt="tcp" /></p>
<p>TCP provides a <strong>reliable</strong>, <strong>connection-oriented</strong>, <strong>byte stream</strong>, transport-layer service.</p>
<p><strong>Reliable:</strong></p>
<ul>
<li>
<p>Acknowledge data packet</p>
</li>
<li>
<p>Time out retransmission</p>
</li>
<li>
<p>Flow control and congestion control</p>
</li>
</ul>
<h3 id="connection">Connection</h3>
<p><img src="/assets/img/post/initiate_connection.png" alt="set up connection" /></p>
<p><img src="/assets/img/post/close_conection.png" alt="close connection" /></p>
<p>normal open: three-way handshake</p>
<p>normal close: four-way handshake</p>
<p>simultaneous open and close: both are four-way handshake</p>
<p>half close: can still receive data</p>
<p><strong>How about the third packet (ACK) lost when initiating a connection?</strong></p>
<ol>
<li>
<p>if client sends data to server, server replys with RST (this shouldn’t happen because client will include ACK sequence when sending data).</p>
</li>
<li>
<p>if client does not send data until server retransmission time out, server will re-send the second packet (SYN-ACK).</p>
</li>
</ol>
<h3 id="state-transition">State Transition</h3>
<p><img src="/assets/img/post/tcp_state.png" alt="tcp" /></p>
<h3 id="interactive-communication">Interactive Communication</h3>
<p><strong>Delayed acknowledgments</strong></p>
<p>Interactive data is normally transmitted in segments smaller than the MSS. <strong>Delayed acknowledgments</strong> (200ms) may
be used by the <strong>receiver</strong> of these small segments to see if the acknowledgment can be piggybacked along with data
going back to the sender.</p>
<p>This often reduces the number of segments, however, it may introduce additional delay.</p>
<p><strong>Nagle algorithm</strong></p>
<p>This algorithm <strong>limits the sender to a single small packet of unacknowledged data at anytime</strong>.
On connections with relatively large round-trip times, such as WANs, the <strong>Nagle algorithm</strong> is often used to <strong>reduce the number of small segments</strong>.
However, it adds delay that is sometimes unacceptable to applications. We can use <code class="highlighter-rouge">TCP_NODELAY</code> to disable Nagle algorithm.</p>
<h3 id="bulk-data-communication">Bulk Data Communication</h3>
<p><img src="/assets/img/post/sliding_window.png" alt="sliding window" /></p>
<p><code class="highlighter-rouge">Sliding Window</code> is used to deal with bulk data.</p>
<ol>
<li>
<p>The window closes as the <strong>left edge advances</strong> to the right. This happens when data that has been sent is <strong>acknowledged</strong> and the window size gets smaller.</p>
</li>
<li>
<p>The window opens when the <strong>right edge moves to the right</strong>, allowing more data to be sent. This happens when the receiving process on the other end
<strong>reads acknowledged data</strong>, freeing up space in its TCP receive buffer (kernel).</p>
</li>
<li>
<p>The window shrinks when the right edge moves to the left. RFC strongly discourages this.</p>
</li>
</ol>
<p><strong>Two problems:</strong></p>
<ol>
<li>
<p>Zero Window and then following Window Update packet lost</p>
</li>
<li>
<p>Silly Window Syndrome (SWS)</p>
</li>
</ol>
<p><strong>Solution:</strong></p>
<p>For the first problem, there is a <code class="highlighter-rouge">Persist Timer</code> to deal with the loss of <code class="highlighter-rouge">Window Update</code> packet.</p>
<p>For SWS, <strong>small data segments</strong> are exchanged across the connection instead of full-size segments.
This leads to undesirable <strong>inefficiency</strong> because each segment has relatively <strong>high overhead</strong>—a small
number of data bytes relative to the number of bytes in the headers. We can fix it on both receiver and sender sides:</p>
<ul>
<li>
<p>For receiver, small windows are not advertised.</p>
</li>
<li>
<p>For sender, small segments are not sent and the Nagle algorithm governs when to send.</p>
</li>
</ul>
<h3 id="congestion-control">Congestion Control</h3>
<p>Congestion is a situation that a router is forced to <strong>discard data</strong> because it cannot handle the arriving traffic rate.</p>
<p>In TCP, an assumption is made that <strong>a lost packet</strong> is an indicator of congestion.</p>
<p><code class="highlighter-rouge">Slow Start</code> and <code class="highlighter-rouge">Congestion Avoidance</code> algorithms are triggered when loss is detected,
either by the fast retransmit algorithm or by retransmission timeouts.</p>
<p><strong>Slow Start</strong></p>
<p>The purpose of slow start is to help TCP <strong>find a value for cwnd</strong> before probing for more available bandwidth using congestion avoidance.</p>
<p>Typically, a TCP begins a new connection in slow start, eventually drops a packet, and then settles into steady-state operation using the congestion avoidance algorithm.</p>
<p>It can also be triggered when a timeout-based retransmission occurs.</p>
<p><strong>Congestion Avoidance</strong></p>
<p>There is a congestion window at the sender called <code class="highlighter-rouge">cwnd</code>. A standard TCP limits its window to the minimum of cwnd and sliding window.</p>
<p><strong>Selecting between Slow Start and Congestion Avoidance</strong></p>
<p>Slow start grows the value of the congestion window <strong>exponentially with time</strong>, and congestion avoidance grows it about <strong>linearly with time</strong>.</p>
<p>Only <strong>one</strong> of the two algorithms is in operation at any one time, and this decision is made by comparing the current value of the
congestion window to the slow start threshold <strong>ssthresh</strong>:</p>
<ul>
<li>
<p>cwnd < ssthresh, slow start is used</p>
</li>
<li>
<p>cwnd > ssthresh, congestion avoidance is used</p>
</li>
</ul>
<p>The value of <strong>ssthresh</strong> is not fixed but instead <strong>varies</strong> over time. Its main purpose is to remember the last “best” estimate
of the operating window when no loss was present.</p>
<h3 id="timer">Timer</h3>
<p>In my humble opinion, timer is used to deal with <code class="highlighter-rouge">exception</code>, such as time out, packet loss, crash or failure, to guarantee TCP Reliability.</p>
<p>Pure <code class="highlighter-rouge">ACK</code> packet and <code class="highlighter-rouge">Window Update</code> packet do not include data, so they are not acknowledged by the peer and not reliable either.</p>
<p><strong>Retransmission</strong></p>
<p>Deal with <code class="highlighter-rouge">time out</code>.</p>
<p>TCP measures the RTT and then uses these measurements to keep track of a smoothed RTT estimator and a smoothed mean deviation estimator.
These two estimators are then used to calculate the next retransmission timeout value.</p>
<p>Without the Timestamps option, a TCP measures only a single RTT per window of data. Karn’s algorithm removes the retransmission
ambiguity problem by preventing the use of RTT measurements for segments that have been lost.</p>
<p>Today, most TCPs use the Timestamps option, which permits each segment to be individually timed.</p>
<p><strong>2MSL</strong></p>
<p>Deal with the <code class="highlighter-rouge">loss of ACK</code> that would acknowledge FIN.</p>
<p>When entering TIME_WAIT state, wait for 2 MSL time so that it can send the last ACK due to its loss.
The other side will send the FIN again due to time out, so when receiving the second FIN , this side can still send ACK.</p>
<p><strong>Persist</strong></p>
<p>Deal with the <code class="highlighter-rouge">loss of Window Updates Packet</code> that would open the window.</p>
<p>Considering sliding window, when the receiver once again has space available, it provides a window update
to the sender to indicate that data is permitted to flow once again. Because such updates do not generally
contain data** (they are a form of “pure ACK”), they are not reliably delivered by TCP.</p>
<p>TCP must therefore handle the case where such window updates that would open the window are lost.</p>
<p>If an acknowledgment is lost, we could end up with both sides waiting for
the other. To prevent this form of <strong>deadlock</strong> from occurring, the <code class="highlighter-rouge">sender</code> uses a persist timer to
query the receiver periodically, <strong>to find out if the window size has increased</strong>.</p>
<p><strong>Keepalive</strong></p>
<p>Deal with one side <code class="highlighter-rouge">crash or failure</code>.</p>
<p>The keepalive feature was originally intended for <strong>server applications</strong> that might <strong>tie up resources</strong>
and want to know <strong>if the client host crashes or goes away</strong>. Of course, it can be used on client side as well
even though it is uncommon.</p>
<h1 id="reference">Reference</h1>
<ul>
<li>TCP/IP Illustrated, Volume 1, Second Version</li>
</ul>
Network I/O Model
2014-10-02T00:00:00+00:00
http://blog.wjin.org/posts/network-io-model
<h1 id="synchronous-vs-asynchronous">Synchronous vs Asynchronous</h1>
<p><strong>Synchronous I/O</strong> would block the calling process until the I/O operation is done.
The <strong>calling process</strong> will finish I/O operation itself within <strong>process context</strong>.</p>
<p><strong>Asynchronous I/O</strong> would not block the calling process. Calling process just sends request,
and then continue to work. The <em>opearting system</em> will do the I/O operation, after it has done,
it will notify the calling process.</p>
<h1 id="io-model">I/O Model</h1>
<p>There are two stages in I/O operation and four I/O models in general.</p>
<h3 id="stage">Stage</h3>
<ol>
<li>
<p>Wait for data to be ready</p>
</li>
<li>
<p>Copy data from kernel space to user space</p>
</li>
</ol>
<h3 id="model">Model</h3>
<p><strong>Blocking</strong></p>
<p>Block on the first stage like recvfrom system call.</p>
<p><strong>Non-blocking</strong></p>
<p>Poll on the first stage.</p>
<p><strong>I/O Multi-plexing</strong></p>
<p>Block on select/poll/epoll system calls.</p>
<p>Above three models belong to synchronous I/O.</p>
<p><strong>Asynchronous</strong></p>
<p>Kernel does the IO and then notifies applications. For example, posix aio_read and aio_write APIs.
However, they are not widely used by applications.</p>
<p>Above cencepts are from book Unix Networking Programming.</p>
<p>Regard to implementation in linux, there are two aio ways, glibc aio (posix standard implementaion) and libaio, but they are specific to file io, not socket.
Glibc aio is simulated by multi-threads and its performance is not good. However, libaio is a library that wrappers linux aio system calls,
so its performance is better than glibc aio. Normally, we use libaio to develop applications, its major APIs are listed here:</p>
<ul>
<li>
<p>io_setup</p>
</li>
<li>
<p>io_destroy</p>
</li>
<li>
<p>io_submit</p>
</li>
<li>
<p>io_cancel</p>
</li>
<li>
<p>io_getevents</p>
</li>
</ul>
Redis Slowlog
2014-09-19T00:00:00+00:00
http://blog.wjin.org/posts/redis-slowlog
<h1 id="introduction">Introduction</h1>
<p>Slowlog, as its name, is used to record commands whose execution time exceeds limitation (a little slow). It is useful to optimize server performance.</p>
<p>There are two config options for this feature: <code class="highlighter-rouge">slowlog-log-slower-than</code> and <code class="highlighter-rouge">slowlog-max-len</code>.</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>slowlog-log-slower-than: time limitation
slowlog-max-len: maximum log entry
</code></pre></div></div>
<h1 id="implementation">Implementation</h1>
<p><strong>Data Structure</strong></p>
<p>File: <code class="highlighter-rouge">slowlog.h</code> and <code class="highlighter-rouge">slowlog.c</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">slowlogEntry</span> <span class="p">{</span>
<span class="n">robj</span> <span class="o">**</span><span class="n">argv</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">argc</span><span class="p">;</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">id</span><span class="p">;</span> <span class="cm">/* Unique entry identifier. */</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">duration</span><span class="p">;</span> <span class="cm">/* Time spent by the query, in nanoseconds. */</span>
<span class="kt">time_t</span> <span class="n">time</span><span class="p">;</span> <span class="cm">/* Unix time at which the query was executed. */</span>
<span class="p">}</span> <span class="n">slowlogEntry</span><span class="p">;</span>
</code></pre></div></div>
<p>All entries are stored in a single list and each entry has an unique ID.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">redisServer</span> <span class="p">{</span>
<span class="p">...</span>
<span class="n">list</span> <span class="o">*</span><span class="n">slowlog</span><span class="p">;</span> <span class="cm">/* SLOWLOG list of commands */</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">slowlog_entry_id</span><span class="p">;</span> <span class="cm">/* SLOWLOG current entry ID */</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">slowlog_log_slower_than</span><span class="p">;</span> <span class="cm">/* SLOWLOG time limit (to get logged) */</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">slowlog_max_len</span><span class="p">;</span> <span class="cm">/* SLOWLOG max number of items logged */</span>
<span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>
<p><strong>Init</strong></p>
<p>When redis server starts up, it will call <code class="highlighter-rouge">initServer</code> function, this function will call <code class="highlighter-rouge">slowlogInit</code> to initialize slowlog related properties.</p>
<p>It also registers a callback <code class="highlighter-rouge">slowlogFreeEntry</code> to release slowlog entry.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">slowlogInit</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="n">server</span><span class="p">.</span><span class="n">slowlog</span> <span class="o">=</span> <span class="n">listCreate</span><span class="p">();</span>
<span class="n">server</span><span class="p">.</span><span class="n">slowlog_entry_id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// id start from 0
</span> <span class="n">listSetFreeMethod</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog</span><span class="p">,</span><span class="n">slowlogFreeEntry</span><span class="p">);</span> <span class="c1">// free entry callback
</span><span class="p">}</span>
</code></pre></div></div>
<p><strong>Add Entry</strong></p>
<p>Inserts new node to list head and delete old node when exceeding max length. <code class="highlighter-rouge">slowlogCreateEntry</code> will increate object reference normally.</p>
<p>However it will also create new object sometimes to save memory:</p>
<ol>
<li>
<p>string is too long</p>
</li>
<li>
<p>two many arguments, using the last argument to identify more arguments</p>
</li>
</ol>
<p>This is feasible because it is just a log infomation.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">slowlogPushEntryIfNeeded</span><span class="p">(</span><span class="n">robj</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">duration</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog_log_slower_than</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span> <span class="cm">/* Slowlog disabled */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">duration</span> <span class="o">>=</span> <span class="n">server</span><span class="p">.</span><span class="n">slowlog_log_slower_than</span><span class="p">)</span>
<span class="n">listAddNodeHead</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog</span><span class="p">,</span><span class="n">slowlogCreateEntry</span><span class="p">(</span><span class="n">argv</span><span class="p">,</span><span class="n">argc</span><span class="p">,</span><span class="n">duration</span><span class="p">));</span>
<span class="cm">/* Remove old entries if needed. */</span>
<span class="k">while</span> <span class="p">(</span><span class="n">listLength</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog</span><span class="p">)</span> <span class="o">></span> <span class="n">server</span><span class="p">.</span><span class="n">slowlog_max_len</span><span class="p">)</span>
<span class="n">listDelNode</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog</span><span class="p">,</span><span class="n">listLast</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog</span><span class="p">));</span>
<span class="p">}</span>
<span class="n">slowlogEntry</span> <span class="o">*</span><span class="nf">slowlogCreateEntry</span><span class="p">(</span><span class="n">robj</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">duration</span><span class="p">)</span> <span class="p">{</span>
<span class="n">slowlogEntry</span> <span class="o">*</span><span class="n">se</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">se</span><span class="p">));</span>
<span class="kt">int</span> <span class="n">j</span><span class="p">,</span> <span class="n">slargc</span> <span class="o">=</span> <span class="n">argc</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">slargc</span> <span class="o">></span> <span class="n">SLOWLOG_ENTRY_MAX_ARGC</span><span class="p">)</span> <span class="n">slargc</span> <span class="o">=</span> <span class="n">SLOWLOG_ENTRY_MAX_ARGC</span><span class="p">;</span>
<span class="n">se</span><span class="o">-></span><span class="n">argc</span> <span class="o">=</span> <span class="n">slargc</span><span class="p">;</span>
<span class="n">se</span><span class="o">-></span><span class="n">argv</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">robj</span><span class="o">*</span><span class="p">)</span><span class="o">*</span><span class="n">slargc</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">slargc</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* increase obj reference or create new object */</span>
<span class="p">....</span>
<span class="p">}</span>
<span class="n">se</span><span class="o">-></span><span class="n">time</span> <span class="o">=</span> <span class="n">time</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">se</span><span class="o">-></span><span class="n">duration</span> <span class="o">=</span> <span class="n">duration</span><span class="p">;</span>
<span class="n">se</span><span class="o">-></span><span class="n">id</span> <span class="o">=</span> <span class="n">server</span><span class="p">.</span><span class="n">slowlog_entry_id</span><span class="o">++</span><span class="p">;</span>
<span class="k">return</span> <span class="n">se</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>Related Commands</strong></p>
<p>Three sub commands have been given so far: <code class="highlighter-rouge">reset</code>, <code class="highlighter-rouge">len</code> and <code class="highlighter-rouge">get</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">slowlogCommand</span><span class="p">(</span><span class="n">redisClient</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">argc</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&&</span> <span class="o">!</span><span class="n">strcasecmp</span><span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">-></span><span class="n">ptr</span><span class="p">,</span><span class="s">"reset"</span><span class="p">))</span> <span class="p">{</span>
<span class="n">slowlogReset</span><span class="p">();</span>
<span class="n">addReply</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="n">shared</span><span class="p">.</span><span class="n">ok</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">argc</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&&</span> <span class="o">!</span><span class="n">strcasecmp</span><span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">-></span><span class="n">ptr</span><span class="p">,</span><span class="s">"len"</span><span class="p">))</span> <span class="p">{</span>
<span class="n">addReplyLongLong</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="n">listLength</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">slowlog</span><span class="p">));</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">c</span><span class="o">-></span><span class="n">argc</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">||</span> <span class="n">c</span><span class="o">-></span><span class="n">argc</span> <span class="o">==</span> <span class="mi">3</span><span class="p">)</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">strcasecmp</span><span class="p">(</span><span class="n">c</span><span class="o">-></span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">-></span><span class="n">ptr</span><span class="p">,</span><span class="s">"get"</span><span class="p">))</span>
<span class="p">{</span>
<span class="cm">/* generate each entry info and return it to client */</span>
<span class="p">...</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">addReplyError</span><span class="p">(</span><span class="n">c</span><span class="p">,</span>
<span class="s">"Unknown SLOWLOG subcommand or wrong # of args. Try GET, RESET, LEN."</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="example">Example</h1>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Weis-MacBook-Pro:redis eric<span class="nv">$ </span>redis-cli
127.0.0.1:6379> config <span class="nb">set </span>slowlog-log-slower-than 0
OK
127.0.0.1:6379> config <span class="nb">set </span>slowlog-max-len 5
OK
127.0.0.1:6379> <span class="nb">set </span>msg <span class="s2">"hello, world"</span>
OK
127.0.0.1:6379> <span class="nb">set </span>idx 1
OK
127.0.0.1:6379> <span class="nb">set </span>x y
OK
127.0.0.1:6379> slowlog get
1<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 4
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105058
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 8
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"set"</span>
2<span class="o">)</span> <span class="s2">"x"</span>
3<span class="o">)</span> <span class="s2">"y"</span>
2<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 3
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105037
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 11
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"set"</span>
2<span class="o">)</span> <span class="s2">"idx"</span>
3<span class="o">)</span> <span class="s2">"1"</span>
3<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 2
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105027
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 30
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"set"</span>
2<span class="o">)</span> <span class="s2">"msg"</span>
3<span class="o">)</span> <span class="s2">"hello, world"</span>
4<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105016
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 9
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"config"</span>
2<span class="o">)</span> <span class="s2">"set"</span>
3<span class="o">)</span> <span class="s2">"slowlog-max-len"</span>
4<span class="o">)</span> <span class="s2">"5"</span>
5<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 0
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411104979
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 26
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"config"</span>
2<span class="o">)</span> <span class="s2">"set"</span>
3<span class="o">)</span> <span class="s2">"slowlog-log-slower-than"</span>
4<span class="o">)</span> <span class="s2">"0"</span>
127.0.0.1:6379> slowlog get
1<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 5
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105073
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 31
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"slowlog"</span>
2<span class="o">)</span> <span class="s2">"get"</span>
2<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 4
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105058
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 8
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"set"</span>
2<span class="o">)</span> <span class="s2">"x"</span>
3<span class="o">)</span> <span class="s2">"y"</span>
3<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 3
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105037
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 11
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"set"</span>
2<span class="o">)</span> <span class="s2">"idx"</span>
3<span class="o">)</span> <span class="s2">"1"</span>
4<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 2
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105027
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 30
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"set"</span>
2<span class="o">)</span> <span class="s2">"msg"</span>
3<span class="o">)</span> <span class="s2">"hello, world"</span>
5<span class="o">)</span> 1<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1
2<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 1411105016
3<span class="o">)</span> <span class="o">(</span>integer<span class="o">)</span> 9
4<span class="o">)</span> 1<span class="o">)</span> <span class="s2">"config"</span>
2<span class="o">)</span> <span class="s2">"set"</span>
3<span class="o">)</span> <span class="s2">"slowlog-max-len"</span>
4<span class="o">)</span> <span class="s2">"5"</span>
127.0.0.1:6379>
</code></pre></div></div>
Iterator
2014-09-17T00:00:00+00:00
http://blog.wjin.org/posts/iterator
<h1 id="iterator">Iterator</h1>
<h2 id="introduction">Introduction</h2>
<p>The iterator pattern provides a way to access the elements of an aggregate object sequentially without exposing its underlying representation.</p>
<h2 id="example">Example</h2>
<p>Here I use binary tree to practice iterator pattern. I design three iterators corresponding to tree preorder, inorder and postorder traverse respectively.</p>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <iostream>
#include <stack>
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="c1">// abstract iterator interface
</span><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="k">class</span> <span class="nc">Iterator</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">bool</span> <span class="n">hasNext</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="n">T</span> <span class="n">next</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// binary tree implementation
</span><span class="k">class</span> <span class="nc">BinaryTree</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">struct</span> <span class="n">TreeNode</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">data</span><span class="p">;</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">left</span><span class="p">,</span> <span class="o">*</span><span class="n">right</span><span class="p">;</span>
<span class="n">TreeNode</span><span class="p">(</span><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">TreeNode</span> <span class="o">*</span><span class="n">l</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">,</span> <span class="n">TreeNode</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="o">:</span>
<span class="n">data</span><span class="p">(</span><span class="n">d</span><span class="p">),</span> <span class="n">left</span><span class="p">(</span><span class="n">l</span><span class="p">),</span> <span class="n">right</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">root</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// preorder iterator
</span> <span class="k">class</span> <span class="nc">PreorderIterator</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Iterator</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">cur</span><span class="p">;</span>
<span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">s</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">PreorderIterator</span><span class="p">(</span><span class="n">TreeNode</span> <span class="o">*</span><span class="n">root</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="o">:</span> <span class="n">cur</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">hasNext</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cur</span> <span class="o">&&</span> <span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">tc</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span> <span class="c1">// temp cur
</span> <span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">ts</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span> <span class="c1">// temp stack
</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">n</span> <span class="o">=</span> <span class="n">next</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">n</span><span class="p">)</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="c1">// restore cur and stack s
</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">tc</span><span class="p">;</span>
<span class="n">swap</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">ts</span><span class="p">);</span> <span class="c1">// faster than s = ts;
</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// might return nullptr
</span> <span class="c1">// hasNext will use it to make a decision
</span> <span class="n">TreeNode</span><span class="o">*</span> <span class="n">next</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">ret</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">top</span><span class="p">();</span>
<span class="n">s</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">cur</span> <span class="o">&&</span> <span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// inorder iterator
</span> <span class="k">class</span> <span class="nc">InorderIterator</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Iterator</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">cur</span><span class="p">;</span>
<span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">s</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">InorderIterator</span><span class="p">(</span><span class="n">TreeNode</span> <span class="o">*</span><span class="n">root</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="o">:</span> <span class="n">cur</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">hasNext</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cur</span> <span class="o">&&</span> <span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// never return nullptr
</span> <span class="n">TreeNode</span><span class="o">*</span> <span class="n">next</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">ret</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="c1">// guarantee no crash if hasNext() returns false, but still
</span> <span class="c1">// try to call next()
</span> <span class="c1">// if (!cur && s.empty()) return ret;
</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">top</span><span class="p">();</span>
<span class="n">s</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cur</span> <span class="o">&&</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">)</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// postorder iterator
</span> <span class="k">class</span> <span class="nc">PostorderIterator</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Iterator</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">cur</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">;</span>
<span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">s</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">PostorderIterator</span><span class="p">(</span><span class="n">TreeNode</span><span class="o">*</span> <span class="n">root</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="o">:</span> <span class="n">cur</span><span class="p">(</span><span class="n">root</span><span class="p">),</span> <span class="n">prev</span><span class="p">(</span><span class="nb">nullptr</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">hasNext</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cur</span> <span class="o">&&</span> <span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// never return nullptr
</span> <span class="n">TreeNode</span><span class="o">*</span> <span class="n">next</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">ret</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="c1">// if (!cur && s.empty()) return ret;
</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">top</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cur</span><span class="o">-></span><span class="n">right</span> <span class="o">&&</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span> <span class="o">!=</span> <span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
<span class="n">prev</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
<span class="n">s</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">cur</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="k">private</span><span class="o">:</span>
<span class="c1">// create tree recursively with pre-order
</span> <span class="c1">// s[i] == '#' means this tree is empty
</span> <span class="n">TreeNode</span><span class="o">*</span> <span class="n">CreateTree</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="o">&</span><span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">)</span> <span class="c1">// null tree
</span> <span class="k">return</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">t</span> <span class="o">=</span> <span class="k">new</span> <span class="n">TreeNode</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="n">idx</span><span class="p">]);</span>
<span class="n">t</span><span class="o">-></span><span class="n">left</span> <span class="o">=</span> <span class="n">CreateTree</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">++</span><span class="n">idx</span><span class="p">);</span>
<span class="n">t</span><span class="o">-></span><span class="n">right</span> <span class="o">=</span> <span class="n">CreateTree</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">++</span><span class="n">idx</span><span class="p">);</span>
<span class="k">return</span> <span class="n">t</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">destroy</span><span class="p">(</span><span class="n">TreeNode</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
<span class="n">destroy</span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">left</span><span class="p">);</span>
<span class="n">destroy</span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">right</span><span class="p">);</span>
<span class="k">delete</span> <span class="n">t</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Init</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">root</span> <span class="o">=</span> <span class="n">CreateTree</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">idx</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">BinaryTree</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// please guarantee s is valid
</span> <span class="c1">// do not check it here
</span> <span class="n">Init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
<span class="o">~</span><span class="n">BinaryTree</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">destroy</span><span class="p">(</span><span class="n">root</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Handle</span><span class="p">(</span><span class="n">TreeNode</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="k">static_cast</span><span class="o"><</span><span class="kt">char</span><span class="o">></span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">data</span><span class="p">)</span> <span class="o"><<</span> <span class="s">"-->"</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">PreOrder</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">cur</span> <span class="o">=</span> <span class="n">root</span><span class="p">;</span>
<span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">s</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cur</span> <span class="o">||</span> <span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">Handle</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">top</span><span class="p">();</span>
<span class="n">s</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">InOrder</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">TreeNode</span><span class="o">*</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">root</span><span class="p">;</span>
<span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">s</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cur</span> <span class="o">||</span> <span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">top</span><span class="p">();</span>
<span class="n">s</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">Handle</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">PostOrder</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">TreeNode</span><span class="o">*</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">root</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="n">stack</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">*></span> <span class="n">s</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cur</span> <span class="o">||</span> <span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">top</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cur</span><span class="o">-></span><span class="n">right</span> <span class="o">&&</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span> <span class="o">!=</span> <span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">right</span><span class="p">;</span>
<span class="n">s</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">cur</span><span class="o">-></span><span class="n">left</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">s</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">Handle</span><span class="p">(</span><span class="n">cur</span><span class="p">);</span>
<span class="n">prev</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
<span class="n">cur</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">PreorderIterator</span><span class="o">></span> <span class="n">CreatePreorderIterator</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">PreorderIterator</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">PreorderIterator</span><span class="p">(</span><span class="n">root</span><span class="p">));</span>
<span class="p">}</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">InorderIterator</span><span class="o">></span> <span class="n">CreateInorderIterator</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">InorderIterator</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">InorderIterator</span><span class="p">(</span><span class="n">root</span><span class="p">));</span>
<span class="p">}</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">PostorderIterator</span><span class="o">></span> <span class="n">CreatePostorderIterator</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">PostorderIterator</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">PostorderIterator</span><span class="p">(</span><span class="n">root</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="cm">/*
* Test Tree
*
A
/ \
B F
/ \ /
C D G
/ \
E H
*/</span>
<span class="c1">// preorder of tree node, # means empty subtree
</span> <span class="n">string</span> <span class="n">s</span><span class="p">(</span><span class="s">"ABC##DE###FG#H###"</span><span class="p">);</span>
<span class="n">BinaryTree</span> <span class="n">t</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"PreOrder: "</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">t</span><span class="p">.</span><span class="n">PreOrder</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">pre</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">CreatePreorderIterator</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">pre</span><span class="o">-></span><span class="n">hasNext</span><span class="p">())</span> <span class="p">{</span>
<span class="n">t</span><span class="p">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">pre</span><span class="o">-></span><span class="n">next</span><span class="p">());</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"InOrder: "</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">t</span><span class="p">.</span><span class="n">InOrder</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">in</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">CreateInorderIterator</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">in</span><span class="o">-></span><span class="n">hasNext</span><span class="p">())</span> <span class="p">{</span>
<span class="n">t</span><span class="p">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">in</span><span class="o">-></span><span class="n">next</span><span class="p">());</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"PostOrder: "</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">t</span><span class="p">.</span><span class="n">PostOrder</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">post</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">CreatePostorderIterator</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="n">post</span><span class="o">-></span><span class="n">hasNext</span><span class="p">())</span> <span class="p">{</span>
<span class="n">t</span><span class="p">.</span><span class="n">Handle</span><span class="p">(</span><span class="n">post</span><span class="o">-></span><span class="n">next</span><span class="p">());</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
State
2014-09-10T00:00:00+00:00
http://blog.wjin.org/posts/state
<h1 id="state">State</h1>
<h2 id="introduction">Introduction</h2>
<p>The state pattern allows an object to <strong>alter its behavior when its internal state changes</strong>. The object will appear to change its class. And the UML graph is similar to Strategy pattern.</p>
<h2 id="example">Example</h2>
<p>Last weekend, I watched a F1 match. And noticed that Hamilton ran into an issue when starting because the car’s start mode button did not work. A conspiracy?</p>
<p>Also I noticed that there is a burst mode of the engine that driveres can use to get much more power with a little more fuel.</p>
<p>So here I take car as an example as I always take the things I am familar with. Normally, a car has three modes: start mode, normal mode and burst mode. As a state pattern, it is easy to add other modes, so auto mode is added to my fancy car because of Google’s advanced unmaned technology. I wish there is a fly mode in the near future :(</p>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<p>As there are two classes referenced with each other, to pass build we need to split header file and cpp file.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// File State.h
</span><span class="k">class</span> <span class="nc">Ferrari</span><span class="p">;</span>
<span class="c1">// abstract car state
</span><span class="k">class</span> <span class="nc">CarState</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">drive</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="o">~</span><span class="n">CarState</span><span class="p">()</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">StartState</span> <span class="o">:</span> <span class="k">public</span> <span class="n">CarState</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Ferrari</span><span class="o">*</span> <span class="n">m_car</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">StartState</span><span class="p">(</span><span class="n">Ferrari</span><span class="o">*</span> <span class="n">car</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">drive</span><span class="p">();</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">NormalState</span> <span class="o">:</span> <span class="k">public</span> <span class="n">CarState</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Ferrari</span><span class="o">*</span> <span class="n">m_car</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">NormalState</span><span class="p">(</span><span class="n">Ferrari</span> <span class="o">*</span><span class="n">car</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">drive</span><span class="p">();</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">BurstState</span> <span class="o">:</span> <span class="k">public</span> <span class="n">CarState</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Ferrari</span><span class="o">*</span> <span class="n">m_car</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">BurstState</span><span class="p">(</span><span class="n">Ferrari</span> <span class="o">*</span><span class="n">car</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">drive</span><span class="p">();</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">AutoState</span> <span class="o">:</span> <span class="k">public</span> <span class="n">CarState</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Ferrari</span><span class="o">*</span> <span class="n">m_car</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">AutoState</span><span class="p">(</span><span class="n">Ferrari</span> <span class="o">*</span><span class="n">car</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">drive</span><span class="p">();</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">Ferrari</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="c1">// all states
</span> <span class="n">CarState</span><span class="o">*</span> <span class="n">m_startState</span><span class="p">;</span> <span class="c1">// start state
</span> <span class="n">CarState</span><span class="o">*</span> <span class="n">m_normalState</span><span class="p">;</span> <span class="c1">// normal state
</span> <span class="n">CarState</span><span class="o">*</span> <span class="n">m_burstState</span><span class="p">;</span> <span class="c1">// burst state
</span> <span class="n">CarState</span><span class="o">*</span> <span class="n">m_autoState</span><span class="p">;</span> <span class="c1">// auto state
</span>
<span class="n">CarState</span><span class="o">*</span> <span class="n">m_state</span><span class="p">;</span> <span class="c1">// current car state
</span> <span class="kt">int</span> <span class="n">m_speed</span><span class="p">;</span> <span class="c1">// current car speed
</span>
<span class="kt">void</span> <span class="nf">setState</span><span class="p">(</span><span class="n">CarState</span><span class="o">*</span> <span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_state</span> <span class="o">=</span> <span class="n">m</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">setSpeed</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_speed</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">Ferrari</span><span class="p">();</span>
<span class="o">~</span><span class="n">Ferrari</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">pushStartStateButton</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">pushNormalStateButton</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">pushBurstStateButton</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">pushAutoStateButton</span><span class="p">();</span>
<span class="kt">void</span> <span class="n">drive</span><span class="p">();</span>
<span class="p">};</span>
<span class="c1">// File State.cpp
</span>
<span class="cp">#include <iostream>
#include "State.h"
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="n">StartState</span><span class="o">::</span><span class="n">StartState</span><span class="p">(</span><span class="n">Ferrari</span><span class="o">*</span> <span class="n">car</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_car</span> <span class="o">=</span> <span class="n">car</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">StartState</span><span class="o">::</span><span class="n">drive</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_car</span><span class="o">-></span><span class="n">setSpeed</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Start State: help start smoothly"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">NormalState</span><span class="o">::</span><span class="n">NormalState</span><span class="p">(</span><span class="n">Ferrari</span> <span class="o">*</span><span class="n">car</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_car</span> <span class="o">=</span> <span class="n">car</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">NormalState</span><span class="o">::</span><span class="n">drive</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_car</span><span class="o">-></span><span class="n">setSpeed</span><span class="p">(</span><span class="mi">50</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Normal State: have fun"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">BurstState</span><span class="o">::</span><span class="n">BurstState</span><span class="p">(</span><span class="n">Ferrari</span> <span class="o">*</span><span class="n">car</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_car</span> <span class="o">=</span> <span class="n">car</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">BurstState</span><span class="o">::</span><span class="n">drive</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_car</span><span class="o">-></span><span class="n">setSpeed</span><span class="p">(</span><span class="mi">100</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Burst State: please concentrate"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">AutoState</span><span class="o">::</span><span class="n">AutoState</span><span class="p">(</span><span class="n">Ferrari</span> <span class="o">*</span><span class="n">car</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_car</span> <span class="o">=</span> <span class="n">car</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">AutoState</span><span class="o">::</span><span class="n">drive</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_car</span><span class="o">-></span><span class="n">setSpeed</span><span class="p">(</span><span class="mi">30</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Auto State: relax, just enjoy"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">Ferrari</span><span class="o">::</span><span class="n">Ferrari</span><span class="p">()</span>
<span class="o">:</span> <span class="n">m_startState</span><span class="p">(</span><span class="k">new</span> <span class="n">StartState</span><span class="p">(</span><span class="k">this</span><span class="p">)),</span>
<span class="n">m_normalState</span><span class="p">(</span><span class="k">new</span> <span class="n">NormalState</span><span class="p">(</span><span class="k">this</span><span class="p">)),</span>
<span class="n">m_burstState</span><span class="p">(</span><span class="k">new</span> <span class="n">BurstState</span><span class="p">(</span><span class="k">this</span><span class="p">)),</span>
<span class="n">m_autoState</span><span class="p">(</span><span class="k">new</span> <span class="n">AutoState</span><span class="p">(</span><span class="k">this</span><span class="p">))</span>
<span class="p">{</span>
<span class="n">setState</span><span class="p">(</span><span class="n">m_normalState</span><span class="p">);</span>
<span class="n">setSpeed</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">Ferrari</span><span class="o">::~</span><span class="n">Ferrari</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">delete</span> <span class="n">m_startState</span><span class="p">;</span>
<span class="k">delete</span> <span class="n">m_normalState</span><span class="p">;</span>
<span class="k">delete</span> <span class="n">m_burstState</span><span class="p">;</span>
<span class="k">delete</span> <span class="n">m_autoState</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Ferrari</span><span class="o">::</span><span class="n">pushStartStateButton</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">setState</span><span class="p">(</span><span class="n">m_startState</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Ferrari</span><span class="o">::</span><span class="n">pushNormalStateButton</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">setState</span><span class="p">(</span><span class="n">m_normalState</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Ferrari</span><span class="o">::</span><span class="n">pushBurstStateButton</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// security check
</span> <span class="k">if</span> <span class="p">(</span><span class="k">typeid</span><span class="p">(</span><span class="o">*</span><span class="n">m_state</span><span class="p">)</span> <span class="o">!=</span> <span class="k">typeid</span><span class="p">(</span><span class="n">NormalState</span><span class="p">))</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Warnning: cannot enter burst state"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">setState</span><span class="p">(</span><span class="n">m_burstState</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Ferrari</span><span class="o">::</span><span class="n">pushAutoStateButton</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">setState</span><span class="p">(</span><span class="n">m_autoState</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Ferrari</span><span class="o">::</span><span class="n">drive</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_state</span><span class="o">-></span><span class="n">drive</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Current Speed:"</span> <span class="o"><<</span> <span class="n">m_speed</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">Ferrari</span> <span class="n">car</span><span class="p">;</span>
<span class="n">car</span><span class="p">.</span><span class="n">pushStartStateButton</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">drive</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">pushNormalStateButton</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">drive</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">pushBurstStateButton</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">drive</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">pushAutoStateButton</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">drive</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">pushBurstStateButton</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">drive</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">pushNormalStateButton</span><span class="p">();</span>
<span class="n">car</span><span class="p">.</span><span class="n">drive</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// abstract car state</span>
<span class="kd">interface</span> <span class="nc">CarState</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">drive</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">StartState</span> <span class="kd">implements</span> <span class="n">CarState</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Ferrari</span> <span class="n">car</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">StartState</span><span class="o">(</span><span class="n">Ferrari</span> <span class="n">car</span><span class="o">)</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="na">car</span> <span class="o">=</span> <span class="n">car</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">drive</span><span class="o">()</span> <span class="o">{</span>
<span class="n">car</span><span class="o">.</span><span class="na">setSpeed</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Start State: help start smoothly"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">NormalState</span> <span class="kd">implements</span> <span class="n">CarState</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Ferrari</span> <span class="n">car</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">NormalState</span><span class="o">(</span><span class="n">Ferrari</span> <span class="n">car</span><span class="o">)</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="na">car</span> <span class="o">=</span> <span class="n">car</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">drive</span><span class="o">()</span> <span class="o">{</span>
<span class="n">car</span><span class="o">.</span><span class="na">setSpeed</span><span class="o">(</span><span class="mi">50</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Normal State: have fun"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">BurstState</span> <span class="kd">implements</span> <span class="n">CarState</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Ferrari</span> <span class="n">car</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">BurstState</span><span class="o">(</span><span class="n">Ferrari</span> <span class="n">car</span><span class="o">)</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="na">car</span> <span class="o">=</span> <span class="n">car</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">drive</span><span class="o">()</span> <span class="o">{</span>
<span class="n">car</span><span class="o">.</span><span class="na">setSpeed</span><span class="o">(</span><span class="mi">100</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Burst State: please concentrate"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">AutoState</span> <span class="kd">implements</span> <span class="n">CarState</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Ferrari</span> <span class="n">car</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">AutoState</span><span class="o">(</span><span class="n">Ferrari</span> <span class="n">car</span><span class="o">)</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="na">car</span> <span class="o">=</span> <span class="n">car</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">drive</span><span class="o">()</span> <span class="o">{</span>
<span class="n">car</span><span class="o">.</span><span class="na">setSpeed</span><span class="o">(</span><span class="mi">30</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Auto State: relax, just enjoy"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">Ferrari</span> <span class="o">{</span>
<span class="c1">// all states</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">CarState</span> <span class="n">m_startState</span><span class="o">;</span> <span class="c1">// start state</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">CarState</span> <span class="n">m_normalState</span><span class="o">;</span> <span class="c1">// normal state</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">CarState</span> <span class="n">m_burstState</span><span class="o">;</span> <span class="c1">// burst state</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">CarState</span> <span class="n">m_autoState</span><span class="o">;</span> <span class="c1">// auto state</span>
<span class="kd">private</span> <span class="n">CarState</span> <span class="n">m_state</span><span class="o">;</span> <span class="c1">// current car state</span>
<span class="kd">private</span> <span class="kt">int</span> <span class="n">m_speed</span><span class="o">;</span> <span class="c1">// current car speed</span>
<span class="kd">private</span> <span class="kt">void</span> <span class="nf">setState</span><span class="o">(</span><span class="n">CarState</span> <span class="n">state</span><span class="o">)</span> <span class="o">{</span>
<span class="n">m_state</span> <span class="o">=</span> <span class="n">state</span><span class="o">;</span>
<span class="o">}</span>
<span class="kt">void</span> <span class="nf">setSpeed</span><span class="o">(</span><span class="kt">int</span> <span class="n">speed</span><span class="o">)</span> <span class="o">{</span>
<span class="n">m_speed</span> <span class="o">=</span> <span class="n">speed</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="nf">Ferrari</span><span class="o">()</span> <span class="o">{</span>
<span class="n">m_startState</span> <span class="o">=</span> <span class="k">new</span> <span class="n">StartState</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="n">m_normalState</span> <span class="o">=</span> <span class="k">new</span> <span class="n">NormalState</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="n">m_burstState</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BurstState</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="n">m_autoState</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AutoState</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="n">setState</span><span class="o">(</span><span class="n">m_normalState</span><span class="o">);</span>
<span class="n">setSpeed</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">pushStartStateButton</span><span class="o">()</span> <span class="o">{</span>
<span class="n">setState</span><span class="o">(</span><span class="n">m_startState</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">pushNormalStateButton</span><span class="o">()</span> <span class="o">{</span>
<span class="n">setState</span><span class="o">(</span><span class="n">m_normalState</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">pushBurstStateButton</span><span class="o">()</span> <span class="o">{</span>
<span class="c1">// security check</span>
<span class="k">if</span> <span class="o">(!(</span><span class="n">m_state</span> <span class="k">instanceof</span> <span class="n">NormalState</span><span class="o">))</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Warnning: cannot enter burst state"</span><span class="o">);</span>
<span class="k">return</span><span class="o">;</span>
<span class="o">}</span>
<span class="n">setState</span><span class="o">(</span><span class="n">m_burstState</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">pushAutoStateButton</span><span class="o">()</span> <span class="o">{</span>
<span class="n">setState</span><span class="o">(</span><span class="n">m_autoState</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">drive</span><span class="o">()</span> <span class="o">{</span>
<span class="n">m_state</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Current Speed:"</span> <span class="o">+</span> <span class="n">m_speed</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">State</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Ferrari</span> <span class="n">car</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Ferrari</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">pushStartStateButton</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">pushNormalStateButton</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">pushBurstStateButton</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">pushAutoStateButton</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">pushBurstStateButton</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">pushNormalStateButton</span><span class="o">();</span>
<span class="n">car</span><span class="o">.</span><span class="na">drive</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Union Find Set
2014-09-06T00:00:00+00:00
http://blog.wjin.org/posts/union-find-set
<h1 id="introduction">Introduction</h1>
<p>Union-Find is a data structure that keeps track of a set of elements partitioned into a number of disjoint subsets. It supports two useful operations:</p>
<ul>
<li>
<p>Find: determine which subset a particular element is in</p>
</li>
<li>
<p>Union: join two subsets into a single subset</p>
</li>
</ul>
<p>Normally, we need to <strong>compress path</strong> and <strong>merge rank</strong> when implementing this data structure. After that, by comparing the result of two Find operations, one can determine whether two elements are in the same subset in <strong>O(log n)</strong> time.</p>
<p>Besides, it is widely used in other famous algorithms, such as <strong>Tarjan</strong> for LCA problem and <strong>Kruskal</strong> for shortest path.</p>
<h1 id="example">Example</h1>
<h2 id="people-infected-virus">People Infected Virus</h2>
<p><strong>Description</strong></p>
<p>There are n (0…n-1) students, m student unions, each union has k students.
Calculate how many people are infected of virus if people zero is infected?</p>
<p>Input:</p>
<p>First line has two numbers n and m. Following m lines are each union’s students.
The first number is students number k, and then following k numbers standing for students ID.</p>
<p>Last line 0 0 means ending input.</p>
<p>100 4</p>
<p>2 1 2</p>
<p>5 10 13 11 12 14</p>
<p>2 0 1</p>
<p>2 99 2</p>
<p>200 2</p>
<p>1 5</p>
<p>5 1 2 3 4 5</p>
<p>1 0</p>
<p>0 0</p>
<p><strong>Analysis</strong></p>
<p>We can merge those students who are in the same student union to one set when reading input data.
And meanwhile calculate how many students in this set. The count of set with student ID 0 is the result.</p>
<p>This is a typical use of UFS and it just records student’s number.
In other cases, we can record any info in the node specific to that question.</p>
<p><strong>Code</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">UnionFindSet</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">struct</span> <span class="n">Node</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">parent</span><span class="p">;</span> <span class="c1">// parent of this node
</span> <span class="kt">int</span> <span class="n">rank</span><span class="p">;</span> <span class="c1">// rank value for merge
</span>
<span class="c1">// can record any data here
</span> <span class="kt">int</span> <span class="n">cnt</span><span class="p">;</span> <span class="c1">// number of people infected in this set
</span>
<span class="n">Node</span><span class="p">()</span><span class="o">:</span> <span class="n">parent</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">rank</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">cnt</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="n">vector</span><span class="o"><</span><span class="n">Node</span><span class="o">></span> <span class="n">node</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">UnionFindSet</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="o">:</span> <span class="n">node</span><span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">int</span> <span class="n">Find</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">node</span><span class="p">[</span><span class="n">x</span><span class="p">].</span><span class="n">parent</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="k">return</span> <span class="n">node</span><span class="p">[</span><span class="n">x</span><span class="p">].</span><span class="n">parent</span> <span class="o">=</span> <span class="n">Find</span><span class="p">(</span><span class="n">node</span><span class="p">[</span><span class="n">x</span><span class="p">].</span><span class="n">parent</span><span class="p">);</span> <span class="c1">// compress path
</span> <span class="p">}</span>
<span class="kt">void</span> <span class="n">Union</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">u1</span> <span class="o">=</span> <span class="n">Find</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">u2</span> <span class="o">=</span> <span class="n">Find</span><span class="p">(</span><span class="n">y</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">u1</span> <span class="o">==</span> <span class="n">u2</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span> <span class="c1">// same set
</span>
<span class="k">if</span> <span class="p">(</span><span class="n">node</span><span class="p">[</span><span class="n">u1</span><span class="p">].</span><span class="n">rank</span> <span class="o"><</span> <span class="n">node</span><span class="p">[</span><span class="n">u2</span><span class="p">].</span><span class="n">rank</span><span class="p">)</span> <span class="p">{</span>
<span class="n">node</span><span class="p">[</span><span class="n">u1</span><span class="p">].</span><span class="n">parent</span> <span class="o">=</span> <span class="n">u2</span><span class="p">;</span>
<span class="n">node</span><span class="p">[</span><span class="n">u2</span><span class="p">].</span><span class="n">cnt</span> <span class="o">+=</span> <span class="n">node</span><span class="p">[</span><span class="n">u1</span><span class="p">].</span><span class="n">cnt</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// >=
</span> <span class="n">node</span><span class="p">[</span><span class="n">u2</span><span class="p">].</span><span class="n">parent</span> <span class="o">=</span> <span class="n">u1</span><span class="p">;</span>
<span class="n">node</span><span class="p">[</span><span class="n">u1</span><span class="p">].</span><span class="n">rank</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">node</span><span class="p">[</span><span class="n">u1</span><span class="p">].</span><span class="n">rank</span><span class="p">,</span> <span class="n">node</span><span class="p">[</span><span class="n">u2</span><span class="p">].</span><span class="n">rank</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">node</span><span class="p">[</span><span class="n">u1</span><span class="p">].</span><span class="n">cnt</span> <span class="o">+=</span> <span class="n">node</span><span class="p">[</span><span class="n">u2</span><span class="p">].</span><span class="n">cnt</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">GetNum</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">node</span><span class="p">[</span><span class="n">Find</span><span class="p">(</span><span class="n">x</span><span class="p">)].</span><span class="n">cnt</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="cp">#ifndef ONLINE_JUDGE
</span> <span class="n">freopen</span><span class="p">(</span><span class="s">"input"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">stdin</span><span class="p">);</span>
<span class="c1">// freopen("output","w",stdout);
</span><span class="cp">#endif
</span>
<span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">k</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cin</span> <span class="o">>></span> <span class="n">n</span> <span class="o">>></span> <span class="n">m</span> <span class="o">&&</span> <span class="n">n</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">UnionFindSet</span> <span class="n">ufs</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">m</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">;</span> <span class="c1">// two students
</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">k</span><span class="p">;</span>
<span class="n">k</span><span class="o">--</span><span class="p">;</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">x</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">k</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">y</span><span class="p">;</span>
<span class="n">ufs</span><span class="p">.</span><span class="n">Union</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">ufs</span><span class="p">.</span><span class="n">GetNum</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
Farewell Wordpress
2014-09-05T00:00:00+00:00
http://blog.wjin.org/posts/farewell-wordpress
<h1 id="farewell">Farewell</h1>
<p>Farewell to my wordpress blog, I am not sure whether I will switch back to wordpress in the future. It seems that NoSql is increasingly prevalent even for a tiny blog :(</p>
<p>Currently, Jekyll and Bootstrap template works pretty well, and the most important thing is that I can use version control system to manage my blog on github, and github can host my blog using my own domain name as well. Thanks to github :(</p>
<h2 id="wordpress-plugin">Wordpress Plugin</h2>
<p>Some plugins that I had ever used before. <strong>syntaxhighlighter</strong> is pretty good because I tried a lot of highlighter but most of them were not perfect together with markdown plugin.</p>
<ul>
<li>
<p>table-of-contents-plus</p>
</li>
<li>
<p>disqus-comment-system</p>
</li>
<li>
<p>jetpack-markdown</p>
</li>
<li>
<p>syntaxhighlighter</p>
</li>
<li>
<p>google-syntax-highlighter</p>
</li>
<li>
<p>send-to-kindle</p>
</li>
</ul>
Template
2014-09-02T00:00:00+00:00
http://blog.wjin.org/posts/template
<h1 id="template-method">Template Method</h1>
<h2 id="introduction">Introduction</h2>
<p>The template method pattern <strong>defines the skeleton of an algorithm</strong> in a method, <strong>deferring</strong> some steps to subclasses. Template method lets subclasses redefine certain steps of an algorithm without changing the algorithm’s structure.</p>
<p>According to above description, we can learn that <a href="http://blog.wjin.org/posts/factory.html">factory method pattern</a> is a special version of template method as it lets subclass make a decision to instantiate which class.</p>
<h2 id="example">Example</h2>
<p>Taking the same example in <a href="http://blog.wjin.org/posts/command.html">command pattern</a>, this time I use template pattern to implement worker class. For different workers, they have the same workflow, that is read request from master process, analyse request, excute request and return result.</p>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// abstract worker
</span><span class="k">class</span> <span class="nc">Worker</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// template method, do not modify
</span> <span class="k">virtual</span> <span class="kt">void</span> <span class="n">process</span><span class="p">()</span> <span class="k">final</span> <span class="p">{</span>
<span class="c1">// hook for subclass
</span> <span class="n">hook</span><span class="p">();</span>
<span class="n">readRequest</span><span class="p">();</span>
<span class="n">analyseRequest</span><span class="p">();</span>
<span class="n">executeRequest</span><span class="p">();</span>
<span class="n">returnResult</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">readRequest</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"IPC: read request from master process"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">analyseRequest</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Protocol Analysis: analyse request and store result to handle"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">executeRequest</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">returnResult</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Return: return result in buffer to master process"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="nf">hook</span><span class="p">()</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="c1">// append log worker
</span><span class="k">class</span> <span class="nc">WorkerForLog</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Worker</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">executeRequest</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Execute: open file and append log"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// access database worker
</span><span class="k">class</span> <span class="nc">WorkerForDB</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Worker</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">executeRequest</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Execute: open database and read tables"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Worker</span><span class="o">></span> <span class="n">log</span><span class="p">(</span><span class="k">new</span> <span class="n">WorkerForLog</span><span class="p">());</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Worker</span><span class="o">></span> <span class="n">db</span><span class="p">(</span><span class="k">new</span> <span class="n">WorkerForDB</span><span class="p">());</span>
<span class="n">log</span><span class="o">-></span><span class="n">process</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">db</span><span class="o">-></span><span class="n">process</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// abstract worker</span>
<span class="kd">abstract</span> <span class="kd">class</span> <span class="nc">Worker</span> <span class="o">{</span>
<span class="c1">// template method, do not modify</span>
<span class="kd">public</span> <span class="kd">final</span> <span class="kt">void</span> <span class="nf">process</span><span class="o">()</span> <span class="o">{</span>
<span class="c1">// hook for subclass</span>
<span class="n">hook</span><span class="o">();</span>
<span class="n">readRequest</span><span class="o">();</span>
<span class="n">analyseRequest</span><span class="o">();</span>
<span class="n">executeRequest</span><span class="o">();</span>
<span class="n">returnResult</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">readRequest</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"IPC: read request from master process"</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">analyseRequest</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span>
<span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Protocol Analysis: analyse request and store result to handle"</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">abstract</span> <span class="kt">void</span> <span class="nf">executeRequest</span><span class="o">();</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">returnResult</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Return: return result in buffer to master process"</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">hook</span><span class="o">()</span> <span class="o">{</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// append log worker</span>
<span class="kd">class</span> <span class="nc">WorkerForLog</span> <span class="kd">extends</span> <span class="n">Worker</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">executeRequest</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Execute: open file and append log"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// access database worker</span>
<span class="kd">class</span> <span class="nc">WorkerForDB</span> <span class="kd">extends</span> <span class="n">Worker</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">executeRequest</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Execute: open database and read tables"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Template</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Worker</span> <span class="n">log</span> <span class="o">=</span> <span class="k">new</span> <span class="n">WorkerForLog</span><span class="o">();</span>
<span class="n">Worker</span> <span class="n">db</span> <span class="o">=</span> <span class="k">new</span> <span class="n">WorkerForDB</span><span class="o">();</span>
<span class="n">log</span><span class="o">.</span><span class="na">process</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">();</span>
<span class="n">db</span><span class="o">.</span><span class="na">process</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Adapter and Facade
2014-09-02T00:00:00+00:00
http://blog.wjin.org/posts/adapter-and-facade
<h1 id="adapter-and-facade">Adapter and Facade</h1>
<h2 id="adapter">Adapter</h2>
<p>The adapter pattern <strong>converts the interface</strong> of a class into another interface the clients expect. Adapter lets classes work together that couldn’t otherwise because of incompatible interfaces.</p>
<p>There are two kinds of adapter pattern:</p>
<ul>
<li>
<p>object adapter using <strong>composition</strong></p>
</li>
<li>
<p>class adapter using <strong>multi-inheritance</strong></p>
</li>
</ul>
<h2 id="facade">Facade</h2>
<p>The façade pattern provides a <strong>unified interface</strong> to a set of interfaces in a subsystem. Façade defines a higher level interface that makes the subsystem easier to use.</p>
<h2 id="difference">Difference</h2>
<ul>
<li>
<p>Adapter: change interface</p>
</li>
<li>
<p>Façade: simplify interface</p>
</li>
<li>
<p>Decorator: do not change interface but add responsibility</p>
</li>
</ul>
Command
2014-08-31T00:00:00+00:00
http://blog.wjin.org/posts/command
<h1 id="command">Command</h1>
<h2 id="introduction">Introduction</h2>
<p>The command pattern <strong>encapsulates a request as an object</strong>, thereby letting you parameterize other objects with different requests, queue or log requests, and support undoable operations.</p>
<p><strong>Key Point</strong></p>
<ul>
<li>
<p>decouple the invoker and receiver</p>
</li>
<li>
<p>macro command</p>
</li>
<li>
<p>typical usage: queuing and logging request</p>
</li>
</ul>
<h2 id="example">Example</h2>
<p>Considering I have a master server process to dispatch requests (add log, read database, and so on) to worker threads/processes. I can encapsulate request into an object in a work queue, each worker (thread/process) gets task from this queue, worker doesn’t care about which task it gets, it just executes command encapsulated in it.</p>
<p>Following demo code shows using command pattern to implement work queue. Server (command invoker) and worker (command receiver) are decoupled.</p>
<p>For simplicity, server and worker are just wrappered in a class and use global variable to communicate with each other. In real world, it might be multi-thread or multi-process model.</p>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// abstract command
</span><span class="k">class</span> <span class="nc">Command</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">execute</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// append log command
</span><span class="k">class</span> <span class="nc">AddLog</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Command</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">execute</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Open file and append log"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// access database command
</span><span class="k">class</span> <span class="nc">AccessDatabase</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Command</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">execute</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Open database and read tables"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// macro command
// include a few commands together
</span><span class="k">class</span> <span class="nc">AccessAndLog</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Command</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">vector</span><span class="o"><</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">>></span> <span class="n">cmds</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">AccessAndLog</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">>></span> <span class="o">&</span><span class="n">v</span><span class="p">)</span> <span class="o">:</span> <span class="n">cmds</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">execute</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">e</span> <span class="o">:</span> <span class="n">cmds</span><span class="p">)</span>
<span class="n">e</span><span class="o">-></span><span class="n">execute</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// shared queue between server and worker
</span><span class="n">queue</span><span class="o"><</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">>></span> <span class="n">workqueue</span><span class="p">;</span>
<span class="c1">// command invoker
</span><span class="k">class</span> <span class="nc">Server</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">addCommand</span><span class="p">(</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">></span> <span class="n">cmd</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">workqueue</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">cmd</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// command receiver
</span><span class="k">class</span> <span class="nc">Worker</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">excuteCommand</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">workqueue</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">></span> <span class="n">cmd</span> <span class="o">=</span> <span class="n">workqueue</span><span class="p">.</span><span class="n">front</span><span class="p">();</span>
<span class="n">workqueue</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
<span class="n">cmd</span><span class="o">-></span><span class="n">execute</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">Server</span> <span class="n">s</span><span class="p">;</span>
<span class="n">Worker</span> <span class="n">w</span><span class="p">;</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">></span> <span class="n">log</span><span class="p">(</span><span class="k">new</span> <span class="n">AddLog</span><span class="p">());</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">></span> <span class="n">db</span><span class="p">(</span><span class="k">new</span> <span class="n">AccessDatabase</span><span class="p">());</span>
<span class="n">vector</span><span class="o"><</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">>></span> <span class="n">v</span> <span class="o">=</span> <span class="p">{</span> <span class="n">log</span><span class="p">,</span> <span class="n">db</span> <span class="p">};</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Command</span><span class="o">></span> <span class="n">macro</span><span class="p">(</span><span class="k">new</span> <span class="n">AccessAndLog</span><span class="p">(</span><span class="n">v</span><span class="p">));</span>
<span class="n">s</span><span class="p">.</span><span class="n">addCommand</span><span class="p">(</span><span class="n">log</span><span class="p">);</span>
<span class="n">s</span><span class="p">.</span><span class="n">addCommand</span><span class="p">(</span><span class="n">db</span><span class="p">);</span>
<span class="n">s</span><span class="p">.</span><span class="n">addCommand</span><span class="p">(</span><span class="n">macro</span><span class="p">);</span>
<span class="n">w</span><span class="p">.</span><span class="n">excuteCommand</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">java.util.LinkedList</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.Queue</span><span class="o">;</span>
<span class="c1">// abstract command</span>
<span class="kd">interface</span> <span class="nc">Command</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">();</span>
<span class="o">};</span>
<span class="c1">// append log command</span>
<span class="kd">class</span> <span class="nc">AddLog</span> <span class="kd">implements</span> <span class="n">Command</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Open file and append log"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1">// access database command</span>
<span class="kd">class</span> <span class="nc">AccessDatabase</span> <span class="kd">implements</span> <span class="n">Command</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Open database and read tables"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1">// macro command</span>
<span class="c1">// include a few commands together</span>
<span class="kd">class</span> <span class="nc">AccessAndLog</span> <span class="kd">implements</span> <span class="n">Command</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Command</span><span class="o">[]</span> <span class="n">cmds</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">AccessAndLog</span><span class="o">(</span><span class="n">Command</span><span class="o">[]</span> <span class="n">v</span><span class="o">)</span> <span class="o">{</span>
<span class="n">cmds</span> <span class="o">=</span> <span class="n">v</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">()</span> <span class="o">{</span>
<span class="k">for</span> <span class="o">(</span><span class="n">Command</span> <span class="n">e</span> <span class="o">:</span> <span class="n">cmds</span><span class="o">)</span>
<span class="n">e</span><span class="o">.</span><span class="na">execute</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1">// shared queue between Server and Worker</span>
<span class="kd">class</span> <span class="nc">SharedQueue</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">Queue</span><span class="o"><</span><span class="n">Command</span><span class="o">></span> <span class="n">workqueue</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LinkedList</span><span class="o"><</span><span class="n">Command</span><span class="o">>();</span>
<span class="o">}</span>
<span class="c1">// command invoker</span>
<span class="kd">class</span> <span class="nc">Server</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">addCommand</span><span class="o">(</span><span class="n">Command</span> <span class="n">cmd</span><span class="o">)</span> <span class="o">{</span>
<span class="n">SharedQueue</span><span class="o">.</span><span class="na">workqueue</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">cmd</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// command receiver</span>
<span class="kd">class</span> <span class="nc">Worker</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">excuteCommand</span><span class="o">()</span> <span class="o">{</span>
<span class="k">while</span> <span class="o">(!</span><span class="n">SharedQueue</span><span class="o">.</span><span class="na">workqueue</span><span class="o">.</span><span class="na">isEmpty</span><span class="o">())</span> <span class="o">{</span>
<span class="n">Command</span> <span class="n">c</span> <span class="o">=</span> <span class="n">SharedQueue</span><span class="o">.</span><span class="na">workqueue</span><span class="o">.</span><span class="na">remove</span><span class="o">();</span>
<span class="n">c</span><span class="o">.</span><span class="na">execute</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">CommandTest</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Server</span> <span class="n">s</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Server</span><span class="o">();</span>
<span class="n">Worker</span> <span class="n">w</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Worker</span><span class="o">();</span>
<span class="n">Command</span> <span class="n">log</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AddLog</span><span class="o">();</span>
<span class="n">Command</span> <span class="n">db</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AccessDatabase</span><span class="o">();</span>
<span class="n">Command</span><span class="o">[]</span> <span class="n">v</span> <span class="o">=</span> <span class="o">{</span> <span class="n">log</span><span class="o">,</span> <span class="n">db</span> <span class="o">};</span>
<span class="n">Command</span> <span class="n">macro</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AccessAndLog</span><span class="o">(</span><span class="n">v</span><span class="o">);</span>
<span class="n">s</span><span class="o">.</span><span class="na">addCommand</span><span class="o">(</span><span class="n">log</span><span class="o">);</span>
<span class="n">s</span><span class="o">.</span><span class="na">addCommand</span><span class="o">(</span><span class="n">db</span><span class="o">);</span>
<span class="n">s</span><span class="o">.</span><span class="na">addCommand</span><span class="o">(</span><span class="n">macro</span><span class="o">);</span>
<span class="n">w</span><span class="o">.</span><span class="na">excuteCommand</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Singleton
2014-08-30T00:00:00+00:00
http://blog.wjin.org/posts/singleton
<h1 id="singleton">Singleton</h1>
<h2 id="introduction">Introduction</h2>
<p>The singleton pattern ensures a class has only one instance, and provides a global point of access to it.</p>
<p><strong>Multithread</strong></p>
<p>Singleton is the simplest design pattern among all patterns, however, we must be careful when dealing with multithread environment. Basically, we have following ways:</p>
<ul>
<li>
<p>java <strong>synchronized</strong> method if performance is not critical to application</p>
</li>
<li>
<p>move to an <strong>early created instance</strong> rather than a lazily created one (jvm can guarantee thread-safe when load this class, however, it may fail when using multiple class loaders. Also it may waste memory if we do not use it later)</p>
</li>
<li>
<p>use <strong>double-checked locking</strong> to reduce use of synchronization in getInstance(), define unique instance as volatile. This method does not suit c++</p>
</li>
<li>
<p><strong>static code block</strong> in java and and <strong>static constructor</strong> in C#</p>
</li>
</ul>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Singleton</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">static</span> <span class="n">Singleton</span><span class="o">&</span> <span class="n">getInstance</span><span class="p">()</span> <span class="c1">// reference
</span> <span class="p">{</span>
<span class="c1">// Guaranteed to be destroyed. Instantiated on first use
</span> <span class="k">static</span> <span class="n">Singleton</span> <span class="n">instance</span><span class="p">;</span>
<span class="k">return</span> <span class="n">instance</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">dump</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"I am singleton pattern: "</span> <span class="o"><<</span> <span class="k">this</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Singleton</span><span class="p">()</span> <span class="p">{}</span> <span class="c1">// Constructor? (the {} brackets) are needed here.
</span>
<span class="c1">// Dont forget to declare these two. You want to make sure they
</span> <span class="c1">// are unaccessable otherwise you may accidently get copies of
</span> <span class="c1">// your singleton appearing.
</span> <span class="n">Singleton</span><span class="p">(</span><span class="k">const</span> <span class="n">Singleton</span> <span class="o">&</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span> <span class="c1">// Don't Implement
</span> <span class="n">Singleton</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">Singleton</span><span class="o">&</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span> <span class="c1">// Don't implement
</span><span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">Singleton</span><span class="o">::</span><span class="n">getInstance</span><span class="p">().</span><span class="n">dump</span><span class="p">();</span>
<span class="n">Singleton</span><span class="o">::</span><span class="n">getInstance</span><span class="p">().</span><span class="n">dump</span><span class="p">();</span>
<span class="n">Singleton</span><span class="o">::</span><span class="n">getInstance</span><span class="p">().</span><span class="n">dump</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// do not consider multithread</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Singleton</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="n">instance</span><span class="o">;</span> <span class="c1">// static</span>
<span class="kd">private</span> <span class="nf">Singleton</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// private</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="nf">getInstance</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// static</span>
<span class="k">if</span> <span class="o">(</span><span class="n">instance</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
<span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Singleton</span><span class="o">();</span>
<span class="o">}</span>
<span class="k">return</span> <span class="n">instance</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Singleton</span> <span class="n">obj1</span> <span class="o">=</span> <span class="n">getInstance</span><span class="o">();</span>
<span class="n">Singleton</span> <span class="n">obj2</span> <span class="o">=</span> <span class="n">getInstance</span><span class="o">();</span>
<span class="k">if</span> <span class="o">(</span><span class="n">obj1</span> <span class="o">!=</span> <span class="n">obj2</span><span class="o">)</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Wrong"</span><span class="o">);</span>
<span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Right"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// synchronized method</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Singleton</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="n">instance</span><span class="o">;</span> <span class="c1">// static</span>
<span class="kd">private</span> <span class="nf">Singleton</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// private</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kd">synchronized</span> <span class="n">Singleton</span> <span class="nf">getInstance</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// static</span>
<span class="k">if</span> <span class="o">(</span><span class="n">instance</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
<span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Singleton</span><span class="o">();</span>
<span class="o">}</span>
<span class="k">return</span> <span class="n">instance</span><span class="o">;</span>
<span class="o">}</span>
<span class="c1">// ...</span>
<span class="o">}</span>
<span class="c1">// double-checked locking</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Singleton</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">volatile</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="n">instance</span><span class="o">;</span> <span class="c1">// volatile static</span>
<span class="kd">private</span> <span class="nf">Singleton</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// private</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="nf">getInstance</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// static</span>
<span class="k">if</span> <span class="o">(</span><span class="n">instance</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
<span class="kd">synchronized</span> <span class="o">(</span><span class="n">Singleton</span><span class="o">.</span><span class="na">class</span><span class="o">)</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">instance</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
<span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Singleton</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">return</span> <span class="n">instance</span><span class="o">;</span>
<span class="o">}</span>
<span class="c1">// ...</span>
<span class="o">}</span>
<span class="c1">// early create object</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Singleton</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Singleton</span><span class="o">();</span> <span class="c1">// static</span>
<span class="kd">private</span> <span class="nf">Singleton</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// private</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="nf">getInstance</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// static</span>
<span class="k">return</span> <span class="n">instance</span><span class="o">;</span>
<span class="o">}</span>
<span class="c1">// ...</span>
<span class="o">}</span>
<span class="c1">// static code block</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Singleton</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="n">instance</span><span class="o">;</span> <span class="c1">// static</span>
<span class="kd">private</span> <span class="nf">Singleton</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// private</span>
<span class="o">}</span>
<span class="kd">static</span> <span class="o">{</span>
<span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Singleton</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="n">Singleton</span> <span class="nf">getInstance</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// static</span>
<span class="k">return</span> <span class="n">instance</span><span class="o">;</span>
<span class="o">}</span>
<span class="c1">// ...</span>
<span class="o">}</span>
</code></pre></div></div>
Factory
2014-08-29T00:00:00+00:00
http://blog.wjin.org/posts/factory
<h1 id="factory">Factory</h1>
<h2 id="introduction">Introduction</h2>
<p>Factory method is used to encapsulate create behavior so that it <strong>decreases dependencies</strong> between classes and make them loosely decouled. We should <strong>depend on abstraction instead of concrete classes</strong>.</p>
<p>There are three kinds of factory pattern in total: simple factory, factory method and abstract factory.</p>
<p><strong>Simple Factory</strong> is too simple, it encapsulates creating objects behavior in another class object or just a simple, normally static, function.</p>
<p><strong>Factory Method</strong> defines an interface for creating an object, but let <strong>subclasses</strong> decide which class to instantiate. Factory Method lets a class <strong>defer instantiation to subclasses</strong>.</p>
<p><strong>Abstract factory</strong> includes a series of factory methods to create more than one object once. We can consider factory method as a special case (just create one object) of abstract factory.</p>
<p>Besides, factory method is a special case of template method pattern, and in CPP, there is a technical term for this kind of programming habit: <strong>NVI</strong> (non-virtual interface).</p>
<h2 id="example">Example</h2>
<p>Again, take Intel SRC as an exmpale. In strategy pattern, it creates SRC object using <strong>new</strong> operator directly. Here I use factory method to encapsulate <strong>create behavoir</strong>.</p>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Base class
</span><span class="k">class</span> <span class="nc">SRC</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// interface, ignore arguments :(
</span> <span class="c1">// convert data from input to output buffer
</span> <span class="k">virtual</span> <span class="kt">void</span> <span class="n">convert</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// Default SRC
</span><span class="k">class</span> <span class="nc">DefaultSRC</span> <span class="o">:</span> <span class="k">public</span> <span class="n">SRC</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">convert</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Default Naive SRC"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// Intel SRC
</span><span class="k">class</span> <span class="nc">IntelSRC</span> <span class="o">:</span> <span class="k">public</span> <span class="n">SRC</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">convert</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Intel High Performance SRC"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// Base Audio Control class
</span><span class="k">class</span> <span class="nc">AudioControl</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// this is a NVI template method to
</span> <span class="c1">// deal with audio data, do not modify
</span> <span class="kt">void</span> <span class="n">processData</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// convert all channel's data to the same sample rate
</span> <span class="c1">// and then mix them togegher
</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span> <span class="n">resampler</span> <span class="o">=</span> <span class="n">createSRC</span><span class="p">();</span>
<span class="n">resampler</span><span class="o">-></span><span class="n">convert</span><span class="p">();</span>
<span class="n">mixChannel</span><span class="p">();</span>
<span class="c1">// ...
</span> <span class="p">}</span>
<span class="kt">void</span> <span class="n">mixChannel</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"mix data"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// factory...
</span><span class="nl">protected:</span>
<span class="k">virtual</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span> <span class="n">createSRC</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// Default Audio Control class
</span><span class="k">class</span> <span class="nc">DefaultAudioControl</span> <span class="o">:</span> <span class="k">public</span> <span class="n">AudioControl</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">virtual</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span> <span class="n">createSRC</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Create Default SRC"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">DefaultSRC</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// Specific Intel Audio Control class
</span><span class="k">class</span> <span class="nc">IntelAudioControl</span> <span class="o">:</span> <span class="k">public</span> <span class="n">AudioControl</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">virtual</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span> <span class="n">createSRC</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Create Intel SRC"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">return</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span><span class="p">(</span><span class="k">new</span> <span class="n">IntelSRC</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">AudioControl</span><span class="o">></span> <span class="n">obj</span><span class="p">(</span><span class="k">new</span> <span class="n">DefaultAudioControl</span><span class="p">());</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">AudioControl</span><span class="o">></span> <span class="n">iobj</span><span class="p">(</span><span class="k">new</span> <span class="n">IntelAudioControl</span><span class="p">());</span>
<span class="n">obj</span><span class="o">-></span><span class="n">processData</span><span class="p">();</span>
<span class="n">iobj</span><span class="o">-></span><span class="n">processData</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">interface</span> <span class="nc">SRC</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">convert</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">DefaultSRC</span> <span class="kd">implements</span> <span class="n">SRC</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">convert</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Default Naive SRC"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">IntelSRC</span> <span class="kd">implements</span> <span class="n">SRC</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">convert</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Intel High Performance SRC"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">abstract</span> <span class="kd">class</span> <span class="nc">AudioControl</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">processData</span><span class="o">()</span> <span class="o">{</span>
<span class="n">SRC</span> <span class="n">resampler</span> <span class="o">=</span> <span class="n">createSRC</span><span class="o">();</span>
<span class="n">resampler</span><span class="o">.</span><span class="na">convert</span><span class="o">();</span>
<span class="n">mixChannel</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">mixChannel</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"mix data"</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1">// factory...</span>
<span class="kd">protected</span> <span class="kd">abstract</span> <span class="n">SRC</span> <span class="nf">createSRC</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">DefaultAudioControl</span> <span class="kd">extends</span> <span class="n">AudioControl</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">protected</span> <span class="n">SRC</span> <span class="nf">createSRC</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Create Default SRC"</span><span class="o">);</span>
<span class="k">return</span> <span class="k">new</span> <span class="nf">DefaultSRC</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">IntelAudioControl</span> <span class="kd">extends</span> <span class="n">AudioControl</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">protected</span> <span class="n">SRC</span> <span class="nf">createSRC</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Create Intel SRC"</span><span class="o">);</span>
<span class="k">return</span> <span class="k">new</span> <span class="nf">IntelSRC</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Factory</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">AudioControl</span> <span class="n">obj</span> <span class="o">=</span> <span class="k">new</span> <span class="n">DefaultAudioControl</span><span class="o">();</span>
<span class="n">AudioControl</span> <span class="n">iobj</span> <span class="o">=</span> <span class="k">new</span> <span class="n">IntelAudioControl</span><span class="o">();</span>
<span class="n">obj</span><span class="o">.</span><span class="na">processData</span><span class="o">();</span>
<span class="n">iobj</span><span class="o">.</span><span class="na">processData</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Decorator
2014-08-28T00:00:00+00:00
http://blog.wjin.org/posts/decorator
<h1 id="decorator">Decorator</h1>
<h2 id="introduction">Introduction</h2>
<p>The decorator pattern <strong>attaches additional responsibilities</strong> on an object dynamically. Decorators provide a flexible alternative to subclassing for extending functionality. Decorators have <strong>the same supertype</strong> as the objects they decorate.</p>
<h2 id="example">Example</h2>
<p>Everybody plays games, right? When we get a fantastic weapon, we can hardly wait to <em>decorate</em> our character to a superman. Here is an example that using Sword and Wand to decorate our worrior and wizard respectively, plus a crazy worrior with Sword and Wand.</p>
<h2 id="code">Code</h2>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// abstract component
</span><span class="k">class</span> <span class="nc">Character</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">attack</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// concrete component
</span><span class="k">class</span> <span class="nc">Warrior</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Character</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">attack</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Warrior is attacking"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// another concrete component
</span><span class="k">class</span> <span class="nc">Wizard</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Character</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">attack</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Wizard is attacking"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// abstract decorator
// decorator has the same type of the object
// that it will decoreate
</span><span class="k">class</span> <span class="nc">Weapon</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Character</span>
<span class="p">{</span>
<span class="p">};</span>
<span class="c1">// concrete decorator
</span><span class="k">class</span> <span class="nc">Sword</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Weapon</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">person</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Sword</span><span class="p">(</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">ptr</span><span class="p">)</span> <span class="o">:</span> <span class="n">person</span><span class="p">(</span><span class="n">ptr</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">attack</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Begin..."</span><span class="p">;</span>
<span class="n">person</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"with Sword"</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"...End"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// another concrete decorator
</span><span class="k">class</span> <span class="nc">Wand</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Weapon</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">person</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Wand</span><span class="p">(</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">ptr</span><span class="p">)</span> <span class="o">:</span> <span class="n">person</span><span class="p">(</span><span class="n">ptr</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">attack</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Begin..."</span><span class="p">;</span>
<span class="n">person</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"with Wand"</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"...End"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">war</span><span class="p">(</span><span class="k">new</span> <span class="n">Warrior</span><span class="p">());</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">wiz</span><span class="p">(</span><span class="k">new</span> <span class="n">Wizard</span><span class="p">());</span>
<span class="n">war</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">wiz</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">warWithSword</span><span class="p">(</span><span class="k">new</span> <span class="n">Sword</span><span class="p">(</span><span class="n">war</span><span class="p">));</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">warWithWand</span><span class="p">(</span><span class="k">new</span> <span class="n">Wand</span><span class="p">(</span><span class="n">war</span><span class="p">));</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">wizWithSword</span><span class="p">(</span><span class="k">new</span> <span class="n">Sword</span><span class="p">(</span><span class="n">wiz</span><span class="p">));</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">wizWithWand</span><span class="p">(</span><span class="k">new</span> <span class="n">Wand</span><span class="p">(</span><span class="n">wiz</span><span class="p">));</span>
<span class="n">warWithSword</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">warWithWand</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">wizWithSword</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="n">wizWithWand</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="c1">// crazy warrior with sword and wand :(
</span> <span class="n">shared_ptr</span><span class="o"><</span><span class="n">Character</span><span class="o">></span> <span class="n">warWithSwordAndWand</span><span class="p">(</span><span class="k">new</span> <span class="n">Wand</span><span class="p">(</span><span class="n">warWithSword</span><span class="p">));</span>
<span class="n">warWithSwordAndWand</span><span class="o">-></span><span class="n">attack</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// abstract component</span>
<span class="kd">abstract</span> <span class="kd">class</span> <span class="nc">Character</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">attack</span><span class="o">()</span> <span class="o">{</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// concrete component</span>
<span class="kd">class</span> <span class="nc">Warrior</span> <span class="kd">extends</span> <span class="n">Character</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">attack</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Warrior is attacking"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// another concrete component</span>
<span class="kd">class</span> <span class="nc">Wizard</span> <span class="kd">extends</span> <span class="n">Character</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">attack</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Wizard is attacking"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// abstract decorator</span>
<span class="c1">// decorator has the same type of the object</span>
<span class="c1">// that it will decorate</span>
<span class="kd">class</span> <span class="nc">Weapon</span> <span class="kd">extends</span> <span class="n">Character</span> <span class="o">{</span>
<span class="o">}</span>
<span class="c1">// concrete decorator</span>
<span class="kd">class</span> <span class="nc">Sword</span> <span class="kd">extends</span> <span class="n">Weapon</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Character</span> <span class="n">person</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">Sword</span><span class="o">(</span><span class="n">Character</span> <span class="n">ptr</span><span class="o">)</span> <span class="o">{</span>
<span class="n">person</span> <span class="o">=</span> <span class="n">ptr</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">attack</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">print</span><span class="o">(</span><span class="s">"Begin..."</span><span class="o">);</span>
<span class="n">person</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">print</span><span class="o">(</span><span class="s">"with Sword"</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"...End"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="c1">// concrete decorator</span>
<span class="kd">class</span> <span class="nc">Wand</span> <span class="kd">extends</span> <span class="n">Weapon</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Character</span> <span class="n">person</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">Wand</span><span class="o">(</span><span class="n">Character</span> <span class="n">ptr</span><span class="o">)</span> <span class="o">{</span>
<span class="n">person</span> <span class="o">=</span> <span class="n">ptr</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">attack</span><span class="o">()</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">print</span><span class="o">(</span><span class="s">"Begin..."</span><span class="o">);</span>
<span class="n">person</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">print</span><span class="o">(</span><span class="s">"with Wand"</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"...End"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Decorator</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Character</span> <span class="n">war</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Warrior</span><span class="o">();</span>
<span class="n">Character</span> <span class="n">wiz</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Wizard</span><span class="o">();</span>
<span class="n">war</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">wiz</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">Character</span> <span class="n">warWithSword</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Sword</span><span class="o">(</span><span class="n">war</span><span class="o">);</span>
<span class="n">Character</span> <span class="n">warWithWand</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Wand</span><span class="o">(</span><span class="n">war</span><span class="o">);</span>
<span class="n">Character</span> <span class="n">wizWithSword</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Sword</span><span class="o">(</span><span class="n">wiz</span><span class="o">);</span>
<span class="n">Character</span> <span class="n">wizWithWand</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Wand</span><span class="o">(</span><span class="n">wiz</span><span class="o">);</span>
<span class="n">warWithSword</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">warWithWand</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">wizWithSword</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="n">wizWithWand</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="c1">// crazy warrior with sword and wand :(</span>
<span class="n">Character</span> <span class="n">warWithSwordAndWand</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Wand</span><span class="o">(</span><span class="n">warWithSword</span><span class="o">);</span>
<span class="n">warWithSwordAndWand</span><span class="o">.</span><span class="na">attack</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Strategy
2014-08-27T00:00:00+00:00
http://blog.wjin.org/posts/strategy
<h1 id="strategy">Strategy</h1>
<h2 id="introduction">Introduction</h2>
<p>The strategy pattern defines <strong>a family of algorithms</strong>, encapsulates each one, and makes them <strong>interchangeable</strong>. Strategy lets the algorithm <strong>vary independently from clients</strong> that use it.</p>
<h2 id="example">Example</h2>
<p>In my previous work about Android audio framework, I was responsible for integrating Intel SRC(sample rate conversion) into Android audio framework as Intel SRC has better performance than the default SRC on x86 platform.</p>
<p>The default design is pretty good and easy to understand and extend because of Strategy Pattern. I implemented a new derived class, which encapsulates Intel SRC algorithm (a family of algorithms).</p>
<p>Also, we can dynamically switch back to default SRC when Intel SRC does not support some specific kind of conversions (interchangaeble).</p>
<p>The client, audio control (AudioFlinger in android), does not care about which algorithm it is using, it just delegates resampler to do conversion for it (vary independently from clients).</p>
<h2 id="code">Code</h2>
<h3 id="cpp">CPP</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Base class
</span><span class="k">class</span> <span class="nc">SRC</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// interface, ignore arguments :(
</span> <span class="c1">// convert data from input to output buffer
</span> <span class="k">virtual</span> <span class="kt">void</span> <span class="n">convert</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// Default SRC
</span><span class="k">class</span> <span class="nc">DefaultSRC</span> <span class="o">:</span> <span class="k">public</span> <span class="n">SRC</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">convert</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Default Naive SRC"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// Intel SRC
</span><span class="k">class</span> <span class="nc">IntelSRC</span> <span class="o">:</span> <span class="k">public</span> <span class="n">SRC</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">convert</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Intel High Performance SRC"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// Audio Control class
</span><span class="k">class</span> <span class="nc">AudioControl</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">shared_ptr</span><span class="o"><</span><span class="n">SRC</span><span class="o">></span> <span class="n">m_resampler</span><span class="p">;</span> <span class="c1">// encapsulate SRC algorithm
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">AudioControl</span><span class="p">()</span> <span class="o">:</span> <span class="n">m_resampler</span><span class="p">(</span><span class="k">new</span> <span class="n">DefaultSRC</span><span class="p">())</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="n">processData</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// convert all channel's data to the same sample rate
</span> <span class="c1">// and then mix them togegher
</span> <span class="n">m_resampler</span><span class="o">-></span><span class="n">convert</span><span class="p">();</span>
<span class="n">mixChannel</span><span class="p">();</span>
<span class="c1">// ...
</span> <span class="p">}</span>
<span class="kt">void</span> <span class="n">mixChannel</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"mix data"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// dynamic switch SRC
</span> <span class="kt">void</span> <span class="n">swithSRC</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_resampler</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="k">new</span> <span class="n">IntelSRC</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">AudioControl</span> <span class="n">obj</span><span class="p">;</span>
<span class="n">obj</span><span class="p">.</span><span class="n">processData</span><span class="p">();</span>
<span class="n">obj</span><span class="p">.</span><span class="n">swithSRC</span><span class="p">();</span>
<span class="n">obj</span><span class="p">.</span><span class="n">processData</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">interface</span> <span class="nc">SRC</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">convert</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">DefaultSRC</span> <span class="kd">implements</span> <span class="n">SRC</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">convert</span><span class="o">()</span>
<span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Default Naive SRC"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">IntelSRC</span> <span class="kd">implements</span> <span class="n">SRC</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">convert</span><span class="o">()</span>
<span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Intel High Performance SRC"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">AudioControl</span> <span class="o">{</span>
<span class="kd">private</span> <span class="n">SRC</span> <span class="n">m_resampler</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">AudioControl</span><span class="o">()</span>
<span class="o">{</span>
<span class="n">m_resampler</span> <span class="o">=</span> <span class="k">new</span> <span class="n">DefaultSRC</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">processData</span><span class="o">()</span>
<span class="o">{</span>
<span class="n">m_resampler</span><span class="o">.</span><span class="na">convert</span><span class="o">();</span>
<span class="n">mixChannel</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">mixChannel</span><span class="o">()</span>
<span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"mix data"</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">swithSRC</span><span class="o">()</span>
<span class="o">{</span>
<span class="n">m_resampler</span> <span class="o">=</span> <span class="k">new</span> <span class="n">IntelSRC</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Strategy</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">AudioControl</span> <span class="n">obj</span> <span class="o">=</span> <span class="k">new</span> <span class="n">AudioControl</span><span class="o">();</span>
<span class="n">obj</span><span class="o">.</span><span class="na">processData</span><span class="o">();</span>
<span class="n">obj</span><span class="o">.</span><span class="na">swithSRC</span><span class="o">();</span>
<span class="n">obj</span><span class="o">.</span><span class="na">processData</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Observer
2014-08-27T00:00:00+00:00
http://blog.wjin.org/posts/observer
<h1 id="observer">Observer</h1>
<h2 id="introduction">Introduction</h2>
<p>The observer pattern defines a <strong>one-to-many</strong> dependency between objects so that when one object changes state, all of its dependents are <strong>notified and updated</strong> automatically.</p>
<p><strong>Key Point</strong></p>
<ul>
<li>
<p>Subject and observers are loosely coupled</p>
</li>
<li>
<p>Do not rely on notification order</p>
</li>
<li>
<p>Pull/push</p>
</li>
</ul>
<h2 id="example">Example</h2>
<p>Nowadays, many people use social network sites or media, such as facebook, twitter and linkedin, to make friends and have fun. If you post a new message, all your friends who are following you will get it instantly. This is an observer pattern or publish/subscribe model in terms of IT guys :(</p>
<p>Another example is about Redis NoSql database. Redis has a feature called pub/sub. Client can subscribe any channels that it is interested in and then gets notification from those channels. This is also a <strong>one-to-many</strong> relationship and redis uses a list to record clients for each channel. In my opinion, observer pattern is an abstract concept that can be found in many softwares and its implementation depends on you.</p>
<h2 id="code">Code</h2>
<p>Demo code shows all followers (Beauties) will get notification when subject (HandsomeEric) has a new post. And they are, of course, loosely coupled.</p>
<h3 id="cpp">Cpp</h3>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Observer</span><span class="p">;</span>
<span class="c1">// abstract subject
</span><span class="k">class</span> <span class="nc">Subject</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">registerObserver</span><span class="p">(</span><span class="n">Observer</span> <span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">removeObserver</span><span class="p">(</span><span class="n">Observer</span> <span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">notifyObservers</span><span class="p">()</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// abstract observer
</span><span class="k">class</span> <span class="nc">Observer</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// concrete observer
</span><span class="k">class</span> <span class="nc">Beauty1</span> <span class="o">:</span> <span class="n">Observer</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Beauty1 got message: "</span> <span class="o"><<</span> <span class="n">s</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">follow</span><span class="p">(</span><span class="n">Subject</span> <span class="o">*</span><span class="n">sub</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">sub</span><span class="o">-></span><span class="n">registerObserver</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">unfollow</span><span class="p">(</span><span class="n">Subject</span> <span class="o">*</span><span class="n">sub</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">sub</span><span class="o">-></span><span class="n">removeObserver</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// another concrete observer
</span><span class="k">class</span> <span class="nc">Beauty2</span> <span class="o">:</span> <span class="n">Observer</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Beauty2 got message: "</span> <span class="o"><<</span> <span class="n">s</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">follow</span><span class="p">(</span><span class="n">Subject</span> <span class="o">*</span><span class="n">sub</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">sub</span><span class="o">-></span><span class="n">registerObserver</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">unfollow</span><span class="p">(</span><span class="n">Subject</span> <span class="o">*</span><span class="n">sub</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">sub</span><span class="o">-></span><span class="n">removeObserver</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// concrete subject
</span><span class="k">class</span> <span class="nc">HandsomeEric</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Subject</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">list</span><span class="o"><</span><span class="n">Observer</span><span class="o">*></span> <span class="n">followers</span><span class="p">;</span>
<span class="n">string</span> <span class="n">latestPost</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">registerObserver</span><span class="p">(</span><span class="n">Observer</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">followers</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">obj</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">removeObserver</span><span class="p">(</span><span class="n">Observer</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">ite</span> <span class="o">=</span> <span class="n">followers</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">ite</span> <span class="o">!=</span> <span class="n">followers</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">ite</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">ite</span> <span class="o">==</span> <span class="n">obj</span><span class="p">)</span> <span class="p">{</span>
<span class="n">followers</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">ite</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">virtual</span> <span class="kt">void</span> <span class="n">notifyObservers</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">e</span> <span class="o">:</span> <span class="n">followers</span><span class="p">)</span> <span class="p">{</span>
<span class="n">e</span><span class="o">-></span><span class="n">update</span><span class="p">(</span><span class="n">latestPost</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">newPost</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">latestPost</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
<span class="n">notifyObservers</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">HandsomeEric</span> <span class="n">eric</span><span class="p">;</span>
<span class="n">Beauty1</span> <span class="n">b1</span><span class="p">;</span>
<span class="n">Beauty2</span> <span class="n">b2</span><span class="p">;</span>
<span class="n">b1</span><span class="p">.</span><span class="n">follow</span><span class="p">(</span><span class="o">&</span><span class="n">eric</span><span class="p">);</span>
<span class="n">b2</span><span class="p">.</span><span class="n">follow</span><span class="p">(</span><span class="o">&</span><span class="n">eric</span><span class="p">);</span>
<span class="n">eric</span><span class="p">.</span><span class="n">newPost</span><span class="p">(</span><span class="s">"anyone wants to have a dinner with me?"</span><span class="p">);</span>
<span class="n">b1</span><span class="p">.</span><span class="n">unfollow</span><span class="p">(</span><span class="o">&</span><span class="n">eric</span><span class="p">);</span>
<span class="n">eric</span><span class="p">.</span><span class="n">newPost</span><span class="p">(</span><span class="s">"anyone wants to watch football match tonight?"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="java">Java</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">java.util.LinkedList</span><span class="o">;</span>
<span class="c1">// abstract Subject</span>
<span class="kd">interface</span> <span class="nc">Subject</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">registerObserver</span><span class="o">(</span><span class="n">Observer</span> <span class="n">obj</span><span class="o">);</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">removeObserver</span><span class="o">(</span><span class="n">Observer</span> <span class="n">obj</span><span class="o">);</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">notifyObservers</span><span class="o">();</span>
<span class="o">};</span>
<span class="c1">// abstract Observer</span>
<span class="kd">interface</span> <span class="nc">Observer</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">update</span><span class="o">(</span><span class="n">String</span> <span class="n">s</span><span class="o">);</span>
<span class="o">};</span>
<span class="c1">// concrete Observer</span>
<span class="kd">class</span> <span class="nc">Beauty1</span> <span class="kd">implements</span> <span class="n">Observer</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">update</span><span class="o">(</span><span class="n">String</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Beauty1 got message: "</span> <span class="o">+</span> <span class="n">s</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">follow</span><span class="o">(</span><span class="n">Subject</span> <span class="n">sub</span><span class="o">)</span> <span class="o">{</span>
<span class="n">sub</span><span class="o">.</span><span class="na">registerObserver</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">unfollow</span><span class="o">(</span><span class="n">Subject</span> <span class="n">sub</span><span class="o">)</span> <span class="o">{</span>
<span class="n">sub</span><span class="o">.</span><span class="na">removeObserver</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1">// another concrete Observer</span>
<span class="kd">class</span> <span class="nc">Beauty2</span> <span class="kd">implements</span> <span class="n">Observer</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">update</span><span class="o">(</span><span class="n">String</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Beauty2 got message: "</span> <span class="o">+</span> <span class="n">s</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">follow</span><span class="o">(</span><span class="n">Subject</span> <span class="n">sub</span><span class="o">)</span> <span class="o">{</span>
<span class="n">sub</span><span class="o">.</span><span class="na">registerObserver</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">unfollow</span><span class="o">(</span><span class="n">Subject</span> <span class="n">sub</span><span class="o">)</span> <span class="o">{</span>
<span class="n">sub</span><span class="o">.</span><span class="na">removeObserver</span><span class="o">(</span><span class="k">this</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1">// concrete Subject</span>
<span class="kd">class</span> <span class="nc">HandsomeEric</span> <span class="kd">implements</span> <span class="n">Subject</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">LinkedList</span><span class="o"><</span><span class="n">Observer</span><span class="o">></span> <span class="n">followers</span><span class="o">;</span>
<span class="kd">private</span> <span class="n">String</span> <span class="n">latestPost</span><span class="o">;</span>
<span class="n">HandsomeEric</span><span class="o">()</span> <span class="o">{</span>
<span class="n">followers</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LinkedList</span><span class="o"><</span><span class="n">Observer</span><span class="o">>();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">registerObserver</span><span class="o">(</span><span class="n">Observer</span> <span class="n">obj</span><span class="o">)</span> <span class="o">{</span>
<span class="n">followers</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">obj</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">removeObserver</span><span class="o">(</span><span class="n">Observer</span> <span class="n">obj</span><span class="o">)</span> <span class="o">{</span>
<span class="k">for</span> <span class="o">(</span><span class="n">Observer</span> <span class="n">e</span> <span class="o">:</span> <span class="n">followers</span><span class="o">)</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">e</span> <span class="o">==</span> <span class="n">obj</span><span class="o">)</span> <span class="o">{</span>
<span class="n">followers</span><span class="o">.</span><span class="na">remove</span><span class="o">(</span><span class="n">e</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">notifyObservers</span><span class="o">()</span> <span class="o">{</span>
<span class="k">for</span> <span class="o">(</span><span class="n">Observer</span> <span class="n">e</span> <span class="o">:</span> <span class="n">followers</span><span class="o">)</span> <span class="o">{</span>
<span class="n">e</span><span class="o">.</span><span class="na">update</span><span class="o">(</span><span class="n">latestPost</span><span class="o">);</span> <span class="c1">// push</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">newPost</span><span class="o">(</span><span class="n">String</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span>
<span class="n">latestPost</span> <span class="o">=</span> <span class="n">s</span><span class="o">;</span>
<span class="n">notifyObservers</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Observe</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">HandsomeEric</span> <span class="n">eric</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HandsomeEric</span><span class="o">();</span>
<span class="n">Beauty1</span> <span class="n">b1</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Beauty1</span><span class="o">();</span>
<span class="n">Beauty2</span> <span class="n">b2</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Beauty2</span><span class="o">();</span>
<span class="n">b1</span><span class="o">.</span><span class="na">follow</span><span class="o">(</span><span class="n">eric</span><span class="o">);</span>
<span class="n">b2</span><span class="o">.</span><span class="na">follow</span><span class="o">(</span><span class="n">eric</span><span class="o">);</span>
<span class="n">eric</span><span class="o">.</span><span class="na">newPost</span><span class="o">(</span><span class="s">"anyone wants to have a dinner with me?"</span><span class="o">);</span>
<span class="n">b1</span><span class="o">.</span><span class="na">unfollow</span><span class="o">(</span><span class="n">eric</span><span class="o">);</span>
<span class="n">eric</span><span class="o">.</span><span class="na">newPost</span><span class="o">(</span><span class="s">"anyone wants to watch football match tonight?"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
Binary Indexed Tree
2014-08-26T00:00:00+00:00
http://blog.wjin.org/posts/binary-indexed-tree
<h1 id="introduction">Introduction</h1>
<p>A <strong>Fenwick tree</strong> or <strong>Binary Indexed Tree</strong> is a data structure providing efficient methods for calculation and manipulation of the <strong>prefix sums</strong> or <strong>cumulative frequency</strong> of a table of values. It was proposed by Peter Fenwick in 1994. Here is an excellent explanation from <a href="http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees">topcoder</a>.</p>
<p><strong>Question Definition</strong></p>
<p>Let’s define the following problem: We have n boxes. Possible queries are:</p>
<ol>
<li>
<p>add marble to box i</p>
</li>
<li>
<p>sum marbles from box k to box l</p>
</li>
</ol>
<p>The naive solution has time complexity of O(1) for query 1 and O(n) for query 2. Suppose we make m queries. The worst case (when all queries are 2) has time complexity O(n * m).</p>
<p>Using some data structure (i.e.: segment tree for <strong>RMQ</strong>), we can solve this problem with the worst case time complexity of <strong>O(m log n)</strong>.</p>
<p>Another approach is to use Binary Indexed Tree data structure, also with the worst time complexity <strong>O(m log n)</strong>. However, Binary Indexed Tree is much easier to code, and requires less memory space, than RMQ.</p>
<p><strong>Basic idea</strong></p>
<p>Let f[idx] is frequencey for each idx, and r is a position in idx of the <strong>last digit 1</strong>. For example, idx = 100100 in binary, r is 2.</p>
<blockquote>
<p>bit[idx] = f[idx-2^r+1] + f[idx-2^r+2] + … + f[idx]</p>
</blockquote>
<p><strong>Useful Tips</strong></p>
<ol>
<li>
<p>BIT input data have some dependency according to each problem’s definition, normally we need to sort them.</p>
</li>
<li>
<p>BIT has a input scope, sometimes we need to discretize input data when input data range is too large (cannot allocate a huge BIT array).</p>
</li>
<li>
<p>2D BIT</p>
</li>
</ol>
<p><strong>Code Template</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BIT</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">maxIdx</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">;</span> <span class="c1">// max index
</span> <span class="kt">int</span> <span class="n">bitMask</span><span class="p">;</span> <span class="c1">// bit mask for binary search
</span> <span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">tree</span><span class="p">;</span> <span class="c1">// binary indexed tree array, tree[0] is not used
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">BIT</span><span class="p">()</span> <span class="o">:</span> <span class="n">tree</span><span class="p">(</span><span class="n">maxIdx</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">bitMask</span> <span class="o">=</span> <span class="n">highbit</span><span class="p">(</span><span class="n">maxIdx</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// clear idx except the LSB '1'
</span> <span class="c1">// 10101000 -> 10100000
</span> <span class="kt">int</span> <span class="n">lowbit</span><span class="p">(</span><span class="kt">int</span> <span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">idx</span> <span class="o">&</span> <span class="p">(</span><span class="o">-</span><span class="n">idx</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// clear idx except the MSB '1'
</span> <span class="c1">// 01010000 -> 01000000
</span> <span class="kt">int</span> <span class="n">highbit</span><span class="p">(</span><span class="kt">int</span> <span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">lowbit</span><span class="p">(</span><span class="n">idx</span><span class="p">))</span> <span class="p">{</span>
<span class="n">idx</span> <span class="o">-=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">idx</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// update tree
</span> <span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="kt">int</span> <span class="n">idx</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">idx</span> <span class="o"><=</span> <span class="n">maxIdx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tree</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">+=</span> <span class="n">num</span><span class="p">;</span>
<span class="n">idx</span> <span class="o">+=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// get cumulative sum of f[1]...f[idx]
</span> <span class="kt">int</span> <span class="n">read</span><span class="p">(</span><span class="kt">int</span> <span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">idx</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">tree</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>
<span class="n">idx</span> <span class="o">-=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// get frequency of f[idx]
</span> <span class="cm">/*
int readSingle(int idx)
{
return (read(idx) - read(idx - 1));
}
*/</span>
<span class="kt">int</span> <span class="n">readSingle</span><span class="p">(</span><span class="kt">int</span> <span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">tree</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span> <span class="c1">// sum will be decreased
</span> <span class="k">if</span> <span class="p">(</span><span class="n">idx</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// spetree[idx]al case
</span> <span class="kt">int</span> <span class="n">z</span> <span class="o">=</span> <span class="n">idx</span> <span class="o">-</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span> <span class="c1">// make z first
</span>
<span class="n">idx</span><span class="o">--</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">idx</span> <span class="o">!=</span> <span class="n">z</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// at some iteration idx (y) will become z
</span> <span class="n">sum</span> <span class="o">-=</span> <span class="n">tree</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>
<span class="c1">// substruct tree frequency which is between y and "the same path"
</span> <span class="n">idx</span> <span class="o">-=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">idx</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// following two functions using binary search to search a
</span> <span class="c1">// cumulative frequencey which is equal to target cumFre.
</span> <span class="c1">// be careful binary search is applicable if and only if
</span> <span class="c1">// f[i] >= 0, 1 <= i <= maxIdx
</span>
<span class="c1">// bitMask - initialy, it is the greatest bit of maxIdx
</span> <span class="c1">// bitMask store interval which should be searched
</span>
<span class="c1">// if in tree exists more than one index with a same
</span> <span class="c1">// cumulative frequency, this procedure will return
</span> <span class="c1">// some of them (we do not know which one)
</span> <span class="kt">int</span> <span class="n">find</span><span class="p">(</span><span class="kt">int</span> <span class="n">cumFre</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// this var is result of function
</span>
<span class="k">while</span> <span class="p">((</span><span class="n">bitMask</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">&&</span> <span class="p">(</span><span class="n">idx</span> <span class="o"><</span> <span class="n">maxIdx</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// nobody likes overflow :)
</span> <span class="kt">int</span> <span class="n">tIdx</span> <span class="o">=</span> <span class="n">idx</span> <span class="o">+</span> <span class="n">bitMask</span><span class="p">;</span> <span class="c1">// we make midpoint of interval
</span> <span class="k">if</span> <span class="p">(</span><span class="n">cumFre</span> <span class="o">==</span> <span class="n">tree</span><span class="p">[</span><span class="n">tIdx</span><span class="p">])</span> <span class="c1">// if it is equal, we just return idx
</span> <span class="k">return</span> <span class="n">tIdx</span><span class="p">;</span>
<span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">cumFre</span> <span class="o">></span> <span class="n">tree</span><span class="p">[</span><span class="n">tIdx</span><span class="p">])</span> <span class="p">{</span>
<span class="c1">// if tree frequency "can fit" into cumFre,
</span> <span class="c1">// then include it
</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">tIdx</span><span class="p">;</span> <span class="c1">// update index
</span> <span class="n">cumFre</span> <span class="o">-=</span> <span class="n">tree</span><span class="p">[</span><span class="n">tIdx</span><span class="p">];</span> <span class="c1">// set frequency for next loop
</span> <span class="p">}</span>
<span class="n">bitMask</span> <span class="o">>>=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// half current interval
</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cumFre</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="c1">// maybe given cumulative frequency doesn't exist
</span> <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">else</span>
<span class="k">return</span> <span class="n">idx</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// if in tree exists more than one index with a same
</span> <span class="c1">// cumulative frequency, this procedure will return
</span> <span class="c1">// the greatest one
</span> <span class="kt">int</span> <span class="n">findG</span><span class="p">(</span><span class="kt">int</span> <span class="n">cumFre</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">((</span><span class="n">bitMask</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">&&</span> <span class="p">(</span><span class="n">idx</span> <span class="o"><</span> <span class="n">maxIdx</span><span class="p">))</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">tIdx</span> <span class="o">=</span> <span class="n">idx</span> <span class="o">+</span> <span class="n">bitMask</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cumFre</span> <span class="o">>=</span> <span class="n">tree</span><span class="p">[</span><span class="n">tIdx</span><span class="p">])</span> <span class="p">{</span>
<span class="c1">// if current cumulative frequency is equal to cumFre,
</span> <span class="c1">// we are still looking for higher index (if exists)
</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">tIdx</span><span class="p">;</span>
<span class="n">cumFre</span> <span class="o">-=</span> <span class="n">tree</span><span class="p">[</span><span class="n">tIdx</span><span class="p">];</span>
<span class="p">}</span>
<span class="n">bitMask</span> <span class="o">>>=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">cumFre</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">else</span>
<span class="k">return</span> <span class="n">idx</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="examples">Examples</h1>
<h2 id="stars">Stars</h2>
<p><strong>Description</strong></p>
<p>Astronomers often examine star maps where stars are represented by points on a plane and each star has Cartesian coordinates. Let the level of a star be an amount of the stars that are not higher and not to the right of the given star. Astronomers want to know the distribution of the levels of the stars.</p>
<p>You are to write a program that will count the amounts of the stars of each level on a given map.</p>
<p>Input</p>
<p>The first line of the input file contains a number of stars N (1<=N<=15000). The following N lines describe coordinates of stars (two integers X and Y per line separated by a space, 0<=X,Y<=32000). There can be only one star at one point of the plane. Stars are listed in ascending order of Y coordinate. Stars with equal Y
coordinates are listed in ascending order of X coordinate.</p>
<p>Output</p>
<p>The output should contain N lines, one number per line. The first line contains amount of stars of the level 0, the second does amount of stars of the level 1 and so on, the last line contains amount of stars of the level N-1.</p>
<p>Sample Input</p>
<p>5</p>
<p>1 1</p>
<p>5 1</p>
<p>7 1</p>
<p>3 3</p>
<p>5 5</p>
<p>Sample Output</p>
<p>1</p>
<p>2</p>
<p>1</p>
<p>1</p>
<p>0</p>
<p><strong>Analysis</strong></p>
<p>As input data have already been sorted, just insert it to BIT.</p>
<p><strong>Code</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Solution</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">MAX</span> <span class="o">=</span> <span class="mi">32002</span><span class="p">;</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">c</span><span class="p">;</span> <span class="c1">// binary indexed tree
</span> <span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">level</span><span class="p">;</span> <span class="c1">// record each level's number
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Solution</span><span class="p">()</span> <span class="o">:</span> <span class="n">c</span><span class="p">(</span><span class="n">MAX</span><span class="p">),</span> <span class="n">level</span><span class="p">(</span><span class="n">MAX</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">int</span> <span class="n">lowbit</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">&</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span> <span class="o"><=</span> <span class="n">MAX</span><span class="p">)</span> <span class="p">{</span>
<span class="n">c</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">+=</span> <span class="n">num</span><span class="p">;</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// read returns cumulative sum of c[1]...c[x]
</span> <span class="c1">// here it is the level of this star
</span> <span class="kt">int</span> <span class="n">read</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">c</span><span class="p">[</span><span class="n">x</span><span class="p">];</span>
<span class="n">x</span> <span class="o">-=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">solve</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">n</span><span class="p">;</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">n</span><span class="p">;</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">x</span> <span class="o">>></span> <span class="n">y</span><span class="p">;</span>
<span class="c1">// as points are sorted by y coordinate with ascending order,
</span> <span class="c1">// for each point, only count points before it in the input stream
</span> <span class="c1">// x + 1 is to avoid case: x=0
</span> <span class="n">level</span><span class="p">[</span><span class="n">read</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span><span class="o">++</span><span class="p">;</span>
<span class="n">update</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="cp">#ifndef ONLINE_JUDGE
</span> <span class="n">freopen</span><span class="p">(</span><span class="s">"input"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">stdin</span><span class="p">);</span>
<span class="c1">// freopen("output","w",stdout);
</span><span class="cp">#endif
</span>
<span class="n">Solution</span> <span class="n">sol</span><span class="p">;</span>
<span class="n">sol</span><span class="p">.</span><span class="n">solve</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="circus-pyramid">Circus Pyramid</h2>
<p><strong>Description</strong></p>
<p>A circus is designing a tower routine consisting of people standing atop one
another’s shoulders. For practical and aesthetic reasons, each person must be
both shorter and lighter than the person below him or her. Given the heights
and weights of each person in the circus, write a method to compute the largest
possible number of people in such a tower.</p>
<p>EXAMPLE:</p>
<p>Input (ht,wt): (65, 100) (70, 150) (56, 90) (75, 190) (60, 95) (68, 110)</p>
<p>Output:The longest tower is length 6 and includes from top to bottom:</p>
<p>(56, 90) (60,95) (65,100) (68,110) (70,150) (75,190)</p>
<p><strong>Analysis</strong></p>
<p>First sort person according to height, and then construct BIT by inserting person pairs, meanwhile calculate that person’s largest possible number in the tower before inserting.</p>
<p><strong>Code</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Solution</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">struct</span> <span class="n">Person</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">ht</span><span class="p">;</span> <span class="c1">// height
</span> <span class="kt">int</span> <span class="n">wt</span><span class="p">;</span> <span class="c1">// weight
</span>
<span class="n">Person</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">w</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">:</span> <span class="n">ht</span><span class="p">(</span><span class="n">h</span><span class="p">),</span> <span class="n">wt</span><span class="p">(</span><span class="n">w</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="k">static</span> <span class="kt">bool</span> <span class="nf">cmp</span><span class="p">(</span><span class="k">const</span> <span class="n">Person</span> <span class="o">&</span><span class="n">lhs</span><span class="p">,</span> <span class="k">const</span> <span class="n">Person</span> <span class="o">&</span><span class="n">rhs</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">lhs</span><span class="p">.</span><span class="n">ht</span> <span class="o"><</span> <span class="n">rhs</span><span class="p">.</span><span class="n">ht</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">MAX</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">;</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">c</span><span class="p">;</span> <span class="c1">// binary indexed tree
</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Solution</span><span class="p">()</span> <span class="o">:</span> <span class="n">c</span><span class="p">(</span><span class="n">MAX</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">int</span> <span class="n">lowbit</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">&</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span> <span class="o"><=</span> <span class="n">MAX</span><span class="p">)</span> <span class="p">{</span>
<span class="n">c</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">+=</span> <span class="n">num</span><span class="p">;</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">read</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">c</span><span class="p">[</span><span class="n">x</span><span class="p">];</span>
<span class="n">x</span> <span class="o">-=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">solve</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">n</span><span class="p">;</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">n</span><span class="p">;</span>
<span class="n">vector</span><span class="o"><</span><span class="n">Person</span><span class="o">></span> <span class="n">v</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">ht</span> <span class="o">>></span> <span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">wt</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">sort</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">v</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="n">cmp</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">INT_MIN</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">e</span> <span class="o">:</span> <span class="n">v</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">ret</span><span class="p">,</span> <span class="n">read</span><span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">wt</span><span class="p">));</span>
<span class="n">update</span><span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">wt</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">ret</span> <span class="o">+</span> <span class="mi">1</span><span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> <span class="c1">// +1 means including himself
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="cp">#ifndef ONLINE_JUDGE
</span> <span class="n">freopen</span><span class="p">(</span><span class="s">"input"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">stdin</span><span class="p">);</span>
<span class="c1">// freopen("output","w",stdout);
</span><span class="cp">#endif
</span>
<span class="n">Solution</span> <span class="n">sol</span><span class="p">;</span>
<span class="n">sol</span><span class="p">.</span><span class="n">solve</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="ultra-quicksort">Ultra-QuickSort</h2>
<p><strong>Description</strong></p>
<p>In this problem, you have to analyze a particular sorting algorithm.
The algorithm processes a sequence of n distinct integers by swapping
two adjacent sequence elements until the sequence is sorted in ascending
order.</p>
<p>For the input sequence</p>
<p>9 1 0 5 4 ,</p>
<p>Ultra-QuickSort produces the output</p>
<p>0 1 4 5 9</p>
<p>Your task is to determine how many swap operations Ultra-QuickSort needs to
perform in order to sort a given input sequence.</p>
<p>Input</p>
<p>The input contains several test cases. Every test case begins with a line that
contains a single integer n < 500,000 – the length of the input sequence.
Each of the the following n lines contains a single integer 0 <= a[i] <= 999,999,999,
the i-th input sequence element. Input is terminated by a sequence of length n = 0.
This sequence must not be processed.</p>
<p>Output</p>
<p>For every input sequence, your program prints a single line containing an integer number
op, the minimum number of swap operations necessary to sort the given input sequence.</p>
<p>Sample Input</p>
<p>5</p>
<p>9</p>
<p>1</p>
<p>0</p>
<p>5</p>
<p>4</p>
<p>3</p>
<p>1</p>
<p>2</p>
<p>3</p>
<p>0</p>
<p>Sample Output</p>
<p>6</p>
<p>0</p>
<p><strong>Analysis</strong></p>
<ol>
<li>
<p>discretize and then BIT</p>
</li>
<li>
<p>merge sort</p>
</li>
</ol>
<p><strong>Code</strong></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Solution</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">struct</span> <span class="n">node</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">val</span><span class="p">;</span> <span class="c1">// value
</span> <span class="kt">int</span> <span class="n">id</span><span class="p">;</span> <span class="c1">// order in the input stream
</span> <span class="p">};</span>
<span class="k">static</span> <span class="kt">bool</span> <span class="nf">cmp</span><span class="p">(</span><span class="k">const</span> <span class="n">node</span> <span class="o">&</span><span class="n">lhs</span><span class="p">,</span> <span class="k">const</span> <span class="n">node</span> <span class="o">&</span><span class="n">rhs</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">lhs</span><span class="p">.</span><span class="n">val</span> <span class="o"><</span> <span class="n">rhs</span><span class="p">.</span><span class="n">val</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">MAX</span><span class="p">;</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">c</span><span class="p">;</span> <span class="c1">// binary indexed tree
</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">int</span> <span class="n">lowbit</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">&</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">update</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span> <span class="o"><=</span> <span class="n">MAX</span><span class="p">)</span> <span class="p">{</span>
<span class="n">c</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">+=</span> <span class="n">num</span><span class="p">;</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">read</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">c</span><span class="p">[</span><span class="n">x</span><span class="p">];</span>
<span class="n">x</span> <span class="o">-=</span> <span class="n">lowbit</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">solveOne</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">MAX</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
<span class="n">c</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="n">c</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">vector</span><span class="o"><</span><span class="n">node</span><span class="o">></span> <span class="n">v</span><span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="c1">// read data
</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">val</span><span class="p">;</span>
<span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">id</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// discretize input
</span> <span class="n">sort</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="n">cmp</span><span class="p">);</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">discrete</span><span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">discrete</span><span class="p">[</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">id</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// counting result
</span> <span class="c1">// i is the input index
</span> <span class="c1">// read(discrete[i]) is the number that little than or equal to discrete[i]
</span> <span class="c1">// i - read(discrete[i]) is the number that grater than discrete[i]
</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">update</span><span class="p">(</span><span class="n">discrete</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">ret</span> <span class="o">+=</span> <span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="n">read</span><span class="p">(</span><span class="n">discrete</span><span class="p">[</span><span class="n">i</span><span class="p">]));</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">ret</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">solve</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cin</span> <span class="o">>></span> <span class="n">n</span><span class="p">,</span> <span class="n">n</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">solveOne</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="cp">#ifndef ONLINE_JUDGE
</span> <span class="n">freopen</span><span class="p">(</span><span class="s">"input"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">stdin</span><span class="p">);</span>
<span class="c1">// freopen("output","w",stdout);
</span><span class="cp">#endif
</span>
<span class="n">Solution</span> <span class="n">sol</span><span class="p">;</span>
<span class="n">sol</span><span class="p">.</span><span class="n">solve</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="count-cards">Count Cards</h2>
<p><strong>Description</strong></p>
<p>There is an array of n cards. Each card is putted face down on table. You have two queries:</p>
<ol>
<li>
<p>T(i, j) (turn cards from index i to index j, include i-th and j-th card - card which was face down will be face up; card which was face up will be face down)</p>
</li>
<li>
<p>Q(i) (answer 0 if i-th card is face down else answer 1)</p>
</li>
</ol>
<p><strong>Analysis</strong></p>
<p>There is time complexity O(log n) for each query (and 1 and 2). In array f (of length n + 1) we will store each query T (i , j) - we set <strong>f[i]++</strong> and <strong>f[j + 1]–</strong>.</p>
<p>For each card k between i and j (include i and j) sum f[1] + f[2] + … + f[k] will be increased for 1, for all others will be same as before, so our solution will be described sum (which is same as cumulative frequency) <strong>module 2</strong>.</p>
Segment Tree
2014-08-20T00:00:00+00:00
http://blog.wjin.org/posts/segment-tree
<h1 id="introduction">Introduction</h1>
<p>Segment tree is a tree data structure for storing intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle, a static structure; that is, its content cannot be modified once the structure is built. A similar data structure is the interval tree.</p>
<p>A segment tree for a set of n intervals uses <strong>O(n log n)</strong> storage and can be built in <strong>O(n log n)</strong> time. Segment trees support searching for all the intervals that contain a query point in <strong>O(log n + k)</strong>, k being the number of retrieved intervals or segments.</p>
<p><strong>Reference</strong></p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Segment_tree">http://en.wikipedia.org/wiki/Segment_tree</a></li>
</ul>
<h1 id="application">Application</h1>
<h2 id="balanced-lineup"><a href="http://poj.org/problem?id=3264">Balanced Lineup</a></h2>
<p><strong>Description</strong></p>
<p>For the daily milking, Farmer John’s N cows (1 <= N <= 50,000) always line up in the same order. One day Farmer John decides to organize a game of Ultimate Frisbee with some of the cows. To keep things simple, he will take a contiguous range of cows from the milking lineup to play the game. However, for all the cows to have fun they should not differ too much in height.</p>
<p>Farmer John has made a list of Q (1 <= Q <= 200,000) potential groups of cows and their heights (1 <= height <= 1,000,000). For each group, he wants your help to determine the difference in height between the shortest and the tallest cow in the group.</p>
<p><strong>Input</strong></p>
<p>Line 1: Two space-separated integers, N and Q.</p>
<p>Lines 2..N+1: Line i+1 contains a single integer that is the height of cow i</p>
<p>Lines N+2..N+Q+1: Two integers A and B (1 <= A <= B <= N), representing the range of cows from A to B inclusive.</p>
<p><strong>Output</strong></p>
<p>Lines 1..Q: Each line contains a single integer that is a response to a reply and indicates the difference in height between the tallest and shortest cow in the range.</p>
<p><strong>Sample Input</strong></p>
<p>6 3</p>
<p>1</p>
<p>7</p>
<p>3</p>
<p>4</p>
<p>2</p>
<p>5</p>
<p>1 5</p>
<p>4 6</p>
<p>2 2</p>
<p><strong>Sample Output</strong></p>
<p>6</p>
<p>3</p>
<p>0</p>
<p><strong>Code</strong></p>
<p>Using array to implement segment tree.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SegmentTree</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="c1">// tree node
</span> <span class="k">struct</span> <span class="n">TreeNode</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">left</span><span class="p">;</span> <span class="c1">// segment's left point
</span> <span class="kt">int</span> <span class="n">right</span><span class="p">;</span> <span class="c1">// segment's right point
</span>
<span class="c1">// here we can record anything specific to a problem
</span> <span class="c1">// i.e.: sum, max or min element, and so on
</span> <span class="kt">int</span> <span class="n">minEle</span><span class="p">;</span> <span class="c1">// min element in the scope[left...right]
</span> <span class="kt">int</span> <span class="n">maxEle</span><span class="p">;</span> <span class="c1">// max element in the scope[left...right]
</span>
<span class="n">TreeNode</span><span class="p">()</span><span class="o">:</span> <span class="n">left</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">right</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">minEle</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">maxEle</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="n">vector</span><span class="o"><</span><span class="n">TreeNode</span><span class="o">></span> <span class="n">treeNode</span><span class="p">;</span> <span class="c1">// tree node set, treeNode[1] is root
</span>
<span class="c1">// build perfect binary tree
</span> <span class="c1">// using array(treeNode) to store it
</span> <span class="kt">void</span> <span class="nf">build</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">data</span><span class="p">,</span> <span class="kt">int</span> <span class="n">t</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// set segment's start and end point
</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">left</span> <span class="o">=</span> <span class="n">left</span><span class="p">;</span>
<span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">right</span> <span class="o">=</span> <span class="n">right</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">left</span> <span class="o">==</span> <span class="n">right</span><span class="p">)</span> <span class="p">{</span>
<span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">minEle</span> <span class="o">=</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">maxEle</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">left</span><span class="p">];</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// build left and right tree recursively
</span> <span class="kt">int</span> <span class="n">mid</span> <span class="o">=</span> <span class="n">left</span> <span class="o">+</span> <span class="p">((</span><span class="n">right</span> <span class="o">-</span> <span class="n">left</span><span class="p">)</span> <span class="o">>></span> <span class="mi">1</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">leftRoot</span> <span class="o">=</span> <span class="n">t</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">rightRoot</span> <span class="o">=</span> <span class="n">leftRoot</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">build</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">leftRoot</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">mid</span><span class="p">);</span>
<span class="n">build</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">rightRoot</span><span class="p">,</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="c1">// update info
</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">minEle</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">treeNode</span><span class="p">[</span><span class="n">leftRoot</span><span class="p">].</span><span class="n">minEle</span><span class="p">,</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">rightRoot</span><span class="p">].</span><span class="n">minEle</span><span class="p">);</span>
<span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">maxEle</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">treeNode</span><span class="p">[</span><span class="n">leftRoot</span><span class="p">].</span><span class="n">maxEle</span><span class="p">,</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">rightRoot</span><span class="p">].</span><span class="n">maxEle</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">query</span><span class="p">(</span><span class="kt">int</span> <span class="n">t</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">,</span> <span class="kt">int</span> <span class="o">&</span><span class="n">lower</span><span class="p">,</span> <span class="kt">int</span> <span class="o">&</span><span class="n">upper</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">left</span> <span class="o">==</span> <span class="n">left</span> <span class="o">&&</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">right</span> <span class="o">==</span> <span class="n">right</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lower</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">lower</span><span class="p">,</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">minEle</span><span class="p">);</span>
<span class="n">upper</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">upper</span><span class="p">,</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">maxEle</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">mid</span> <span class="o">=</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">left</span> <span class="o">+</span> <span class="p">((</span><span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">right</span> <span class="o">-</span> <span class="n">treeNode</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="n">left</span><span class="p">)</span> <span class="o">>></span> <span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">left</span> <span class="o">></span> <span class="n">mid</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// search right child
</span> <span class="n">query</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">right</span> <span class="o"><=</span> <span class="n">mid</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// search left child
</span> <span class="n">query</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">t</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// search both
</span> <span class="n">query</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">t</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">mid</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="n">query</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// 2 * n is enough to store the tree with n leaves
</span> <span class="n">SegmentTree</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">)</span> <span class="o">:</span> <span class="n">treeNode</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">build</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">maxDiff</span><span class="p">(</span><span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">maxData</span> <span class="o">=</span> <span class="n">INT_MIN</span><span class="p">,</span> <span class="n">minData</span> <span class="o">=</span> <span class="n">INT_MAX</span><span class="p">;</span>
<span class="n">query</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">minData</span><span class="p">,</span> <span class="n">maxData</span><span class="p">);</span>
<span class="k">return</span> <span class="n">maxData</span> <span class="o">-</span> <span class="n">minData</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="cp">#ifndef ONLINE_JUDGE
</span> <span class="n">freopen</span><span class="p">(</span><span class="s">"input"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">stdin</span><span class="p">);</span>
<span class="c1">// freopen("output","w",stdout);
</span><span class="cp">#endif
</span>
<span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="n">q</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">;</span>
<span class="n">cin</span> <span class="o">>></span> <span class="n">n</span> <span class="o">>></span> <span class="n">q</span><span class="p">;</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">v</span><span class="p">(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="n">cin</span> <span class="o">>></span> <span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">SegmentTree</span> <span class="n">sol</span><span class="p">(</span><span class="n">v</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cin</span> <span class="o">>></span> <span class="n">a</span> <span class="o">>></span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">sol</span><span class="p">.</span><span class="n">maxDiff</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Using chain list to implement segment tree.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SegmentTree</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="c1">// tree node
</span> <span class="k">struct</span> <span class="n">TreeNode</span> <span class="p">{</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">lChild</span><span class="p">,</span> <span class="o">*</span><span class="n">rChild</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">left</span><span class="p">;</span> <span class="c1">// segment's left point
</span> <span class="kt">int</span> <span class="n">right</span><span class="p">;</span> <span class="c1">// segment's right point
</span>
<span class="c1">// here we can recode anything specific to a problem
</span> <span class="c1">// i.e.: sum, max or min element, and so on
</span> <span class="kt">int</span> <span class="n">minEle</span><span class="p">;</span> <span class="c1">// min element in the scope[left...right]
</span> <span class="kt">int</span> <span class="n">maxEle</span><span class="p">;</span> <span class="c1">// max element in the scope[left...right]
</span> <span class="n">TreeNode</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="n">l</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span><span class="o">:</span>
<span class="n">lChild</span><span class="p">(</span><span class="nb">nullptr</span><span class="p">),</span> <span class="n">rChild</span><span class="p">(</span><span class="nb">nullptr</span><span class="p">),</span> <span class="n">left</span><span class="p">(</span><span class="n">l</span><span class="p">),</span> <span class="n">right</span><span class="p">(</span><span class="n">r</span><span class="p">),</span> <span class="n">minEle</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">maxEle</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="n">TreeNode</span> <span class="o">*</span><span class="n">root</span><span class="p">;</span>
<span class="c1">// build binary tree
</span> <span class="n">TreeNode</span><span class="o">*</span> <span class="nf">build</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">data</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// set segment's start and end point
</span> <span class="n">TreeNode</span> <span class="o">*</span><span class="n">root</span> <span class="o">=</span> <span class="k">new</span> <span class="n">TreeNode</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">root</span> <span class="o">==</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="k">throw</span> <span class="n">runtime_error</span><span class="p">(</span><span class="s">"no memory"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">left</span> <span class="o">==</span> <span class="n">right</span><span class="p">)</span> <span class="p">{</span>
<span class="n">root</span><span class="o">-></span><span class="n">minEle</span> <span class="o">=</span> <span class="n">root</span><span class="o">-></span><span class="n">maxEle</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">left</span><span class="p">];</span>
<span class="k">return</span> <span class="n">root</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// build left and right tree recursively
</span> <span class="kt">int</span> <span class="n">mid</span> <span class="o">=</span> <span class="n">left</span> <span class="o">+</span> <span class="p">((</span><span class="n">right</span> <span class="o">-</span> <span class="n">left</span><span class="p">)</span> <span class="o">>></span> <span class="mi">1</span><span class="p">);</span>
<span class="n">root</span><span class="o">-></span><span class="n">lChild</span> <span class="o">=</span> <span class="n">build</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">mid</span><span class="p">);</span>
<span class="n">root</span><span class="o">-></span><span class="n">rChild</span> <span class="o">=</span> <span class="n">build</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="c1">// update info
</span> <span class="n">root</span><span class="o">-></span><span class="n">minEle</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">root</span><span class="o">-></span><span class="n">lChild</span><span class="o">-></span><span class="n">minEle</span><span class="p">,</span> <span class="n">root</span><span class="o">-></span><span class="n">rChild</span><span class="o">-></span><span class="n">minEle</span><span class="p">);</span>
<span class="n">root</span><span class="o">-></span><span class="n">maxEle</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">root</span><span class="o">-></span><span class="n">lChild</span><span class="o">-></span><span class="n">maxEle</span><span class="p">,</span> <span class="n">root</span><span class="o">-></span><span class="n">rChild</span><span class="o">-></span><span class="n">maxEle</span><span class="p">);</span>
<span class="k">return</span> <span class="n">root</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">query</span><span class="p">(</span><span class="n">TreeNode</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">,</span> <span class="kt">int</span> <span class="o">&</span><span class="n">lower</span><span class="p">,</span> <span class="kt">int</span> <span class="o">&</span><span class="n">upper</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">t</span> <span class="o">==</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span> <span class="c1">// not happen
</span> <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">left</span> <span class="o">==</span> <span class="n">left</span> <span class="o">&&</span> <span class="n">t</span><span class="o">-></span><span class="n">right</span> <span class="o">==</span> <span class="n">right</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lower</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">lower</span><span class="p">,</span> <span class="n">t</span><span class="o">-></span><span class="n">minEle</span><span class="p">);</span>
<span class="n">upper</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">upper</span><span class="p">,</span> <span class="n">t</span><span class="o">-></span><span class="n">maxEle</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">mid</span> <span class="o">=</span> <span class="n">t</span><span class="o">-></span><span class="n">left</span> <span class="o">+</span> <span class="p">((</span><span class="n">t</span><span class="o">-></span><span class="n">right</span> <span class="o">-</span> <span class="n">t</span><span class="o">-></span><span class="n">left</span><span class="p">)</span> <span class="o">>></span> <span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">left</span> <span class="o">></span> <span class="n">mid</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// search right child
</span> <span class="n">query</span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">rChild</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">right</span> <span class="o"><=</span> <span class="n">mid</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// search left child
</span> <span class="n">query</span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">lChild</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// search both
</span> <span class="n">query</span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">lChild</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">mid</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="n">query</span><span class="p">(</span><span class="n">t</span><span class="o">-></span><span class="n">rChild</span><span class="p">,</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">lower</span><span class="p">,</span> <span class="n">upper</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">SegmentTree</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">root</span> <span class="o">=</span> <span class="n">build</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">maxDiff</span><span class="p">(</span><span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">maxData</span> <span class="o">=</span> <span class="n">INT_MIN</span><span class="p">,</span> <span class="n">minData</span> <span class="o">=</span> <span class="n">INT_MAX</span><span class="p">;</span>
<span class="n">query</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">minData</span><span class="p">,</span> <span class="n">maxData</span><span class="p">);</span>
<span class="k">return</span> <span class="n">maxData</span> <span class="o">-</span> <span class="n">minData</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
Heap
2014-08-18T00:00:00+00:00
http://blog.wjin.org/posts/heap
<h1 id="heap-implementation">Heap Implementation</h1>
<p>Heap is similar to <code class="highlighter-rouge">priority_queue</code> in C++ STL. Core functions are <em>sift_down</em> and <em>sift_up</em> operations.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// max heap
</span><span class="k">class</span> <span class="nc">Heap</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">m_data</span><span class="p">;</span> <span class="c1">// heap data
</span> <span class="kt">size_t</span> <span class="n">m_size</span><span class="p">;</span> <span class="c1">// heap size
</span>
<span class="kt">void</span> <span class="nf">sift_up</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">></span> <span class="mi">1</span> <span class="o">&&</span> <span class="n">m_data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">></span> <span class="n">m_data</span><span class="p">[</span><span class="n">i</span> <span class="o">/</span> <span class="mi">2</span><span class="p">])</span> <span class="p">{</span>
<span class="n">swap</span><span class="p">(</span><span class="n">m_data</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">m_data</span><span class="p">[</span><span class="n">i</span> <span class="o">/</span> <span class="mi">2</span><span class="p">]);</span>
<span class="n">i</span> <span class="o">/=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">sift_down</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span> <span class="c1">// j is the left child
</span>
<span class="k">while</span> <span class="p">(</span><span class="n">j</span> <span class="o"><=</span> <span class="n">m_size</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// loop until leaf
</span> <span class="c1">// find max child
</span> <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">+</span> <span class="mi">1</span> <span class="o"><=</span> <span class="n">m_size</span> <span class="o">&&</span> <span class="n">m_data</span><span class="p">[</span><span class="n">j</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">></span> <span class="n">m_data</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">// swap if possible
</span> <span class="k">if</span> <span class="p">(</span><span class="n">m_data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o"><</span> <span class="n">m_data</span><span class="p">[</span><span class="n">j</span><span class="p">])</span> <span class="p">{</span>
<span class="n">swap</span><span class="p">(</span><span class="n">m_data</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">m_data</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
<span class="n">j</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">make_heap</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m_size</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">></span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sift_down</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">Heap</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_size</span> <span class="o">=</span> <span class="n">v</span><span class="p">.</span><span class="n">size</span><span class="p">();</span>
<span class="c1">// do not use m_data[0] to simplify sift_down operation
</span> <span class="c1">// as when parameter i is 0, cannot find left child using 2*i
</span> <span class="n">m_data</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">m_size</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">m_data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="n">copy</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">v</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="n">m_data</span><span class="p">.</span><span class="n">begin</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">make_heap</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">push_heap</span><span class="p">(</span><span class="kt">int</span> <span class="n">val</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_data</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">val</span><span class="p">);</span>
<span class="n">m_size</span><span class="o">++</span><span class="p">;</span>
<span class="n">sift_up</span> <span class="p">(</span><span class="n">m_size</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">pop_heap</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">val</span> <span class="o">=</span> <span class="n">get_top</span><span class="p">();</span>
<span class="n">swap</span><span class="p">(</span><span class="n">m_data</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">m_data</span><span class="p">[</span><span class="n">m_size</span><span class="p">]);</span>
<span class="n">m_data</span><span class="p">.</span><span class="n">pop_back</span><span class="p">();</span>
<span class="n">m_size</span><span class="o">--</span><span class="p">;</span>
<span class="n">sift_down</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="k">return</span> <span class="n">val</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">get_top</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">m_data</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">get_size</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">m_size</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="application">Application</h1>
<h2 id="heap-sort">Heap Sort</h2>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HeapSort</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">Heap</span> <span class="n">hp</span><span class="p">;</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">HeapSort</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">)</span> <span class="o">:</span> <span class="n">hp</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">sort</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">ret</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">hp</span><span class="p">.</span><span class="n">get_size</span><span class="p">())</span> <span class="p">{</span>
<span class="n">ret</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">hp</span><span class="p">.</span><span class="n">pop_heap</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Or we can just sort it in place:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HeapSort</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">sift_down</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">,</span> <span class="kt">int</span> <span class="n">vs</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span> <span class="c1">// j is the left child
</span>
<span class="k">while</span> <span class="p">(</span><span class="n">j</span> <span class="o"><=</span> <span class="n">vs</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// loop until leaf
</span> <span class="c1">// find max child
</span> <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">+</span> <span class="mi">1</span> <span class="o"><=</span> <span class="n">vs</span> <span class="o">&&</span> <span class="n">v</span><span class="p">[</span><span class="n">j</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">></span> <span class="n">v</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">// swap if possible
</span> <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o"><</span> <span class="n">v</span><span class="p">[</span><span class="n">j</span><span class="p">])</span> <span class="p">{</span>
<span class="n">swap</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">v</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
<span class="n">j</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">make_heap</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">,</span> <span class="kt">int</span> <span class="n">vs</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">vs</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">></span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sift_down</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">vs</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">Sort</span><span class="p">(</span><span class="n">vector</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="o">&</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">vs</span> <span class="o">=</span> <span class="n">v</span><span class="p">.</span><span class="n">size</span><span class="p">();</span>
<span class="c1">// do not use v[0] to simplify sift_down operation
</span> <span class="c1">// as when parameter i is 0, cannot find left child using 2*i
</span> <span class="n">v</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
<span class="n">make_heap</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">vs</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">vs</span> <span class="o">></span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">swap</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">v</span><span class="p">[</span><span class="n">vs</span><span class="p">]);</span>
<span class="n">vs</span><span class="o">--</span><span class="p">;</span>
<span class="n">sift_down</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">vs</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">v</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">begin</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<h2 id="find-the-median-of-a-data-flow-at-anytime">Find the median of a data flow at anytime</h2>
<p><strong>Analysis</strong></p>
<p>Maintain two heaps: <em>small</em> heap and <em>big</em> heap respectively. All elements from small heap are smaller than those in big heap.</p>
<p>And we use <strong>max heap</strong> to construct small heap(top is the max element) and <strong>min heap</strong> to construct big heap(top is the minimum element).</p>
<p>When a new <strong>element</strong> comes, compare it with the top of two heaps(smallTop, bigTop).</p>
<ul>
<li>
<p>insert <strong>element</strong> to small heap if element < smallTop</p>
</li>
<li>
<p>insert <strong>element</strong> to big heap if element > bigTop</p>
</li>
<li>
<p>always make sure abs(smallSize - bigSize) <= 1.</p>
</li>
</ul>
<p>For simplicity, let smallSize >= bigSize, so at anytime, the median is either</p>
<blockquote>
<p>smallTop, (smallSize > bigSize)</p>
</blockquote>
<p>or</p>
<blockquote>
<p>(smallTop + bigTop) / 2, (smallSize == bigSize)</p>
</blockquote>
<p>Note: there is a O(n) algorithm to find a median of a <strong>given array</strong>, however, that is different from this question.</p>
Football Word
2014-08-17T00:00:00+00:00
http://blog.wjin.org/posts/football-word
<h1 id="pitch">Pitch</h1>
<p>field / pitch 足球场</p>
<p>midfield 中场</p>
<p>backfield 后场</p>
<p>kickoff circle / center circle 中圈</p>
<p>halfway line 中线</p>
<p>touchline / sideline 边线</p>
<p>goal line 球门线</p>
<p>end line 底线</p>
<p>penalty mark (点球)罚球点</p>
<p>penalty area 禁区(罚球区)</p>
<p>goal area 小禁区(球门区)</p>
<h1 id="team">Team</h1>
<p>coach 教练</p>
<p>head coach 主教练</p>
<p>football player 足球运动员</p>
<p>referee 裁判</p>
<p>lineman 巡边员</p>
<p>captain / leader 队长</p>
<p>forward / striker 前锋</p>
<p>midfielder 前卫</p>
<p>left midfielder 左前卫</p>
<p>right midfielder 右前卫</p>
<p>attacking midfielder 攻击型前卫(前腰)</p>
<p>defending midfielder 防守型前卫(后腰)</p>
<p>center forward 中锋</p>
<p>full back 后卫</p>
<p>center back 中后卫</p>
<p>left back 左后卫</p>
<p>right back 右后卫</p>
<p>sweeper 清道夫,拖后中卫</p>
<p>goalkeeper / goalie 守门员</p>
<p>cheer team 拉拉队</p>
<h1 id="term">Term</h1>
<p>kick-off 开球</p>
<p>bicycle kick / overhead kick 倒钩球</p>
<p>chest-high ball 半高球</p>
<p>corner ball / corner 角球</p>
<p>goal kick 球门球</p>
<p>handball 手球</p>
<p>header 头球</p>
<p>penalty kick 点球</p>
<p>place kick 定位球</p>
<p>own goal 乌龙球</p>
<p>hat-trick 帽子戏法</p>
<p>free kick 任意球</p>
<p>direct free kick 直接任意球</p>
<p>indirect free kick 间接任意球</p>
<p>stopping 停球</p>
<p>chesting 胸部停球</p>
<p>pass 传球</p>
<p>short pass 短传</p>
<p>long pass 长传</p>
<p>cross pass 横传</p>
<p>spot pass 球传到位</p>
<p>consecutive passes 连续传球</p>
<p>take a pass 接球</p>
<p>triangular pass 三角传球</p>
<p>flank pass 边线传球</p>
<p>lobbing pass 高吊传球</p>
<p>volley pass 凌空传球</p>
<p>slide tackle 铲球</p>
<p>rolling pass / ground pass 地滚球</p>
<p>flying header 跳起顶球</p>
<p>clearance kick 解围</p>
<p>shoot 射门</p>
<p>close-range shot 近射</p>
<p>long shot 远射</p>
<p>offside 越位</p>
<p>throw-in 掷界外球</p>
<p>block tackle 正面抢截</p>
<p>body check 阻挡</p>
<p>fair charge 合理冲撞</p>
<p>diving header 鱼跃顶球</p>
<p>dribbling 盘球,带球</p>
<p>clean catching (守门员)接高球</p>
<p>finger-tip save (守门员)托救球</p>
<p>offside 越位</p>
<p>deceptive movement 假动作</p>
<p>break through 突破</p>
<p>kick-out 踢出界</p>
<h1 id="strategy">Strategy</h1>
<p>set the pace 掌握进攻节奏</p>
<p>ward off an assault 击退一次攻势</p>
<p>break up an attack 破坏一次攻势</p>
<p>disorganize the defence 搅乱防守</p>
<p>total football 全攻全守足球战术</p>
<p>open football 拉开的足球战术</p>
<p>off-side trap 越位战术</p>
<p>wing play 边锋战术</p>
<p>time wasting tactics 拖延战术</p>
<p>4-3-3 formation 433阵型</p>
<p>4-4-2 formation 442阵型</p>
<p>beat the offside trap 反越位成功</p>
<p>foul 犯规</p>
<p>technical foul 技术犯规</p>
<p>break loose 摆脱</p>
<p>control the midfield 控制中场</p>
<p>set a wall 筑人墙</p>
<p>close-marking defence 盯人防守</p>
<h1 id="match">Match</h1>
<p>half-time interval 中场休息</p>
<p>round robin 循环赛</p>
<p>group round robin 小组循环赛</p>
<p>extra time 加时赛</p>
<p>elimination match 淘汰赛</p>
<p>injury time 伤停补时</p>
<p>golden goal / sudden death 金球制,突然死亡法</p>
<p>eighth-final 八分之一决赛</p>
<p>quarterfinal 四分之一决赛</p>
<p>semi-final 半决赛</p>
<p>final match 决赛</p>
<p>preliminary match 预赛</p>
<p>one-sided game 一边倒的比赛</p>
<p>competition regulations 比赛条例</p>
<p>disqualification 取消比赛资格</p>
<p>match ban 禁赛命令</p>
<p>doping test 药检</p>
<p>draw / sortition 抽签</p>
<p>send a player off 判罚出场</p>
<p>red card 红牌</p>
<p>yellow card 黄牌</p>
<p>goal 球门,进球数</p>
<p>draw 平局</p>
<p>goal drought 进球荒</p>
<p>ranking 排名(名次)</p>
Redis Event Library
2014-08-16T00:00:00+00:00
http://blog.wjin.org/posts/redis-event-library
<h1 id="introduction">Introduction</h1>
<p>Redis as a network service must wait for client connection and deal with client request. Considering this kind of socket event, redis has its own simple event library instead of using the well-known libevent library for simplicity.</p>
<p>The redis server process is like this:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
<span class="p">...</span>
<span class="c1">// create event loop handle
</span> <span class="n">server</span><span class="p">.</span><span class="n">el</span> <span class="o">=</span> <span class="n">aeCreateEventLoop</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">maxclients</span><span class="o">+</span><span class="n">REDIS_EVENTLOOP_FDSET_INCR</span><span class="p">);</span>
<span class="p">...</span>
<span class="n">aeSetBeforeSleepProc</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">el</span><span class="p">,</span><span class="n">beforeSleep</span><span class="p">);</span> <span class="c1">// set function ptr
</span> <span class="n">aeMain</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">el</span><span class="p">);</span> <span class="c1">// loop
</span>
<span class="c1">// stop
</span> <span class="n">aeDeleteEventLoop</span><span class="p">(</span><span class="n">server</span><span class="p">.</span><span class="n">el</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">aeMain</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">)</span> <span class="p">{</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">stop</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">stop</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">beforesleep</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">beforesleep</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">);</span>
<span class="c1">// deal with events
</span> <span class="n">aeProcessEvents</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">,</span> <span class="n">AE_ALL_EVENTS</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Above code snippet shows how redis server process starts and waits for connection. It creates an event loop handle, and then loop until explicit stop. It will deal with client request when looping.</p>
<h1 id="event-library">Event Library</h1>
<p>There are two kinds of events that redis cares about: <strong>file event</strong> and <strong>time event</strong>. Macro <code class="highlighter-rouge">AE_DONT_WAIT</code> is used for <strong>non-blocking</strong>, which means return to the caller immediately.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define AE_FILE_EVENTS 1
#define AE_TIME_EVENTS 2
#define AE_ALL_EVENTS (AE_FILE_EVENTS|AE_TIME_EVENTS)
#define AE_DONT_WAIT 4
</span></code></pre></div></div>
<p>There are two kinds of operations for event library: <strong>read</strong> and <strong>write</strong>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define AE_NONE 0
#define AE_READABLE 1
#define AE_WRITABLE 2
</span></code></pre></div></div>
<p>According to different OS, event library wrappers different low level implementation, such as epoll, kqueue, select and so on. Those low level implementations start with api*.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef HAVE_EVPORT
#include "ae_evport.c"
#else
</span> <span class="cp">#ifdef HAVE_EPOLL
</span> <span class="cp">#include "ae_epoll.c"
</span> <span class="cp">#else
</span> <span class="cp">#ifdef HAVE_KQUEUE
</span> <span class="cp">#include "ae_kqueue.c"
</span> <span class="cp">#else
</span> <span class="cp">#include "ae_select.c"
</span> <span class="cp">#endif
</span> <span class="cp">#endif
#endif
</span></code></pre></div></div>
<h2 id="data-structure">Data Structure</h2>
<p>Related file: <strong>ae.h</strong></p>
<h3 id="file-event">File event</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">aeFileEvent</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">mask</span><span class="p">;</span> <span class="c1">// AE_READABLE or AE_WRITABLE
</span> <span class="n">aeFileProc</span> <span class="o">*</span><span class="n">rfileProc</span><span class="p">;</span> <span class="c1">// read call back function
</span> <span class="n">aeFileProc</span> <span class="o">*</span><span class="n">wfileProc</span><span class="p">;</span> <span class="c1">// write call back function
</span> <span class="kt">void</span> <span class="o">*</span><span class="n">clientData</span><span class="p">;</span> <span class="c1">// specific api data
</span><span class="p">}</span> <span class="n">aeFileEvent</span><span class="p">;</span>
</code></pre></div></div>
<p>Fired file event:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">aeFiredEvent</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">mask</span><span class="p">;</span>
<span class="p">}</span> <span class="n">aeFiredEvent</span><span class="p">;</span>
</code></pre></div></div>
<h3 id="time-event">Time event</h3>
<p>All time event will be <strong>linked in a single list</strong>. New time event will be inserted in after the head, just like insert an element to a list reversely.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">aeTimeEvent</span> <span class="p">{</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">id</span><span class="p">;</span> <span class="cm">/* time event identifier. */</span>
<span class="kt">long</span> <span class="n">when_sec</span><span class="p">;</span> <span class="cm">/* seconds */</span>
<span class="kt">long</span> <span class="n">when_ms</span><span class="p">;</span> <span class="cm">/* milliseconds */</span>
<span class="n">aeTimeProc</span> <span class="o">*</span><span class="n">timeProc</span><span class="p">;</span>
<span class="n">aeEventFinalizerProc</span> <span class="o">*</span><span class="n">finalizerProc</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">clientData</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">aeTimeEvent</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="p">}</span> <span class="n">aeTimeEvent</span><span class="p">;</span>
</code></pre></div></div>
<h3 id="event-handle">Event handle</h3>
<p><em>setsize</em> is the total number of file descriptors this event handle tracked. For example, if setsize is 1024, this event handle can deal with fd from 0 to 1023. If setsize is 10, it can only deal with fd from 0 to 9.</p>
<p><em>maxfd</em> is the current max fd. It is always little than <em>setsize</em>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">aeEventLoop</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">maxfd</span><span class="p">;</span> <span class="cm">/* highest file descriptor currently registered */</span>
<span class="kt">int</span> <span class="n">setsize</span><span class="p">;</span> <span class="cm">/* max number of file descriptors tracked */</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">timeEventNextId</span><span class="p">;</span>
<span class="kt">time_t</span> <span class="n">lastTime</span><span class="p">;</span> <span class="cm">/* Used to detect system clock skew */</span>
<span class="n">aeFileEvent</span> <span class="o">*</span><span class="n">events</span><span class="p">;</span> <span class="cm">/* Registered events */</span>
<span class="n">aeFiredEvent</span> <span class="o">*</span><span class="n">fired</span><span class="p">;</span> <span class="cm">/* Fired events */</span>
<span class="n">aeTimeEvent</span> <span class="o">*</span><span class="n">timeEventHead</span><span class="p">;</span> <span class="c1">// time event single list head
</span> <span class="kt">int</span> <span class="n">stop</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">apidata</span><span class="p">;</span> <span class="cm">/* This is used for polling API specific data */</span>
<span class="n">aeBeforeSleepProc</span> <span class="o">*</span><span class="n">beforesleep</span><span class="p">;</span>
<span class="p">}</span> <span class="n">aeEventLoop</span><span class="p">;</span>
</code></pre></div></div>
<h2 id="event-library-api">Event Library API</h2>
<p>Related file: <strong>ae.c</strong></p>
<h3 id="init">Init</h3>
<p>Create event loop handle so that we can use it to add/delete event. Allocate memory according to <em>setsize</em>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aeEventLoop</span> <span class="o">*</span><span class="nf">aeCreateEventLoop</span><span class="p">(</span><span class="kt">int</span> <span class="n">setsize</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="n">eventLoop</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">eventLoop</span><span class="p">)))</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
<span class="c1">// allocate memory for file events
</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">aeFileEvent</span><span class="p">)</span><span class="o">*</span><span class="n">setsize</span><span class="p">);</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">aeFiredEvent</span><span class="p">)</span><span class="o">*</span><span class="n">setsize</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span> <span class="o">==</span> <span class="nb">NULL</span> <span class="o">||</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">setsize</span> <span class="o">=</span> <span class="n">setsize</span><span class="p">;</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">lastTime</span> <span class="o">=</span> <span class="n">time</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">timeEventHead</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="c1">// null list
</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">timeEventNextId</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">stop</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">maxfd</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// no event added at present, so set it -1
</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">beforesleep</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">aeApiCreate</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
<span class="c1">// all file event's initial state is AE_NONE
</span> <span class="c1">// it will be changed when adding event
</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">setsize</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">mask</span> <span class="o">=</span> <span class="n">AE_NONE</span><span class="p">;</span>
<span class="k">return</span> <span class="n">eventLoop</span><span class="p">;</span>
<span class="nl">err:</span>
<span class="p">...</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="destroy">Destroy</h3>
<p>Just release all dynamically allocated memory.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">aeDeleteEventLoop</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aeApiFree</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">);</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span><span class="p">);</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span><span class="p">);</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="adddelete-event">Add/Delete Event</h3>
<p>According to mask and fd, call function <em>aeApiAddEvent</em> to set event to epoll.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">aeCreateFileEvent</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mask</span><span class="p">,</span>
<span class="n">aeFileProc</span> <span class="o">*</span><span class="n">proc</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">clientData</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// check fd
</span> <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">>=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">setsize</span><span class="p">)</span> <span class="p">{</span>
<span class="n">errno</span> <span class="o">=</span> <span class="n">ERANGE</span><span class="p">;</span>
<span class="k">return</span> <span class="n">AE_ERR</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">aeFileEvent</span> <span class="o">*</span><span class="n">fe</span> <span class="o">=</span> <span class="o">&</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span><span class="p">[</span><span class="n">fd</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="n">aeApiAddEvent</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="n">mask</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">AE_ERR</span><span class="p">;</span>
<span class="n">fe</span><span class="o">-></span><span class="n">mask</span> <span class="o">|=</span> <span class="n">mask</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mask</span> <span class="o">&</span> <span class="n">AE_READABLE</span><span class="p">)</span> <span class="n">fe</span><span class="o">-></span><span class="n">rfileProc</span> <span class="o">=</span> <span class="n">proc</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mask</span> <span class="o">&</span> <span class="n">AE_WRITABLE</span><span class="p">)</span> <span class="n">fe</span><span class="o">-></span><span class="n">wfileProc</span> <span class="o">=</span> <span class="n">proc</span><span class="p">;</span>
<span class="n">fe</span><span class="o">-></span><span class="n">clientData</span> <span class="o">=</span> <span class="n">clientData</span><span class="p">;</span>
<span class="c1">// update maxfd
</span> <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">></span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">maxfd</span><span class="p">)</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">maxfd</span> <span class="o">=</span> <span class="n">fd</span><span class="p">;</span>
<span class="k">return</span> <span class="n">AE_OK</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Delete Event is similar to add.</p>
<h3 id="deal-with-event">Deal with Event</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">aeProcessEvents</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">processed</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">numevents</span><span class="p">;</span>
<span class="cm">/* Nothing to do? return ASAP */</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_TIME_EVENTS</span><span class="p">)</span> <span class="o">&&</span> <span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_FILE_EVENTS</span><span class="p">))</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="cm">/* Note that we want call select() even if there are no
* file events to process as long as we want to process time
* events, in order to sleep until the next time event is ready
* to fire. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">maxfd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span> <span class="o">||</span>
<span class="p">((</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_TIME_EVENTS</span><span class="p">)</span> <span class="o">&&</span> <span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_DONT_WAIT</span><span class="p">)))</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">j</span><span class="p">;</span>
<span class="n">aeTimeEvent</span> <span class="o">*</span><span class="n">shortest</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv</span><span class="p">,</span> <span class="o">*</span><span class="n">tvp</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_TIME_EVENTS</span> <span class="o">&&</span> <span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_DONT_WAIT</span><span class="p">))</span>
<span class="n">shortest</span> <span class="o">=</span> <span class="n">aeSearchNearestTimer</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">shortest</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">long</span> <span class="n">now_sec</span><span class="p">,</span> <span class="n">now_ms</span><span class="p">;</span>
<span class="cm">/* Calculate the time missing for the nearest
* timer to fire. */</span>
<span class="n">aeGetTime</span><span class="p">(</span><span class="o">&</span><span class="n">now_sec</span><span class="p">,</span> <span class="o">&</span><span class="n">now_ms</span><span class="p">);</span>
<span class="n">tvp</span> <span class="o">=</span> <span class="o">&</span><span class="n">tv</span><span class="p">;</span>
<span class="n">tvp</span><span class="o">-></span><span class="n">tv_sec</span> <span class="o">=</span> <span class="n">shortest</span><span class="o">-></span><span class="n">when_sec</span> <span class="o">-</span> <span class="n">now_sec</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">shortest</span><span class="o">-></span><span class="n">when_ms</span> <span class="o"><</span> <span class="n">now_ms</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tvp</span><span class="o">-></span><span class="n">tv_usec</span> <span class="o">=</span> <span class="p">((</span><span class="n">shortest</span><span class="o">-></span><span class="n">when_ms</span><span class="o">+</span><span class="mi">1000</span><span class="p">)</span> <span class="o">-</span> <span class="n">now_ms</span><span class="p">)</span><span class="o">*</span><span class="mi">1000</span><span class="p">;</span>
<span class="n">tvp</span><span class="o">-></span><span class="n">tv_sec</span> <span class="o">--</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">tvp</span><span class="o">-></span><span class="n">tv_usec</span> <span class="o">=</span> <span class="p">(</span><span class="n">shortest</span><span class="o">-></span><span class="n">when_ms</span> <span class="o">-</span> <span class="n">now_ms</span><span class="p">)</span><span class="o">*</span><span class="mi">1000</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">tvp</span><span class="o">-></span><span class="n">tv_sec</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="n">tvp</span><span class="o">-></span><span class="n">tv_sec</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">tvp</span><span class="o">-></span><span class="n">tv_usec</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="n">tvp</span><span class="o">-></span><span class="n">tv_usec</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="cm">/* If we have to check for events but need to return
* ASAP because of AE_DONT_WAIT we need to set the timeout
* to zero */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_DONT_WAIT</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tv</span><span class="p">.</span><span class="n">tv_sec</span> <span class="o">=</span> <span class="n">tv</span><span class="p">.</span><span class="n">tv_usec</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">tvp</span> <span class="o">=</span> <span class="o">&</span><span class="n">tv</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="cm">/* Otherwise we can block */</span>
<span class="n">tvp</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="cm">/* wait forever */</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">numevents</span> <span class="o">=</span> <span class="n">aeApiPoll</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">,</span> <span class="n">tvp</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">numevents</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aeFileEvent</span> <span class="o">*</span><span class="n">fe</span> <span class="o">=</span> <span class="o">&</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span><span class="p">[</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span><span class="p">];</span>
<span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">mask</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">rfired</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="cm">/* note the fe->mask & mask & ... code: maybe an already processed
* event removed an element that fired and we still didn't
* processed, so we check if the event is still valid. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fe</span><span class="o">-></span><span class="n">mask</span> <span class="o">&</span> <span class="n">mask</span> <span class="o">&</span> <span class="n">AE_READABLE</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rfired</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">fe</span><span class="o">-></span><span class="n">rfileProc</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">,</span><span class="n">fd</span><span class="p">,</span><span class="n">fe</span><span class="o">-></span><span class="n">clientData</span><span class="p">,</span><span class="n">mask</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fe</span><span class="o">-></span><span class="n">mask</span> <span class="o">&</span> <span class="n">mask</span> <span class="o">&</span> <span class="n">AE_WRITABLE</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">rfired</span> <span class="o">||</span> <span class="n">fe</span><span class="o">-></span><span class="n">wfileProc</span> <span class="o">!=</span> <span class="n">fe</span><span class="o">-></span><span class="n">rfileProc</span><span class="p">)</span>
<span class="n">fe</span><span class="o">-></span><span class="n">wfileProc</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">,</span><span class="n">fd</span><span class="p">,</span><span class="n">fe</span><span class="o">-></span><span class="n">clientData</span><span class="p">,</span><span class="n">mask</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">processed</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="cm">/* Check time events */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">AE_TIME_EVENTS</span><span class="p">)</span>
<span class="n">processed</span> <span class="o">+=</span> <span class="n">processTimeEvents</span><span class="p">(</span><span class="n">eventLoop</span><span class="p">);</span>
<span class="k">return</span> <span class="n">processed</span><span class="p">;</span> <span class="cm">/* return the number of processed file/time events */</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="low-level-api">Low Level API</h2>
<p>These low level APIs are used by the event library and hide the details specific to operating system.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="n">aeApiCreate</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">aeApiAddEvent</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mask</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">void</span> <span class="n">aeApiDelEvent</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">delmask</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">aeApiPoll</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="k">struct</span> <span class="n">timeval</span> <span class="o">*</span><span class="n">tvp</span><span class="p">);</span>
</code></pre></div></div>
<p>Take epoll implementation as a example, function map relationship:</p>
<blockquote>
<p>aeApiCreate ——> epoll_create</p>
</blockquote>
<blockquote>
<p>aeApiAddEvent/aeApiDelEvent ——>epoll_ctl</p>
</blockquote>
<blockquote>
<p>aeApiPoll ——> epoll_wait</p>
</blockquote>
<h3 id="epoll-wrapper-data-structure">Epoll wrapper data structure</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">aeApiState</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">epfd</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">epoll_event</span> <span class="o">*</span><span class="n">events</span><span class="p">;</span>
<span class="p">}</span> <span class="n">aeApiState</span><span class="p">;</span>
</code></pre></div></div>
<h3 id="create-epoll-handle">Create epoll handle:</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">aeApiCreate</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// allocate memory
</span> <span class="n">aeApiState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">aeApiState</span><span class="p">));</span>
<span class="n">state</span><span class="o">-></span><span class="n">events</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">epoll_event</span><span class="p">)</span><span class="o">*</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">setsize</span><span class="p">);</span>
<span class="c1">// create epoll fd
</span> <span class="n">state</span><span class="o">-></span><span class="n">epfd</span> <span class="o">=</span> <span class="n">epoll_create</span><span class="p">(</span><span class="mi">1024</span><span class="p">);</span>
<span class="c1">// initialize specific api data
</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">apidata</span> <span class="o">=</span> <span class="n">state</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="adddelete-event-to-epoll-fd">Add/Delete event to epoll fd</h3>
<p>It converts event library mask to epoll mask and then call epoll_ctl.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">aeApiAddEvent</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mask</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aeApiState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">apidata</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">epoll_event</span> <span class="n">ee</span><span class="p">;</span>
<span class="cm">/* If the fd was already monitored for some event, we need a MOD
* operation. Otherwise we need an ADD operation. */</span>
<span class="kt">int</span> <span class="n">op</span> <span class="o">=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span><span class="p">[</span><span class="n">fd</span><span class="p">].</span><span class="n">mask</span> <span class="o">==</span> <span class="n">AE_NONE</span> <span class="o">?</span>
<span class="n">EPOLL_CTL_ADD</span> <span class="o">:</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">;</span>
<span class="n">ee</span><span class="p">.</span><span class="n">events</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">mask</span> <span class="o">|=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">events</span><span class="p">[</span><span class="n">fd</span><span class="p">].</span><span class="n">mask</span><span class="p">;</span> <span class="cm">/* Merge old events */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mask</span> <span class="o">&</span> <span class="n">AE_READABLE</span><span class="p">)</span> <span class="n">ee</span><span class="p">.</span><span class="n">events</span> <span class="o">|=</span> <span class="n">EPOLLIN</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mask</span> <span class="o">&</span> <span class="n">AE_WRITABLE</span><span class="p">)</span> <span class="n">ee</span><span class="p">.</span><span class="n">events</span> <span class="o">|=</span> <span class="n">EPOLLOUT</span><span class="p">;</span>
<span class="n">ee</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">fd</span> <span class="o">=</span> <span class="n">fd</span><span class="p">;</span>
<span class="c1">// epoll system call
</span> <span class="k">if</span> <span class="p">(</span><span class="n">epoll_ctl</span><span class="p">(</span><span class="n">state</span><span class="o">-></span><span class="n">epfd</span><span class="p">,</span><span class="n">op</span><span class="p">,</span><span class="n">fd</span><span class="p">,</span><span class="o">&</span><span class="n">ee</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Delete is similar to add operation.</p>
<h3 id="deal-with-event-1">Deal with event</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">aeApiPoll</span><span class="p">(</span><span class="n">aeEventLoop</span> <span class="o">*</span><span class="n">eventLoop</span><span class="p">,</span> <span class="k">struct</span> <span class="n">timeval</span> <span class="o">*</span><span class="n">tvp</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aeApiState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">eventLoop</span><span class="o">-></span><span class="n">apidata</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">retval</span><span class="p">,</span> <span class="n">numevents</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">retval</span> <span class="o">=</span> <span class="n">epoll_wait</span><span class="p">(</span><span class="n">state</span><span class="o">-></span><span class="n">epfd</span><span class="p">,</span><span class="n">state</span><span class="o">-></span><span class="n">events</span><span class="p">,</span><span class="n">eventLoop</span><span class="o">-></span><span class="n">setsize</span><span class="p">,</span>
<span class="n">tvp</span> <span class="o">?</span> <span class="p">(</span><span class="n">tvp</span><span class="o">-></span><span class="n">tv_sec</span><span class="o">*</span><span class="mi">1000</span> <span class="o">+</span> <span class="n">tvp</span><span class="o">-></span><span class="n">tv_usec</span><span class="o">/</span><span class="mi">1000</span><span class="p">)</span> <span class="o">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">retval</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">j</span><span class="p">;</span>
<span class="n">numevents</span> <span class="o">=</span> <span class="n">retval</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">numevents</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">epoll_event</span> <span class="o">*</span><span class="n">e</span> <span class="o">=</span> <span class="n">state</span><span class="o">-></span><span class="n">events</span><span class="o">+</span><span class="n">j</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLIN</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">AE_READABLE</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLOUT</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">AE_WRITABLE</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLERR</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">AE_WRITABLE</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="o">-></span><span class="n">events</span> <span class="o">&</span> <span class="n">EPOLLHUP</span><span class="p">)</span> <span class="n">mask</span> <span class="o">|=</span> <span class="n">AE_WRITABLE</span><span class="p">;</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">fd</span> <span class="o">=</span> <span class="n">e</span><span class="o">-></span><span class="n">data</span><span class="p">.</span><span class="n">fd</span><span class="p">;</span>
<span class="n">eventLoop</span><span class="o">-></span><span class="n">fired</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">mask</span> <span class="o">=</span> <span class="n">mask</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">numevents</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="http://redis.io/topics/internals-eventlib">http://redis.io/topics/internals-eventlib</a></li>
<li><a href="http://redis.io/topics/internals-rediseventlib">http://redis.io/topics/internals-rediseventlib</a></li>
</ul>
Redis Communication Protocol
2014-08-16T00:00:00+00:00
http://blog.wjin.org/posts/redis-communication-protocol
<h1 id="introduction">Introduction</h1>
<p>Redis clients communicate with Redis server using a protocol called <strong>RESP</strong> (REdis Serialization Protocol).</p>
<p>RESP is simple, fast and human readable. Also, it is binary safe and can serialize different data types like integers, strings, arrays. It is an application layer protocol over tcp as client connects to Redis server using a <strong>TCP connection</strong> to the default port 6379.</p>
<p>Note: RESP is only used for client-server communication. Redis Cluster uses a different binary protocol in order to exchange messages between nodes.</p>
<h1 id="protocol-specification">Protocol Specification</h1>
<h2 id="data-type">Data Type</h2>
<p>In RESP, there are five basic types to identify different data, the difference is the first byte character:</p>
<ul>
<li>Simple Strings : +</li>
<li>Errors : -</li>
<li>Integers : :</li>
<li>Bulk Strings : $</li>
<li>Bulk Arrays : *</li>
</ul>
<h2 id="bulk-string">Bulk String</h2>
<p>Bulk Strings are encoded in the following way:</p>
<ul>
<li>A “$” byte</li>
<li>Number of bytes composing the string</li>
<li>CRLF</li>
<li>Actual string data</li>
<li>CRLF</li>
</ul>
<p>Example:</p>
<ul>
<li>normal bulk string: “$6\r\nfoobar\r\n”</li>
<li>empty string: “$0\r\n\r\n”</li>
<li>null string: “$-1\r\n”</li>
</ul>
<h2 id="bulk-array">Bulk Array</h2>
<p>Bulk Arrays are sent using the following format:</p>
<ul>
<li>A “*” byte</li>
<li>Number of elements in the array</li>
<li>CRLF</li>
</ul>
<p>Following with additional RESP type for every element of the Array.</p>
<p>Example:</p>
<ul>
<li>null array: “*-1\r\n”</li>
<li>empty array : “*0\r\n”</li>
<li>Integer array: “*3\r\n:1\r\n:2\r\n:3\r\n”</li>
<li>string array: “*2\r\n$3\r\nfoo\r\n$3\r\nbar\r\n”</li>
<li>mix array: “*2\r\n$3\r\nfoo\r\n:1\r\n”</li>
<li>array array:</li>
</ul>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*2\r\n
*3\r\n
:1\r\n
:2\r\n
:3\r\n
*2\r\n
+Foo\r\n
-Bar\r\n
</code></pre></div></div>
<ul>
<li>null element in array:</li>
</ul>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*3\r\n
$3\r\n
foo\r\n
$-1\r\n
$3\r\n
bar\r\n
</code></pre></div></div>
<h1 id="request-and-response">Request and Response</h1>
<ul>
<li>Clients send commands to a Redis server as a <strong>RESP Array</strong> of Bulk Strings.</li>
<li>Server replies with <strong>one of the RESP types</strong> according to the specific command.</li>
</ul>
<h1 id="pipelining">Pipelining</h1>
<p>Clients and Servers are connected via networking link. Whatever the network latency is, there is a time for the packets to travel from the client to the server, and back from the server to the client to carry the reply. This time is called <strong>RTT</strong> (Round Trip Time). This can heavily affect the server performance.</p>
<p>Pipelining sends <strong>multiple commands</strong> to the server without waiting for the replies at all, and finally read the replies in a single step. So there is no RTT overhead for each command, just one RTT for all commands together.</p>
<p>NOTE: while the client sends commands using pipelining, the server will be forced to queue the replies, using memory. For saving memory, just send a reasonable number of commands, like <strong>10k</strong> commands each time.</p>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="http://redis.io/topics/protocol">http://redis.io/topics/protocol</a></li>
<li><a href="http://redis.io/topics/pipelining">http://redis.io/topics/pipelining</a></li>
</ul>
Concurrent Server Model
2014-08-16T00:00:00+00:00
http://blog.wjin.org/posts/concurrent-server-model
<h1 id="introduction">Introduction</h1>
<p>Develop large-scale, high concurrent networking service application is challenging. Here are a few models. PPC (process per connection), TPC (thread per connection), and I/O multiplexing, such as select, poll and epoll.</p>
<h1 id="ppc-and-tpc">PPC and TPC</h1>
<p>PPC and TPC create a new process/thread for each connection. Therefore, this mode has its inherent drawbacks:when connecting increases dramatically, the overhead of create a new process/thread cannot be ignored.</p>
<p>Even though <strong>process/thread pool</strong> may alleviate it, switch and resource contention between processes/threads can obviously slow down the system response. That is the famous problem of <strong>apache webserver avalanche</strong>.</p>
<h1 id="io-multiplexing">I/O Multiplexing</h1>
<h2 id="select-and-poll">Select and Poll</h2>
<p>Select and poll allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become “ready”. A file descriptor is considered ready if it is possible to perform the corresponding I/O operation without blocking.</p>
<p>There are some drawbacks for select:</p>
<ul>
<li>The max number of concurrency is limited by FD (set by FD_SETSIZE)</li>
<li>O(n) time complexity when scanning FD</li>
<li>Memory Copy between kernel and user space</li>
</ul>
<h2 id="linux-epoll">Linux Epoll</h2>
<p><strong>Epoll</strong> standards for event poll. It is a variant of poll that can be used either as an <strong>edge-triggered</strong> or a <strong>level-triggered</strong> interface and scales well to large numbers of watched file descriptors. It overcomes the disadvantages of select and poll.</p>
<ul>
<li>No limitation of FD, the concurrent number is related to how many files this process can open</li>
<li>O(1) time complexity when scanning FD</li>
<li>Memory Sharing</li>
</ul>
<p>To develop a concurrent and high performance service in linux environment, we should choose epoll with no doubt, and its related sytem calls are also easy to use.</p>
<h3 id="epoll_create">epoll_create</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">epoll_create</span><span class="p">(</span><span class="kt">int</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>
<h3 id="epoll_ctl">epoll_ctl</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">epoll_ctl</span><span class="p">(</span><span class="kt">int</span> <span class="n">epfd</span><span class="p">,</span> <span class="kt">int</span> <span class="n">op</span><span class="p">,</span> <span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">struct</span> <span class="n">epoll_event</span> <span class="o">*</span><span class="n">event</span><span class="p">);</span>
</code></pre></div></div>
<h3 id="epoll_wait">epoll_wait</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">epoll_wait</span><span class="p">(</span><span class="kt">int</span> <span class="n">epfd</span><span class="p">,</span> <span class="k">struct</span> <span class="n">epoll_event</span> <span class="o">*</span><span class="n">events</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">maxevents</span><span class="p">,</span> <span class="kt">int</span> <span class="n">timeout</span><span class="p">);</span>
<span class="k">struct</span> <span class="n">epoll_event</span> <span class="p">{</span>
<span class="n">__uint32_t</span> <span class="n">events</span><span class="p">;</span> <span class="c1">// Epoll events
</span> <span class="n">epoll_data_t</span> <span class="n">data</span><span class="p">;</span> <span class="c1">// User data variable
</span><span class="p">};</span>
<span class="k">typedef</span> <span class="k">union</span> <span class="n">epoll_data</span> <span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
<span class="n">__uint32_t</span> <span class="n">u32</span><span class="p">;</span>
<span class="n">__uint64_t</span> <span class="n">u64</span><span class="p">;</span>
<span class="p">}</span> <span class="n">epoll_data_t</span><span class="p">;</span>
</code></pre></div></div>
Redis Internal Data Structure: Intset
2014-08-15T00:00:00+00:00
http://blog.wjin.org/posts/redis-internal-data-structure-intset
<h1 id="introduction">Introduction</h1>
<p>Intset is used to store <strong>sorted</strong>, <strong>non-duplicate</strong> integer data.</p>
<p>It can store three types of integers: int16, int32 and int64. It can upgrade automatically if a new added value is beyond current encoding (overflow). Actually, two types of upgrade: <strong>int16->int32</strong> and <strong>int32->int64</strong>.</p>
<p>During add or delete, it will move elements so time complexity is <strong>O(n)</strong>, However, it uses <strong>memmove</strong> function to move elements, and many runtime libraries have optimisation to this memory operation, so it is faster than general move int element one by one.</p>
<p>Besides, it supports both <strong>little endian</strong> and <strong>big endian</strong>.</p>
<p>Here is intset data structure overview:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+----------+--------+-----------+------------+-----------+
| encoding | length | first int | second int | ... |
+----------+--------+-----------+------------+-----------+
|
`-> data stored here
</code></pre></div></div>
<h1 id="implementation">Implementation</h1>
<p>Related files: <strong>intset.h</strong> and <strong>intset.c</strong></p>
<h2 id="data-structure">Data Structure</h2>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define INTSET_ENC_INT16 (sizeof(int16_t))
#define INTSET_ENC_INT32 (sizeof(int32_t))
#define INTSET_ENC_INT64 (sizeof(int64_t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="n">intset</span> <span class="p">{</span>
<span class="kt">uint32_t</span> <span class="n">encoding</span><span class="p">;</span> <span class="c1">// encoding
</span> <span class="kt">uint32_t</span> <span class="n">length</span><span class="p">;</span> <span class="c1">// number of integers
</span> <span class="kt">int8_t</span> <span class="n">contents</span><span class="p">[];</span> <span class="c1">// data store here
</span><span class="p">}</span> <span class="n">intset</span><span class="p">;</span>
</code></pre></div></div>
<h2 id="initialization">Initialization</h2>
<p>Default encoding is int16, it can be upgraded to int32 and then int64 automatically.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">intset</span> <span class="o">*</span><span class="nf">intsetNew</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="n">intset</span> <span class="o">*</span><span class="n">is</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">intset</span><span class="p">));</span>
<span class="n">is</span><span class="o">-></span><span class="n">encoding</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">INTSET_ENC_INT16</span><span class="p">);</span>
<span class="n">is</span><span class="o">-></span><span class="n">length</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">is</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Function <em>intrev32ifbe</em> is used to deal with endian problem. It will reverse int32 byte by byte in big endian and do nothing in little endian. There are many other similar functions.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if (BYTE_ORDER == LITTLE_ENDIAN)
#define memrev16ifbe(p)
#define memrev32ifbe(p)
#define memrev64ifbe(p)
#define intrev16ifbe(v) (v)
#define intrev32ifbe(v) (v)
#define intrev64ifbe(v) (v)
#else
#define memrev16ifbe(p) memrev16(p)
#define memrev32ifbe(p) memrev32(p)
#define memrev64ifbe(p) memrev64(p)
#define intrev16ifbe(v) intrev16(v)
#define intrev32ifbe(v) intrev32(v)
#define intrev64ifbe(v) intrev64(v)
#endif
</span></code></pre></div></div>
<h2 id="destroy">Destroy</h2>
<p>Actually, there is no destroy API in intset. According to initialization process, what you only need to do is to release dynamic memory when destroying.</p>
<h2 id="find">Find</h2>
<p>As data are sorted in intset, so we can use <strong>binary search</strong> to find an element. <strong>intsetSearch</strong> will get the right position for <em>value</em> even if <em>value</em> is not in the set.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="nf">intsetFind</span><span class="p">(</span><span class="n">intset</span> <span class="o">*</span><span class="n">is</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">value</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="n">valenc</span> <span class="o">=</span> <span class="n">_intsetValueEncoding</span><span class="p">(</span><span class="n">value</span><span class="p">);</span> <span class="c1">// get encoding for value
</span> <span class="k">return</span> <span class="n">valenc</span> <span class="o"><=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">encoding</span><span class="p">)</span> <span class="o">&&</span> <span class="n">intsetSearch</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">value</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">uint8_t</span> <span class="nf">intsetSearch</span><span class="p">(</span><span class="n">intset</span> <span class="o">*</span><span class="n">is</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">value</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">pos</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">min</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">max</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">mid</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="kt">int64_t</span> <span class="n">cur</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="cm">/* The value can never be found when the set is empty */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pos</span><span class="p">)</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="cm">/* Check for the case where we know we cannot find the value,
* but do know the insert position. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">value</span> <span class="o">></span> <span class="n">_intsetGet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pos</span><span class="p">)</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">value</span> <span class="o"><</span> <span class="n">_intsetGet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="mi">0</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pos</span><span class="p">)</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// binary search
</span> <span class="k">while</span><span class="p">(</span><span class="n">max</span> <span class="o">>=</span> <span class="n">min</span><span class="p">)</span> <span class="p">{</span>
<span class="n">mid</span> <span class="o">=</span> <span class="n">min</span><span class="o">+</span><span class="p">((</span><span class="n">max</span><span class="o">-</span><span class="n">min</span><span class="p">)</span> <span class="o">>></span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// (min+max)/2;
</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">_intsetGet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">mid</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">value</span> <span class="o">></span> <span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">min</span> <span class="o">=</span> <span class="n">mid</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">value</span> <span class="o"><</span> <span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="n">max</span> <span class="o">=</span> <span class="n">mid</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// update position
</span> <span class="k">if</span> <span class="p">(</span><span class="n">value</span> <span class="o">==</span> <span class="n">cur</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pos</span><span class="p">)</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="n">mid</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pos</span><span class="p">)</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="n">min</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="add">Add</h2>
<p>Function <em>intsetAdd</em> first verify whether current encoding can store <em>value</em>, if not, it will automatically upgrade encoding. Otherwise, it will check whether <em>value</em> is already in the array as it does not store duplicate data. After that, it will adjust array memory, move elements, and then insert value. As element movement, so time complexity is <strong>O(n)</strong>. However, it uses <strong>memmove</strong> to move elements, so it is pretty fast as there are many optimisations against those mem* functions in libc. Also, we should not use <strong>memcpy</strong> here because of data overlap.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">intset</span> <span class="o">*</span><span class="nf">intsetAdd</span><span class="p">(</span><span class="n">intset</span> <span class="o">*</span><span class="n">is</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">value</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">success</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="n">valenc</span> <span class="o">=</span> <span class="n">_intsetValueEncoding</span><span class="p">(</span><span class="n">value</span><span class="p">);</span> <span class="c1">// get corresponding encoding for value
</span> <span class="kt">uint32_t</span> <span class="n">pos</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">success</span><span class="p">)</span> <span class="o">*</span><span class="n">success</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">valenc</span> <span class="o">></span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">encoding</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// cannot store, upgrade encoding
</span> <span class="k">return</span> <span class="n">intsetUpgradeAndAdd</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// can store value with current encoding
</span> <span class="k">if</span> <span class="p">(</span><span class="n">intsetSearch</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">value</span><span class="p">,</span><span class="o">&</span><span class="n">pos</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// duplicate
</span> <span class="k">if</span> <span class="p">(</span><span class="n">success</span><span class="p">)</span> <span class="o">*</span><span class="n">success</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">is</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">is</span> <span class="o">=</span> <span class="n">intsetResize</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// realloc one more entry
</span>
<span class="c1">// if insert position is not the last pos, need to move elements
</span> <span class="k">if</span> <span class="p">(</span><span class="n">pos</span> <span class="o"><</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">))</span> <span class="n">intsetMoveTail</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">pos</span><span class="p">,</span><span class="n">pos</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">_intsetSet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">pos</span><span class="p">,</span><span class="n">value</span><span class="p">);</span> <span class="c1">// set value
</span> <span class="n">is</span><span class="o">-></span><span class="n">length</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// length++
</span> <span class="k">return</span> <span class="n">is</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">intsetMoveTail</span><span class="p">(</span><span class="n">intset</span> <span class="o">*</span><span class="n">is</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">from</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">to</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="o">*</span><span class="n">dst</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">-</span><span class="n">from</span><span class="p">;</span> <span class="c1">// calculate move bytes
</span> <span class="kt">uint32_t</span> <span class="n">encoding</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">encoding</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">encoding</span> <span class="o">==</span> <span class="n">INTSET_ENC_INT64</span><span class="p">)</span> <span class="p">{</span>
<span class="n">src</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int64_t</span><span class="o">*</span><span class="p">)</span><span class="n">is</span><span class="o">-></span><span class="n">contents</span><span class="o">+</span><span class="n">from</span><span class="p">;</span>
<span class="n">dst</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int64_t</span><span class="o">*</span><span class="p">)</span><span class="n">is</span><span class="o">-></span><span class="n">contents</span><span class="o">+</span><span class="n">to</span><span class="p">;</span>
<span class="n">bytes</span> <span class="o">*=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int64_t</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">encoding</span> <span class="o">==</span> <span class="n">INTSET_ENC_INT32</span><span class="p">)</span> <span class="p">{</span>
<span class="n">src</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int32_t</span><span class="o">*</span><span class="p">)</span><span class="n">is</span><span class="o">-></span><span class="n">contents</span><span class="o">+</span><span class="n">from</span><span class="p">;</span>
<span class="n">dst</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int32_t</span><span class="o">*</span><span class="p">)</span><span class="n">is</span><span class="o">-></span><span class="n">contents</span><span class="o">+</span><span class="n">to</span><span class="p">;</span>
<span class="n">bytes</span> <span class="o">*=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int32_t</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">src</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int16_t</span><span class="o">*</span><span class="p">)</span><span class="n">is</span><span class="o">-></span><span class="n">contents</span><span class="o">+</span><span class="n">from</span><span class="p">;</span>
<span class="n">dst</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int16_t</span><span class="o">*</span><span class="p">)</span><span class="n">is</span><span class="o">-></span><span class="n">contents</span><span class="o">+</span><span class="n">to</span><span class="p">;</span>
<span class="n">bytes</span> <span class="o">*=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int16_t</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">memmove</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span><span class="n">src</span><span class="p">,</span><span class="n">bytes</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="delete">Delete</h2>
<p>Move elements after <em>delete</em> position, so time complexity is O(n).</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">intset</span> <span class="o">*</span><span class="nf">intsetRemove</span><span class="p">(</span><span class="n">intset</span> <span class="o">*</span><span class="n">is</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">value</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">success</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="n">valenc</span> <span class="o">=</span> <span class="n">_intsetValueEncoding</span><span class="p">(</span><span class="n">value</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">pos</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">success</span><span class="p">)</span> <span class="o">*</span><span class="n">success</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">valenc</span> <span class="o"><=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">encoding</span><span class="p">)</span> <span class="o">&&</span> <span class="n">intsetSearch</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">value</span><span class="p">,</span><span class="o">&</span><span class="n">pos</span><span class="p">))</span> <span class="p">{</span>
<span class="kt">uint32_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">);</span>
<span class="cm">/* We know we can delete */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">success</span><span class="p">)</span> <span class="o">*</span><span class="n">success</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="cm">/* Overwrite value with tail and update length */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pos</span> <span class="o"><</span> <span class="p">(</span><span class="n">len</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="n">intsetMoveTail</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">pos</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span><span class="n">pos</span><span class="p">);</span> <span class="c1">// move tail element
</span> <span class="n">is</span> <span class="o">=</span> <span class="n">intsetResize</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">len</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// length--
</span> <span class="n">is</span><span class="o">-></span><span class="n">length</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">len</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// update length
</span> <span class="p">}</span>
<span class="k">return</span> <span class="n">is</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="upgrade">Upgrade</h2>
<p>When calling function <em>intsetUpgradeAndAdd</em>, it means that current encoding cannot store <em>value</em>(overflow). So after upgrade, <em>value</em> must be the <strong>min</strong> or <strong>max</strong> element. Local variable <strong>prepend</strong> is used to control insert position either at the beginning(min) or ending(max).</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">intset</span> <span class="o">*</span><span class="nf">intsetUpgradeAndAdd</span><span class="p">(</span><span class="n">intset</span> <span class="o">*</span><span class="n">is</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">value</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="n">curenc</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">encoding</span><span class="p">);</span> <span class="c1">// get current encoding
</span> <span class="kt">uint8_t</span> <span class="n">newenc</span> <span class="o">=</span> <span class="n">_intsetValueEncoding</span><span class="p">(</span><span class="n">value</span><span class="p">);</span> <span class="c1">// new encoding
</span> <span class="kt">int</span> <span class="n">length</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">prepend</span> <span class="o">=</span> <span class="n">value</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// very tricky :(
</span>
<span class="n">is</span><span class="o">-></span><span class="n">encoding</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">newenc</span><span class="p">);</span> <span class="c1">// set new encoding
</span> <span class="n">is</span> <span class="o">=</span> <span class="n">intsetResize</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// allocate one more entry memory
</span>
<span class="cm">/* Upgrade back-to-front so we don't overwrite values.
* Note that the "prepend" variable is used to make sure we have an empty
* space at either the beginning or the end of the intset. */</span>
<span class="c1">// get data according to old encoding and then insert
</span> <span class="c1">// them as new encoding, no override
</span> <span class="k">while</span><span class="p">(</span><span class="n">length</span><span class="o">--</span><span class="p">)</span>
<span class="n">_intsetSet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">length</span><span class="o">+</span><span class="n">prepend</span><span class="p">,</span><span class="n">_intsetGetEncoded</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">length</span><span class="p">,</span><span class="n">curenc</span><span class="p">));</span>
<span class="cm">/* Set the value at the beginning or the end. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prepend</span><span class="p">)</span>
<span class="n">_intsetSet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">value</span><span class="p">);</span> <span class="c1">// minimum element
</span> <span class="k">else</span>
<span class="n">_intsetSet</span><span class="p">(</span><span class="n">is</span><span class="p">,</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">),</span><span class="n">value</span><span class="p">);</span> <span class="c1">// max element
</span>
<span class="n">is</span><span class="o">-></span><span class="n">length</span> <span class="o">=</span> <span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">intrev32ifbe</span><span class="p">(</span><span class="n">is</span><span class="o">-></span><span class="n">length</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// update length
</span> <span class="k">return</span> <span class="n">is</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
Redis Internal Data Structure : Skiplist
2014-08-15T00:00:00+00:00
http://blog.wjin.org/posts/redis-internal-data-structure--skiplist
<h1 id="introduction">Introduction</h1>
<p>Skip List gets <strong>O(log n)</strong> time complexity on average. And it is easy to implement compared to AVL tree or Red-Black tree. So Redis uses it to implement ordered set.</p>
<h1 id="implementation">Implementation</h1>
<h2 id="data-structure">Data Structure</h2>
<p>Related File: <strong>redis.h</strong></p>
<p>Redis adds a <em>backward</em> pointer for each skip list node to traverse reversely. Also there is a <em>span</em> variable in level entry to record how many nodes must be crossed when reaching to next node. Actually, when traverse list, we can accumulate span to get the <strong>rank</strong> of a node in sorted set.</p>
<p>Here is an example of skip list without dumb head node. There are three nodes and they are sorted by score. There are two lists: level 1 and level2. We can reach to node3 from node1 by level2 list directly, and its span is 2. Also, we can cross node2 to get to node3 with level1 list.</p>
<p><img src="/assets/img/post/redis_skiplist.png" alt="img" /></p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">zskiplistNode</span> <span class="p">{</span>
<span class="n">robj</span> <span class="o">*</span><span class="n">obj</span><span class="p">;</span> <span class="c1">// redis generic object
</span> <span class="kt">double</span> <span class="n">score</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">backward</span><span class="p">;</span> <span class="c1">// backward pointer, only exist in level zero list
</span> <span class="k">struct</span> <span class="n">zskiplistLevel</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">forward</span><span class="p">;</span> <span class="c1">// next node, may skip a lot of nodes
</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">span</span><span class="p">;</span> <span class="c1">// number of nodes need be crossed to reach to next node
</span> <span class="p">}</span> <span class="n">level</span><span class="p">[];</span> <span class="c1">// level array to make up lists
</span><span class="p">}</span> <span class="n">zskiplistNode</span><span class="p">;</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="n">zskiplist</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">header</span><span class="p">,</span> <span class="o">*</span><span class="n">tail</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">length</span><span class="p">;</span> <span class="c1">// number of nodes
</span> <span class="kt">int</span> <span class="n">level</span><span class="p">;</span> <span class="c1">// current level
</span><span class="p">}</span> <span class="n">zskiplist</span><span class="p">;</span>
</code></pre></div></div>
<h2 id="core-api">Core API</h2>
<p>Related file: <strong>t_zset.c</strong></p>
<h3 id="init">Init</h3>
<p>Allocate memory for skip list and create a <strong>dumb head</strong> skip list node. This dumb head has the max levels(<code class="highlighter-rouge">ZSKIPLIST_MAXLEVEL</code>), all level’s pointer is initialised to null, so the skip list is empty. Actually, it has <code class="highlighter-rouge">ZSKIPLIST_MAXLEVEL</code> empty single lists.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zskiplist</span> <span class="o">*</span><span class="nf">zslCreate</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">j</span><span class="p">;</span>
<span class="n">zskiplist</span> <span class="o">*</span><span class="n">zsl</span><span class="p">;</span>
<span class="n">zsl</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">zsl</span><span class="p">));</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">level</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// only one level
</span> <span class="n">zsl</span><span class="o">-></span><span class="n">length</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// no node at present
</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">header</span> <span class="o">=</span> <span class="n">zslCreateNode</span><span class="p">(</span><span class="n">ZSKIPLIST_MAXLEVEL</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="c1">// initialise each level to empty list
</span> <span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">ZSKIPLIST_MAXLEVEL</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">forward</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">span</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="o">-></span><span class="n">backward</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">return</span> <span class="n">zsl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">zskiplistNode</span> <span class="o">*</span><span class="nf">zslCreateNode</span><span class="p">(</span><span class="kt">int</span> <span class="n">level</span><span class="p">,</span> <span class="kt">double</span> <span class="n">score</span><span class="p">,</span> <span class="n">robj</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span> <span class="p">{</span>
<span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">zn</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">zn</span><span class="p">)</span><span class="o">+</span><span class="n">level</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">zskiplistLevel</span><span class="p">));</span>
<span class="n">zn</span><span class="o">-></span><span class="n">score</span> <span class="o">=</span> <span class="n">score</span><span class="p">;</span>
<span class="n">zn</span><span class="o">-></span><span class="n">obj</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>
<span class="k">return</span> <span class="n">zn</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="destroy">Destroy</h3>
<p>In Redis, there is an abstract data type: robj. It can be shared by many structures using reference count to save memory. So when destroy node, just decrease robj reference count.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">zslFree</span><span class="p">(</span><span class="n">zskiplist</span> <span class="o">*</span><span class="n">zsl</span><span class="p">)</span> <span class="p">{</span>
<span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="p">,</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="p">);</span> <span class="c1">// free dumb head
</span>
<span class="c1">// free all nodes one by one
</span> <span class="c1">// just use level[0] list because all nodes must be in level[0] list
</span> <span class="k">while</span><span class="p">(</span><span class="n">node</span><span class="p">)</span> <span class="p">{</span>
<span class="n">next</span> <span class="o">=</span> <span class="n">node</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="p">;</span>
<span class="n">zslFreeNode</span><span class="p">(</span><span class="n">node</span><span class="p">);</span>
<span class="n">node</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">zsl</span><span class="p">);</span> <span class="o">/</span> <span class="n">free</span> <span class="n">skiplist</span> <span class="n">itself</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">zslFreeNode</span><span class="p">(</span><span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">node</span><span class="p">)</span> <span class="p">{</span>
<span class="n">decrRefCount</span><span class="p">(</span><span class="n">node</span><span class="o">-></span><span class="n">obj</span><span class="p">);</span> <span class="c1">// decrease reference count
</span> <span class="n">zfree</span><span class="p">(</span><span class="n">node</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="insert">Insert</h3>
<p><strong>zslInsert</strong> is the most important function of skip list. The <strong>update</strong> array stores previous pointers for each level, new node will be added after them. <strong>rank</strong> array stores the rank value of each skiplist node.</p>
<p>Steps:</p>
<ol>
<li>generate update and rank array</li>
<li>create a new node with random level</li>
<li>insert new node according to <em>update</em> and <em>rank</em> info</li>
<li>update other necessary infos, such as span, backward pointer, length.</li>
</ol>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zskiplistNode</span> <span class="o">*</span><span class="nf">zslInsert</span><span class="p">(</span><span class="n">zskiplist</span> <span class="o">*</span><span class="n">zsl</span><span class="p">,</span> <span class="kt">double</span> <span class="n">score</span><span class="p">,</span> <span class="n">robj</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span> <span class="p">{</span>
<span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">update</span><span class="p">[</span><span class="n">ZSKIPLIST_MAXLEVEL</span><span class="p">],</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">rank</span><span class="p">[</span><span class="n">ZSKIPLIST_MAXLEVEL</span><span class="p">];</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">,</span> <span class="n">level</span><span class="p">;</span>
<span class="c1">// get update and rank array
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* store rank that is crossed to reach the insert position */</span>
<span class="n">rank</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span> <span class="o">==</span> <span class="p">(</span><span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="n">rank</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span> <span class="o">&&</span>
<span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">score</span> <span class="o"><</span> <span class="n">score</span> <span class="o">||</span>
<span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">score</span> <span class="o">==</span> <span class="n">score</span> <span class="o">&&</span>
<span class="n">compareStringObjects</span><span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">obj</span><span class="p">,</span><span class="n">obj</span><span class="p">)</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">rank</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span><span class="p">;</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// create new node
</span> <span class="n">level</span> <span class="o">=</span> <span class="n">zslRandomLevel</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">level</span> <span class="o">></span> <span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">level</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rank</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="p">;</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">length</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">level</span> <span class="o">=</span> <span class="n">level</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">zslCreateNode</span><span class="p">(</span><span class="n">level</span><span class="p">,</span><span class="n">score</span><span class="p">,</span><span class="n">obj</span><span class="p">);</span>
<span class="c1">// insert
</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">level</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span> <span class="o">=</span> <span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="p">;</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="cm">/* update span covered by update[i] as x is inserted here */</span>
<span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">=</span> <span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">-</span> <span class="p">(</span><span class="n">rank</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">rank</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">=</span> <span class="p">(</span><span class="n">rank</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">rank</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* increment span for untouched levels */</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="n">level</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// update info
</span> <span class="n">x</span><span class="o">-></span><span class="n">backward</span> <span class="o">=</span> <span class="p">(</span><span class="n">update</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="p">)</span> <span class="o">?</span> <span class="nb">NULL</span> <span class="o">:</span> <span class="n">update</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="p">)</span>
<span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">backward</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="k">else</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">tail</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">length</span><span class="o">++</span><span class="p">;</span>
<span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="delete">Delete</h3>
<p>Similar to insert, we need to get <strong>update</strong> array.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">zslDelete</span><span class="p">(</span><span class="n">zskiplist</span> <span class="o">*</span><span class="n">zsl</span><span class="p">,</span> <span class="kt">double</span> <span class="n">score</span><span class="p">,</span> <span class="n">robj</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span> <span class="p">{</span>
<span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">update</span><span class="p">[</span><span class="n">ZSKIPLIST_MAXLEVEL</span><span class="p">],</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="c1">// get update info
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span> <span class="o">&&</span>
<span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">score</span> <span class="o"><</span> <span class="n">score</span> <span class="o">||</span>
<span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">score</span> <span class="o">==</span> <span class="n">score</span> <span class="o">&&</span>
<span class="n">compareStringObjects</span><span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">obj</span><span class="p">,</span><span class="n">obj</span><span class="p">)</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="p">;</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// find it, delete
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&&</span> <span class="n">score</span> <span class="o">==</span> <span class="n">x</span><span class="o">-></span><span class="n">score</span> <span class="o">&&</span> <span class="n">equalStringObjects</span><span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">obj</span><span class="p">,</span><span class="n">obj</span><span class="p">))</span> <span class="p">{</span>
<span class="n">zslDeleteNode</span><span class="p">(</span><span class="n">zsl</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">update</span><span class="p">);</span>
<span class="n">zslFreeNode</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* not found */</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">zslDeleteNode</span><span class="p">(</span><span class="n">zskiplist</span> <span class="o">*</span><span class="n">zsl</span><span class="p">,</span> <span class="n">zskiplistNode</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="n">zskiplistNode</span> <span class="o">**</span><span class="n">update</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span> <span class="o">==</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// previous pointer pointed to this node
</span> <span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">+=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">forward</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// previous pointer pointed to nodes behind this node or nullptr
</span> <span class="c1">// just decrease span
</span> <span class="n">update</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">span</span> <span class="o">-=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// update backward
</span> <span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">forward</span><span class="o">-></span><span class="n">backward</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">backward</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">tail</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">backward</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// top level lists may be null
</span> <span class="k">while</span><span class="p">(</span><span class="n">zsl</span><span class="o">-></span><span class="n">level</span> <span class="o">></span> <span class="mi">1</span> <span class="o">&&</span> <span class="n">zsl</span><span class="o">-></span><span class="n">header</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">forward</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">level</span><span class="o">--</span><span class="p">;</span>
<span class="n">zsl</span><span class="o">-></span><span class="n">length</span><span class="o">--</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="other-apis">Other APIs</h3>
<p>There are a lot of useful APIs, however, they are easy to understand if you have already known about <strong>update</strong> and <strong>rank</strong> array in <em>Insert</em> operation.</p>
<p>Remember there are MAX_LEVEL single lists in skip lists. <strong>update</strong> records previous pointer in each list. <strong>rank</strong> is the position of a node (node are sorted in some way).</p>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="http://epaperpress.com/sortsearch/download/skiplist.pdf">http://epaperpress.com/sortsearch/download/skiplist.pdf</a></li>
<li><a href="http://en.wikipedia.org/wiki/Skip_list">http://en.wikipedia.org/wiki/Skip_list</a></li>
</ul>
Redis Internal Data Structure : Doubly Linked List
2014-08-14T00:00:00+00:00
http://blog.wjin.org/posts/redis-internal-data-structure--doubly-linked-list
<h1 id="introduction">Introduction</h1>
<p>Almost every programmer uses list in their code, so does Redis Author. And there is a simple implementation for this widely used data structure.</p>
<p>It looks like:</p>
<p><img src="/assets/img/post/redis_list.png" alt="img" /></p>
<h1 id="implementation">Implementation</h1>
<p>In Redis, it was called adlist. I guess ‘adlist’ stands for ‘a doubly linked list’. Related files are: <strong>src/adlist.h</strong> and <strong>src/adlist.c</strong>.</p>
<h2 id="data-structure">Data Structure</h2>
<p>Here is the list node structure, be careful about the value type, it is void*.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// list node
</span><span class="k">typedef</span> <span class="k">struct</span> <span class="n">listNode</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">listNode</span> <span class="o">*</span><span class="n">prev</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">listNode</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">;</span> <span class="c1">// list node data
</span><span class="p">}</span> <span class="n">listNode</span><span class="p">;</span>
</code></pre></div></div>
<p>And here is the list itself. There is a member <strong>len</strong> to record list node numbers so that we can get list length in O(1) time. Also, three function pointers used to deal with <strong>value</strong> in list node.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// list itself
</span><span class="k">typedef</span> <span class="k">struct</span> <span class="n">list</span> <span class="p">{</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">head</span><span class="p">;</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">tail</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">dup</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">);</span> <span class="c1">// used to copy list node
</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">);</span> <span class="c1">// used to release list node
</span> <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">);</span> <span class="c1">// used to compare list node
</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">;</span> <span class="c1">// list length
</span><span class="p">}</span> <span class="n">list</span><span class="p">;</span>
</code></pre></div></div>
<p>For convenience, adlist provides an iterator to traverse list node.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// list iterator
</span><span class="k">typedef</span> <span class="k">struct</span> <span class="n">listIter</span> <span class="p">{</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">direction</span><span class="p">;</span> <span class="c1">// AL_START_HEAD or AL_START_TAIL
</span><span class="p">}</span> <span class="n">listIter</span><span class="p">;</span>
</code></pre></div></div>
<p>We can use functions <strong>listGetIterator</strong> and <strong>listNext</strong> to traverse list, like this:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">iter</span> <span class="o">=</span> <span class="n">listGetIterator</span><span class="p">(</span><span class="n">list</span><span class="p">,</span><span class="n">direction</span><span class="p">);</span>
<span class="k">while</span> <span class="p">((</span><span class="n">node</span> <span class="o">=</span> <span class="n">listNext</span><span class="p">(</span><span class="n">iter</span><span class="p">))</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="n">doSomething</span><span class="p">(</span><span class="n">listNodeValue</span><span class="p">(</span><span class="n">node</span><span class="p">));</span>
</code></pre></div></div>
<h2 id="initialization">Initialization</h2>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">list</span> <span class="o">*</span><span class="nf">listCreate</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">struct</span> <span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="n">list</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">list</span><span class="p">)))</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">list</span><span class="o">-></span><span class="n">head</span> <span class="o">=</span> <span class="n">list</span><span class="o">-></span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">list</span><span class="o">-></span><span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">list</span><span class="o">-></span><span class="n">dup</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">list</span><span class="o">-></span><span class="n">free</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">list</span><span class="o">-></span><span class="n">match</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">return</span> <span class="n">list</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="destroy">Destroy</h2>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">listRelease</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">;</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">current</span><span class="p">,</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">list</span><span class="o">-></span><span class="n">head</span><span class="p">;</span>
<span class="n">len</span> <span class="o">=</span> <span class="n">list</span><span class="o">-></span><span class="n">len</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">len</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">next</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">next</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">list</span><span class="o">-></span><span class="n">free</span><span class="p">)</span> <span class="n">list</span><span class="o">-></span><span class="n">free</span><span class="p">(</span><span class="n">current</span><span class="o">-></span><span class="n">value</span><span class="p">);</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">current</span><span class="p">);</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">list</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="all-apis">ALL APIs</h1>
<p>This simple data structure is easy to use and understand compared with Linux kernel list. All APIs are here:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">list</span> <span class="o">*</span><span class="n">listCreate</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">listRelease</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">);</span>
<span class="n">list</span> <span class="o">*</span><span class="n">listAddNodeHead</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
<span class="n">list</span> <span class="o">*</span><span class="n">listAddNodeTail</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
<span class="n">list</span> <span class="o">*</span><span class="n">listInsertNode</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="n">listNode</span> <span class="o">*</span><span class="n">old_node</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">,</span> <span class="kt">int</span> <span class="n">after</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">listDelNode</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="n">listNode</span> <span class="o">*</span><span class="n">node</span><span class="p">);</span>
<span class="n">listIter</span> <span class="o">*</span><span class="n">listGetIterator</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="kt">int</span> <span class="n">direction</span><span class="p">);</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">listNext</span><span class="p">(</span><span class="n">listIter</span> <span class="o">*</span><span class="n">iter</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">listReleaseIterator</span><span class="p">(</span><span class="n">listIter</span> <span class="o">*</span><span class="n">iter</span><span class="p">);</span>
<span class="n">list</span> <span class="o">*</span><span class="n">listDup</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">orig</span><span class="p">);</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">listSearchKey</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">);</span>
<span class="n">listNode</span> <span class="o">*</span><span class="n">listIndex</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="kt">long</span> <span class="n">index</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">listRewind</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="n">listIter</span> <span class="o">*</span><span class="n">li</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">listRewindTail</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">,</span> <span class="n">listIter</span> <span class="o">*</span><span class="n">li</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">listRotate</span><span class="p">(</span><span class="n">list</span> <span class="o">*</span><span class="n">list</span><span class="p">);</span>
</code></pre></div></div>
Redis Internal Data Structure : Dictionary
2014-08-14T00:00:00+00:00
http://blog.wjin.org/posts/redis-internal-data-structure--dictionary
<h1 id="introduction">Introduction</h1>
<p>In Redis, dictionary is a widely used data structure as well as sds and adlist. And it is the most important data structure as Redis is a key-value store.</p>
<p>It is implemented by means of hash table and there are two hash tables in dictionary to implement <strong>incremental rehashing</strong>. Dictionary can <strong>auto-resize</strong> if needed, and the internal hash table size is always <strong>power of two</strong>. And the internal hash table uses <strong>chaining</strong> to deal with collision.</p>
<p>Here is an overview of dictionary. ht[1] is always empty unless it is in the process of rehashing.</p>
<p><img src="/assets/img/post/redis_dict.png" alt="img" /></p>
<h1 id="implementation">Implementation</h1>
<p>Related files: <strong>dict.h</strong> and <strong>dict.c</strong></p>
<h2 id="data-structure">Data Structure</h2>
<p>To make dictionary more flexible and universal, <strong>dictType</strong> structure includes 6 hooks to deal with key and value.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">dictType</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">hashFunction</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">);</span> <span class="c1">// hash function
</span> <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">keyDup</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">privdata</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">);</span> <span class="c1">// copy key
</span> <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">valDup</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">privdata</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">obj</span><span class="p">);</span> <span class="c1">// copy value
</span> <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">keyCompare</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">privdata</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key1</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key2</span><span class="p">);</span> <span class="c1">// compare keys
</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">keyDestructor</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">privdata</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">);</span> <span class="c1">// free key
</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">valDestructor</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">privdata</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">obj</span><span class="p">);</span> <span class="c1">// free value
</span><span class="p">}</span> <span class="n">dictType</span><span class="p">;</span>
</code></pre></div></div>
<p>This is dictionary data structure:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">dict</span> <span class="p">{</span>
<span class="n">dictType</span> <span class="o">*</span><span class="n">type</span><span class="p">;</span> <span class="c1">// hooks
</span> <span class="kt">void</span> <span class="o">*</span><span class="n">privdata</span><span class="p">;</span> <span class="c1">// private data used in hooks function
</span> <span class="n">dictht</span> <span class="n">ht</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span> <span class="c1">// two hash tables for incremental rehashing
</span> <span class="kt">int</span> <span class="n">rehashidx</span><span class="p">;</span> <span class="c1">// rehashing tag
</span> <span class="kt">int</span> <span class="n">iterators</span><span class="p">;</span> <span class="c1">// number of iterators
</span><span class="p">}</span> <span class="n">dict</span><span class="p">;</span>
</code></pre></div></div>
<p>Member <strong>rehashidx</strong> is pretty important because it identifies whether dict is in the process of rehashing. If it is not <em>-1</em>, dictionary is rehashing and <em>ht[1]</em> will not be NULL. And all new pairs <key, val> will be added into ht[1] table.</p>
<p>Below two structures are related to hash table itself. According to structure <em>dictEntry</em>, we know that it will chain all conflicted nodes using pointer <em>next</em>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">dictEntry</span> <span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">;</span> <span class="c1">// key
</span> <span class="k">union</span> <span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">val</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">u64</span><span class="p">;</span>
<span class="kt">int64_t</span> <span class="n">s64</span><span class="p">;</span>
<span class="p">}</span> <span class="n">v</span><span class="p">;</span> <span class="c1">// value
</span> <span class="k">struct</span> <span class="n">dictEntry</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span> <span class="c1">// next node in the same slot
</span><span class="p">}</span> <span class="n">dictEntry</span><span class="p">;</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="n">dictht</span> <span class="p">{</span>
<span class="n">dictEntry</span> <span class="o">**</span><span class="n">table</span><span class="p">;</span> <span class="c1">// hash node array (buckets)
</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">size</span><span class="p">;</span> <span class="c1">// total slot size
</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">sizemask</span><span class="p">;</span> <span class="c1">// size mask
</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">used</span><span class="p">;</span> <span class="c1">// slot in use
</span><span class="p">}</span> <span class="n">dictht</span><span class="p">;</span>
</code></pre></div></div>
<p>The member <em>sizemask</em> is used to get an index or slot in hash table by <strong>dictHashKey(key) & sizemask</strong>. So it is always initialized with <strong>size - 1</strong> because array <strong>table</strong> subscription starts from zero in C language. For example, if size is 16, and sizemask will be 15, slot index will be from 0 to 15, that is the result of <strong>dictHashKey(key) & sizemask</strong>.</p>
<h2 id="core-api">Core API</h2>
<p>Following I will analyse some core APIs, including init, destroy, add, delete and replace operations. Code does not exactly match the original code. I have removed some trivial code.</p>
<h3 id="init">Init</h3>
<p>Allocate memory for dictionary structure and initialize with default values. Be careful both hash tables ht[0] and ht[1] in dictionary are NULL at present.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dict</span> <span class="o">*</span><span class="nf">dictCreate</span><span class="p">(</span><span class="n">dictType</span> <span class="o">*</span><span class="n">type</span><span class="p">,</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">privDataPtr</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">dict</span> <span class="o">*</span><span class="n">d</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">d</span><span class="p">));</span>
<span class="n">_dictInit</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">type</span><span class="p">,</span><span class="n">privDataPtr</span><span class="p">);</span>
<span class="k">return</span> <span class="n">d</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">_dictInit</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="n">dictType</span> <span class="o">*</span><span class="n">type</span><span class="p">,</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">privDataPtr</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_dictReset</span><span class="p">(</span><span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="n">_dictReset</span><span class="p">(</span><span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
<span class="n">d</span><span class="o">-></span><span class="n">type</span> <span class="o">=</span> <span class="n">type</span><span class="p">;</span>
<span class="n">d</span><span class="o">-></span><span class="n">privdata</span> <span class="o">=</span> <span class="n">privDataPtr</span><span class="p">;</span>
<span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="n">d</span><span class="o">-></span><span class="n">iterators</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">DICT_OK</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">_dictReset</span><span class="p">(</span><span class="n">dictht</span> <span class="o">*</span><span class="n">ht</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">ht</span><span class="o">-></span><span class="n">table</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">ht</span><span class="o">-></span><span class="n">size</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">ht</span><span class="o">-></span><span class="n">sizemask</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">ht</span><span class="o">-></span><span class="n">used</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="destroy">Destroy</h3>
<p>Clear hash tables ht[0] and ht[1] and then release dict memory.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">dictRelease</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_dictClear</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">_dictClear</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">d</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">_dictClear</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="n">dictht</span> <span class="o">*</span><span class="n">ht</span><span class="p">,</span> <span class="kt">void</span><span class="p">(</span><span class="n">callback</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">))</span> <span class="p">{</span>
<span class="c1">// traverse all slots
</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">ht</span><span class="o">-></span><span class="n">size</span> <span class="o">&&</span> <span class="n">ht</span><span class="o">-></span><span class="n">used</span> <span class="o">></span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">callback</span> <span class="o">&&</span> <span class="p">(</span><span class="n">i</span> <span class="o">&</span> <span class="mi">65535</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="n">callback</span><span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">privdata</span><span class="p">);</span>
<span class="k">if</span> <span class="p">((</span><span class="n">he</span> <span class="o">=</span> <span class="n">ht</span><span class="o">-></span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span> <span class="c1">// null slot, continue
</span> <span class="k">while</span><span class="p">(</span><span class="n">he</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// iterate current slot
</span> <span class="n">nextHe</span> <span class="o">=</span> <span class="n">he</span><span class="o">-></span><span class="n">next</span><span class="p">;</span>
<span class="n">dictFreeKey</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">he</span><span class="p">);</span>
<span class="n">dictFreeVal</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">he</span><span class="p">);</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">he</span><span class="p">);</span>
<span class="n">ht</span><span class="o">-></span><span class="n">used</span><span class="o">--</span><span class="p">;</span>
<span class="n">he</span> <span class="o">=</span> <span class="n">nextHe</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">ht</span><span class="o">-></span><span class="n">table</span><span class="p">);</span> <span class="c1">// free bucket array
</span> <span class="n">_dictReset</span><span class="p">(</span><span class="n">ht</span><span class="p">);</span> <span class="c1">// reset
</span> <span class="k">return</span> <span class="n">DICT_OK</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="add">Add</h3>
<p>Add pair <key, val> to dictionary: it calls function <em>dictAddRaw</em> to get an entry pointer first and then call function <em>dictSetVal</em> to set value.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">dictAdd</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">val</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">dictEntry</span> <span class="o">*</span><span class="n">entry</span> <span class="o">=</span> <span class="n">dictAddRaw</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">key</span><span class="p">);</span> <span class="c1">// get entry
</span> <span class="n">dictSetVal</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">entry</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span> <span class="c1">// set value
</span> <span class="k">return</span> <span class="n">DICT_OK</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">dictEntry</span> <span class="o">*</span><span class="nf">dictAddRaw</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// get the index
</span> <span class="k">if</span> <span class="p">((</span><span class="n">index</span> <span class="o">=</span> <span class="n">_dictKeyIndex</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">key</span><span class="p">))</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="c1">// if rehashing, add to the second hash table
</span> <span class="n">ht</span> <span class="o">=</span> <span class="n">dictIsRehashing</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="o">?</span> <span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">:</span> <span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="c1">// allocate entry memory
</span> <span class="n">entry</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">));</span>
<span class="c1">// add entry to slot index, similar to insert node to list head
</span> <span class="n">entry</span><span class="o">-></span><span class="n">next</span> <span class="o">=</span> <span class="n">ht</span><span class="o">-></span><span class="n">table</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
<span class="n">ht</span><span class="o">-></span><span class="n">table</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="n">entry</span><span class="p">;</span>
<span class="n">ht</span><span class="o">-></span><span class="n">used</span><span class="o">++</span><span class="p">;</span>
<span class="c1">// set the hash entry field
</span> <span class="n">dictSetKey</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">entry</span><span class="p">,</span> <span class="n">key</span><span class="p">);</span>
<span class="k">return</span> <span class="n">entry</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// be careful about the hook function
// all other hooks are used in the same way
</span><span class="cp">#define dictSetVal(d, entry, _val_) do { \
if ((d)->type->valDup) \
entry->v.val = (d)->type->valDup((d)->privdata, _val_); \
else \
entry->v.val = (_val_); \
} while(0)
</span></code></pre></div></div>
<h3 id="delete">Delete</h3>
<p>There are two versions of delete node. The difference is whether free key memory.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">dictDelete</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">ht</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">dictGenericDelete</span><span class="p">(</span><span class="n">ht</span><span class="p">,</span><span class="n">key</span><span class="p">,</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// no free tags means do not free key memory
</span><span class="kt">int</span> <span class="nf">dictDeleteNoFree</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">ht</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">dictGenericDelete</span><span class="p">(</span><span class="n">ht</span><span class="p">,</span><span class="n">key</span><span class="p">,</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">dictGenericDelete</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nofree</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">dictHashKey</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">key</span><span class="p">);</span> <span class="c1">// get hash result
</span> <span class="k">for</span> <span class="p">(</span><span class="n">table</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">table</span> <span class="o"><=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">table</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">h</span> <span class="o">&</span> <span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="n">table</span><span class="p">].</span><span class="n">sizemask</span><span class="p">;</span> <span class="c1">// get slot
</span> <span class="n">he</span> <span class="o">=</span> <span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="n">table</span><span class="p">].</span><span class="n">table</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span> <span class="c1">// get first entry in slot idx
</span>
<span class="c1">// iterate current slot
</span> <span class="k">while</span><span class="p">(</span><span class="n">he</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">dictCompareKeys</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">he</span><span class="o">-></span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nofree</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dictFreeKey</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">he</span><span class="p">);</span>
<span class="n">dictFreeVal</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">he</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">he</span><span class="p">);</span>
<span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="n">table</span><span class="p">].</span><span class="n">used</span><span class="o">--</span><span class="p">;</span>
<span class="k">return</span> <span class="n">DICT_OK</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">he</span> <span class="o">=</span> <span class="n">he</span><span class="o">-></span><span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">dictIsRehashing</span><span class="p">(</span><span class="n">d</span><span class="p">))</span> <span class="k">break</span><span class="p">;</span> <span class="c1">// no need to traverse table 1 if no rehashing
</span> <span class="p">}</span>
<span class="k">return</span> <span class="n">DICT_ERR</span><span class="p">;</span> <span class="c1">// not found
</span><span class="p">}</span>
</code></pre></div></div>
<h3 id="replace">Replace</h3>
<p><em>Replace</em> operation tries to add a pair <key, value> first, if <em>add</em> fails, it will find an entry and then set a new value.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">dictReplace</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">val</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">dictAdd</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="o">==</span> <span class="n">DICT_OK</span><span class="p">)</span> <span class="c1">// try add
</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">entry</span> <span class="o">=</span> <span class="n">dictFind</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">key</span><span class="p">);</span> <span class="c1">// get entry
</span> <span class="cm">/* Set the new value and free the old one. Note that it is important
* to do that in this order, as the value may just be exactly the same
* as the previous one. In this context, think to reference counting,
* you want to increment (set), and then decrement (free), and not the
* reverse. */</span>
<span class="n">auxentry</span> <span class="o">=</span> <span class="o">*</span><span class="n">entry</span><span class="p">;</span>
<span class="n">dictSetVal</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">entry</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span>
<span class="n">dictFreeVal</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="o">&</span><span class="n">auxentry</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="rehashing">Rehashing</h2>
<p>We need to adjust buckets(slot) size according to the number of entries used in hash table. That is called <strong>auto-resize</strong> or <strong>rehashing</strong>. For example, when hash table is in highly use (entries / slot_size is too large), we need to enlarge slot size to reduce conflict. Vice verse, if hash table is nearly not used, we need to shrink slot size to save memory.</p>
<p>Rehashing is controlled by two variables <code class="highlighter-rouge">dict_can_resize</code> and <code class="highlighter-rouge">dict_force_resize_ratio</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="n">dict_can_resize</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">dict_force_resize_ratio</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
</code></pre></div></div>
<p>When <code class="highlighter-rouge">dict_can_resize</code> is set to 0, not all resizes are prevented: a hash table is still allowed to grow if the ratio between the number of elements and the <code class="highlighter-rouge">buckets > dict_force_resize_ratio</code>. This is just for efficiency. It can guarantee that there are at most 5 entries in a slot on average.</p>
<p>When dict is in the process of rehashing, ht[1] is in use. All newly added pairs will be added into this hash table. The core function to execute rehash operation is function <strong>dictRehash</strong>. It moves n slots from ht[0] to ht[1] each time according to the second parameter.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Performs N steps of incremental rehashing. Returns 1 if there are still
* keys to move from the old to the new hash table, otherwise 0 is returned.
* Note that a rehashing step consists in moving a bucket (that may have more
* than one key as we use chaining) from the old to the new hash table. */</span>
<span class="kt">int</span> <span class="nf">dictRehash</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">dictIsRehashing</span><span class="p">(</span><span class="n">d</span><span class="p">))</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">n</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dictEntry</span> <span class="o">*</span><span class="n">de</span><span class="p">,</span> <span class="o">*</span><span class="n">nextde</span><span class="p">;</span>
<span class="cm">/* Check if we already rehashed the whole table... */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">used</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">table</span><span class="p">);</span>
<span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">_dictReset</span><span class="p">(</span><span class="o">&</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
<span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Note that rehashidx can't overflow as we are sure there are more
* elements because ht[0].used != 0 */</span>
<span class="n">assert</span><span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">size</span> <span class="o">></span> <span class="p">(</span><span class="kt">unsigned</span><span class="p">)</span><span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span><span class="p">);</span>
<span class="k">while</span><span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">table</span><span class="p">[</span><span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span><span class="o">++</span><span class="p">;</span>
<span class="n">de</span> <span class="o">=</span> <span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">table</span><span class="p">[</span><span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span><span class="p">];</span>
<span class="cm">/* Move all the keys in this bucket from the old to the new hash HT */</span>
<span class="k">while</span><span class="p">(</span><span class="n">de</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">h</span><span class="p">;</span>
<span class="n">nextde</span> <span class="o">=</span> <span class="n">de</span><span class="o">-></span><span class="n">next</span><span class="p">;</span>
<span class="cm">/* Get the index in the new hash table */</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">dictHashKey</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">de</span><span class="o">-></span><span class="n">key</span><span class="p">)</span> <span class="o">&</span> <span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">sizemask</span><span class="p">;</span>
<span class="n">de</span><span class="o">-></span><span class="n">next</span> <span class="o">=</span> <span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">table</span><span class="p">[</span><span class="n">h</span><span class="p">];</span>
<span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">table</span><span class="p">[</span><span class="n">h</span><span class="p">]</span> <span class="o">=</span> <span class="n">de</span><span class="p">;</span>
<span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">used</span><span class="o">--</span><span class="p">;</span>
<span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">used</span><span class="o">++</span><span class="p">;</span>
<span class="n">de</span> <span class="o">=</span> <span class="n">nextde</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">d</span><span class="o">-></span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">table</span><span class="p">[</span><span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span><span class="p">]</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">d</span><span class="o">-></span><span class="n">rehashidx</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>However, for server response time, we should not finish rehash in just only one call. What we need is <strong>incremental rehashing</strong>.</p>
<p>Function <code class="highlighter-rouge">_dictRehashStep</code> moves one slot from ht[0] to ht[1] each time.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">_dictRehashStep</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">d</span><span class="o">-></span><span class="n">iterators</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="n">dictRehash</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>Incremental hashing</strong> happens during following operations: add, delete, find, and getRandomKey.</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis/src/dict.c|344| <<dictAddRaw>> if (dictIsRehashing(d)) _dictRehashStep(d);
redis/src/dict.c|408| <<dictGenericDelete>> if (dictIsRehashing(d)) _dictRehashStep(d);
redis/src/dict.c|487| <<dictFind>> if (dictIsRehashing(d)) _dictRehashStep(d);
redis/src/dict.c|622| <<dictGetRandomKey>> if (dictIsRehashing(d)) _dictRehashStep(d);
</code></pre></div></div>
<p>Except for controlling how many slots will be moved in a rehashing process. Here is another function <strong>dictRehashMilliseconds</strong> to control how much time rehashing will last.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Rehash for an amount of time between ms milliseconds and ms+1 milliseconds */</span>
<span class="kt">int</span> <span class="nf">dictRehashMilliseconds</span><span class="p">(</span><span class="n">dict</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">ms</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">start</span> <span class="o">=</span> <span class="n">timeInMilliseconds</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">rehashes</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">dictRehash</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="mi">100</span><span class="p">))</span> <span class="p">{</span>
<span class="n">rehashes</span> <span class="o">+=</span> <span class="mi">100</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">timeInMilliseconds</span><span class="p">()</span><span class="o">-</span><span class="n">start</span> <span class="o">></span> <span class="n">ms</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">rehashes</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
Redis Internal Data Structure : SDS
2014-08-13T00:00:00+00:00
http://blog.wjin.org/posts/redis-internal-data-structure--sds
<h1 id="introduction">Introduction</h1>
<p>SDS means <strong>Simple Dynamic Strings</strong>. It is the simplest basic data structure and widely used in many modules in Redis. Its purpose is to replace char* in C language.</p>
<p>Redis provides SDS because it supports efficient functions to get the <strong>string length</strong> and <strong>append</strong> another string to the end without allocating memory each time.</p>
<p>Also, it is <strong>binary safe</strong> because it does not care about whether this string is ending with ‘\0’ or not.</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------+-------------------------------+-----------+
| Header | Binary safe C alike string... | Null term |
+--------+-------------------------------+-----------+
|
`-> Pointer returned to the user.
</code></pre></div></div>
<p>Now, it is extracted and forked as a stand alone project at : <a href="https://github.com/antirez/sds">https://github.com/antirez/sds</a>.</p>
<h1 id="implementation">Implementation</h1>
<p>SDS data structure related files: <strong>src/sds.h</strong> and <strong>src/sds.c</strong></p>
<h2 id="data-structure">Data Structure</h2>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">char</span> <span class="o">*</span><span class="n">sds</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">sdshdr</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">len</span><span class="p">;</span> <span class="c1">// current string length
</span> <span class="kt">int</span> <span class="n">free</span><span class="p">;</span> <span class="c1">// available space in buffer
</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[];</span> <span class="c1">// string content stored here, c99 grammar
</span><span class="p">};</span>
</code></pre></div></div>
<h2 id="initialization">Initialization</h2>
<p>The core function to create a new SDS string is <strong>sdsnewlen</strong>. The string is always null-termined. Also, it is binary safe and can contain ‘\0’ characters in the middle, as the length is stored in the sds header.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sds</span> <span class="nf">sdsnewlen</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">init</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">initlen</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">sdshdr</span> <span class="o">*</span><span class="n">sh</span><span class="p">;</span>
<span class="c1">// +1 means there is always a '\0' in the end
</span> <span class="k">if</span> <span class="p">(</span><span class="n">init</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sh</span> <span class="o">=</span> <span class="n">zmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)</span><span class="o">+</span><span class="n">initlen</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">sh</span> <span class="o">=</span> <span class="n">zcalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)</span><span class="o">+</span><span class="n">initlen</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sh</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span> <span class="c1">// no memory
</span> <span class="n">sh</span><span class="o">-></span><span class="n">len</span> <span class="o">=</span> <span class="n">initlen</span><span class="p">;</span> <span class="c1">// init string length
</span> <span class="n">sh</span><span class="o">-></span><span class="n">free</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// init available space to zero
</span> <span class="k">if</span> <span class="p">(</span><span class="n">initlen</span> <span class="o">&&</span> <span class="n">init</span><span class="p">)</span>
<span class="n">memcpy</span><span class="p">(</span><span class="n">sh</span><span class="o">-></span><span class="n">buf</span><span class="p">,</span> <span class="n">init</span><span class="p">,</span> <span class="n">initlen</span><span class="p">);</span>
<span class="n">sh</span><span class="o">-></span><span class="n">buf</span><span class="p">[</span><span class="n">initlen</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'\0'</span><span class="p">;</span>
<span class="c1">// return the real string content, not including header
</span> <span class="c1">// user can use it as usual
</span> <span class="k">return</span> <span class="p">(</span><span class="kt">char</span><span class="o">*</span><span class="p">)</span><span class="n">sh</span><span class="o">-></span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>Note</strong>: sdsnewlen always returns the <strong>real string content</strong> to user, not including sds header. Actually, so do all other APIs. So it does not break the API use.</p>
<p>According to above implementation, it is easy to get string length and extra available buffer size in O(1) time.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">size_t</span> <span class="nf">sdslen</span><span class="p">(</span><span class="k">const</span> <span class="n">sds</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">sdshdr</span> <span class="o">*</span><span class="n">sh</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)(</span><span class="n">s</span><span class="o">-</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)));</span>
<span class="k">return</span> <span class="n">sh</span><span class="o">-></span><span class="n">len</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">size_t</span> <span class="nf">sdsavail</span><span class="p">(</span><span class="k">const</span> <span class="n">sds</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">sdshdr</span> <span class="o">*</span><span class="n">sh</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)(</span><span class="n">s</span><span class="o">-</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)));</span>
<span class="k">return</span> <span class="n">sh</span><span class="o">-></span><span class="n">free</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="destroy">Destroy</h2>
<p>SDS memory (including sds header) is dynamically allocated. So just call free to release memory to system.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">sdsfree</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">s</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
<span class="n">zfree</span><span class="p">(</span><span class="n">s</span><span class="o">-</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="memory-allocation-strategy">Memory Allocation Strategy</h2>
<p>SDS uses function <strong>sdsMakeRoomFor</strong> to adjust the buf size. This function accepts a second parameter <em>addlen</em>. It guarantees that after calling it, there is at least <em>addlen</em> bytes free space in the end of buf.</p>
<p>It <strong>pre-allocates</strong> memory to reduce memory allocation times. Actually, it just simply doubles the original size when it is less than <code class="highlighter-rouge">SDS_MAX_PREALLOC(1MB)</code>. This is similar to C++ vector allocation strategy when memory is not enough. This is why string append operation does not need to allocate memory every time.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sds</span> <span class="nf">sdsMakeRoomFor</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">addlen</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">sdshdr</span> <span class="o">*</span><span class="n">sh</span><span class="p">,</span> <span class="o">*</span><span class="n">newsh</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">free</span> <span class="o">=</span> <span class="n">sdsavail</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">newlen</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">free</span> <span class="o">>=</span> <span class="n">addlen</span><span class="p">)</span> <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="n">len</span> <span class="o">=</span> <span class="n">sdslen</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="n">sh</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span> <span class="p">(</span><span class="n">s</span><span class="o">-</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)));</span>
<span class="n">newlen</span> <span class="o">=</span> <span class="p">(</span><span class="n">len</span><span class="o">+</span><span class="n">addlen</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">newlen</span> <span class="o"><</span> <span class="n">SDS_MAX_PREALLOC</span><span class="p">)</span>
<span class="n">newlen</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">;</span> <span class="c1">// double size
</span> <span class="k">else</span>
<span class="n">newlen</span> <span class="o">+=</span> <span class="n">SDS_MAX_PREALLOC</span><span class="p">;</span>
<span class="n">newsh</span> <span class="o">=</span> <span class="n">zrealloc</span><span class="p">(</span><span class="n">sh</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)</span><span class="o">+</span><span class="n">newlen</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">newsh</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">newsh</span><span class="o">-></span><span class="n">free</span> <span class="o">=</span> <span class="n">newlen</span> <span class="o">-</span> <span class="n">len</span><span class="p">;</span>
<span class="k">return</span> <span class="n">newsh</span><span class="o">-></span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Function <strong>sdsRemoveFreeSpace</strong> used to remove the free space in the end of buf.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sds</span> <span class="nf">sdsRemoveFreeSpace</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">sdshdr</span> <span class="o">*</span><span class="n">sh</span><span class="p">;</span>
<span class="n">sh</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span> <span class="p">(</span><span class="n">s</span><span class="o">-</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)));</span>
<span class="n">sh</span> <span class="o">=</span> <span class="n">zrealloc</span><span class="p">(</span><span class="n">sh</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">sdshdr</span><span class="p">)</span><span class="o">+</span><span class="n">sh</span><span class="o">-></span><span class="n">len</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="n">sh</span><span class="o">-></span><span class="n">free</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">return</span> <span class="n">sh</span><span class="o">-></span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="all-apis">All APIs</h1>
<p>Most SDS APIs are similar to standard c string.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sds</span> <span class="n">sdsnewlen</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">init</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">initlen</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsnew</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">init</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsempty</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">sdslen</span><span class="p">(</span><span class="k">const</span> <span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsdup</span><span class="p">(</span><span class="k">const</span> <span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdsfree</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">sdsavail</span><span class="p">(</span><span class="k">const</span> <span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsgrowzero</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscatlen</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscat</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">t</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscatsds</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="n">sds</span> <span class="n">t</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscpylen</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscpy</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">t</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscatprintf</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">fmt</span><span class="p">,</span> <span class="p">...);</span>
<span class="n">sds</span> <span class="n">sdscatfmt</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="kt">char</span> <span class="k">const</span> <span class="o">*</span><span class="n">fmt</span><span class="p">,</span> <span class="p">...);</span>
<span class="n">sds</span> <span class="n">sdstrim</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cset</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdsrange</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">start</span><span class="p">,</span> <span class="kt">int</span> <span class="n">end</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdsupdatelen</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdsclear</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">sdscmp</span><span class="p">(</span><span class="k">const</span> <span class="n">sds</span> <span class="n">s1</span><span class="p">,</span> <span class="k">const</span> <span class="n">sds</span> <span class="n">s2</span><span class="p">);</span>
<span class="n">sds</span> <span class="o">*</span><span class="n">sdssplitlen</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">sep</span><span class="p">,</span> <span class="kt">int</span> <span class="n">seplen</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">count</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdsfreesplitres</span><span class="p">(</span><span class="n">sds</span> <span class="o">*</span><span class="n">tokens</span><span class="p">,</span> <span class="kt">int</span> <span class="n">count</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdstolower</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdstoupper</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsfromlonglong</span><span class="p">(</span><span class="kt">long</span> <span class="kt">long</span> <span class="n">value</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdscatrepr</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
<span class="n">sds</span> <span class="o">*</span><span class="n">sdssplitargs</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">line</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">argc</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsmapchars</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">from</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">to</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">setlen</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsjoin</span><span class="p">(</span><span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">sep</span><span class="p">);</span>
<span class="cm">/* Low level functions exposed to the user API */</span>
<span class="n">sds</span> <span class="n">sdsMakeRoomFor</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">addlen</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">sdsIncrLen</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">incr</span><span class="p">);</span>
<span class="n">sds</span> <span class="n">sdsRemoveFreeSpace</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">sdsAllocSize</span><span class="p">(</span><span class="n">sds</span> <span class="n">s</span><span class="p">);</span>
</code></pre></div></div>
Redis Introduction
2014-08-12T00:00:00+00:00
http://blog.wjin.org/posts/redis-introduction
<h1 id="background">Background</h1>
<p>There are increasing data generated on the Internet every day. How to store, analyse and deal with them efficiently is becoming pretty important for those Internet Giants.</p>
<p>It is no surprise that many solutions coexist, such as well-known commercial RDBMS Oracle, Open Source RDBMS Mysql (Acquired by Oracle), and several kinds of NoSql, like MongoDB, Redis, HBase and so on. See more details here: <a href="http://db-engines.com/en/ranking">DB ranking</a>.</p>
<p>In this article, I will give a brief introduction about the most popular key-value store <strong>Redis</strong>.</p>
<h1 id="introduction">Introduction</h1>
<p>Redis is an <strong><em>in-memory persistent key-value</em></strong> store. It can be used as a cache level for backend DB. Actually, it can be used anywhere you use Memcached.</p>
<p>It is <strong>blazing fast</strong> because all of its data are in memory. Also, it writes data back to disk for persistence using either <strong>RDB</strong> or <strong>AOF</strong> (I will explain it in my following articles). Last but not least, it is much more than a key-value store because it exposes five different data structures (<strong><em>string, list, set, sorted-set, hash</em></strong>), which makes it be more popular than Memcached.</p>
<p>Besides, Redis Cluster is ongoing: <a href="http://redis.io/topics/cluster-spec">cluster specification</a>.</p>
<h1 id="install">Install</h1>
<p>It conforms to general *nix program installation:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/antirez/redis.git
<span class="nb">cd </span>redis
make
make <span class="nb">test</span> <span class="o">(</span>optional<span class="o">)</span>
make install <span class="o">(</span>optional<span class="o">)</span>
</code></pre></div></div>
<h1 id="run">Run</h1>
<p>After build, you can directly start redis server by:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Weis-MacBook-Pro:redis eric<span class="nv">$ </span>./src/redis-server
<span class="o">[</span>5512] 01 May 01:18:52.548 <span class="c"># Warning: no config file specified, using the default config. In order to specify a config file use ./src/redis-server /path/to/redis.conf</span>
<span class="o">[</span>5512] 01 May 01:18:52.549 <span class="k">*</span> Increased maximum number of open files to 10032 <span class="o">(</span>it was originally <span class="nb">set </span>to 256<span class="o">)</span><span class="nb">.</span>
_._
_.-<span class="sb">``</span>__ <span class="s1">''</span>-._
_.-<span class="sb">``</span> <span class="sb">`</span><span class="nb">.</span> <span class="sb">`</span>_. <span class="s1">''</span>-._ Redis 2.9.11 <span class="o">(</span>11d9ecb7/0<span class="o">)</span> 64 bit
.-<span class="sb">``</span> .-<span class="sb">```</span><span class="nb">.</span> <span class="sb">```</span><span class="se">\/</span> _.,_ <span class="s1">''</span>-._
<span class="o">(</span> <span class="s1">' , .-` | `, ) Running in stand alone mode
|`-._`-...-` __...-.``-._|'</span><span class="sb">`</span> _.-<span class="s1">'| Port: 6379
| `-._ `._ / _.-'</span> | PID: 5512
<span class="sb">`</span>-._ <span class="sb">`</span>-._ <span class="sb">`</span>-./ _.-<span class="s1">' _.-'</span>
|<span class="sb">`</span>-._<span class="sb">`</span>-._ <span class="sb">`</span>-.__.-<span class="s1">' _.-'</span>_.-<span class="s1">'|
| `-._`-._ _.-'</span>_.-<span class="s1">' | http://redis.io
`-._ `-._`-.__.-'</span>_.-<span class="s1">' _.-'</span>
|<span class="sb">`</span>-._<span class="sb">`</span>-._ <span class="sb">`</span>-.__.-<span class="s1">' _.-'</span>_.-<span class="s1">'|
| `-._`-._ _.-'</span>_.-<span class="s1">' |
`-._ `-._`-.__.-'</span>_.-<span class="s1">' _.-'</span>
<span class="sb">`</span>-._ <span class="sb">`</span>-.__.-<span class="s1">' _.-'</span>
<span class="sb">`</span>-._ _.-<span class="s1">'
`-.__.-'</span>
<span class="o">[</span>5512] 01 May 01:18:52.550 <span class="c"># Server started, Redis version 2.9.11</span>
<span class="o">[</span>5512] 01 May 01:18:52.550 <span class="k">*</span> The server is now ready to accept connections on port 6379
</code></pre></div></div>
<p>And then run client to connect to redis server:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Weis-MacBook-Pro:redis eric<span class="nv">$ </span>./src/redis-cli
127.0.0.1:6379>
</code></pre></div></div>
<p>There is a benchmark you can run to verify how fast redis is:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Weis-MacBook-Pro:src eric<span class="nv">$ </span>./src/redis-benchmark
</code></pre></div></div>
<h1 id="usage">Usage</h1>
<p>As Redis is a key-val store, so keys and values are fundamental concepts. Below is a simple example to operate keys and values using Redis SET and GET command. You could reference all commands at : <a href="http://redis.io/commands">http://redis.io/commands</a>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Weis-MacBook-Pro:~ eric<span class="nv">$ </span>redis-cli
127.0.0.1:6379> <span class="nb">set test </span>eric
OK
127.0.0.1:6379> get <span class="nb">test</span>
<span class="s2">"eric"</span>
127.0.0.1:6379>
</code></pre></div></div>
<h1 id="notes">Notes</h1>
<ul>
<li>Keys are always strings which identify pieces of data (values)</li>
<li>Values are arbitrary byte arrays, they can be any other types.</li>
</ul>
Consistent Hash
2014-08-12T00:00:00+00:00
http://blog.wjin.org/posts/consistent-hash
<h1 id="introduction">Introduction</h1>
<p>Traditional hash tables must remap all keys when changing slot size: <strong>hash(key) mod N</strong>.</p>
<p>Consistent hashing is a special kind of hashing such that when a hash table is <strong><em>resized</em></strong>, only <strong><em>K/N</em></strong> keys need to be remapped on average, where K is the number of keys, and N is the number of slots. <a href="http://www.codeproject.com/Articles/56138/Consistent-hashing">This</a> article gives a good explanation about it.</p>
<h1 id="property">Property</h1>
<ul>
<li>Monotonicity</li>
<li>Balance</li>
</ul>
<h1 id="usage">Usage</h1>
<ul>
<li>Distributed Caching System</li>
<li>Load Balance</li>
</ul>
<h1 id="code-practice">Code Practice</h1>
<p>As a practice, I design a cache system to store person information in a few cache servers using consistent hashing.</p>
<p>And then remove one server from the cache system to observe affected data, it shows that only data on this removed server are affected. And we don’t need to remap other data stored on other servers.</p>
<p>This simple code snippet simulates how consistent hashing works.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <iostream>
#include <map>
#include <vector>
#include <string>
#include <functional>
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="c1">// Hash Function Class used to generate a number between 0 to MAX
// leverge stl hash template class
</span><span class="k">class</span> <span class="nc">HashFunction</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">size_t</span> <span class="k">operator</span><span class="p">()(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">str</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
<span class="c1">// just let ring have 100 slots: 0 to 99
</span> <span class="c1">// so that it is easy to observe output
</span> <span class="k">return</span> <span class="n">m_hash</span><span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="o">%</span> <span class="mi">100</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">hash</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">m_hash</span><span class="p">;</span>
<span class="p">};</span>
<span class="c1">// according to consistent hashing algorithm
// we will map server node and data key in this ring
// and then choose appropriate server to store data
</span><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">HashFunc</span><span class="o">></span>
<span class="k">class</span> <span class="nc">HashRing</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">HashRing</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">replicas</span><span class="p">)</span>
<span class="o">:</span> <span class="n">m_replicas</span><span class="p">(</span><span class="n">replicas</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">AddNode</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">size_t</span> <span class="n">slot</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Adding server : "</span> <span class="o"><<</span> <span class="n">node</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">m_replicas</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">slot</span> <span class="o">=</span> <span class="n">m_hash</span><span class="p">(</span><span class="n">node</span> <span class="o">+</span> <span class="n">to_string</span><span class="p">(</span><span class="n">i</span><span class="p">));</span>
<span class="n">m_ring</span><span class="p">[</span><span class="n">slot</span><span class="p">]</span> <span class="o">=</span> <span class="n">node</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">" add virtual node "</span> <span class="o"><<</span> <span class="n">i</span> <span class="o"><<</span> <span class="s">" , slot "</span> <span class="o"><<</span> <span class="n">slot</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">RemoveNode</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">m_replicas</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">size_t</span> <span class="n">slot</span> <span class="o">=</span> <span class="n">m_hash</span><span class="p">(</span><span class="n">node</span> <span class="o">+</span> <span class="n">to_string</span><span class="p">(</span><span class="n">i</span><span class="p">));</span>
<span class="n">m_ring</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">slot</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">GetNode</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span><span class="o">&</span> <span class="n">data</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
<span class="kt">size_t</span> <span class="n">slot</span> <span class="o">=</span> <span class="n">m_hash</span><span class="p">(</span><span class="n">data</span><span class="p">);</span>
<span class="c1">// Look for the first node >= hash
</span> <span class="k">auto</span> <span class="n">ite</span> <span class="o">=</span> <span class="n">m_ring</span><span class="p">.</span><span class="n">lower_bound</span><span class="p">(</span><span class="n">slot</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ite</span> <span class="o">==</span> <span class="n">m_ring</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="c1">// Wrapped around; get the first node
</span> <span class="n">ite</span> <span class="o">=</span> <span class="n">m_ring</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ite</span><span class="o">-></span><span class="n">second</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">map</span><span class="o"><</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">string</span><span class="o">></span> <span class="n">m_ring</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">size_t</span> <span class="n">m_replicas</span><span class="p">;</span> <span class="c1">// virtual node number
</span> <span class="n">HashFunc</span> <span class="n">m_hash</span><span class="p">;</span> <span class="c1">// hash function object
</span><span class="p">};</span>
<span class="k">struct</span> <span class="n">Person</span> <span class="p">{</span>
<span class="n">Person</span><span class="p">(</span><span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">string</span> <span class="n">name</span><span class="o">=</span><span class="s">""</span><span class="p">,</span> <span class="n">string</span> <span class="n">addr</span> <span class="o">=</span> <span class="s">""</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">phone</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">:</span>
<span class="n">m_id</span><span class="p">(</span><span class="n">id</span><span class="p">),</span> <span class="n">m_name</span><span class="p">(</span><span class="n">name</span><span class="p">),</span> <span class="n">m_addr</span><span class="p">(</span><span class="n">addr</span><span class="p">),</span> <span class="n">m_phone</span><span class="p">(</span><span class="n">phone</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">m_id</span><span class="p">;</span>
<span class="n">string</span> <span class="n">m_name</span><span class="p">;</span>
<span class="n">string</span> <span class="n">m_addr</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">m_phone</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">Key</span><span class="p">,</span> <span class="k">typename</span> <span class="n">Value</span><span class="o">></span>
<span class="k">class</span> <span class="nc">CacheServer</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">Set</span><span class="p">(</span><span class="k">const</span> <span class="n">Key</span><span class="o">&</span> <span class="n">k</span><span class="p">,</span> <span class="k">const</span> <span class="n">Value</span><span class="o">&</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_cache</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">Value</span> <span class="n">Get</span><span class="p">(</span><span class="k">const</span> <span class="n">Key</span><span class="o">&</span> <span class="n">k</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
<span class="n">Value</span> <span class="n">v</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">ite</span> <span class="o">=</span> <span class="n">m_cache</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">k</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ite</span> <span class="o">!=</span> <span class="n">m_cache</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">ite</span><span class="o">-></span><span class="n">second</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">Remove</span><span class="p">(</span><span class="k">const</span> <span class="n">Key</span><span class="o">&</span> <span class="n">k</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">auto</span> <span class="n">ite</span> <span class="o">=</span> <span class="n">m_cache</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">k</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ite</span> <span class="o">!=</span> <span class="n">m_cache</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span>
<span class="n">m_cache</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">ite</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="c1">// data can be stored on this server
</span> <span class="c1">// simply use person's id as key
</span> <span class="n">map</span><span class="o"><</span><span class="n">Key</span><span class="p">,</span> <span class="n">Value</span><span class="o">></span> <span class="n">m_cache</span><span class="p">;</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">CacheServer</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span> <span class="n">Person</span><span class="o">>></span> <span class="n">servers</span><span class="p">;</span>
<span class="c1">// initialize servers according to server host name
</span> <span class="c1">// here we have 3 servers used to store Person information
</span> <span class="c1">// CacheServer stores person as : <id, info>
</span> <span class="n">servers</span><span class="p">[</span><span class="s">"cache1.wjin.org"</span><span class="p">]</span> <span class="o">=</span> <span class="n">CacheServer</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span> <span class="n">Person</span><span class="o">></span><span class="p">();</span>
<span class="n">servers</span><span class="p">[</span><span class="s">"cache2.wjin.org"</span><span class="p">]</span> <span class="o">=</span> <span class="n">CacheServer</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span> <span class="n">Person</span><span class="o">></span><span class="p">();</span>
<span class="n">servers</span><span class="p">[</span><span class="s">"cache3.wjin.org"</span><span class="p">]</span> <span class="o">=</span> <span class="n">CacheServer</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span> <span class="n">Person</span><span class="o">></span><span class="p">();</span>
<span class="n">HashRing</span><span class="o"><</span><span class="n">HashFunction</span><span class="o">></span> <span class="n">ring</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="c1">// persons
</span> <span class="n">vector</span><span class="o"><</span><span class="n">Person</span><span class="o">></span> <span class="n">persons</span> <span class="o">=</span> <span class="p">{</span>
<span class="p">{</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"Eric King"</span><span class="p">,</span> <span class="s">"Shanghai"</span><span class="p">,</span> <span class="mi">1111</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">2</span><span class="p">,</span> <span class="s">"Peter Will"</span><span class="p">,</span> <span class="s">"Beijing"</span><span class="p">,</span> <span class="mi">2222</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">3</span><span class="p">,</span> <span class="s">"Smith John"</span><span class="p">,</span> <span class="s">"Shenzheng"</span><span class="p">,</span> <span class="mi">3333</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">4</span><span class="p">,</span> <span class="s">"Joe Richard"</span><span class="p">,</span> <span class="s">"Guangzhou"</span><span class="p">,</span> <span class="mi">4444</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">5</span><span class="p">,</span> <span class="s">"Tim Hans"</span><span class="p">,</span> <span class="s">"Chengdu"</span><span class="p">,</span> <span class="mi">5555</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">6</span><span class="p">,</span> <span class="s">"Tom Paul"</span><span class="p">,</span> <span class="s">"Hangzhou"</span><span class="p">,</span> <span class="mi">6666</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">7</span><span class="p">,</span> <span class="s">"Kate James"</span><span class="p">,</span> <span class="s">"Hangzhou"</span><span class="p">,</span> <span class="mi">7777</span> <span class="p">},</span>
<span class="p">{</span> <span class="mi">8</span><span class="p">,</span> <span class="s">"Jim Jordan"</span><span class="p">,</span> <span class="s">"Hangzhou"</span><span class="p">,</span> <span class="mi">8888</span> <span class="p">},</span>
<span class="p">};</span>
<span class="c1">// add server to the hash ring
</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">ite</span> <span class="o">=</span> <span class="n">servers</span><span class="p">.</span><span class="n">begin</span><span class="p">();</span> <span class="n">ite</span> <span class="o">!=</span> <span class="n">servers</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">ite</span><span class="p">)</span> <span class="p">{</span>
<span class="n">string</span> <span class="n">name</span> <span class="o">=</span> <span class="n">ite</span><span class="o">-></span><span class="n">first</span><span class="p">;</span>
<span class="n">ring</span><span class="p">.</span><span class="n">AddNode</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"-------------------------------------------------"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="c1">// store person info
</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">p</span> <span class="o">:</span> <span class="n">persons</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// use id + name to find an appropriate server name in ring
</span> <span class="n">string</span> <span class="n">host</span> <span class="o">=</span> <span class="n">ring</span><span class="p">.</span><span class="n">GetNode</span><span class="p">(</span><span class="n">to_string</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">m_id</span><span class="p">)</span> <span class="o">+</span> <span class="n">p</span><span class="p">.</span><span class="n">m_name</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">host</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="k">throw</span> <span class="n">runtime_error</span><span class="p">(</span><span class="s">"No server available"</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"storing ID "</span> <span class="o"><<</span> <span class="n">p</span><span class="p">.</span><span class="n">m_id</span> <span class="o"><<</span> <span class="s">" on server "</span> <span class="o"><<</span> <span class="n">host</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">servers</span><span class="p">[</span><span class="n">host</span><span class="p">].</span><span class="n">Set</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">m_id</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"-------------------------------------------------"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="c1">// read data from server
</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">p</span> <span class="o">:</span> <span class="n">persons</span><span class="p">)</span> <span class="p">{</span>
<span class="n">string</span> <span class="n">host</span> <span class="o">=</span> <span class="n">ring</span><span class="p">.</span><span class="n">GetNode</span><span class="p">(</span><span class="n">to_string</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">m_id</span><span class="p">)</span> <span class="o">+</span> <span class="n">p</span><span class="p">.</span><span class="n">m_name</span><span class="p">);</span>
<span class="n">CacheServer</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span> <span class="n">Person</span><span class="o">></span> <span class="n">server</span> <span class="o">=</span> <span class="n">servers</span><span class="p">[</span><span class="n">host</span><span class="p">];</span>
<span class="n">Person</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">server</span><span class="p">.</span><span class="n">Get</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">m_id</span><span class="p">);</span> <span class="c1">// read data
</span> <span class="n">cout</span> <span class="o"><<</span> <span class="s">"ID "</span> <span class="o"><<</span> <span class="n">p</span><span class="p">.</span><span class="n">m_id</span> <span class="o"><<</span> <span class="s">" stored on server "</span> <span class="o"><<</span> <span class="n">host</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">" ID : "</span> <span class="o"><<</span> <span class="n">p</span><span class="p">.</span><span class="n">m_id</span> <span class="o"><<</span> <span class="s">" Name : "</span> <span class="o"><<</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_name</span> <span class="o"><<</span> <span class="s">" Addr : "</span> \
<span class="o"><<</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_addr</span> <span class="o"><<</span> <span class="s">" Phone : "</span> <span class="o"><<</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_phone</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"-------------------------------------------------"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="c1">// remove server cache3.wjin.org
</span> <span class="n">ring</span><span class="p">.</span><span class="n">RemoveNode</span><span class="p">(</span><span class="s">"cache3.wjin.org"</span><span class="p">);</span>
<span class="c1">// read data again
</span> <span class="c1">// as we removed server cache3, data stored on it are not available
</span> <span class="c1">// we got default person info
</span> <span class="c1">// however, other data won't be affected, that is consistent hashing
</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">p</span> <span class="o">:</span> <span class="n">persons</span><span class="p">)</span> <span class="p">{</span>
<span class="n">string</span> <span class="n">host</span> <span class="o">=</span> <span class="n">ring</span><span class="p">.</span><span class="n">GetNode</span><span class="p">(</span><span class="n">to_string</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">m_id</span><span class="p">)</span> <span class="o">+</span> <span class="n">p</span><span class="p">.</span><span class="n">m_name</span><span class="p">);</span>
<span class="n">CacheServer</span><span class="o"><</span><span class="kt">int</span><span class="p">,</span> <span class="n">Person</span><span class="o">></span> <span class="n">server</span> <span class="o">=</span> <span class="n">servers</span><span class="p">[</span><span class="n">host</span><span class="p">];</span>
<span class="n">Person</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">server</span><span class="p">.</span><span class="n">Get</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">m_id</span><span class="p">);</span> <span class="c1">// read data
</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"ID "</span> <span class="o"><<</span> <span class="n">p</span><span class="p">.</span><span class="n">m_id</span> <span class="o"><<</span> <span class="s">" stored on server "</span> <span class="o"><<</span> <span class="n">host</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">" ID : "</span> <span class="o"><<</span> <span class="n">p</span><span class="p">.</span><span class="n">m_id</span> <span class="o"><<</span> <span class="s">" Name : "</span> <span class="o"><<</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_name</span> <span class="o"><<</span> <span class="s">" Addr : "</span> \
<span class="o"><<</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_addr</span> <span class="o"><<</span> <span class="s">" Phone : "</span> <span class="o"><<</span> <span class="n">ret</span><span class="p">.</span><span class="n">m_phone</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="next">Next</h1>
<p>Above solution is naive. Normally, when a server was failure, we can map data on that server to other servers. Also, if we could add new servers to the ring, we need to split data from others to this new added server.</p>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="http://thor.cs.ucsb.edu/~ravenben/papers/coreos/kll%2B97.pdf">http://thor.cs.ucsb.edu/~ravenben/papers/coreos/kll%2B97.pdf</a></li>
<li><a href="http://en.wikipedia.org/wiki/Consistent_hashing">http://en.wikipedia.org/wiki/Consistent_hashing</a></li>
</ul>
Skip List
2014-08-08T00:00:00+00:00
http://blog.wjin.org/posts/skiplist
<h1 id="introduction">Introduction</h1>
<p>Skip List is a data structure that allows <strong>fast</strong> search within an <strong>ordered</strong> sequence of elements.</p>
<p>Fast search is made possible by maintaining a <strong>linked hierarchy</strong> of subsequences, each skipping over fewer elements. The elements that are skipped over may be chosen <strong>probabilistically</strong>.</p>
<p>Here is an overview of this data structure:</p>
<p><img src="/assets/img/post/skiplist_overview.png" alt="img" /></p>
<p>And this gif pic is pretty useful to understand its implementation:</p>
<p><img src="/assets/img/post/skiplist_add_element.gif" alt="img" /></p>
<p>Note: both pics are from wiki, see reference.</p>
<h1 id="complexity">Complexity</h1>
<p>You may have known about AVL tree and Red-Black tree, both get O(log n) in worst case. However, it may be hard to implement them in a short time, especially for AVL tree.</p>
<p>Now, you have an another choice, that is Skip List, it works well in practice. And the most important is that it is easy to implement. See my code below.</p>
<p>And in Redis source code, it uses skip list in its implementation of ordered set.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Average</th>
<th style="text-align: right">Worst case</th>
</tr>
</thead>
<tbody>
<tr>
<td>Space</td>
<td>O(n)</td>
<td style="text-align: right">O(n)</td>
</tr>
<tr>
<td>Insert</td>
<td>O(log n)</td>
<td style="text-align: right">O(n)</td>
</tr>
<tr>
<td>Delete</td>
<td>O(log n)</td>
<td style="text-align: right">O(n)</td>
</tr>
<tr>
<td>Find</td>
<td>O(log n)</td>
<td style="text-align: right">O(n)</td>
</tr>
</tbody>
</table>
<h1 id="code-practice">Code Practice</h1>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <iostream>
#include <vector>
#include <climits>
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="c1">// simple skiplist implementation
// 1) do not consider duplicate data
// 2) no backward pointer in the node
</span>
<span class="k">class</span> <span class="nc">SkipList</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">struct</span> <span class="n">skiplistNode</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">data</span><span class="p">;</span> <span class="c1">// node's data
</span> <span class="n">vector</span><span class="o"><</span><span class="n">skiplistNode</span><span class="o">*></span> <span class="n">level</span><span class="p">;</span> <span class="c1">// level array
</span>
<span class="n">skiplistNode</span><span class="p">(</span><span class="kt">int</span> <span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">l</span><span class="p">)</span> <span class="o">:</span> <span class="n">data</span><span class="p">(</span><span class="n">d</span><span class="p">),</span> <span class="n">level</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="n">m_maxlevel</span><span class="p">;</span> <span class="c1">// max level of skip list
</span> <span class="kt">int</span> <span class="n">m_curlevel</span><span class="p">;</span> <span class="c1">// current level of skip list
</span> <span class="kt">int</span> <span class="n">m_len</span><span class="p">;</span> <span class="c1">// number of nodes
</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">m_prob</span><span class="p">;</span> <span class="c1">// probability
</span>
<span class="n">skiplistNode</span> <span class="n">head</span><span class="p">;</span> <span class="c1">// dumb head
</span>
<span class="c1">// get random level for a node
</span> <span class="kt">int</span> <span class="nf">RandomLevel</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">level</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">// m_prob probability to reach to upper level
</span> <span class="k">while</span> <span class="p">((</span><span class="n">random</span><span class="p">()</span> <span class="o">&</span> <span class="mh">0xFFFF</span><span class="p">)</span> <span class="o"><</span> <span class="p">(</span><span class="n">m_prob</span> <span class="o">*</span> <span class="mh">0xFFFF</span><span class="p">))</span>
<span class="n">level</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">return</span> <span class="p">(</span><span class="n">level</span> <span class="o"><</span> <span class="n">m_maxlevel</span><span class="p">)</span> <span class="o">?</span> <span class="n">level</span> <span class="o">:</span> <span class="n">m_maxlevel</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">SkipList</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="n">ml</span> <span class="o">=</span> <span class="mi">32</span><span class="p">,</span> <span class="k">const</span> <span class="kt">double</span> <span class="n">p</span> <span class="o">=</span> <span class="mf">0.25</span><span class="p">)</span> <span class="o">:</span>
<span class="n">m_maxlevel</span><span class="p">(</span><span class="n">ml</span><span class="p">),</span> <span class="n">m_prob</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">head</span><span class="p">(</span><span class="n">INT_MIN</span><span class="p">,</span> <span class="n">m_maxlevel</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">m_curlevel</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// only one level when initialization
</span> <span class="n">m_len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">Insert</span><span class="p">(</span><span class="kt">int</span> <span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// Considering a single linked list, when inserting
</span> <span class="c1">// an element, we need to get the previous pointer
</span>
<span class="c1">// As for skiplist, each level is a list, so we need
</span> <span class="c1">// to keep a previous pointer for each level
</span>
<span class="n">vector</span><span class="o"><</span><span class="n">skiplistNode</span><span class="o">*></span> <span class="n">prev</span><span class="p">(</span><span class="n">m_maxlevel</span><span class="p">);</span>
<span class="c1">// traverse each level to get prev array
</span> <span class="n">skiplistNode</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="o">&</span><span class="n">head</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m_curlevel</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// minus i : move towards bottom
</span> <span class="c1">// forward x : move towards right
</span> <span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&&</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">data</span> <span class="o"><</span> <span class="n">data</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span> <span class="c1">// update current level's previous pointer
</span> <span class="p">}</span>
<span class="c1">// duplicate data
</span> <span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&&</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">-></span><span class="n">data</span> <span class="o">==</span> <span class="n">data</span><span class="p">)</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">level</span> <span class="o">=</span> <span class="n">RandomLevel</span><span class="p">();</span>
<span class="c1">// increase skiplist's level
</span> <span class="k">if</span> <span class="p">(</span><span class="n">level</span> <span class="o">></span> <span class="n">m_curlevel</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m_curlevel</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">level</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="o">&</span><span class="n">head</span><span class="p">;</span> <span class="c1">// update prev array
</span> <span class="p">}</span>
<span class="n">m_curlevel</span> <span class="o">=</span> <span class="n">level</span><span class="p">;</span> <span class="c1">// update level
</span> <span class="p">}</span>
<span class="c1">// insert new node
</span> <span class="n">x</span> <span class="o">=</span> <span class="k">new</span> <span class="n">skiplistNode</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">level</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span> <span class="c1">// no memory
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">level</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// insert x after pointer prev[i]
</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">m_len</span><span class="o">++</span><span class="p">;</span>
<span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">Delete</span><span class="p">(</span><span class="kt">int</span> <span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// again, we need to keep previous pointer as
</span> <span class="c1">// single linked list delete operation
</span> <span class="n">vector</span><span class="o"><</span><span class="n">skiplistNode</span><span class="o">*></span> <span class="n">prev</span><span class="p">(</span><span class="n">m_maxlevel</span><span class="p">);</span>
<span class="c1">// traverse each level to get prev array
</span> <span class="n">skiplistNode</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="o">&</span><span class="n">head</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m_curlevel</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&&</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">data</span> <span class="o"><</span> <span class="n">data</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span> <span class="c1">// update current level's previous pointer
</span> <span class="p">}</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="nb">nullptr</span> <span class="o">||</span> <span class="n">x</span><span class="o">-></span><span class="n">data</span> <span class="o">!=</span> <span class="n">data</span><span class="p">)</span> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span> <span class="c1">// not exist
</span>
<span class="c1">// delete
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">m_curlevel</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="c1">// delete node
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">delete</span> <span class="n">x</span><span class="p">;</span>
<span class="n">x</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">m_curlevel</span> <span class="o">></span> <span class="mi">1</span> <span class="o">&&</span> <span class="n">head</span><span class="p">.</span><span class="n">level</span><span class="p">[</span><span class="n">m_curlevel</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="n">m_curlevel</span><span class="o">--</span><span class="p">;</span> <span class="c1">// delete null list on the top
</span>
<span class="n">m_len</span><span class="o">--</span><span class="p">;</span> <span class="c1">// update length
</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">bool</span> <span class="n">Find</span><span class="p">(</span><span class="kt">int</span> <span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">skiplistNode</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="o">&</span><span class="n">head</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">m_curlevel</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&&</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">data</span> <span class="o"><</span> <span class="n">data</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&&</span> <span class="n">x</span><span class="o">-></span><span class="n">level</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-></span><span class="n">data</span> <span class="o">==</span> <span class="n">data</span><span class="p">)</span> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">Test</span> <span class="p">{</span>
<span class="kt">void</span> <span class="n">orderedTest</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">SkipList</span> <span class="n">sl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"*************Ordered Test*************"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Insert Nodes:"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Insert: "</span> <span class="o"><<</span> <span class="n">i</span> <span class="o"><<</span> <span class="s">" Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Insert</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find Nodes:"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find: "</span> <span class="o"><<</span> <span class="n">i</span> <span class="o"><<</span> <span class="s">" Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Find</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Delete Node: 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Reseult: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Delete</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find Node: 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Find</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Insert Node : 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Insert</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find Node: 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Find</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">randomTest</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">SkipList</span> <span class="n">sl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"*************Random Test*************"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Insert Nodes:"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">random</span><span class="p">()</span> <span class="o">%</span> <span class="mi">20</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Insert: "</span> <span class="o"><<</span> <span class="n">r</span> <span class="o"><<</span> <span class="s">" Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Insert</span><span class="p">(</span><span class="n">r</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find Nodes:"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">random</span><span class="p">()</span> <span class="o">%</span> <span class="mi">20</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find: "</span> <span class="o"><<</span> <span class="n">r</span> <span class="o"><<</span> <span class="s">" Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Find</span><span class="p">(</span><span class="n">r</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Delete Node: 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Reseult: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Delete</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find Node: 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Find</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Insert Node : 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Insert</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Find Node: 50 "</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Result: "</span> <span class="o"><<</span> <span class="n">sl</span><span class="p">.</span><span class="n">Find</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">Test</span> <span class="n">t</span><span class="p">;</span>
<span class="n">srand</span><span class="p">(</span><span class="n">time</span><span class="p">(</span><span class="nb">nullptr</span><span class="p">));</span>
<span class="n">t</span><span class="p">.</span><span class="n">orderedTest</span><span class="p">();</span>
<span class="n">t</span><span class="p">.</span><span class="n">randomTest</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="http://epaperpress.com/sortsearch/download/skiplist.pdf">http://epaperpress.com/sortsearch/download/skiplist.pdf</a></li>
<li><a href="http://en.wikipedia.org/wiki/Skip_list">http://en.wikipedia.org/wiki/Skip_list</a></li>
</ul>
Kindle Tips
2014-08-08T00:00:00+00:00
http://blog.wjin.org/posts/kindle-tips
<h1 id="introduction">Introduction</h1>
<p>Normally, you can use your authorised mail account to send an email with attachment to your Kindle mail address, like mykindle@kindle.com(registered at Amazon).</p>
<p>And if you titled this mail with ‘convert’, Amazon will convert your attachment to Kindle format (.azw) automatically. After that, you can sync converted file to your Kindle device when connecting wifi.</p>
<p>However, it is boring to send an email every time. Here is a simple way to do that routine thing. Thanks for the interesting website <a href="https://ifttt.com/">ifttt</a>, a great Internet Robot. You only need to create a recipe:</p>
<p><img src="/assets/img/post/ifttt_recipe.png" alt="image" /></p>
<h1 id="preparation">Preparation</h1>
<ul>
<li>Kindle mail address</li>
<li>Authorised mail</li>
<li><a href="https://dropbox.com">Dropbox</a></li>
<li><a href="https://ifttt.com/">ifttt</a> account</li>
</ul>
<h1 id="howto">Howto</h1>
<ul>
<li>Create a public directory under your Dropbox directory, such as <strong>Dropbox/Public/sendToKindle/</strong></li>
<li>Login your <em>ifttt</em> account, <strong>activate</strong> dropbox and gmail channels. Then create a new recipe:
<ul>
<li>click <strong><em>create</em></strong> hyperlink</li>
<li>click <strong><em>this</em></strong> hyperlink for trigger, and choose channel dropbox. Fill your directory
<img src="/assets/img/post/ifttt_trigger.png" alt="image" /></li>
<li>click <strong><em>that</em></strong> hyperlink for action, and choose channel gmail (my authorised mail is gmail)
<ol>
<li>fill your kindle mail address</li>
<li>fill title with ‘convert’
<img src="/assets/img/post/ifttt_action.png" alt="image" /></li>
</ol>
</li>
</ul>
</li>
</ul>
<h1 id="enjoy">Enjoy</h1>
<ul>
<li>Drag your file to <strong>Dropbox/Public/sendToKindle/</strong></li>
<li>Wait ifttt recipe to be triggered, it might spend a few minutes</li>
<li>Open your Kindle device, enjoy the converted file that you dragged to dropbox just now</li>
</ul>