Static Oneplus
不可控制论
http://yjliu.net/blog
2015-05-20T16:00:00Z
Oneplus
New neural parser added to LTP
http://yjliu.net/blog/2015/05/21/new-ltp-release.html
2015-05-20T16:00:00Z
2017-12-13T15:33:28+08:00
Article Author
<p>[<a href="https://github.com/HIT-SCIR/ltp/releases/tag/v3.3.0">Github</a>]
[<a href="http://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569#path=%252Fltp-models%252F3.3.0">Pre-trained Model</a>]</p>
<p>We recently added a new dependency parser to LTP.
This parser is a transiton-based neural network parser, which mainly based on Chen and Manning (2014).
Besides the parser originally described in their work, additional features are also incorporated in our parser, including global feature (Zhang and Nivre, 2011), word cluster and dynamic oracle (Goldberg et. al. 2014).
Dr. Jiang Guo developed the prototype for this parser in his <em>ACL2015</em> work.</p>
<p>For quick summary, our new parser’s main features includes</p>
<ul>
<li>Fast linear parsing time: over 8,000 tokens/second.</li>
<li>High parsing accuracy: 85.24 (UAS) on CTB5</li>
</ul>
<p>and the key techniques includes</p>
<ul>
<li>Linear time transition-based parsing.</li>
<li>Neural network classifier with cubic active function</li>
<li>Supporting clustering features and global features</li>
<li>Supporting learning from dynamic oracle</li>
</ul>
<h3 id="word-cluster">Word Cluster</h3>
<p>Word cluster can be the most reliable feature acrossing the syntactic tasks, including POS tagging, NER and parsing.
Thus, the performance improvement we gain from word cluster is as expected.
In our system, word cluster is added to as another ‘POS tag’.</p>
<h3 id="global-feature">Global Feature</h3>
<p>Zhang and Nivre (2012) study the problem of interaction between searching and learning. Their work can be summarized as</p>
<ul>
<li>local learning + greedy search: [good]</li>
<li>local learning + beam search: [bad]</li>
<li>global learning + greedy search: [bad]</li>
<li>global learning + beam search: [good]</li>
</ul>
<p>They also say that</p>
<ul>
<li>local learning + global features: [bad]</li>
<li>global learning + global features: [good]</li>
</ul>
<p>We also found that global features contribute nothing if they are added to the parser standalone.
But we found that when coupled with word cluster, the global feature can further help.
Thus, this feature is add in our released version.</p>
<h3 id="dynamic-oracle">Dynamic Oracle</h3>
<p>Dynamic oracle is another reliable stuff!
On the NLTK’s issues list, Dr. Honnibal (author of spaCy) quoted that</p>
<blockquote>
<p>Second, when you train the parser, you parser, you should really use the Goldberg and Nivre (2012) “dynamic oracle” strategy.</p>
</blockquote>
<p>We also found dynamic oracle works great on Chinese dataset.
The detailed result will be presented in experiment section.</p>
<h3 id="experiment">Experiment</h3>
<p>We report our parser’s performance on Chinese treebank 5 and Chinese dependency treebank.
Parsing accuray is evaluated using UAS/LAS.
We also compare our result with the our old LTP parser and the other state-of-the-art parsers.</p>
<h4 id="ctb5-experiment">CTB5 experiment</h4>
<p>CTB5 data is splited according to Zhang and Clark (2008), with 16,091 for training, 1,910 for testing and 803 for development.
Word embedding trained on Gigawords xinhua news with word2vec is used as initialization embedding.
Brown clustering result with 1000 clusters on the same data are also used in our experiment.</p>
<table>
<thead>
<tr>
<th>Parser</th>
<th>Dev.UAS</th>
<th>Dev.LAS</th>
<th>Test.UAS</th>
<th>Test.LAS</th>
<th>Test Speed(tokens/sec.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>ZPar (b=64)</td>
<td>85.15</td>
<td>85.45</td>
<td>85.82</td>
<td>84.54</td>
<td>(about-) 700</td>
</tr>
<tr>
<td>LTP (o2sib)</td>
<td>84.46</td>
<td>82.96</td>
<td>84.05</td>
<td>82.62</td>
<td>40.99(?)</td>
</tr>
<tr>
<td>NN (e=50,h=200)</td>
<td>82.87</td>
<td>81.13</td>
<td>83.22</td>
<td>81.68</td>
<td>16737.41</td>
</tr>
<tr>
<td>NN-nondet(e=50,h=200)</td>
<td>83.48</td>
<td>81.75</td>
<td>83.95</td>
<td>82.40</td>
<td> </td>
</tr>
<tr>
<td>NN-explore(e=50,h=200)</td>
<td>84.42</td>
<td>82.68</td>
<td>84.44</td>
<td>82.70</td>
<td> </td>
</tr>
<tr>
<td>NN-explore+glob.feat(e=50,h=200)</td>
<td>84.48</td>
<td>82.79</td>
<td>84.74</td>
<td>83.02</td>
<td>14627.79</td>
</tr>
<tr>
<td>NN-explore+cluster(e=50,h=200)</td>
<td>85.08</td>
<td>83.26</td>
<td>84.98</td>
<td>83.28</td>
<td>9734.88</td>
</tr>
<tr>
<td>NN-explore+glob.feat+cluster(e=50,h=200)</td>
<td>85.16</td>
<td>83.54</td>
<td>85.24</td>
<td>83.61</td>
<td>9325.72</td>
</tr>
</tbody>
</table>
<h4 id="cdt-experiment">CDT experiment</h4>
<p>Suggested data split is used in CDT experiment.
Embedding and cluster settings are almost identical to the CTB5 experiment except that data are segmented with a PKU standard segmentor.</p>
<table>
<thead>
<tr>
<th>Parser</th>
<th>Dev.UAS</th>
<th>Dev.LAS</th>
<th>Test.UAS</th>
<th>Test.LAS</th>
<th>Test Speed(tokens/sec.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>ZPar (b=64)</td>
<td>86.43</td>
<td>83.68</td>
<td>85.48</td>
<td>82.41</td>
<td> </td>
</tr>
<tr>
<td>LTP (o2sib)</td>
<td>84.95</td>
<td>82.09</td>
<td>83.99</td>
<td>81.31</td>
<td>263.89</td>
</tr>
<tr>
<td>NN (e=50,h=400)</td>
<td>84.33</td>
<td>81.57</td>
<td>83.04</td>
<td>80.01</td>
<td>11478.48</td>
</tr>
<tr>
<td>NN-explore(e=50,h=400)</td>
<td>85.02</td>
<td>82.41</td>
<td>83.74</td>
<td>80.81</td>
<td> </td>
</tr>
<tr>
<td>NN-explore+glob.feat(e=50,h=400)</td>
<td>84.81</td>
<td>82.23</td>
<td>83.56</td>
<td>80.65</td>
<td> </td>
</tr>
<tr>
<td>NN-explore+cluster(e=50,h=400)</td>
<td>85.82</td>
<td>83.25</td>
<td>83.49</td>
<td>80.60</td>
<td> </td>
</tr>
<tr>
<td>NN-explore+glob.feat+cluster(e=50,h=400)</td>
<td>85.80</td>
<td>83.19</td>
<td>84.11</td>
<td>81.17</td>
<td>8448.45</td>
</tr>
</tbody>
</table>
<p>Generally speaking, the dynamic oracle shows to improve parsing accuracy on both dataset.
What’s more, brown cluster is also a great feature to boost up the score.</p>
<h3 id="in-a-way-to-kill-old-people">“In a way to kill old people”</h3>
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/1.0.2/Chart.min.js"></script>
<p>We can see the parsing accuracy and speed benchmark with our the old LTP parser in the following chart.</p>
<canvas id="speed" width="200" height="300"></canvas>
<canvas id="accuracy" width="200" height="300"></canvas>
<script>
var spd_dat = {
labels: ["old parser", "new parser",],
datasets: [
{
label: "parsing speed",
fillColor: "rgba(220,220,220,0.5)",
strokeColor: "rgba(220,220,220,0.8)",
highlightFill: "rgba(220,220,220,0.75)",
highlightStroke: "rgba(220,220,220,1)",
data: [263.89, 8448.45]
}]
};
var acc_dat = {
labels: ["old parser", "new parser",],
datasets: [
{
label: "parsing accuracy",
fillColor: "rgba(220,220,220,0.5)",
strokeColor: "rgba(220,220,220,0.8)",
highlightFill: "rgba(220,220,220,0.75)",
highlightStroke: "rgba(220,220,220,1)",
data: [83.99, 84.11]
}]
};
var spd_ctx= document.getElementById("speed").getContext("2d");
new Chart(spd_ctx).Bar(spd_dat, {barShowStroke: false});
var acc_ctx= document.getElementById("accuracy").getContext("2d");
new Chart(acc_ctx).Bar(acc_dat, {barShowStroke: false});
</script>
<p>In a way to kill old people, as the phrase is.</p>
<h3 id="reference">Reference</h3>
<ul>
<li>Danqi Chen and Christopher Manning, 2014, A Fast and Accurate Dependency Parser using Neural Networks, In Proc. <em>EMNLP2014</em></li>
<li>Yue Zhang and Joakim Nivre, 2011, Transition-based Dependency Parsing with Rich Non-local Features, In Proc <em>ACL2011</em></li>
<li>Yoav Goldberg, Francesco Sartorioand Giorgio Satta, 2014, A Tabular Method for Dynamic Oracles in Transition-Based Parsing, In <em>TACL2014</em></li>
<li>Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang and Ting Liu, 2015, Cross-lingual Dependency Parsing Based on Distributed Representations, (to apper) In Proc <em>ACL2015</em></li>
<li>Yue Zhang and Joakim Nivre, 2012, Analyzing the Effect of Global Learning and Beam-Search on Transition-Based Dependency Parsing, In Proc <em>COLING2012</em></li>
<li>Yue Zhang and Stephen Clark, A Tale of Two Parsers: Investigating and Combining Graph-based and Transition-based Dependency Parsing, In Proc <em>ACL2008</em></li>
</ul>
<style>
table{ border-collapse: collapse; border-spacing: 0; border:2px solid #ff0000; width: 100%; }
th{ border:2px solid #000000; }
td{ border:1px solid #000000; }
</style>
Parallel and HPC with Python (or numpy)
http://yjliu.net/blog/2014/04/25/parallel-in-python.html
2014-04-24T16:00:00Z
2017-12-13T15:33:28+08:00
Article Author
<p>For guys working with natural language processing problems, it’s daily task to process
tons of data. To handle the millions of lines of sentences, I would prefer C/C++ or
Java in the past, especially at certain scenario like performing machine learning algorithm
onto the data. However, in this days, I wrote a very slow python program (and working around
<code>numpy</code>, it’s an important clue for future story). After wasting too much time on this single
thread program, I decided to parallel it.</p>
<h3 id="buzz-in-the-task">Buzz in the task</h3>
<p>Let me briefly introduce my task (It’s usually important for choosing the appropriate
parallel model). I have a collection of data which contains about 200 thousand entries.
My algorithm is some kind of <code>loop-loop</code> and can be illustrated as the following pseudocode.</p>
<div class="highlight"><pre class="highlight python"><code><span class="k">while</span> <span class="ow">not</span> <span class="n">terminal</span><span class="o">-</span><span class="n">condition</span><span class="p">:</span>
<span class="n">init</span><span class="p">(</span><span class="n">global_vector</span><span class="p">)</span>
<span class="k">for</span> <span class="n">instance</span> <span class="ow">in</span> <span class="n">instances</span><span class="p">:</span>
<span class="n">global_vector</span> <span class="o">+=</span> <span class="n">time</span><span class="o">-</span><span class="n">consuming</span><span class="o">-</span><span class="n">process</span><span class="p">(</span><span class="n">instance</span><span class="p">)</span>
<span class="n">do</span><span class="o">-</span><span class="n">something</span><span class="p">(</span><span class="n">global_vector</span><span class="p">)</span>
</code></pre></div>
<p>Since the <code>time-consuming-process</code> is very time consuming, we can easily use a <code>producer</code>
to distribute these tasks onto several <code>consumers</code>. What a textbook parallel model! To
make it more clear, also for convenience of future discussion, let me put it into some
meaningless but runnable code.</p>
<div class="highlight"><pre class="highlight python"><code><span class="c">#!/usr/bin/env python</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="n">nr_instances</span> <span class="o">=</span> <span class="mi">2000</span>
<span class="n">nr_dim</span> <span class="o">=</span> <span class="mi">257241</span>
<span class="k">def</span> <span class="nf">do_something</span><span class="p">(</span><span class="n">vector</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">consumer</span><span class="p">():</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span> <span class="c"># use to simulate the time consuming</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span> <span class="c"># numpy array operation.</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="n">ret</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">nr_dim</span><span class="p">,</span> <span class="mi">20</span><span class="p">)]</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
<span class="k">return</span> <span class="n">ret</span>
<span class="k">def</span> <span class="nf">producer</span><span class="p">():</span>
<span class="n">global_vector</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="n">instances</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">nr_instances</span><span class="p">)</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">instance</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">instances</span><span class="p">):</span>
<span class="n">global_vector</span> <span class="o">+=</span> <span class="n">consumer</span><span class="p">()</span>
<span class="n">do_something</span><span class="p">(</span><span class="n">global_vector</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">"__main__"</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">producer</span><span class="p">()</span>
<span class="k">print</span> <span class="s">"iter </span><span class="si">%</span><span class="s">d is done."</span> <span class="o">%</span> <span class="n">i</span>
</code></pre></div>
<p>A simple <code>time</code> command shows the above code runs <code>1m29.155s</code> on my server.</p>
<h3 id="threading">Threading</h3>
<p>As I mentioned before, I decide to paralled the above code. First thing that came into
my mind is the <strong>threading</strong>. According to my past experience, multi-threaded programming
is always the best choice when you have a server with several cores.</p>
<p>Distributing the producer’s to several thread can be painless done with python <code>threading</code>
module. The producer’s job is dividing the instances into several groups, feed them to each
of the thread and wait all these threads finish their work. A wrapper for the consumer is
implemented for recieveing data and invoke meta-consumer process.</p>
<p>After a slight modification on the above program, it became the multi-threaded version.</p>
<div class="highlight"><pre class="highlight python"><code><span class="c">#!/usr/bin/env python</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">threading</span>
<span class="kn">from</span> <span class="nn">basic</span> <span class="kn">import</span> <span class="n">consumer</span><span class="p">,</span> <span class="n">do_something</span><span class="p">,</span> <span class="n">nr_instances</span><span class="p">,</span> <span class="n">nr_dim</span>
<span class="n">nr_threads</span> <span class="o">=</span> <span class="mi">4</span>
<span class="k">def</span> <span class="nf">consumer_wrapper</span><span class="p">(</span><span class="n">instances</span><span class="p">,</span> <span class="n">results</span><span class="p">,</span> <span class="n">index</span><span class="p">):</span>
<span class="n">global_vector</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="k">for</span> <span class="n">instance</span> <span class="ow">in</span> <span class="n">instances</span><span class="p">:</span>
<span class="n">global_vector</span> <span class="o">+=</span> <span class="n">consumer</span><span class="p">()</span>
<span class="n">results</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="n">global_vector</span>
<span class="k">def</span> <span class="nf">producer</span><span class="p">():</span>
<span class="n">global_vector</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="n">instances</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">nr_instances</span><span class="p">)</span>
<span class="n">threads</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">nr_threads</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">nr_threads</span>
<span class="n">fence</span> <span class="o">=</span> <span class="n">nr_instances</span> <span class="o">/</span> <span class="n">nr_threads</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">nr_threads</span><span class="p">):</span>
<span class="n">chunk</span> <span class="o">=</span> <span class="n">instances</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="n">fence</span><span class="p">:</span> <span class="p">(</span><span class="n">L</span> <span class="k">if</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">==</span><span class="n">nr_instances</span> <span class="k">else</span> <span class="n">i</span><span class="o">*</span><span class="n">fence</span><span class="o">+</span><span class="n">fence</span><span class="p">)]</span>
<span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">threading</span><span class="o">.</span><span class="n">Thread</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">consumer_wrapper</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">chunk</span><span class="p">,</span> <span class="n">results</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">nr_threads</span><span class="p">):</span>
<span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">join</span><span class="p">()</span>
<span class="n">global_vector</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
<span class="n">do_something</span><span class="p">(</span><span class="n">global_vector</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">"__main__"</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">producer</span><span class="p">()</span>
<span class="k">print</span> <span class="s">"iter </span><span class="si">%</span><span class="s">d is done."</span> <span class="o">%</span> <span class="n">i</span>
</code></pre></div>
<p>I was expecting that the multi-threaded version will bring 2 to 3 times speed up if I
config the program with 4 threads. However, this code run <code>1m33.678s</code> on the same server.
I can’t even believe that a multi-threaded program runs slower than the single-threaded
program.</p>
<p>After a survey on this issue, I found the answer. It suffer from the Python GIL which
prevent the script running on two cores. There are lots of article talking about the
GIL problems, so I won’t write more on this. The conclusion is that <em>multi-threaded in
Python doesn’t work for my task</em>.</p>
<h3 id="multiprocessing">Multiprocessing</h3>
<p>The failure of multi-threaded program drive me to seek for some alternative and I found
the <code>multiprocessing</code> module. At the begining of its document page, it says,</p>
<blockquote>
<p><em>effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads</em></p>
</blockquote>
<p>To my understanding, the mechanism of multiprocessing module is treating the each thread
as a process. When creating a thread, it actually copy(fork) the entire processing into
a new process.</p>
<div class="highlight"><pre class="highlight python"><code><span class="c">#!/usr/bin/env python</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span>
<span class="kn">from</span> <span class="nn">basic</span> <span class="kn">import</span> <span class="n">consumer</span><span class="p">,</span> <span class="n">do_something</span><span class="p">,</span> <span class="n">nr_dim</span><span class="p">,</span> <span class="n">nr_instances</span>
<span class="n">nr_threads</span> <span class="o">=</span> <span class="mi">4</span>
<span class="k">def</span> <span class="nf">consumer_wrapper</span><span class="p">(</span><span class="n">instances</span><span class="p">):</span>
<span class="n">global_vector</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="k">for</span> <span class="n">instance</span> <span class="ow">in</span> <span class="n">instances</span><span class="p">:</span>
<span class="n">global_vector</span> <span class="o">+=</span> <span class="n">consumer</span><span class="p">()</span>
<span class="k">return</span> <span class="n">global_vector</span>
<span class="k">def</span> <span class="nf">producer</span><span class="p">():</span>
<span class="n">instances</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">nr_instances</span><span class="p">)</span>
<span class="n">pool</span> <span class="o">=</span> <span class="n">Pool</span><span class="p">(</span><span class="n">processes</span> <span class="o">=</span> <span class="n">nr_threads</span><span class="p">)</span>
<span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
<span class="n">fence</span> <span class="o">=</span> <span class="n">nr_instances</span> <span class="o">/</span> <span class="n">nr_threads</span>
<span class="n">arguments</span> <span class="o">=</span> <span class="p">[(</span><span class="n">instances</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="n">fence</span><span class="p">:(</span><span class="n">L</span> <span class="k">if</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">==</span><span class="n">nr_threads</span> <span class="k">else</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="n">fence</span><span class="p">)])</span> \
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">nr_threads</span><span class="p">)]</span>
<span class="n">global_vector</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pool</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">consumer_wrapper</span><span class="p">,</span> <span class="n">arguments</span><span class="p">))</span>
<span class="n">pool</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">pool</span><span class="o">.</span><span class="n">join</span><span class="p">()</span>
<span class="n">do_something</span><span class="p">(</span><span class="n">global_vector</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">"__main__"</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">producer</span><span class="p">()</span>
<span class="k">print</span> <span class="s">"iter </span><span class="si">%</span><span class="s">d fin"</span> <span class="o">%</span> <span class="n">i</span>
</code></pre></div>
<p>It takes <code>0m33.970s</code> for the multiprocessing version to run. The multiprocessing module
bring in about 2.5 times speed up.</p>
<p>However, one disappointing feature of multiprocessing is it copy entire program, therefore
resulting in large memory consumption. This feature make it very unscalable if the <em>single
process version</em> consume a lot of memory. In my experiments, my script consume 8G memory.
If I apply it to 8 processors, the program explode into 64G (or more), almost reach the
limit of the server.</p>
<h3 id="mpi">MPI</h3>
<p>The first time I meet <code>MPI</code> is that I read some source code of a machine-translation toolkit.
The MPI module is embeded in a mess of C++ code and make it very difficult to understand.</p>
<p>Now it came to me again because the document page claims that</p>
<blockquote>
<p>MPI is not an IEEE or ISO standard, but has in fact, become the “industry standard”
for writing message passing programs on HPC platforms.</p>
</blockquote>
<p>My supervisor also endorse it. It seems a widely used library for parallel programming.
And its Python embedding also makes it less painful (or painless) to use.</p>
<p>In MPI, the producer-consumer model can be very clearly implemented by letting the zero-ranked
(or master) program distribute the instances (or tasks), keep recieveing data from consumer.
Running status of the consumers can be easily obtain by check the <code>tag</code> field of the recieved
data.</p>
<p>Revisiting my problem, the MPI version is shown blow.</p>
<div class="highlight"><pre class="highlight python"><code><span class="c">#!/usr/bin/env python</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">mpi4py</span> <span class="kn">import</span> <span class="n">MPI</span>
<span class="kn">from</span> <span class="nn">basic</span> <span class="kn">import</span> <span class="n">consumer</span><span class="p">,</span> <span class="n">do_something</span><span class="p">,</span> <span class="n">nr_dim</span><span class="p">,</span> <span class="n">nr_instances</span>
<span class="n">READY</span><span class="p">,</span> <span class="n">START</span><span class="p">,</span> <span class="n">DONE</span><span class="p">,</span> <span class="n">EXIT</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span>
<span class="n">comm</span> <span class="o">=</span> <span class="n">MPI</span><span class="o">.</span><span class="n">COMM_WORLD</span>
<span class="n">size</span> <span class="o">=</span> <span class="n">comm</span><span class="o">.</span><span class="n">size</span>
<span class="n">rank</span> <span class="o">=</span> <span class="n">comm</span><span class="o">.</span><span class="n">rank</span>
<span class="n">status</span> <span class="o">=</span> <span class="n">MPI</span><span class="o">.</span><span class="n">Status</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">consumer_daemon</span><span class="p">():</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">MPI</span><span class="o">.</span><span class="n">Get_processor_name</span><span class="p">()</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">comm</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">READY</span><span class="p">)</span>
<span class="n">task</span> <span class="o">=</span> <span class="n">comm</span><span class="o">.</span><span class="n">recv</span><span class="p">(</span><span class="n">source</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">MPI</span><span class="o">.</span><span class="n">ANY_TAG</span><span class="p">,</span> <span class="n">status</span><span class="o">=</span><span class="n">status</span><span class="p">)</span>
<span class="n">tag</span> <span class="o">=</span> <span class="n">status</span><span class="o">.</span><span class="n">Get_tag</span><span class="p">()</span>
<span class="k">if</span> <span class="n">tag</span> <span class="o">==</span> <span class="n">START</span><span class="p">:</span>
<span class="n">instances</span> <span class="o">=</span> <span class="n">task</span>
<span class="n">global_vector</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="k">for</span> <span class="n">instance</span> <span class="ow">in</span> <span class="n">instances</span><span class="p">:</span>
<span class="n">global_vector</span> <span class="o">+=</span> <span class="n">consumer</span><span class="p">()</span>
<span class="n">comm</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">global_vector</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">DONE</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">tag</span> <span class="o">==</span> <span class="n">EXIT</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">comm</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">EXIT</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">producer</span><span class="p">():</span>
<span class="n">instances</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">nr_instances</span><span class="p">)</span>
<span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">instances</span><span class="p">)</span>
<span class="n">fence</span> <span class="o">=</span> <span class="n">L</span><span class="o">/</span> <span class="p">(</span><span class="n">size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">arguments</span> <span class="o">=</span> <span class="p">[(</span><span class="n">instances</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="n">fence</span><span class="p">:(</span><span class="n">L</span> <span class="k">if</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">==</span><span class="n">size</span><span class="o">-</span><span class="mi">1</span> <span class="k">else</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="n">fence</span><span class="p">)])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="n">comm</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">arguments</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">dest</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">START</span><span class="p">)</span>
<span class="n">finished</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">global_vector</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nr_dim</span><span class="p">)</span>
<span class="k">while</span> <span class="n">finished</span> <span class="o"><</span> <span class="n">size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">comm</span><span class="o">.</span><span class="n">recv</span><span class="p">(</span><span class="n">source</span><span class="o">=</span><span class="n">MPI</span><span class="o">.</span><span class="n">ANY_SOURCE</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">MPI</span><span class="o">.</span><span class="n">ANY_TAG</span><span class="p">,</span> <span class="n">status</span><span class="o">=</span><span class="n">status</span><span class="p">)</span>
<span class="n">tag</span> <span class="o">=</span> <span class="n">status</span><span class="o">.</span><span class="n">Get_tag</span><span class="p">()</span>
<span class="k">if</span> <span class="n">tag</span> <span class="o">==</span> <span class="n">DONE</span><span class="p">:</span>
<span class="n">global_vector</span> <span class="o">+=</span> <span class="n">data</span>
<span class="n">finished</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">do_something</span><span class="p">(</span><span class="n">global_vector</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">"__main__"</span><span class="p">:</span>
<span class="k">if</span> <span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">producer</span><span class="p">()</span>
<span class="k">print</span> <span class="s">"iter </span><span class="si">%</span><span class="s">d is done."</span> <span class="o">%</span> <span class="n">i</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="n">comm</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">tag</span><span class="o">=</span><span class="n">EXIT</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">consumer_daemon</span><span class="p">()</span>
</code></pre></div><p>Under same settings, the above program runs <code>0m31.332s</code> and it memory performance is better
than the multiprocessing version. One reason for the faster speed, I think, is the consumer
wasn’t shut down between each iterations.</p>
<h3 id="summary">Summary</h3>
<p>When it comes to the topic of paralleling in Python, my advice is avoid using <code>threading</code>
especially with some code that will trigger GIL. If the task only takes a little memory,
my advice is the <code>multiprocessing</code>, because it’s easier to use and easier to switch from
threading-oriented program. If you decide to make it a <em>real</em> paralleled program (aka,
using it on multi-cores server or even across several servers), mpi4py is no doubt a better
choice.</p>
<h4 id="reference">Reference</h4>
<ul>
<li>There are also some unlucky guys who found Python threading is slower: <a href="http://stackoverflow.com/questions/3121109/python-threading-unexpectedly-slower">Python threading unexpectedly slower</a></li>
<li>Introduction to MPI : <a href="https://computing.llnl.gov/tutorials/mpi/">Message Passing Interface (MPI)</a></li>
<li>And the Python embedding : <a href="http://mpi4py.scipy.org/">MPI4Py - SciPy</a></li>
<li>And some fancy examples! : <a href="https://github.com/jbornschein/mpi4py-examples">jbornschein/mpi4py-examples</a></li>
</ul>
研一这一年吧
http://yjliu.net/blog/2013/08/05/summary-on-first-year-as-graduate-student.html
2013-08-04T17:07:11Z
2017-12-13T15:33:28+08:00
Article Author
<p>给这一年列一个时间表吧,这篇文章想写很久了,虽然有预感写出来又是满满的负能量。</p>
<h3 id="section">2012年</h3>
<h4 id="section-1">8月-9月</h4>
<p>写了一篇叫《基于序列标注的中文分词、词性标注模型比较分析》,投了一个学生会议。大概是想论证用分类方法做分词、词性标注这些序列性问题可以取得与序列标注模型类似的性能。还想强调分类速度比较快。不过实验并没有获得符合预期的结果。所以这篇论文的论点比较奇怪,不管是写还是修改都很痛苦。在后来被转投中文信息学报,我又不得不痛苦地改了一遍。</p>
<h4 id="section-2">9月-10月</h4>
<p>参加了微博分词的评测。最后提交的系统是一个混了一大坨预处理的特征、半监督特征的CRF模型。除了把别人论证过的东西实现了一通后,几乎没有引入什么有新意的东西。由于官方没公布评测排名,并不知道自己的系统排名如何,但开会时统计了一下,大概是第二的位置。大概半年后,我review这个系统时发现当时使用crfsuite工具训练模型时没把负特征开关打开,所以最终结果应该是高于提交的系统的结果的。但主办方也没公布数据,也没法去做实验。</p>
<h4 id="section-3">10月-11月</h4>
<p>拿出不多的时间学习了机器学习,包括实现了一些简单的机器学习算法。重组织了博客。调研了一部分domain adaptation的论文,改之前的水文。还有一些杂七杂八的上课实验什么的。</p>
<h4 id="section-4">11月-12月</h4>
<p>写微博分词的评测报告,同时也看一段gibbs抽样。准备去上海开会的报告以及去开会,接着改论文投中文信息学报。</p>
<h4 id="section-5">12月-1月</h4>
<p>前半段考试,做了一坨课程实验,后半段去天津开了微博分词的会,其间都是准备poster一些杂事。这月后半段开始接了网管的工作。这其间还草草做了一个字聚类帮助分词的实验。后来这个非常烂的idea投了ccl2013。</p>
<h3 id="section-6">2013年</h3>
<h4 id="section-7">1月-2月</h4>
<p>前半段修硬盘来着。刚接手网管,实验室的磁盘阵列就挂了。大概原因就是raid5坏了两块后没人知道,第三块坏后就彻底挂了。大冬天抱着硬盘跑数据恢复,反正是非常苦逼。后来十天草草准备了一下托福,一战成了挂逼。</p>
<h4 id="section-8">2月-3月</h4>
<p>打算考G,利用过年时间背了一个多月单词,约了5月的G和6月的T。</p>
<h4 id="section-9">3月-5月</h4>
<p>开始调研用deep learning做分词,基本把rnnlm看了一遍。最初的思路是用语言模型做分词,结果做出来就让人没什么信心,后来又把思路换成用embedding做semi-supervised,也没什么信心。当时觉得主要问题在embedding结果对于分词模型不是线性作用的。也想过用c&w直接做一个分词的神经网络,不过后来看ccl好像有中科院的同学用实现了这个思路,不过效果似乎也让人没什么信心。不知道dl是不是不适合参加自然语言处理的battle against state-of-the-art。</p>
<p>断断续续地准备英语。</p>
<p>这两个月中做的另外一件事是机房上新设备。配上机器,电路改造,修空调,反正很少能安心下来看看书或者读读论文,跑实验也总出错。</p>
<h4 id="section-10">5月-6月</h4>
<p>面对一坨考试,终于撑不住了。取消的5月的G。不过考试还是成功考出两科60分。</p>
<h4 id="section-11">6月-8月</h4>
<p>在万念俱灰的情绪下,用一个星期整理了之前cluster帮助分词的工作。想论证怎样做字的表示才能帮助分词任务,结论是对单个字做不靠谱,对字聚类时要考虑字的上下文信息。这个水文投出去后重构了正华师兄的依存句法分析器dparser。当时估计工作量大概是一个月,1万行代码以下。但是实际做起来发现可以顺势将ltp里面的其他模块也重构一下。结果就是重写了本科毕设的序列标注统一框架,整个项目下来有1.7万行代码。还有9K字的文档,再加上写了python和ruby两个版本的client和一部分web页面。总之这两个月彻底做了一只代码狗。</p>
<p>6月末托福二战,再次准备十天,再次挂逼,考了个什么都不能做的93。基本是死了出国这条心了。
再一件事是实验室网站被挂马,被网络中心关站了。没办法用静态页生成器重构了主页,迁移+升级了服务器,跑了几趟网络中心。现在还有一些服务没恢复,拙计。</p>
<p>这一年,基本上是在一种忙碌而压抑的状态中度过。没怎么过过周末,也没时间去运动。体重又回到了110斤以下,可悲的是肚子好像有胖的趋势。前个周末,爸妈来哈尔滨,幸亏没买到车票。因为那天综合楼停电,结果来电后机房就烧到50度了。若是他们来,恐怕只能把他们撂旅馆里一天了。</p>
<p>有的时候,我也不知道想过什么应该怎样生活。也没时间去想一想,反正一直被无形的力量推动,不由自主。</p>
<p>他们大四的毕业那会儿,我有几次被三点唱歌的醉汉们吵醒。只好去公寓平台的椅子里乘凉,有时看天色由黑转灰,继而一片青蓝,觉得很陌生。</p>
脱臼
http://yjliu.net/blog/2013/07/21/luxation.html
2013-07-21T04:28:17Z
2017-12-13T15:33:28+08:00
Article Author
<p>昨天睡觉前,躺在床上打了个哈欠,结果一不小心下巴脱臼了。</p>
<p>下巴脱臼已经不是一次两次了。第一次脱臼好像是由于吃苹果张太大口,后来则以打哈欠打脱臼了居多。我把这种病症归咎于基因,因为印象中我妈好像也脱臼过。可能是骨骼的构造不合适,也可能是韧带的力度不够,不过这些一定是由基因决定的。由于接受了这种设定,我也就比较释然,重来没埋怨过苹果或者是哈欠。</p>
<p>出于多次脱臼的经验,我已经学会了一些基本的救治手段。一般是将下巴向左稍用力顶一下,感觉头骨错动,也可能会听到咯噔一声。如果运气好,脱臼就归位了。运气不好就多试几次,总有那么一次成功的。</p>
<p>我对自己的身体还有自己的技术都是有一定信心的,所以躺在床上左手扶头,右手扶下巴,向左用了一下力。可惜没出现咯噔一声。我知道我失败了,休息了几秒,换了个姿势,又向左使了一下力,当然还是失败。“或许是躺着不方便”,这么想后,我便坐了起来,哈尔滨夏夜的凉风慢悠悠地从窗子吹了进来。我额头上有汗,所以敏锐地捕捉到了这阵不期而遇的夏天的风。</p>
<p>坐起来后,我又重复了几次相同的自救策略,都以失败告终。好在我并没有着急,大学这几年唯一教会我的就是在着急之前先平静一下。平静的片刻之后,一大坨信息就冲入我的脑中:“首都机场好像发生爆炸案了”;“是不是城管最近又打死人了”;“托福口语如何提高成绩,真着急”;“昨天复华那边打架来着”;“不知道当年温岚有没有和周杰伦好过”;“女神建议我换个孙燕姿的桌面,要不要考虑一下”;“最近好像写了很多代码,都快成狗了”;“哎,活着真没意思”;“不过我还是不要告诉别人,免得看起来像个怨妇”。</p>
<p>在一阵激烈的对于宇宙终极问题的思考之后,我意识到“这次脱臼好像是自己弄不好了。”</p>
<p>而这时,寝室里另外一个哥们打起了呼噜,配着青黑天空中闪烁的繁星,显得非常有节奏。</p>
<p>大概又经历了一次尝试的失败,我放弃了治疗。小心翼翼穿上衣服。我要去医院,这件事不怎么严重,不必惊动其他人,我自己就够了。</p>
<p>“我是多么独立的一个人”, 我这么想着瞄了一眼寝室的镜子,以确保自己脱臼的嘴脸在旁人看起来不那么可憎。</p>
<p>然后,一切妥当后,我拖着着这副夸张的对世界惊呆了的表情走入了孤独的夜色之中。</p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2013/07/暴走漫画-www.iPc_.me33.png"><img src="http://blog.oneplus.info/wp-content/uploads/2013/07/暴走漫画-www.iPc_.me33.png" alt="" title="superize" width="118" height="145" class="aligncenter size-full wp-image-820" /></a></p>
实现一个更快一点的hashmap
http://yjliu.net/blog/2013/06/18/implementation-of-a-faster-hashmap.html
2013-06-18T13:54:02Z
2017-12-13T15:33:28+08:00
Article Author
<p>这段时间在写parser,难免又碰到了特征映射的问题。去年毕设做分词、词性标注时,这部分是用<code>__gnu_cxx::hash_map<string,int></code>来实现的。下表显示了几种数据集条件下的特征字典规模。</p>
<table width="100%" border="1">
<tr><td>数据集</td><td>Ctb5</td><td>Ctb7</td><td>People’s Daily</td></tr>
<tr><td>数据规模</td><td>1.8W sent.</td><td>4.7W sent.</td><td>18.4W sent.</td></tr>
<tr><td>分词特征规模</td><td>203.1W</td><td>334.8W</td><td>774.9W</td></tr>
<tr><td>词性标注特征规模</td><td>158.7W</td><td>274.2W</td><td>751.3W</td></tr>
</table>
<p>对于这个级别的数据量,在特征检索过程中,特征字典的性能已经对于整个分析器的性能产生了影响。不过,分词词性标注都是序列模型,对于序列中的每个元素,只要相应进行一次抽取就可以,特征字典的性能提高一点或者降低一点,对于整体速度的影响并不是非常明显。</p>
<p>不过在parser中,特征检索的情况就有所不同了。
主要表现就是作为一种结构学习,parser在学习模型参数以及预测过程中,需要对序列中的每两个元素抽取特征。
假如我们有30种特征模板,30词的句子,放到词性标注任务中,需要进行<code>30*30=900</code>次特征检索,而放到依存句法分析中,就需要进行<code>30*30*30=27000</code>次特征检索。
所以如果特征字典能再快点,当然是好事情。</p>
<p>另外一件让我比较不爽的事情是,c++的<code>hash_map</code>没法很好地支持持久化。
我想把<code>hash_map</code>当成整段内存dump到磁盘上,没可能。
只能一个key-value,一个key-value地处理。
所以呢,最理想的是有这样一种<code>hash map</code></p>
<ul>
<li>是一种动态的词典</li>
<li>以string或者const char * 作为key</li>
<li>性能与<code>__gnu_cxx::hash_map</code>相近,或者更胜一筹</li>
<li>能够很方便地进行持久化</li>
<li>不需要考虑删除操作</li>
</ul>
<h3 id="section">数据持久化</h3>
<p>进一步分析,要对数据进行持久化,最好的办法就是把所有的数据都放到一段内存上,dump的时候,直接把这段内存写到磁盘上;load时,直接从磁盘中读一段内存。
池子一下子变成了一个超级好的选择。
stl中的string基本没法持久化,不用想了。
<code>const char *</code>倒是不错。我们可以把所有的key都固化到一段<code>char *</code>的buffer中。</p>
<p>对于value的固化,其实有这样一种考虑:如果value的类型是可以固化的,比如说int、double,那么也可以用池子来存这些value,但是如果是一些自建类型,比如说类啊什么的,本来就没有很直接的持久化的方法,小店也就只能伺候不周了。</p>
<p>池子固然是好想法,但我们考虑数据结构的动态性还要大于其性能考虑。
想让池子变得动态基本就要用到stl中的allocator的技术了。
维护一个池子长度上限以及当前长度,如果新加入的元素的规模大于池子上限,就将上限翻倍,重新给池子分配一段空间,把旧空间拷贝过去。</p>
<p>这样的话,数据固化的问题基本可以沿着这个思路解决。简单的代码可以写成这样</p>
<div class="highlight"><pre class="highlight cpp"><code><span class="k">if</span> <span class="p">(</span><span class="n">pool_cap</span> <span class="o"><=</span> <span class="p">(</span><span class="n">new_size</span><span class="o">=</span><span class="p">(</span><span class="n">pool_size</span><span class="o">+</span><span class="n">element_size</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">pool_cap</span><span class="o">=</span><span class="p">((</span><span class="n">new_size</span><span class="p">)</span><span class="o"><<</span><span class="mi">1</span><span class="p">);</span>
<span class="n">Element_type</span> <span class="o">*</span> <span class="n">new_pool</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Element_type</span><span class="p">[</span><span class="n">pool_cap</span><span class="p">];</span>
<span class="n">std</span><span class="o">::</span><span class="n">copy</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">pool</span> <span class="o">+</span> <span class="n">pool_size</span><span class="p">,</span> <span class="n">new_pool</span><span class="p">);</span>
<span class="k">delete</span> <span class="p">[](</span><span class="n">pool</span><span class="p">);</span>
<span class="n">pool</span> <span class="o">=</span> <span class="n">new_pool</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>
<h3 id="section-1">性能优化</h3>
<p>虽然hashmap中hash function是很大程度上影响性能的因素,但是这也是我们不能控制的事情。
我们能够提供的大概只有给一个合理的hashtable大小,以使得碰撞不是非常剧烈。
什么样的hashtable大小比较合适,开大了浪费,开小了碰撞又多,看来又要上动态的hashtable了。
不过即便是动态的hashtable,也面临resize时机的问题,爆栈上有这样一个问题“<a href="http://stackoverflow.com/questions/1603712/when-should-i-do-rehashing-of-entire-hash-table/1604428#1604428">When should I do rehashing of entire hash table?</a>”,第一条答案给了一个经验性的回答,翻译整理一下:</p>
<blockquote>首先需要明确一个量load factor的概念,这个值表示hashtable的桶的个数M和桶中元素的个数N的比值,load factor=N/M。然后看一看你所使用的hashtable的类型有关(关于load factor和hashtable类型都可以去看侯捷老师的stl源码分析的5.7.1节)。
<ul><li>线性探测(linear probing):load factor在60%左右时就该resize了</li>
<li>二次探测(quadratic probing):load factor在80%-85%时就该resize了</li>
<li>开链(separate chaining):load factor大于150%时就该resize了</li>
</ul>
</blockquote>
<p>然后,我们还会面临一个问题,就是resize到多大呢?
二倍当然是一个选择,但不一定是好选择,比如说二次探测的hashtable需要hashtable大小为质数,二倍了就不是质数了。
比较好的选择是二倍以上的质数,这个<a href="http://planetmath.org/goodhashtableprimes">网址</a>给出了一个hashtable size的质数表。
后来我发现stl中也有一样的质数表。</p>
<p>具体实现resize的时机是在hashmap每次insert一个元素之后,看一看是不是符合resize的机制。
如果符合,申请一段新的hashtable,然后枚举旧的hashtable中的每个元素,把他们插入到新的hashtable中。</p>
<p>好的,到这里我们基本对于hashtable的实现心里也有数了。
不过做出来的可能也只是hash_map的一个翻版,如何在性能上进一步提升呢?
之前有一次讨论中,陆子龙师兄说在构造hashmap时可以将高频的key有限插入hashtable,这样从概率的角度讲,高频的key就有更多更大可能性一次就被检索到。
这或许是一个优化的好思路,而且貌似gnu.trove已经实现了这种机制。</p>
<p>为了实现这一机制,用开链的方式显得要比其他几种容易。
比方说保证高频key靠前,就只要保证在插入元素的时候维护链的有序性。
因为每个链中都不会有很多元素,所以直接用类似于交换排序的思想就好了。
维护有序性的时机主要是在某个key被重复插入后,这个key的频率增加,只要看一看这个key在链表中的前一个key是不是已经小于这个key了,如果是,就往前交换就好了。
因为每个链都比较小,而且在频率增加前这个链是有序的,所以可以在极小时间复杂度内求出算某个key的正确位置时。</p>
<p>到这里基本上整个数据结构的设计就出来了。Hash node设计成</p>
<div class="highlight"><pre class="highlight cpp"><code><span class="k">struct</span> <span class="n">hash_node_t</span> <span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">__key_off</span><span class="p">;</span> <span class="c1">// 存key在key_pool中的偏移量
</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">__val_off</span><span class="p">;</span> <span class="c1">// 存value在value_pool中的偏移量
</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">__freq</span><span class="p">;</span> <span class="c1">// 存频次
</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">__hash_val</span><span class="p">;</span> <span class="c1">// 存hashvalue,用于加速
</span> <span class="kt">int</span> <span class="n">__next_off</span><span class="p">;</span> <span class="c1">// 存后继节点在node_pool中的偏移量
</span><span class="p">};</span>
</code></pre></div>
<p>因为池子的地址会动态变化,不能直接在hash node中保存指针,存偏移量就没有问题了。</p>
<p>一共有三个pool,一个存key,一个存value,一个存hash node,另外一个指针数组(或者偏移量数组)存hashtable的头结点,就设计成如下图那样了。</p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2013/06/structure.jpg"><img src="http://blog.oneplus.info/wp-content/uploads/2013/06/structure.jpg" alt="Hash Table的数据结构" title="structure" width="500" class="aligncenter size-full wp-image-805" /></a>
我实现的代码放在<a href="https://github.com/Oneplus/libutilities/blob/master/src/smartmap/smartmap.hpp">这里</a></p>
<h3 id="section-2">测试</h3>
<p>为了证明这个hashmap的性能,我进行了一个简单的benchmark。数据完全是模拟现实情境中首先构造特征字典,然后进行特征检索。benchmark的方法参考了<a href="incise.org/hash-table-benchmarks.html">这篇博客</a>中的方法。评价的大体过程是首先用一个有重复元素的key字典构建字典,然后用一个比较大规模的key集合去检索他的value。评价中采用了三个数据集,分别是ctb5,ctb6的分词特征集,和cdt的一阶句法特征集。key集是构造特征空间时用到的key,retrieve集是全特征集。各个数据的统计如下表所示</p>
<table width="100%" border="1">
<tr><td>data set</td><td># of keys</td><td># of unique keys</td><td># retrieve entries</td></tr>
<tr><td>CTB5</td><td>12.89M</td><td>2.2M</td><td>77.3M</td></tr>
<tr><td>CTB6</td><td>16.91M</td><td>2.7M</td><td>101.5M</td></tr>
<tr><td>CDT</td><td>55.6M</td><td>5.2M</td><td>198.9M</td></tr>
</table>
<p>参与对比的hashmap有__gnu_cxx::hash_map,google_sparse_hash,google_dense_hash,几种hashtable都做了类似的封装。benchmark用到的代码地址在<a href="https://github.com/Oneplus/libutilities/tree/master/benchmark/smartmap">这里</a>。运行的时候用</p>
<div class="highlight"><pre class="highlight plaintext"><code>nice -n-20 ionice -c1 -n0 python bench.py
</code></pre></div>
<p>保证了进程的优先级。</p>
<p>实验在xeon5650 2.67GHz的服务器上进行,gcc版本是比较老的4.1.2。</p>
<p>在实验数据集上,时间效率和内存效率分别如下表显示。</p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2013/06/speed.png"><img src="http://blog.oneplus.info/wp-content/uploads/2013/06/speed.png" alt="" title="speed benchmark" width="500" class="aligncenter size-full wp-image-810" /></a></p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2013/06/memory.png"><img src="http://blog.oneplus.info/wp-content/uploads/2013/06/memory.png" alt="" title="memory benchmark" width="500" class="aligncenter size-full wp-image-811" /></a></p>
<p>总体上来讲是达到了我的期望。下一步打算把这个模块并入LTP中,期望还能再进一步优化!</p>
<h3 id="section-3">参考资料</h3>
<ul>
<li><a href="http://book.douban.com/subject/1110934/">STL源码分析</a></li>
<li><a href="http://stackoverflow.com/questions/1603712/when-should-i-do-rehashing-of-entire-hash-table/1604428#1604428">When should I do rehashing of entire hash table?</a></li>
<li><a href="http://incise.org/hash-table-benchmarks.html">Hash Table Benchmarks </a></li>
</ul>
<p>话说好久没写博客了。</p>
小记博客重组织
http://yjliu.net/blog/2012/10/13/restruct-my-blog.html
2012-10-13T13:58:33Z
2017-12-13T15:33:28+08:00
Article Author
<p><a href="http://oneplus.info">oneplus.info</a>这个域名和它使用的主机空间是我在2011年初买下的。到现在,就快有两年的时间了。两年之间,这个里产生了38篇博文,接受了2.4万次PV,其中《<a href="http://blog.oneplus.info/archives/535">哈工大男女比例调研报告</a>》和《<a href="http://blog.oneplus.info/archives/455">关于一个点歌社交网络的构想</a>》两篇得到了豆瓣九点首页的推荐。总的来讲,我对博客中提供的内容还是比较用心。</p>
<p>虽然这个博客的一直以来的表现也没什么差错,但是我却在很早以前就产生了重新组织网站结构的想法。建站之初,没什么经验(现在我也没什么经验),直接把wordpress安装在web根目录public_html下,oneplus.info的域名也直接定位到博客上。到了现在,觉得有必要在这里加一个主页,把原来的博客移到blog.oneplus.info的域名下。出于这个考虑,这周在业余时间里完成了这两项工作。</p>
<h3 id="wordpress">WordPress迁移</h3>
<p>在迁移之前,我的<code>public_html</code>是这样的:</p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2012/10/before.jpg"><img src="http://blog.oneplus.info/wp-content/uploads/2012/10/before.jpg" alt="" title="before" width="399" height="282" class="aligncenter size-full wp-image-764" /></a></p>
<p>我希望它变成这样,</p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2012/10/after.jpg"><img src="http://blog.oneplus.info/wp-content/uploads/2012/10/after.jpg" alt="" title="after" width="398" height="152" class="aligncenter size-full wp-image-765" /></a></p>
<p>并且可以通过blog.oneplus.info来访问。</p>
<p>要实现上面的效果,首先要做的工作是把blog.oneplus.info解析到主机的IP上。
这个只需要在DNS服务商处给主域添加一个A记录,使得HOST_NAME为blog.oneplus.info的http请求发送给我的主机。
我的DNS服务商是Godaddy,在Domain Manager面板上添加记录如下。</p>
<p><a href="http://blog.oneplus.info/wp-content/uploads/2012/10/godaddy.jpg"><img src="http://blog.oneplus.info/wp-content/uploads/2012/10/godaddy.jpg" alt="" title="godaddy" width="550" class="aligncenter size-full wp-image-766" /></a></p>
<p>添加后大概一个小时就生效了。</p>
<p>在DNS服务的工作做完后,要做的是使得主机能够处理这个请求。
通过Google发现,这个问题大致可以通过三种途径解决,它们分别是:</p>
<ul>
<li>配置apache服务器,添加Virtual Host</li>
<li>通过mod_rewrite模块把blog相关的请求重定向到blog.oneplus.info/blog/下</li>
<li>在cpanel中添加子域</li>
</ul>
<p>这里面,对于用cpanel管理的主机,由于用户不能接触httpd,第一种方法不能实现。
第二种方法有点麻烦,具体做法可以google”.htaccess”、”重定向”、”二级域名”这几个关键字。
第三种方法最简单,只要在cpanel的子域一项中添加一个名为blog.oneplus.info的子域就好了,非常傻瓜。</p>
<p>在主机可以处理blog.oneplus.info的请求之后,下一步就是wordpress的搬家了。
由于,我使用的wordpress版本是3.3.1,而且是站内移动,搬家这件事就变得非常简单。
具体做法是在<code>设置->;常规</code>中将<code>WordPress地址</code>和<code>站点地址</code>都写成blog.oneplus.inf
o。保存之后,站点会暂时坏掉。不过把wordpress相关的文件移动到blog文件夹下,修改就生效了。</p>
<p>进行这些操作后,blog.oneplus.info便可以正常访问了。但是,还有一个问题是,博客中有一些图片的链接还指向blog.oneplus.info/wp-content/,科学的做法是把数据库的导出,然后把所有blog.oneplus.info改成blog.oneplus.info再导入。</p>
<p>这些都做完了,博客的迁移工作基本就完成了。整个过程都没什么难度,但是我忽略了最早的一步,白白浪费了一个晚上的时间。</p>
<h3 id="feed">Feed重定向</h3>
<p>博客迁移之后,我的博客从下面几个方面会受到影响:</p>
<ul>
<li>博客的订阅</li>
<li>搜索引擎排名</li>
<li>wumii的喜欢按钮</li>
</ul>
<p>对于写博客的人,第二项虽然很重要,但是能做的其实不太多(而且我挺反感SEO的,虽然我是学信息检索的)。
所以,服务好自己的订阅用户才是要紧事。</p>
<p>现在的情景是由于feed的输出地址发生了改变,原来通过blog.oneplus.info/feed进行的订阅失效了。
打开Google Reader,查看自己博客的订阅,发现Statistic中显示Parsing Error。
好在主域还在我的手里,只要把blog.oneplus.info/feed的请求重定向到blog.oneplus.info/feed就行了。</p>
<p>这里要用到前面说过的比较麻烦的<code>mod_rewrite</code>。做法是在web根目录下的.htaccess中添加重写条件和重写规则。现在我对feed的重写规则是这样的</p>
<div class="highlight"><pre class="highlight plaintext"><code>RewriteCond %{HTTP_HOST} ^blog.oneplus.info$
RewriteCond %{REQUEST_URI} ^/feed$ [NC]
RewriteRule .* http://blog.oneplus.info/feed [NC,L,R=301]
</code></pre></div><p>它的含义是把所有主机名为blog.oneplus.info,URI为/feed的请求都重定向到blog.oneplus.info/feed下。</p>
<p>添加完重写规则后可以通过访问blog.oneplus.info/feed来测试一下重写规则是否生效。
还有一些其他网站提供mod_rewrite的测试,比如说<a href="http://martinmelin.se/rewrite-rule-tester/">这里</a>,重定向失败的话可以把.htaccess投到这个网站中,找些样例测试一下。</p>
<p>至于第三项,我倒确实把男女比那篇的一百多个“喜欢”给丢了,不过I don’t care</p>
<h3 id="section">主页</h3>
<p>完成博客迁移后,我发觉应该给blog.oneplus.info写一个主页。最后决定在里面写一个个人简介(留着吹牛用)。
这次,我想尝试一下css框架(平时在实验室里也没机会),于是把bootstrap、blueprint、foundation几个框架都试了试。最后还是决定用bootstrap,原因是我在网上找到了它对ie6做的patch。
至于为什么要兼容ie6,这篇豆瓣<a href="http://www.douban.com/note/241422302/">日记</a>记录了原因。</p>
<p>后来我觉得应该在主页中加一个最近发布的博客。
直接查库当然是个好选择,但是我还想在ir.hit.edu.cn/~yjliu/上面做个镜像。
所以在主机上写了一个小php - query来查库。
这样,blog.oneplus.info和ir.hit.edu.cn/~yjliu/都可以通过查这个网页获得最近的文章。
不过,我的主机在国外,国内访问速度慢,而且这个网页的更新频率实在不高。做cache是必须的。</p>
<p>起初,我想了好久如何在query.php上做cache,后来才发现,最应该做cache的是主页这一端。所以在主页中添加了下面的代码:</p>
<div class="highlight"><pre class="highlight plaintext"><code>$cache = new Cache(3600, &quot;some_path&quot;);
$key = &quot;last_post&quot;;
$values = $cache-&gt;get( $key );
if ($values == false) {
$page = '';
$handler = fopen('some_url', 'r');
while(!feof($handler)){
$page .= fread( $handler, 1048576 );
}
fclose( $handler );
$values = $page;
$cache-&gt;put( $key, $values );
echo $values;
} else {
echo $values;
}
</code></pre></div>
<p>对于主页的请求,先去看看cache过没过期,没过期就直接返回cache的结果,这样可以减少不少网络传输。其中,cache类我是参考<a href="http://www.mangguo.org/the-simple-php-cache-class/">这篇</a>的。</p>
<h3 id="section-1">总结</h3>
<p>至此,博客的重组织工作告于段落。现在可以通过<a href="http://www.oneplus.info">www.oneplus.info</a>访问我的主页,也可以通过<a href="http://blog.oneplus.info">blog.oneplus.info</a>访问博客。
虽然还有一些想做的工作,不过我还有别的事情,不能在这上面花太多时间,就这样吧。</p>
<p>PS: 这篇的另一目的是测试一下Feed输出是否正常。</p>