Wednesday, November 11, 2009

无题

看到同住的Kwoon lim正在为病怏怏Joshua煮晚餐,我称赞他说,不愧是哥们儿啊。他回答道:没什么嘛,顺手牵羊而已嘛。

Tuesday, October 20, 2009

总统府一日游

键词:高尔夫球场,植物园,稀世珍宝,没看到丹哥,热

Thursday, September 17, 2009

图解CUDA中用多线调度执行来隐藏内存访问延迟的机制

近在捣鼓CUDA,实现的几个小算法速度能提高到40~50x单线CPU的程度。可惜现在这边弄这个的人不多,基本没有“真人交流”。希望今后能用来做一些实际的项目,发挥一下潜能。嘿嘿。

最近应该会写一些心得,记录一下学习过程。我其实是一直有心做这样的事情,只是懒…… 从今天开始,哈哈。

图解CUDA中用多线调度执行来隐藏内存访问延迟的机制

GPU
和device RAM之间带宽虽宽(如GTX285的150GB/s),不过内存访问延迟依旧很大(400~600 cycles)。CPU用cache可以很好的解决内存延迟;不过GPU由于要处理的通常是大规模的数据集,源数据的locality或许并不高,即便加上很大的data cache效果也不一定好,所以索性放弃了data cache,换而采用用户可控的shared memory来开发data locality。最重要的,虽然GPU的一个处理器(SM - Streaming Multiprocessor)同时只能执行一个warp的线程(32个),但却可以容纳大量的活动线程(active threads <= 1024)。当一个warp由于访问内存而被阻塞时,SM可以马上转为执行其他ready的warp,直到其被内存访问阻塞。只要这样转一圈的时间(执行完一圈其他的warps再转回来到第一个被阻塞的warp的时间)大于600 cycle的延迟,这个延迟就被隐藏了。

下图示意了内存访问延迟不能被完全隐藏的情况。假设SM上只有4个warp,他们需要从内存里面读取数据,在上面计算一把,循环这个过程。当然,我们也可以看成每个warp计算一次就终止了,之后新的warp会产生出来,代替原来那个warp的位置。从图里面很容易看出来虚线部分的latency是怎么被隐藏的;也很容易看出来因为每个warp的计算不够长,所以没有能够隐藏掉所有的latency。


放到现有的G200的architecture上来,每个SM最多1024个active thread,也就是32个warp。每个warp执行一个基本指令需要4个cycle。也就是说,要隐藏600 cycles需要每个warp平均执行 600/4/32 = 4.69个基本指令。这就是为什么推荐compute to memory ratio至少大于5的原因。这还是建立在理想的1024个active thread的情况下的。如果SM上thread少,这个比例还要提高。

Wednesday, July 01, 2009

RIP

关上灯,耳机中响起《gone too soon》的旋律。歌声中夹杂的气息,呼吸声,点点的喉音,一切听起来都那么清晰,那么近,就像那个人还在耳边歌唱,还没有离去。

我们的这个时代,没有披头士,没有猫王,但是却很幸运的有MJ。他其实一直都是一个天真的孩子。只是,这个世界太复杂。May you rest in peace,默默的,祝福。

Monday, June 15, 2009

C code for Confidence level computation for BER testing

Article is "statistical confidence levels for estimating error probability".

The first function calculates the confidence level (CL) given the number bit errors measured (n), the total number of bits tested (N), and the expected BER (ph).


// CL = 1 - (\Sigma^{0}_{N}(n * ph)^k/(k!)) * e^{-n * ph}
double solve_CL_from_N_n_ph(unsigned N, double n, double ph)
{
double emnp = exp(-n * ph);
double sum = 1;
double product = 1;
for(unsigned k = 1; k <= N; k++)
{
product *= (n * ph / k);
sum += product;
}
sum *= emnp;
return(1 - sum);
}


The second function computes the minimum number of bits that need to be tested to confirm a certain confidence level (CL) of an expected BER (ph). If the number of bit errors measured upon receiving of such number of bits is less than N, the hypothesis is confirmed positively. The formula to compute n given N greater than 0 is mathematically hard to solve. This function solves by trying out different values of n in a binary search manner using the first function (which is reverse of the function in question) until n is converged to a certain extent.

double solve_n_from_ph_CL_N(double ph, double CL, unsigned N)
{
if(N==0)
return(-log(1-CL)/ph);
else
{
// solve n using binary search, based on "solve_CL_from_N_n_ph"
double upper_n = N/ph * max(100, N), // enlarge at least 100 times
lower_n = 100000, // at least 100000 data
cur_n = upper_n, next_n;
double min_interval = 100; // minimum internal to break the search
while(1)
{
double cur_CL = solve_CL_from_N_n_ph(N, cur_n, ph);
if(cur_CL >= CL) // if calculated value is more confident, try to reduce n
{
upper_n = cur_n;
next_n = lower_n + (upper_n - lower_n)/2;
} else
{
lower_n = cur_n;
next_n = upper_n - (upper_n - lower_n)/2;
}
if(fabs(next_n - cur_n) < min_interval)
break;
cur_n = next_n;
}
return(cur_n);
}
}

Tuesday, January 27, 2009

罗俊同学的口头禅:cheese我了。

What's best of being an Christian?

You can do anything you want. In the end, you just confess and get forgiveness from the GOD.

Wednesday, January 21, 2009

Ng

Ng不是not good,而是“黄”,新加坡人很多姓这个的。Ng是2个辅音字母,很多人读不出来,实际就读N。大街上老远看到一个熟人姓Ng的,大叫Mr N,喊破嗓子也听不见。哦,不对,不可能喊破嗓子。


还有姓Gn的,这个不会发了。

Sunday, January 18, 2009

X Fair Price

知道Jurong Point新开了家Fair Price,因为规模大,名字也不一样,叫X Fair Price。听起来觉得挺别扭,总觉得和ex-wife很像。遂感叹新加坡真是 昔非今比啊。

Friday, January 09, 2009

新加坡最繁华的路叫 乌贼路。