2011年11月2日水曜日

MKL parallel execution benchmark (part 2)

Using the same code, but for 1024x1024 real symmetric matrices.

The efficiency is better for the larger matrices, but still not very good at this size.
(Still the calculation is somewhat faster with more threads.)

さっきと同じで、32x32 の系について
(1024x1024 行列の対角化)

(予想されるように)行列のサイズが大きくなると並列化の効率は向上する。
このサイズでは並列度を増やすとそれなりに時間の短縮になるが、効率はあまり良くない


**** smp 1 ****

実行時間: 6811秒
Command being timed: "./a.out"
User time (seconds): 6809.87
System time (seconds): 0.02
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:53:31

**** smp 2 ****

実行時間: 4559秒
Command being timed: "./a.out"
User time (seconds): 8988.00
System time (seconds): 52.54
Percent of CPU this job got: 198%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15:59

**** smp 4 ****

実行時間: 3358秒
Command being timed: "./a.out"
User time (seconds): 13043.14
System time (seconds): 153.09
Percent of CPU this job got: 392%
Elapsed (wall clock) time (h:mm:ss or m:ss): 55:57.95

**** smp 8 ****

実行時間: 2897秒
Command being timed: "./a.out"
User time (seconds): 22270.76
System time (seconds): 360.07
Percent of CPU this job got: 781%
Elapsed (wall clock) time (h:mm:ss or m:ss): 48:17.03

MKL parallel execution benchmark

In the following example, most(?) of computation time is spent with LAPACK DSYEV to obtain all eigenvalues of 400x400 real symmetric matrices.

The effect of parallelization is not so large at this size. (See also the next post.)


FK_2.c で 400x400実対称行列の全固有値を DSYEV で求める場合
(ちゃんと調べていないが、計算時間のかなりの部分はLAPACKで固有値を求めるのに使われている)

以下は、20x20サイトの系でtight-binding Hamiltonianの固有値問題を
各MC step について 20x20x2 回求め、10 stepの計算。

このサイズではあまり並列化の効果は大きくないようだ。


**** smp 1 ****

実行時間: 209秒
Command being timed: "./a.out"
User time (seconds): 209.29
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:29.47

**** smp 2 *****

実行時間: 185秒
Command being timed: "./a.out"
User time (seconds): 364.18
System time (seconds): 4.30
Percent of CPU this job got: 199%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:04.85

***** smp 4 *****

実行時間: 205秒
Command being timed: "./a.out"
User time (seconds): 483.49
System time (seconds): 12.41
Percent of CPU this job got: 242%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:24.66

***** smp 8 ******

実行時間: 169秒
Command being timed: "./a.out"
User time (seconds): 1314.38
System time (seconds): 29.81
Percent of CPU this job got: 794%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:49.19

Using LAPACK/MKL with intel compiler (fast, and MKL is parallelized)

$ icc -parallel -fast -mtune=core2 FK_2.c mt19937ar.c -lmkl_lapack -lmkl_em64t

でコンパイル、動作確認。(バージョンは以下の通り)
The version of icc/MKL is as follows:



[oshikawa@cdg ~]$ icc --version
icc (ICC) 11.0 20090318
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.

Package ID: l_mkl_enh_p_9.1.018
Package Contents: Intel(R) Math Kernel Library 9.1 for Linux*



The main code used here is not parallelized. Nevertheless, MKL can be run in parallel mode.
このときのメインプログラムは特に並列化していないが、MKLライブラリは並列実行可能。

You can specify the number of parallel threads as follows, in Sun Grid Engine.
実行時の並列度は(SMPの範囲で)Sun Grid Engine で以下のようにして指定可能

#$ -v OMP_NUM_THREADS= (number of threads)
#$ -pe smp (number of threads)

Following example job file specifies 4 threads
例: SMP 4並列の場合

#!/bin/bash -x
#$ -V # Inherit the submission environment
#$ -cwd # Start job in submission directory
#$ -N myFK_L32_smp4 # Job Name
#$ -j y # Combine stderr and stdout
#$ -o $JOB_NAME.o$JOB_ID # Name of the output file (eg. myMPI.oJobID)
#$ -v OMP_NUM_THREADS=4
#$ -pe smp 4
#$ -q all.q # Queue name "normal"
/usr/bin/time -v ./a.out # Run the MPI executable named "a.out"

Using large arrays in C

If you declare large arrays in functions, you might encounter "segmentation fault" error in runtime.

You can avoid this problem by
1) changing the limit of stack size
2) making the large arrays global variables
3) use malloc


関数内で配列を宣言 ⇒ スタック領域を使用

スタックが足りなくなると、動作時に segmentation fault エラー

解決法:
1) スタック領域制限の変更
bash: ulimit -s
qsub: スクリプト中に以下の一文を入れる (うまく動いていない? 要調査)
#$ -l h_stack=256M

2) 巨大配列はグローバル変数とする

3) malloc で動的確保