-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is rindow_clblast used for matrix operations? #3
Comments
Hi Before starting this topic.... Well, The manual does not describe how to program to use the GPU. The processing speed of the composer is very fast. If that time bothers you, there's more to do.
The cross() function trades speed for convenience. The la() function calls linear algebra functions.
You haven't used the GPU yet in your source code. However, Here is an example of how to make a CPU and GPU in blocking mode. <?php
include('vendor/autoload.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;
$mo = new MatrixOperator;
//$mode = 'CPU-NORMAL';
//$mode = 'CPU-RAW';
$mode = 'GPU';
$size = 1000;
$epochs = 100;
//$dtype = NDArray::float32;
$dtype = NDArray::float64;
switch($mode) {
case 'CPU-NORMAL': {
$la = $mo->la();
break;
}
case 'CPU-RAW': {
$la = $mo->laRawMode();
break;
}
case 'GPU': {
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
echo "blocking mode...\n";
$la->blocking(true);
break;
}
default: {
throw new Exception('Invalid mode');
}
}
//$la = $mo->la();
$accel = $la->accelerated()?'GPU':'CPU';
echo "Mode:$accel($mode)\n";
$fp64 = $la->fp64()?'TRUE':'FALSE';
echo "Supports float64 on this device: $fp64\n";
if($dtype==NDArray::float64 && $fp64=='FALSE') {
$dtype = NDArray::float32;
}
if($accel=='CPU') {
$name = $la->getBlas()->getCorename();
echo "CPU core name :$name\n";
$theads = $la->getBlas()->getNumThreads();
echo "CPU theads:$theads\n";
} else {
$i = 0;
$devices = $la->getContext()->getInfo(OpenCL::CL_CONTEXT_DEVICES);
$name = $devices->getInfo(0,OpenCL::CL_DEVICE_NAME);
echo "GPU device name :$name\n";
$cu = $devices->getInfo($i,OpenCL::CL_DEVICE_MAX_COMPUTE_UNITS);
echo "GPU Compute units :$cu\n";
}
$strdtype = $mo->dtypetostring($dtype);
echo "data type: $strdtype\n";
echo "computing size: [$size,$size]\n";
echo "epochs: $epochs\n";
$a = $mo->arange($size*$size,dtype:$dtype)->reshape([$size,$size]);
$b = $mo->arange($size*$size,dtype:$dtype)->reshape([$size,$size]);
$a = $la->array($a);
$b = $la->array($b);
$start = microtime(true);
for($i=0;$i<$epochs;$i++) {
$c = $la->gemm($a,$b);
}
echo "elapsed time:".(microtime(true)-$start)."\n";
Even in CPU mode, rindow-math runs multithreaded by the openblas library. In GPU mode, in non-blocking mode, the CPU and GPU work asynchronously, so even if the function returns from the call, the processing is not finished. Therefore, while the GPU is operating, the CPU can also operate in parallel. Good luck! |
Thank you! I run your code and results are strange a bit:
Shouldn't GPU mode be faster then CPU? And I don't understand GPU Compute units number reported. My GPU card has 640 cores, not 5. Maybe this is a reason why there is no improvement in speed? Or "compute unit" is not the same as "core"? I don't understand how you create martices ($a, $b) - I can't find anything about arrange nor reshape methods. Could you show me please any example how to convert PHP array into one accepted by the code? Best wishes, |
That's a good question. Effective use of GPUThe GPU is faster when doing large matrix operations. ** == 1000x1000 ==
** == 5000x5000 ==
About N-VidiaI don't know how N-Vidia configures Compute units. To illustrate the concept, a less precise explanation is as follows. Compute units are the number of threads that can run independently and in parallel. Core is the number of calculators. The cores controlled by one Compute unit can only perform the same operation at the same time. For example, if 8 Compute units are controlling 16 cores, 128 cores will run simultaneously. The GPU is built on a completely different concept than the CPU. Linear Algebra FunctionsSee this page: They only support floating point. Integers are not supported.
We have selected and implemented frequently used functions from among these. <?php
include('vendor/autoload.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;
$mo = new MatrixOperator();
$dtype = NDArray::float32;
//$dtype = NDArray::float64;
//// CPU
$la = $mo->la();
$a = $la->array([[1,2],[3,4]],dtype:$dtype);
$b = $la->array([[5,6],[7,8]],dtype:$dtype);
$c = $la->array([9,10],dtype:$dtype);
$y = $la->gemm($a,$b); // y = matrix-matrix-multiply(a,b)
$z = $la->gemv($a,$c); // z = matrix-vector-multiply(a,c)
$la->axpy($a,$b); // b = a + b
echo "y=".$mo->toString($y)."\n";
echo "z=".$mo->toString($z)."\n";
echo "b=".$mo->toString($b)."\n";
print_r($y->toArray());
//// GPU
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$la->blocking(true);
$a = $mo->la()->array([[1,2],[3,4]],dtype:$dtype);
$b = $mo->la()->array([[5,6],[7,8]],dtype:$dtype);
$c = $mo->la()->array([9,10],dtype:$dtype);
$a = $la->array($a);
$b = $la->array($b);
$c = $la->array($c);
$y = $la->gemm($a,$b); // y = matrix-matrix-multiply(a,b)
$z = $la->gemv($a,$c); // y = matrix-vector-multiply(a,c)
$la->axpy($a,$b); // b = a + b
echo "y=".$mo->toString($y)."\n";
echo "z=".$mo->toString($z)."\n";
echo "b=".$mo->toString($b)."\n";
print_r($y->toArray()); Some reference manualsThanks, |
Anyway, the BLAS is a magic! I tested matrix multiplication with my real life example, and - even without any GPU support - the results are amazing:
The first number in brackets is conversion time of both input arrays from PHP e.g. |
array() syntaxActually, I shouldn't have made array() a function with the same name on the GPU. When computing with GPU, data is transferred as follows. PHP world memory => CPU world memory => GPU world memory The first copy uses $cpuLA->array(). When using GPU, be sure to write as follows. GPU calculation resultI have a very bad feeling about GPUs. The calculation results are all different depending on the GPU hardware and drivers. The reason why PHP's internal calculations and CPU calculation results are the same is that they are calculated using the same CPU. It may be hard to believe for those unfamiliar with scientific computing. Optimizing programs on your dataI don't think a single change in the matrix product will improve it any further. If your entire application program is built using PHP's native arrays, you'll need to rewrite them all into a matrix math library. It will also be necessary to rewrite the program to be optimized for the GPU. |
Thank you again for your explonation. As rewriting the code is my long-term plan, for now I have to focus on the optimalizing the existing one. Probably I can manipulate on my matrices to pass them to MatrixOperator in a way that saves time on conversion. First findings in construction of new NDArrayPhp object. There are some operations, for which I measured the execution time:
So, the most costly is B & D, e.g. Looking into NDArrayPhp:
In particular cases the sorting is probably not necessary. When commented above fragment some improvement is achieved:
|
As you may have noticed, the most time consuming part is converting between PHP arrays and CPU memory. NDArrayPhp does the heavy lifting to easily populate a crudely constructed PHP array into CPU memory. However, if the data structure is known, the developer can expand the memory by himself. <?php
include('vendor/autoload.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;
function php2array($data,$a) {
foreach($data as $i=>$row) {
foreach($row as $j=>$d) {
$a[$i][$j] = $d;
}
}
}
$mo = new MatrixOperator();
$dtype = NDArray::float32;
//$dtype = NDArray::float64;
$dataA = [[1,2],[3,4]];
$dataB = [[5,6],[7,8]];
$la = $mo->la();
$g_la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$g_la->blocking(true);
$a = $la->alloc([2,2],dtype:$dtype);
$b = $la->alloc([2,2],dtype:$dtype);
php2array($dataA,$a);
php2array($dataB,$b);
$g_a = $g_la->array($a);
$g_b = $g_la->array($b);
$g_y = $g_la->gemm($g_a,$g_b);
$y = $g_la->toNDArray($g_y);
foreach($y as $row) {
foreach($row as $d) {
echo $d."\n";
}
} In other words, the rindow-math-matrix library does not assume PHP arrays as input. In fact, when using rindow-math-matrix for machine learning, you don't use PHP arrays as input. It is used with the input data already copied to the CPU memory. This is the design concept because an application-specific algorithm is optimal for the task of expanding application input data into the CPU memory space. |
Well, I have added some code, however I have some problem with PHP objects - can not find constructor of OpenBlasBuffer class (called from newBuffor method of NDArrayPhp class). May I ask for your kind support once again? I noticed that creating the buffer by adding elements one by one is slow. I'd like to test copying the whole array at once, however I feel a bit lost with class extensions, interfaces and implements :(. |
OpenBlasBuffer.php use Rindow\OpenBLAS\Buffer as BufferImplement;
class OpenBlasBuffer extends BufferImplement implements LinearBuffer
{....} Rindow\OpenBLAS\Buffer is in Buffer.c OpenBlasBuffer is written entirely in C Language. The only way to pass PHP data to the CPU memory space is via this Buffer.c. The array2Flat function in NDArrayPhp also uses the offsetSet of this Buffer to put values one by one. So slow! Programming in php $a[0] = 1.0; and $a->offsetSet(0, 1.0) have the same meaning. Buffer also implemented load and dump using binary data. However, it did not work well when the floating point was converted to a string and loaded with the PHP standard function pack(). So I'm not using for floating point inputs. It is a very big problem that the speed of converting data in the PHP world into data that can be directly handled by the CPU is very slow. So far I haven't been able to get past this issue. |
Finally I found a way to decrease time of array loading from 0.809 sec to just 0.097 sec. It's based on pack() and work pretty well. As you can notice, I resigned of many controls, assuming the data are checked/prepared on early stage of the code. The main trick is to pack the whole row of the array at once (call_user_func_array costs some time as well, so I asked PHP team to consider something like I just copied NDArrayPhp class and added my own constructor. Probably this is not optimal way, however works ;-).
Files, which I changed/added: EDIT: somebody proposed simple solution with pack, which works and is even faster:
|
wonderful! Generally elegantly, you can also write, for example, function php2array($php)
{
$buf = '';
foreach($php as $row)
$buf .= pack('d*',...$row);
return $buf;
}
$a = $la->alloc([??,??],NDArray::float64);
$a->buffer()->load(php2array($php)); No need to modify existing codes. |
Hi! function myArray(object $la, array $arr) : object
{
$buf = '';
foreach($arr as $row)
$buf .= pack('d*',...$row);
$a = $la->alloc([count($arr),count(reset($arr))],NDArray::float64);
$a->buffer()->load($buf);
return $a;
} However this doesn't work for GPU mode: $mo = new MatrixOperator;
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$la->blocking(true);
$a = myArray($la,$A);
Still don't know why. I will investigate. |
Please remember the memory space. PHP world memory => CPU world memory => GPU world memory $mo = new MatrixOperator();
$la = $mo->la();
$gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$gla->blocking(true);
$a = myArray($la,$A);
$ga = $gla->array($a);
### something ####
$y = $gla->toNDArray($gy); |
Hi. I'm testing different ideas for GPU, still without significant improvement versus CPU. I'm wondering about some things. Can I ask for your advice again, please?
function regMultiply(array $A, array $B) : array
{
global $_USE_BLAS,$_USE_GPU;
if( $_USE_BLAS )
{
$mo = new MatrixOperator;
$la = $mo->laRawMode();
$A = regMyArray($la,$A);
$B = regMyArray($la,$B);
if( $_USE_GPU )
{
$gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$gla->blocking(true);
$A = $gla->array($A);
$B = $gla->array($B);
$C = $gla->gemm($A,$B);
$C = $gla->toNDArray($C);
}
else
$C = $la->gemm($A,$B);
return $C->toArray();
}
else
{
$rows=count($A);
$cols=count($B[0]);
$m=count($A[0]);
$C=[];
for( $i=0; $i<$rows; $i++ )
for( $j=0; $j<$cols; $j++ )
{
$C[$i][$j]=0;
for( $r=0; $r<$m; $r++ )
$C[$i][$j]+=$A[$i][$r]*$B[$r][$j];
}
return $C;
}
} |
Hi ;-)
|
I have discovered* that And the next saving is replacing So that, in myArray function I can save app. 70% of execution time. *EDIT: it depends on memory available. The code is faster if there is enough memory available for the script (for me 8GB was not enough). In opposite the second code (with ".=") works much, much faster. Just added this info, because I think it's worth to know, however maybe it's not usefull for you. |
Hi @yuichiis ,
I have to say that speed improvement is impressive! I have 10x acceleration, what means that calculations done in the past for 7 days are now calculated in just a few hours. I've found that this can be even more productive, when using threads from PHP parallel extension (to be honest it's a bit surprising to me - does this mean GPU units still have some free time when using rindow extenstion?). And is was enough that only one function (matrices multiplication = cross product) was migrated. Thank you for your excellent work with parallelism in PHP!
Since time is crutial for my purposes, I must resign from any unncecessary code. That's why I avoid using composer/autoloader (well, checking PHP version against 5.6 and cascade of includes and function calls each time is time-wasting). So I found minimal configuration that works:
And in the php.ini I added:
As you noticed, there is nothing about
rindow_clblast
. Does it means that this extension is not for matrix manipulation? If I add it, will I get extra speed up in any way?The text was updated successfully, but these errors were encountered: