Changeset 406

Feb 12, 2013 5:01:12 AM (8 years ago)

Merge branch 'cuda' of 'gitclone'

Signed-off-by: Kshitij Kulshreshtha <kshitij@…>

The following commits were merged:

commit b62001306013a3cff8685cd41355db4bbcdbdfe1
Author: Kshitij Kulshreshtha <kshitij@…>
Date: Tue Feb 12 10:46:48 2013 +0100

Add examples to build system, copyrights and multi-include protection

Signed-off-by: Kshitij Kulshreshtha <kshitij@…>

commit ce0f1c68301992d307136ac114af10b0a05e8915
Author: Alina Koniaeva <alinak@…>
Date: Thu Jul 12 16:19:38 2012 +0200

sqrt and log function changed in the header file adoublecuda.h

Signed-off-by: Alina Koniaeva <alinak@…>

commit eed8665516916084e3b7dd0a783938f814d26391
Author: Alina Koniaeva <alinak@…>
Date: Thu Jul 12 16:18:34 2012 +0200

Examples for the use of ADOL-C with Cuda

Signed-off-by: Alina Koniaeva <alinak@…>

commit c1dff45b95768c536366cd688281488721279c3d
Author: Alina Koniaeva <alinak@…>
Date: Thu Jul 12 16:13:34 2012 +0200

Documentation for the use of ADOL-C with Cuda

Signed-off-by: Alina Koniaeva <alinak@…>

commit 7d3ffd136d5785162beeb7e7f1063e50d236f001
Author: Alina Koniaeva <alinak@…>
Date: Fri Jun 1 16:52:05 2012 +0200

Example for the use of ADOL-C with Cuda

Signed-off-by: Alina Koniaeva <alinak@…>

commit e59c0d04e59e01e3c498f6e3e60aa1371ed0fbc2
Author: Alina Koniaeva <alinak@…>
Date: Fri Jun 1 16:51:20 2012 +0200

Headerfile adoublecuda.h included in makefile

Signed-off-by: Alina Koniaeva <alinak@…>

commit 3a81a3ae62be40673cd6d915e5aa56e57996e918
Author: Alina Koniaeva <alinak@…>
Date: Fri Jun 1 16:50:37 2012 +0200

Headerfile for the use of ADOL-C with Cuda

Signed-off-by: Alina Koniaeva <alinak@…>

5 added
4 edited


  • trunk/ADOL-C/doc/adolc-manual.tex

    r379 r406  
    34313431    \end{itemize}
     3435\section{Traceless forward differentiation in ADOL-C using Cuda}
     3438One major drawback using the traceless version of ADOL-C is the fact
     3439that several function evaluations are needed to compute derivatives
     3440in many different points. More precisely, to calculate the Jacobian for
     3441a function $F:\mathbb{R}^n\rightarrow \mathbb{R}^m$ in $M$ points, $M$
     3442function evaluations are needed for the traceless vector mode and even
     3443$M*n$ for the traceless scalar mode. Depending on the size of the function
     3444this can result in a long runtime. To achieve a better performance one can
     3445use parallelisation techniques as the same operations are
     3446performed during a function evaluation. One possibility is to use GPUs
     3447since they are optimized for data parallel computation. Starting with
     3448version 2.3.0 ADOL-C now features a traceless forward mode for computing
     3449first order derivatives in scalar mode on GPUs using the
     3450general purpose parallel computing architecture Cuda.
     3452The idea is to include parallel code that executes in many GPU threads across
     3453processing elements. This can be done by using kernel functions, that is functions
     3454which are executed on GPU as an array of threads in parallel. In general all
     3455threads execute the same code. They are grouped into blocks which are then grouped
     3456into grids. A kernel is executed as a grid of blocks of threads. For more
     3457details see, e.g., the NVIDIA CUDA C Programming Guide which can be downloaded from
     3458the web page {\sf}. To solve the problem of calculating
     3459the Jacobian of $F$ at $M$ points it is possible to let each thread perform a
     3460function evaluation and thus the computation of derivatives for one direction
     3461at one point. The advantage is that the function is evaluated in different points
     3462in parallel which can result in a faster wallclock runtime.
     3464The following subsection describes the source code modifications required to use
     3465the traceless forward mode of ADOL-C with Cuda. 
     3467\subsection{Modifying the source code}
     3469Let us again consider the coordinate transformation from Cartesian to spherical
     3470polar coordinates given by the function $F: \mathbb{R}^3 \to \mathbb{R}^3$, $y
     3471= F(x)$, with
     3473y_1  =  \sqrt{x_1^2 + x_2^2 + x_3^2},\qquad
     3474y_2  =  \arctan\left(\sqrt{x_1^2 + x_2^2}/x_3\right),\qquad
     3475y_3  =  \arctan(x_2/x_1),
     3477as an example. We now calculate the Jacobian at $M=1024$ different points.
     3478The source code for one point is shown in \autoref{fig:tapeless}. This example
     3479has no real application but can still be used to show the combination of the traceless version
     3480of ADOL-C and Cuda.
     3482For the use of this mode a Cuda toolkit, which is suitable for the grafic card used,
     3483has to be installed. Furthermore, it is important to check that the graphic card
     3484used supports double precision (for details see e.g. NVIDIA CUDA C Programming Guide).
     3485Otherwise the data type employed inside of the {\sf adouble} class has to be adapted
     3486to {\sf float}. To use the traceless forward mode with Cuda, one has to include the header
     3487files \verb#adoublecuda.h# and \verb#cuda.h#. The first one contains the
     3488definition of the class {\sf adouble} and the overloaded operators
     3489for this version of ADOL-C. As in the other versions all derivative
     3490calculations are introduced by calls to overloaded operators. The
     3491second header file is needed for the use of Cuda. 
     3493One possibility to solve the problem above is the following. First of all
     3494three {\sf double} arrays are needed: {\sf x} for the independent variables,
     3495{\sf y} for the dependent variables and {\sf deriv} for the values of the
     3496Jacobian matrices. The independent variables have to be initialised, therefore
     3497the points at which the function should be evaluated are saved in a row in the
     3498same array of length $3*M$. For the computation on GPUs one also has to allocate memory
     3499on this device. Using the syntax in Cuda one can allocate an array of
     3500length $3*M$ (number of independent variables times number of points) for
     3501the independent variables as follows:
     3503  \begin{tabular}{l}
     3504    {\sf double * devx;}\\
     3505    {\sf cudaMalloc((void**)\&devx, 3*M*sizeof(double));}\\
     3506  \end{tabular}
     3508The arrays for the dependent variables and the values of the Jacobian matrices
     3509are allocated in the same way. Then the values of the independent variables
     3510have to be copied to the GPU using the following command
     3512  \begin{tabular}{l}
     3513    {\sf cudaMemcpy(devx, x, sizeof(double)*3*M, cudaMemcpyHostToDevice);}\\
     3514  \end{tabular}
     3516The argument {\sf cudaMemcpyHostToDevice} indicates that the values are copied from
     3517the host to the GPU. In this case the values stored in {\sf x} are copied to {\sf devx}.
     3519Now all required information has been transferred to the GPU. The changes
     3520in the source code made so far are summarized in \autoref{fig:cudacode1}.
     3526\hspace*{-1cm} \= \kill
     3527\> {\sf \#include $<$iostream$>$}\\
     3528\> {\sf \#include $<$cuda.h$>$}\\
     3529\> {\sf \#include $<$adoublecuda.h$>$}\\
     3530\> {\sf using namespace std;}\\
     3532\> {\sf int main() \{}\\
     3533\> {\sf \rule{0.5cm}{0pt}int M=1024;}\\
     3534\> {\sf \rule{0.5cm}{0pt}double* deriv = new double[9*M];}\\
     3535\> {\sf \rule{0.5cm}{0pt}double* y = new double[3*M];}\\
     3536\> {\sf \rule{0.5cm}{0pt}double* x = new double[3*M];}\\
     3538\> {\sf \rule{0.5cm}{0pt}// Initialize $x_i$}\\
     3539\> {\sf \rule{0.5cm}{0pt}for (int k=0; k$<$M; ++k)\{}\\
     3540\> {\sf \rule{1cm}{0pt}for (int i=0; i$<$3; ++i)}\\
     3541\> {\sf \rule{1.5cm}{0pt}x[k*3+i] =i + 1/(k+1);\}}\\
     3543\> {\sf \rule{0.5cm}{0pt}// Allocate array for independent and dependent variables}\\
     3544\> {\sf \rule{0.5cm}{0pt}// and Jacobian matrices on GPU}\\
     3545\> {\sf \rule{0.5cm}{0pt}double * devx;}\\
     3546\> {\sf \rule{0.5cm}{0pt}cudaMalloc((void**)\&devx, 3*M*sizeof(double));}\\
     3547\> {\sf \rule{0.5cm}{0pt}double * devy;}\\
     3548\> {\sf \rule{0.5cm}{0pt}cudaMalloc((void**)\&devy, 3*M*sizeof(double));}\\
     3549\> {\sf \rule{0.5cm}{0pt}double * devderiv;}\\
     3550\> {\sf \rule{0.5cm}{0pt}cudaMalloc((void**)\&devderiv, 3*3*M*sizeof(double));}\\
     3552\> {\sf \rule{0.5cm}{0pt}// Copy values of independent variables from host to GPU}\\
     3553\> {\sf \rule{0.5cm}{0pt}cudaMemcpy(devx, x, sizeof(double)*3*M, cudaMemcpyHostToDevice);}\\
     3555\> {\sf \rule{0.5cm}{0pt}// Call function to specify amount of blocks and threads to be used}\\
     3556\> {\sf \rule{0.5cm}{0pt}kernellaunch(devx, devy, devderiv,M);}\\
     3558\> {\sf \rule{0.5cm}{0pt}// Copy values of dependent variables and Jacobian matrices from GPU to host}\\
     3559\> {\sf \rule{0.5cm}{0pt}cudaMemcpy(y, devy, sizeof(double)*3*M,cudaMemcpyDeviceToHost);}\\
     3560\> {\sf \rule{0.5cm}{0pt}cudaMemcpy(deriv, devderiv, sizeof(double)*3*3*M, cudaMemcpyDeviceToHost);}\\
     3561\> {\sf \}}
     3565\caption{Example for traceless scalar forward mode with Cuda}
     3569In the next step the user has to specify how many blocks and threads per block
     3570will be needed for the function evaluations. In the present example this is done by 
     3571the call of the function {\sf kernellaunch}, see \autoref{fig:cudacode2}.
     3572In this case the blocks are two dimensional: the x-dimension is determined
     3573by the number of points $M$ at which the Jacobian matrix has to be calculated
     3574while the y-dimension is given by the number of independent variables, i.e., 3.
     3575Since a block cannot contain more than 1024 threads, the
     3576x-dimension in the example is 64 instead of $M=1024$. Therefore $1024/64=16$
     3577blocks are needed. The described division into blocks is reasonable as each
     3578thread has to perform a function evaluation for one point and one direction,
     3579hence $M*3$ threads are needed where 3 denotes the number of independent
     3580variables corresponding to the number of directions needed for the computation
     3581of the Jacobian in one point.
     3586\hspace*{-1cm} \= \kill
     3587\> {\sf cudaError\_t kernellaunch(double* inx, double* outy, double* outderiv, int M) \{}\\
     3588\> {\sf \rule{0.5cm}{0pt}// Create 16 blocks}\\
     3589\> {\sf \rule{0.5cm}{0pt}int Blocks=16;}\\
     3590\> {\sf \rule{0.5cm}{0pt}// Two dimensional (M/Blocks)$\times$3 blocks}\\
     3591\> {\sf \rule{0.5cm}{0pt}dim3 threadsPerBlock(M/Blocks,3);}\\
     3593\> {\sf \rule{0.5cm}{0pt}// Call kernel function with 16 blocks with (M/Blocks)$\times$3 threads per block}\\
     3594\> {\sf \rule{0.5cm}{0pt}kernel $<<<$ Blocks, threadsPerBlock $>>>$(inx, outy, outderiv);}\\
     3595\> {\sf \rule{0.5cm}{0pt}cudaError\_t cudaErr = cudaGetLastError();}\\
     3597\> {\sf \rule{0.5cm}{0pt}return cudaErr;}\\
     3598\> {\sf \}}
     3601\caption{Example for traceless scalar forward mode with Cuda}
     3605We can now perform the function evaluations together with the calculation
     3606of the Jacobians. The corresponding code is illustrated in \autoref{fig:cudacode3}.
     3607Since the function evaluations should be performed on a GPU a kernel function is
     3608needed which is defined by using the \mbox{{\sf \_\_ global\_\_}} declaration specifier
     3609(see \autoref{fig:cudacode3}). Then each thread executes the operations that are
     3610defined in the kernel.
     3615\hspace*{-1cm} \= \kill
     3616\> {\sf \_\_global\_\_ void kernel(double* inx, double* outy, double* outderiv) \{}\\
     3617\> {\sf \rule{0.5cm}{0pt}const int index = threadIdx.x ;}\\
     3618\> {\sf \rule{0.5cm}{0pt}const int index1 = threadIdx.y;}\\
     3619\> {\sf \rule{0.5cm}{0pt}const int index2 = blockIdx.x;}\\
     3620\> {\sf \rule{0.5cm}{0pt}const int index3 = blockDim.x;}\\
     3621\> {\sf \rule{0.5cm}{0pt}const int index4 = blockDim.x*blockDim.y;}\\
     3623\> {\sf \rule{0.5cm}{0pt}// Declare dependent and independent variables as {\sf adouble}s}\\
     3624\> {\sf \rule{0.5cm}{0pt}adtlc::adouble y[3], x[3];}\\
     3625\> {\sf \rule{0.5cm}{0pt}// Read out point for function evaluation}\\
     3626\> {\sf \rule{0.5cm}{0pt}for(int i=0; i $<$ 3; i++)}\\
     3627\> {\sf \rule{1cm}{0pt}x[i]=inx[index2*index4+index*3+i];}\\
     3628\> {\sf \rule{0.5cm}{0pt}// Set direction for calculation of derivatives}\\
     3629\> {\sf \rule{0.5cm}{0pt}x[index1].setADValue(1);}\\
     3631\> {\sf \rule{0.5cm}{0pt}// Function evaluation}\\
     3632\> {\sf \rule{0.5cm}{0pt}y[0] = sqrt(x[0]*x[0]+x[1]*x[1]+x[2]*x[2]);}\\
     3633\> {\sf \rule{0.5cm}{0pt}y[1] = atan(sqrt(x[0]*x[0]+x[1]*x[1])/x[2]);}\\
     3634\> {\sf \rule{0.5cm}{0pt}y[2] = atan(x[1]/x[0]);}\\
     3636\> {\sf \rule{0.5cm}{0pt}for(int i=0; i $<$ 3; i++)}\\
     3637\> {\sf \rule{1cm}{0pt}outy[(index2*index3+index)*3+i]=y[i].getValue();}\\
     3638\> {\sf \rule{0.5cm}{0pt}for(int i=0; i $<$ 3; i++)}\\
     3639\> {\sf \rule{1cm}{0pt}outderiv[(index2*index4+index*3+index1)*3+i]=y[i].getADValue();}\\
     3640\> {\sf \}}
     3643\caption{Example for traceless scalar forward mode with Cuda}
     3647In this example each thread is assigned the task of calculating the
     3648derivatives in one point with respect to one independent variable. Therefore
     3649some indices are needed for the implementation. Each thread has a unique thread
     3650ID in a block consisting of an x and an y-dimension. The blocks have an ID as
     3651well. In the example the following indices are used.
     3654 \item {\sf index = threadIdx.x} denotes the x-dimension of a thread (ranges from 0 to 63 in the example)
     3655 \item {\sf index1 = threadIdx.y} denotes the y-dimension of a thread (ranges from 0 to 2 in the example)
     3656 \item {\sf index2 = blockIdx.x} denotes the block index (ranges from 0 to 15 in the example)
     3657 \item {\sf index3 = blockDim.x} denotes the x-dimension of a block (always 64 in the example)
     3658 \item {\sf index4 = blockDim.x*blockDim.y} denotes the size of a block (always $64*3=192$ in the example)
     3661For the calculation of derivatives the function evaluation has to be performed
     3662with {\sf adouble}s. Therefore the dependent and the independent variables
     3663have to be declared as {\sf adouble}s as in the other versions of ADOL-C (see, e.g., \autoref{tapeless}).
     3664Similar to the traceless version without Cuda the namespace \verb#adtlc# is used.
     3665Now each thread has to read out the point at which the function evaluation is
     3666then performed. This is determined by the blockindex and the x-dimension of
     3667the thread, therefore the independent variables in a thread
     3668have the values
     3670  \begin{tabular}{l}
     3671    {\sf \rule{0.5cm}{0pt}x[i]=inx[index2*index4+index*3+i];}\\
     3672  \end{tabular}
     3674for {\sf i=0,1,2} where {\sf inx} corresponds to the vector {\sf devx}. The direction for
     3675the derivatives is given by {\sf index1}.
     3677  \begin{tabular}{l}
     3678     {\sf \rule{0.5cm}{0pt}x[index1].setADValue(1);}\\
     3679  \end{tabular}
     3681The functions for setting and getting the value and the derivative value of an
     3682{\sf adouble} are the same as in the traceless version of ADOL-C for first
     3683order derivatives (see \autoref{tapeless}).
     3684The function evaluation is then performed with {\sf adouble x} on GPU and the results
     3685are saved in {\sf adouble y}. The function evaluation itself remains unchanged
     3686(see \autoref{fig:cudacode3}). Then we store the values of the function
     3687evaluations in each point in a row in one array:
     3689  \begin{tabular}{l}
     3690   {\sf \rule{0.5cm}{0pt}outy[(index2*index3+index)*3+i]=y[i].getValue();}\\
     3691  \end{tabular}
     3693for {\sf i=0,1,2}. The values of a Jacobian are stored column by column in
     3694a row:
     3696  \begin{tabular}{l}
     3697   {\sf \rule{0.5cm}{0pt}outderiv[(index2*index4+index*3+index1)*3+i]=y[i].getADValue();}\\
     3698  \end{tabular}
     3700for {\sf i=0,1,2}.
     3702Now there is one thing left to do. The values calculated on the GPU have to be copied
     3703back to host. For the dependent variables in the example this can be done by the
     3704following call:
     3706  \begin{tabular}{l}
     3707 {\sf \rule{0.5cm}{0pt}cudaMemcpy(y, devy, sizeof(double)*3*M,cudaMemcpyDeviceToHost);}\\
     3708  \end{tabular}
     3710see \autoref{fig:cudacode1}. The argument {\sf cudaMemcpyDeviceToHost} determines that
     3711the values are copied from GPU to host.
     3714\subsection{Compiling and Linking the Source Code}
     3716After incorporating the required changes, one has to compile the
     3717source code and link the object files to get the executable. For
     3718the compilation of a Cuda file the {\sf nvcc} compiler is needed.
     3719The compile sequence should be similar to the following example:
     3721  \begin{tabular}{l}
     3722    {\sf nvcc -arch=sm\_20 -o traceless\_cuda traceless\}
     3723  \end{tabular}
     3725The compiler option \verb#-arch=sm_20# specifies the compute capability that is
     3726assumed, in this case one that supports double precision.
     3728The mentioned source code {\sf traceless\} illustrating the use of the
     3729forward traceless scalar mode with Cuda and a further example {\sf}
     3730can be found in the directory examples. The second example is an adaption of
     3731the OpenMP example to the traceless version with Cuda.
  • trunk/ADOL-C/examples/additional_examples/

    r106 r406  
    1616SUBDIRS              = clock hessmat lufact ode sparse tapesave timing \
    1717                       detexam helm lighthouse scal speelpenning taylor pow \
    18                        checkpointing ext_diff_func fixpoint_exam openmp_exam
     18                       checkpointing ext_diff_func fixpoint_exam openmp_exam \
     19                       cuda
    2223SUBDIRS              = clock hessmat lufact ode sparse tapesave timing \
    2324                       detexam helm lighthouse scal speelpenning taylor pow \
    24                        checkpointing ext_diff_func fixpoint_exam openmp_exam
     25                       checkpointing ext_diff_func fixpoint_exam openmp_exam \
     26                       cuda
  • trunk/ADOL-C/include/adolc/

    r375 r406  
    1616                       externfcts.h checkpointing.h fixpoint.h\
    1717                       adolc_sparse.h adolc_openmp.h \
    18                        revolve.h advector.h adolc_settings.h
     18                       revolve.h advector.h adolc_settings.h \
     19                       adoublecuda.h
    2021SUBDIRS = drivers tapedoc
  • trunk/

    r402 r406  
    465465                ADOL-C/examples/Makefile
    466466                ADOL-C/examples/additional_examples/Makefile
     467                ADOL-C/examples/additional_examples/cuda/Makefile
    467468                ADOL-C/examples/additional_examples/clock/Makefile
    468469                ADOL-C/examples/additional_examples/checkpointing/Makefile
Note: See TracChangeset for help on using the changeset viewer.