Privatization (computer programming)

Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus simultaneously.^[1]

Each parallel algorithm specifies whether a variable is shared or private. Many errors in implementation can arise if the variable is declared to be shared but the algorithm requires it to be private, or vice versa.^[2]

Traditionally, parallellizing compilers could apply privatization to scalar elements only. To exploit parallelism that occurs across iterations within a parallel program (loop-level parallelism), the need grew for compilers that can also perform array variable privatization.^[3] Most of today's compilers can performing array privatization with more features and functions to enhance the performance of the parallel program in general. An example is the Polaris parallelizing compiler.^[4]

Description

A shared-memory multiprocessor is a "computer system composed of multiple independent processors that execute different instruction streams".^[4] The shared memory programming model is the most widely used for parallel processor designs.^[1] This programming model starts by identifying possibilities for parallelism within a piece of code and then mapping these parallel tasks into threads.

The next step is to determine the scope of variables used in a parallel program, which is one of the key steps and main concerns within this model.

Variable scope

The next step in the model groups tasks together into bigger tasks, as there are typically more tasks than available processors. Typically, the number of execution threads that the tasks are assigned to, is chosen to be less than or equal to the number of processors, with each thread assigned to a unique processor.^[1]

Right after this step, the use of variables within tasks needs to be analyzed. This step determines whether each variable should be shared-by-all or private-to-each thread.^[1] This step is unique to shared-memory programming. (An alternative is message passing, in which all variables are private.)^[1]

According to their behavior, the variables are then categorized as:

Read-Only: when a variable is only read by all the parallel tasks.
Read/Write Non-conflicting: when a variable is read, written, or both by only one task. If the variable is not scalar, different elements may be read/written by different parallel tasks.
Read/Write Conflicting: when a variable is written by a task and may be read by another. If the variable is not scalar, different elements are read/written by different parallel tasks.

As it appears from their definition, Read/Write Conflicting variables introduce dependencies between different execution threads and hence prevent the automatic parallelization of the program. The two major techniques used to remove these dependencies are privatization and reduction. In reduction, each thread is provided with a copy of the R/W Conflicting variable to operate on it and produce a partial result, which is then combined with other threads' copies to produce a global result.^[1] Another technique similar to privatization is called expansion, in which a scalar variable is expanded into an array, which makes each thread access a different array element.^[5] If the variable to be expanded is an array, then expansion adds another dimension to the array.^[6]

Privatization

Dependencies – potential conflicts between different threads during execution – prevent parallelization, and these conflicts appear when we have Read/Write Conflicting variables. One technique to remove these conflicts is privatization. The basic principle involves making a private copy of a variable or each thread, rather than share one instance. This changes the category of the variable from Read/Write Conflicting to Read/Write Non-conflicting.

The actual local (private) instances of the Read/Write Conflicting variables are created at compile time, by allocating several areas of memory for the variables stored at different memory locations. The architecture of shared-memory multiprocessors helps, as threads share an address space.

There are two situations in which a variable can be described as privatizable:

When the variable is written before it is read by the same task during the original program's sequential order. In this case, if the task wrote to its private copy rather than the shared one, the conflict/dependency would be removed. The reason for this is that the program's sequential order will ensure that the value will be the one written by the same task, removing any conflicts that might occur by other threads accessing the same variable. See § Example 1.
When the variable is read before it is written by the same task. The difference here is that the value the task is trying to read is one from a prior computing step in another task. But if each task wrote to its own private copy, any conflicts or dependencies would be solved during execution, as they would all read a value known ahead of time and then write their correct values on their own copies. See § Example 2.

Because Read/Write Conflicting variables are the only category that prevents parallelization, there is no need explicitly to declare Read-only and Read/Write Non-conflicting variables as private. Doing so won't affect the correctness of the program, but may use more memory for unnecessary copies.

Limitations

Sometimes a variable can be neither privatized nor reduced to remove the read/write conflict. In these cases, the Read/Write Conflicting variable needs to be updated by different tasks at different points of time. An example of this case is explained in Example 3.

This problem can sometimes be solved by changing the scope of parallelism to explore a different parallel region. This might produce good results, as it is often that after reanalyzing the code, some Read/Write Conflicting variables may change to Read/Write Non-conflicting.^[1] If the variable still causes conflicts, the last resort is to declaring it as shared and protecting its access by a critical section, and providing synchronization if accesses to the variable needs to happen in a specified order to ensure correctness.

Arrays

Read/Write Conflicting variables can be scalar, or compound types such as arrays, matrices, structured types an so on. Privatization can be applied to both types of variables.

When applied to scalar variables, the additional space and overhead introduced by making the extra private copies per thread is relatively small, because scalars are small.^[1] However, applying privatization on arrays, matrices or other compound types is much more complex.

When dealing with arrays, the compiler tries to analyze the behavior of each array element separately and check for the order it is read and written. If each element is written before it is read in the same iteration, this array can be privatized. To do this, the compiler needs to further analyze the array to combine its accesses into sections. Moreover, the compiler should have extra functions, to be able to manipulate and deal with the array elements. For example, some array expressions may have symbolic terms, hence, to be able to privatize such array, the compiler needs to have some advanced symbolic manipulation functions.^[5]

Examples

Example 1

There are several situations in which it is appropriate to privatize a variable. The first case is when each thread will write to the variable before reading from it. In this case, it does not matter what the other threads are doing because the current thread always writes to the variable before reading (using) it. See the simple example below in which the variable "x" is used as a "temp" variable to help swaps 3 different pair of variables. Because variable x is always written to before being used, variable x can be privatized.

//Sequential Code:
//Swap Function
//Assume the variables have already been initialized
x = a;
a = b;
b = x;

x = c;
c = d;
d = x;

x = e;
e = f;
b = x;

The block above is the sequential code. Notice that without privatizing the variable "x", the code could not be parallelized. The code below shows what is possible by parallelizing "x". The code can be split up and run on 3 different threads. Each thread has its own copy of "x".

//Parallel Code:
//Swap Function
//Assume the variables have already been initialized

Thread 1:
x = a;
a = b;
b = x;

Thread 2:
x' = c;
c = d;
d = x’;

Thread 3:
x'’ = e;
e = f;
b = x''

Example 2

Another case in which privatization is possible is when a variables value is known before it is used – even if it is not redefined by the same thread. The example below demonstrates this. The variable "x" is redefined in the middle of each thread, however, the value that it is redefined as is known when the program was written. By making "x" private and defining it at the beginning of each thread, the code can be run in parallel. The example below first shows the sequential code and then how it can parallelized with the help of privatization.

//Sequential Code:
//Assume the variables have already been declared
x = 1;
y = x * 3;
x = 4;
z = y/x;

a = x * 9;
x = 3;
b = a/x;

c = x * 1;
x = 11;
d = c/x;

Notice that in order to make the sequential code above parallel, a few extra lines of code had to be added so that "x" could be privatized. Because of this, this example may not actually see much of a speed up. However, in larger/longer examples this technique could help improve performance a lot.

//Parallel Code:
//Assume the variables have already been declared
x = 1;
y = x * 3;
x = 4;
z = y/x;

x’ = 4;
a = x’ * 9;
x’ = 3;
b = a/x’;

x’’ = 3;
c = x’’ * 1;
x’’ = 11;
d = c/x’’;

Example 3

One example when privatization fails is when a value is written in one task and read in another and the value is not known ahead of time. Take this example of summing an array. The sum is a shared variable and is read/written in each iteration of the loop. In sequential code, this works fine. However, if you try to parallelize the loops (each loop a different thread), then the wrong sum would be calculated. In this case, privatization does not work. You cannot privatize the sum because they each rely on each other. While there are techniques to still parallelize this code, simple privatization does not work.

//Sequential Code:
//Assume the variables have already been declared

sum = 0;
for (int i = 0; i < 10; i++) {
    sum = sum + a[i];
}

Privatization in Open MP

OpenMP is a programming language that supports multiprocessor programming with shared memory. Because of this, read/write conflicting variables will undoubtedly occur. In these cases, Privatization can sometimes be used to allow parallel execution of the code. The example below shows 2 different examples of code. The first example is the original code written sequentially. This example includes a dependence which would normally prevent the code to be run in parallel. The second example, shows the code parallelized and the privatization technique used to remove the dependence.^[2]

//Sequential Code:
//Assume the variables have already been initialized
  do i = 10, N - 1
       x = (b(i) + c(i))/2
       b(i) = a(i + 1) + x
  enddo

For each iteration of the loop above, x is assigned and then read. Because x is only a single variable, the loop cannot be executed in parallel because it would constantly be overwritten and b(i) would not always be assigned the correct value.^[2]

//Parallel Code:
//Assume the variables have already been initialized
  !$omp parallel do shared(a, b) private(x)
  do i = 10, N - 1
       x = (b(i) + c(i))/2
       b(i) = a(i + 1) + x
  enddo

For the parallel code, x is declared as private which means each thread gets its own copy and the dependence is therefore removed.

Comparison with Other Techniques

Normally, when a variable is read/write conflicting, the solution will be declaring it as shared and protecting access to it by the critical section, and providing synchronization when needed. Due to the fact that adding more elements into the critical section decreases the overall system performance, this technique is avoided as much as possible.

Thus, the variable is checked if it can be reduced first. If it can't be reduced, the variable is checked for Privatization, taken into consideration whether it is more tolerable to add the extra private copies or serialize access to the variable through the critical section.

When compared with Reduction, privatization requires one task instead of two separate tasks in the case of privatization. This task, in an abstract form, is basically analyzing the code to identify the privatizable variables. On the other hand, the two tasks required by Reduction are: identifying the reduction variable, and then parallelizing the reduction operator.^[7] By observing each of the two techniques, it is easy to tell what type of overhead each one adds to the parallel program; reduction increases the computation overhead while privatization increases the memory consumed by the program.^[5]

When compared to Expansion, the difference here is that expansion produces more memory overhead. The reason for this is that the memory space needed for privatization is proportional to the number of processors, while in expansion, it is proportional to the number of loop iterations.^[5] As it was previously mentioned, the number of tasks is typically higher than the number of processors available, which makes memory required here much larger than what is required for privatization.

As previously mentioned, changing the scope of parallelism to explore a different parallel region is also one of the techniques used to enable parallelization. Changing the scope in which the parallel tasks are identified might sometimes greatly change the behavior of the variables. Hence, reanalyzing the code and performing this technique may often change read/write conflicting variables into non-conflicting.^[1]

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Solihin, Yan (2015). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. ISBN 9781482211184.
^ ^a ^b ^c Chandra, Rohit (2001). Penrose, Denise (ed.). Parallel Programming in OpenMP (PDF). Morgan Kaufmann. pp. 48, 74, 143. ISBN 978-1-55860-671-5.
^ Gupta, M. (1997-04-01). On privatization of variables for data-parallel execution. pp. 533–541. CiteSeerX 10.1.1.50.2508. doi:10.1109/IPPS.1997.580952. ISBN 978-0-8186-7793-9. {{cite book}}: |journal= ignored (help)
^ ^a ^b Ceze, Luis H. (2011-01-01). "Shared-Memory Multiprocessors". In Padua, David (ed.). Encyclopedia of Parallel Computing. Springer US. pp. 1810–1812. doi:10.1007/978-0-387-09766-4_142. ISBN 9780387097657.
^ ^a ^b ^c ^d Padua, David (2011-01-01). "Parallelization, Automatic". In Padua, David (ed.). Encyclopedia of Parallel Computing. Springer US. pp. 1442–1450 – Paraphrased by Bruce Leasure. doi:10.1007/978-0-387-09766-4_197. ISBN 9780387097657.
^ Tu, Peng; Padua, David (1993-08-12). Banerjee, Utpal; Gelernter, David; Nicolau, Alex; Padua, David (eds.). Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 500–521. CiteSeerX 10.1.1.3.5746. doi:10.1007/3-540-57659-2_29. ISBN 9783540576594.
^ Yu, Hao; Rauchwerger, Lawrence (2014-01-01). Adaptive Reduction Parallelization Techniques. New York, NY, USA: ACM. pp. 311–322. doi:10.1145/2591635.2667180. ISBN 9781450328401. {{cite book}}: |journal= ignored (help)

[:0-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Solihin, Yan (2015). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. ISBN 9781482211184.

[:1-2] Chandra, Rohit (2001). Penrose, Denise (ed.). Parallel Programming in OpenMP (PDF). Morgan Kaufmann. pp. 48, 74, 143. ISBN 978-1-55860-671-5.

[3] Gupta, M. (1997-04-01). On privatization of variables for data-parallel execution. pp. 533–541. CiteSeerX 10.1.1.50.2508. doi:10.1109/IPPS.1997.580952. ISBN 978-0-8186-7793-9. {{cite book}}: |journal= ignored (help)

[:4-4] Ceze, Luis H. (2011-01-01). "Shared-Memory Multiprocessors". In Padua, David (ed.). Encyclopedia of Parallel Computing. Springer US. pp. 1810–1812. doi:10.1007/978-0-387-09766-4_142. ISBN 9780387097657.

[:2-5] Padua, David (2011-01-01). "Parallelization, Automatic". In Padua, David (ed.). Encyclopedia of Parallel Computing. Springer US. pp. 1442–1450 – Paraphrased by Bruce Leasure. doi:10.1007/978-0-387-09766-4_197. ISBN 9780387097657.

[:3-6] Tu, Peng; Padua, David (1993-08-12). Banerjee, Utpal; Gelernter, David; Nicolau, Alex; Padua, David (eds.). Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 500–521. CiteSeerX 10.1.1.3.5746. doi:10.1007/3-540-57659-2_29. ISBN 9783540576594.

[7] Yu, Hao; Rauchwerger, Lawrence (2014-01-01). Adaptive Reduction Parallelization Techniques. New York, NY, USA: ACM. pp. 311–322. doi:10.1145/2591635.2667180. ISBN 9781450328401. {{cite book}}: |journal= ignored (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]