Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus, simultaneously.[1]

Each parallel algorithm specifies whether a variable is shared or private. Many errors in implementation can arise if the variable is declared to be shared but the algorithm requires it to be private, or vice versa.[2]

Traditionally, parallellizing compilers could apply privatization to scalar elements only. To exploit parallelism that occurs across iterations within a parallel program (loop-level parallelism), the need grew for compilers that can also perform array variable privatization.[3] Most of today's compilers can performing array privatization with more features and functions to enhance the performance of the parallel program in general. An example is the Polaris parallelizing compiler.[4]

Description

A shared-memory multiprocessor is a "computer system composed of multiple independent processors that execute different instruction streams".[4] The shared memory programming model is the most widely used for parallel processor designs.[1] This programming model starts by identifying possibilities for parallelism within a piece of code and then mapping these parallel tasks into threads.

The next step is to determine the scope of variables used in a parallel program, which is one of the key steps and main concerns within this model.

Variable scope

The next step in the model groups tasks together into bigger tasks, as there are typically more tasks than available processors. Typically, the number of execution threads that the tasks are assigned to, is chosen to be less than or equal to the number of processors, with each thread assigned to a unique processor.[1]

Right after this step, the use of variables within tasks needs to be analyzed. This step determines whether each variable should be shared-by-all or private-to-each thread.[1] This step is unique to shared-memory programming. (An alternative is message passing, in which all variables are private.)[1]

According to their behavior, the variables are then categorized as:

  • Read-Only: when a variable is only read by all the parallel tasks.
  • Read/Write Non-conflicting: when a variable is read, written, or both by only one task. If the variable is not scalar, different elements may be read/written by different parallel tasks.
  • Read/Write Conflicting: when a variable is written by a task and may be read by another. If the variable is not scalar, different elements are read/written by different parallel tasks.

As it appears from their definition, Read/Write Conflicting variables introduce dependencies between different execution threads and hence prevent the automatic parallelization of the program. The two major techniques used to remove these dependencies are privatization and reduction. In reduction, each thread is provided with a copy of the R/W Conflicting variable to operate on it and produce a partial result, which is then combined with other threads' copies to produce a global result.[1] Another technique similar to privatization is called expansion, in which a scalar variable is expanded into an array, which makes each thread access a different array element.[5] If the variable to be expanded is an array, then expansion adds another dimension to the array.[6]

Privatization

Dependencies potential conflicts between different threads during execution prevent parallelization, and these conflicts appear when we have Read/Write Conflicting variables. One technique to remove these conflicts is privatization. The basic principle involves making a private copy of a variable for each thread, rather than share one instance. This changes the category of the variable from Read/Write Conflicting to Read/Write Non-conflicting.

The actual local (private) instances of the Read/Write Conflicting variables are created at compile time, by allocating several areas of memory for the variables stored at different memory locations. The architecture of shared-memory multiprocessors helps, as threads share an address space.

There are two situations in which a variable can be described as privatizable:

  1. When the variable is written before it is read by the same task during the original program's sequential order. In this case, if the task wrote to its private copy rather than the shared one, the conflict/dependency would be removed. The reason for this is that the program's sequential order will ensure that the value will be the one written by the same task, removing any conflicts that might occur by other threads accessing the same variable. See § Example 1.
  2. When the variable is read before it is written by the same task. The difference here is that the value the task is trying to read is one from a prior computing step in another task. But if each task wrote to its own private copy, any conflicts or dependencies would be solved during execution, as they would all read a value known ahead of time and then write their correct values on their own copies. See § Example 2.

Because Read/Write Conflicting variables are the only category that prevents parallelization, there is no need explicitly to declare Read-only and Read/Write Non-conflicting variables as private. Doing so will not affect the correctness of the program, but may use more memory for unnecessary copies.

Limitations

Sometimes a variable can be neither privatized nor reduced to remove the read/write conflict. In these cases, the Read/Write Conflicting variable needs to be updated by different tasks at different points of time. See § Example 3.

This problem can sometimes be solved by changing the scope of parallelism to explore a different parallel region. This might produce good results, as it is often that after reanalyzing the code, some Read/Write Conflicting variables may change to Read/Write Non-conflicting.[1] If the variable still causes conflicts, the last resort is to declaring it as shared and protecting its access by some form of mutual exclusion, and providing synchronization if accesses to the variable needs to happen in a specified order to ensure correctness.

Arrays

Read/Write Conflicting variables can be scalar, or compound types such as arrays, matrices, structured types an so on. Privatization can be applied to both types of variables.

When applied to scalar variables, the additional space and overhead introduced by making the extra private copies per thread is relatively small, because scalars are small.[1] However, applying privatization on arrays, matrices or other compound types is much more complex.

When dealing with arrays, the compiler tries to analyze the behavior of each array element separately and check for the order it is read and written. If each element is written before it is read in the same iteration, this array can be privatized. To do this, the compiler needs to further analyze the array to combine its accesses into sections. Moreover, the compiler should have extra functions, to be able to manipulate and deal with the array elements. For example, some array expressions may have symbolic terms, hence, to be able to privatize such array, the compiler needs to have some advanced symbolic manipulation functions.[5]

Examples

Example 1

A variable can be privatized if each task will write to it before reading from it. In this case, it does not matter whether other threads are doing so. In the code below, the variable x is used to help swap three different pairs of variables. Because it is always written to before being read, it can be privatized.

//Sequential Code:

x = a;
a = b;
b = x;

x = c;
c = d;
d = x;

x = e;
e = f;
b = x;

This code cannot be made parallel without privatizing x. With x privatized, it can run on three different threads, each with its own private x:

//Parallel Code:

//Thread 1:
x[1] = a;
a = b;
b = x[1];

// Thread 2:
x[2] = c;
c = d;
d = x[2];

// Thread 3:
x[3]= e;
e = f;
b = x[3];

Example 2

Privatization is possible when a variable's value is known before it is used – even if it is written to by a different task. The code below demonstrates this. The variable x is written to in the middle of each task, but that value could be computed when the program was compiled. By making x private and defining it at the beginning of each task, the code can be run in parallel:

//Sequential Code:
x = 1;
y = x * 3;
x = 4;
z = y/x;

a = x * 9;
x = 3;
b = a/x;

c = x * 1;
x = 11;
d = c/x;

To make the sequential code above parallel, a few lines of code must be added so that x can be privatized:

//Parallel Code
//Thread 0:
x[0] = 1;
y = x[0] * 3;
x[0] = 4;
z = y/x[0];

//Thread 1:
x[1] = 4;
a = x[1] * 9;
x[1] = 3;
b = a/x[1];

//Thread 2:
x[2] = 3;
c = x[2] * 1;
x[2] = 11;
d = c/x[2];

Because of the extra code, this short example may not actually see much of a speed up. But in real-life, longer code this technique can greatly improve performance.

Example 3

Privatization fails is when a variable is written in one task and read in another, and the value is not known ahead of time. An example is summing the elements of an array. The sum is a shared variable and is read/written in each iteration of the loop.

In sequential code, this works fine. But if the iterations were each done in a different thread, the wrong sum would be calculated. In this case, privatization does not work. The sum cannot be made private because it relies on its value from the previous iteration.

//Sequential Code:

sum = 0;
for (i = 0; i < 100; i++)
    sum += a[i];

This problem can be solved in part by loop unrolling. Because it does not matter which order the elements are added, the loop can be split into an arbitrary number of parts:

// Thread 0:
sum[0] = 0;
for (i[0] = 0; i[0] < 100; i[0] += 3)
    sum[0] += a[i[0]];

// Thread 1:
sum[1] = 0;
for (i[1] = 1; i[1] < 100; i[1] += 3)
    sum[1] += a[i[1]];

// Thread 2:
sum[2] = 0;
for (i[2] = 2; i[2] < 100; i[2] += 3)
    sum[2] += a[i[2]];

// "Master" thread:
wait_for_all(thread[0], thread[1], thread[2]);
sum = sum[0] + sum[1] + sum[2];

OpenMP

OpenMP is a programming language that supports multiprocessor programming with shared memory. Because of this, Read/Write Conflicting variables will undoubtedly occur. In these cases, privatization can sometimes be used to allow parallel execution of the code.

Given the sequential code:

  do i = 10, N - 1
       x = (b(i) + c(i))/2
       b(i) = a(i + 1) + x
  enddo

For each iteration of the loop, x is written to and then read from. Because x is only a scalar variable, the loop cannot be executed in parallel because it would be overwritten in different threads, and b(i) would not always be assigned the correct value.[2]

Equivalent parallelized code using privatization is:

  !$omp parallel do shared(a, b) private(x)
  do i = 10, N - 1
       x = (b(i) + c(i))/2
       b(i) = a(i + 1) + x
  enddo

Because x is declared as private, each thread gets its own copy and the dependence is removed.[2]

Comparison with other techniques

Normally, when a variable is Read/Write Conflicting, the solution will be declaring it as shared and protecting access to it by mutual exclusion, providing synchronization when needed. Because mutual exclusion slows things down, this technique is avoided as much as possible.

Thus, the compiler or programmer first checks whether the variable can be reduced. If it cannot, the next check is for privatization. Privatization trades space for time, so mutual exclusion can be a better option when memory is limited.

Compared to reduction, privatization requires one modelling step: merely to analyze the code to identify the privatizable variables. On the other hand, the two steps required by reduction are: identifying the reduction variable, and then parallelizing the reduction operator.[7] By observing each of the two techniques, it is easy to tell what type of overhead each one adds to the parallel program; reduction increases the computation overhead while privatization increases the memory consumed by the program.[5]

Compared to expansion, privatization has less memory overhead. The memory space needed for privatization is proportional to the number of processors, while in expansion, it is proportional to the number of iterations.[5] Since the number of tasks is typically higher than the number of processors, the memory required by expansion is much larger than by privatization.

Changing the scope of parallelism can be done to explore a different parallel region. Doing so may sometimes greatly change the behavior of the variables. So reanalyzing the code and performing this technique may often change Read/Write Conflicting variables to be Non-conflicting.[1]

See also

References

  1. 1 2 3 4 5 6 7 8 9 Solihin, Yan (2015). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. ISBN 978-1-4822-1118-4.
  2. 1 2 3 Chandra, Rohit (2001). Penrose, Denise (ed.). Parallel Programming in OpenMP (PDF). Morgan Kaufmann. pp. 48, 74, 143. ISBN 978-1-55860-671-5.
  3. Gupta, M. (1997-04-01). "On privatization of variables for data-parallel execution". Proceedings 11th International Parallel Processing Symposium. pp. 533–541. CiteSeerX 10.1.1.50.2508. doi:10.1109/IPPS.1997.580952. ISBN 978-0-8186-7793-9. S2CID 17389658.
  4. 1 2 Ceze, Luis H. (2011-01-01). "Shared-Memory Multiprocessors". In Padua, David (ed.). Encyclopedia of Parallel Computing. Springer US. pp. 1810–1812. doi:10.1007/978-0-387-09766-4_142. ISBN 978-0-387-09765-7.
  5. 1 2 3 4 Padua, David (2011-01-01). "Parallelization, Automatic". In Padua, David (ed.). Encyclopedia of Parallel Computing. Springer US. pp. 1442–1450 – Paraphrased by Bruce Leasure. doi:10.1007/978-0-387-09766-4_197. ISBN 978-0-387-09765-7.
  6. Tu, Peng; Padua, David (1993-08-12). "Automatic Array Privatization". In Banerjee, Utpal; Gelernter, David; Nicolau, Alex; Padua, David (eds.). Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 500–521. CiteSeerX 10.1.1.3.5746. doi:10.1007/3-540-57659-2_29. ISBN 978-3-540-57659-4.
  7. Yu, Hao; Rauchwerger, Lawrence (2014-01-01). "Adaptive Reduction Parallelization Techniques". ACM International Conference on Supercomputing 25th Anniversary Volume. New York, NY, USA: ACM. pp. 311–322. doi:10.1145/2591635.2667180. ISBN 978-1-4503-2840-1. S2CID 52865514.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.