Jump to content

User:Sbbhuva/sandbox/Program Level Parallelism

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Sbbhuva (talk | contribs) at 03:37, 12 September 2016 (Background). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Program Level Parallelism is a type of parallelism which involves the usage of the concepts of task oriented parallelism and multi-threading. This helps in increasing the throughput of the processor by helping to increase the net amount of useful work done in the same time frame.[1]

Background

With the advent of multi-processor system and intense competition in the market, emphasis has always been on making systems which are faster, power efficient and require a fewer number of instructions. It is here that the concepts of parallelism creep in, so that multiple processors can be used to work on different blocks and hence complete the given task in a smaller amount of time.

The execution time(T) of a program is mathematically defined as the product of the number of instructions, clock cycles per instruction and the time required for each cycle

T = Total number of instructions * Cycles Per Instruction * Time required for each clock cycle [2] 

However, if we look at the factors affecting the execution time, we realize that lowering the clock time is not feasible as there is a threshold beyond which the clock frequency cannot be increased with the current manufacturing process. Also, with increasing frequency, power consumption also increases which might result in the device getting burnt out due to heat.

The other two factors, number of instructions and CPI, are continuously being enhanced by constant changes in the architecture and designing of the processors. The common thread that binds these is parallelism.

Program level parallelism is one such way of improving the performance

Description

Program level parallelism, as the name suggests is achieved by splitting a program into different parts, each executed by a different processor.

Consider the example of matrix multiplication.

[C] = [A] * [B]

If Matrix [A] is a m*n, [B] is n*d, so by the properties of matrix multiplication, the resultant matrix [C] will be m*d matrix. Following is a generalized code for carrying out matrix multiplication

for (int i=0;i<m;i++)
   for (int j=0; j<d; j++)
   { 
   int sum =0;
       for (int k=0; k<n;k++)
       sum = sum + a[i].[k] + b[k].[j] ;
    c[i].[j] = sum;
    }

After analyzing the code for matrix multiplication, we realize that there is no dependency in the first and second loop while calculating the value of sum. Thus, we can split the task of first two loops into different processors. However, the value of sum is changing for every iteration of the third loop (k). Thus, there is a true dependency between different iterations of the loop and hence can't be split into multiple processors(i.e same processor needs to execute that task).

Suppose both [A] and [B] are 2*2 matrices, so C will also be a 2*2 matrix. Let us consider that we have 4 processors which can be used to execute this code in parallel. Thus, each processor can be used to calculate an element in the matrix.  The code can be split amongst 4 processors in this way.

Processor 1 Processor 2 Processor 3 Processor 4
int sum1 = 0;

for (int k = 0 ; k < 2 ; k++)

sum1 = sum1 + a[1][k]*b[k][1];

c[1][1] = sum1;

int sum2 = 0;

for (int k = 0 ; k < 2 ; k++)

sum2 = sum2 + a[1][k]*b[k][2];

c[1][2] = sum2;

int sum3  =  0;

for (int k = 0 ; k < 2 ; k++)

sum3 = sum3 + a[2][k]*b[k][1];

c[2][1] = sum3;

int sum4 = 0;

for (int k = 0 ; k < 2 ; k++)

sum4 = sum4 + a[2][k]*b[k][2];

c[2][2] = sum4;

Thus, a program which needed 3 loops to execute will now need just one loop to execute and will be completed quickly as all the elements are being computed in parallel.

Types

Iterative code within a loop

A Code segment having iterations without any data dependencies between loops is one scenario where the code can be parallelized at program level. In such a scenario, values associated with one index are not dependent on the succeeding or previous indexed values. Thus every iteration can be executed as a separate element with each of them functioning in parallel.  [2]

Independent code

Parts of the code which are independent of each other can be made to execute in parallel. In a program there might be multiple piece of code which are not related or dependent upon each other in any way. Distributing those code segments and treating each of them as separate is one form of program level parallelism.[2]

Example

Consider the following piece of code, having a loop over i,

for (i=0; i<=n; i++) 
{
c[i] = a[i] + b[i];  /* code segment S1 */
f[i] = d[i] + e[i];  /* code segment S2 */
}

In code segment S1, value for cycle i doesn’t depends upon the previous (i-1) or next cycle (i+1), which means it does not have any data dependency. Thus for all the iterations, each computation for i = 0 to i= n can be parallelized.

Thus independent processors can be asked to calculate the values of c[0], c[1],c[2].. c[n], helping the system to perform the task in a quicker manner.

c[0] = a[0] + b[0];

c[1] = a[1] + b[1];

c[2] = a[2] + b[2];

…..

c[n] = a[i] + b[i];

All the above computations will run in parallel as there is no dependency between execution of either of them.

Further, code segments S1 and S2 act like two independent blocks which implies we can execute S1 and S2 parallel. To achieve this, the above code can be decomposed into following code:

for (i=0; i<=n; i++) 
{
c[i] = a[i] + b[i];  /* code segment S1 */
}

for (i=0; i<=n; i++) 
{
f[i] = d[i] + e[i];  /* code segment S2 */
}

Thus, two independent processors are being used in order to execute a loop rather than just one, helping reduce the time for completion.

Realization

Different processors executing different task

Implementation

In a multiprocessor system with each of the processor executing different tasks and all of them connected to a single bus is one method of implementing program level parallelism.

Figure 1
Figure 1

It can be observed in the figure besides, where multiple processors with their own cache are executing different programs speeding up the total computation time. This form of parallelism is mostly seen in servers employing four to eight processors. [1]

Limitations and challenges

There is a possibility of different load across different processors, which might lead to some processors executing more than the others. To take care of such issues, load balancing techniques are used which make sure that all the processors are equally balanced.

Single program(task) spread across multiple threads

Implementation

A program executing on a single processor can be speed up by distributing it across multiple threads running on multiple processors. With all the threads running concurrently on all the processors, the overall execution time will be less. [1]

In the adjoining figure, Thread A, Thread B and Thread C are all executing concurrently.

Limitations and Challenges

To create sufficient number of threads so that all the available cores are optimally utilized is one of the challenge faced by software developers. If too many threads are created, then might lead to some heavy rescheduling and context switching. Moreover, programs that are developed using threading model cannot properly adjust to the changes in multi-core system architecture.[3]

Limitations

Limitations on speed-up

How effectively a program can be distributed amongst multiple processors, so that each processor is sharing equal load is the most important question while talking about parallelism. So the effect to which parallelism is possible is majorly dependent on the program which is being executed. Thus, as per Amdahl’s law, if a program is parallel only in short bursts and majority of it is sequential code, theoretically it will need many processors to achieve a greater speed-up. However, if the number of processors is limited some operations will be delayed compared to the others. But, an important point is that this delay will add to the execution time only if the delayed operation is in the critical path.[4]

Data Dependency

Executing a program in parallel is often limited by data dependency that sections of the program might have on each other. Due to this, processors might have to wait till the other processor does not complete its execution and hence synchronization signals come into play. This causes an increase in the communication overhead. Also, the limitation comes into play as programs having DOPIPE and DOACROSS dependencies can’t be parallelized at program level.[5]

Note

There is a popular belief that maximum parallelism can be extracted by rewriting the current code, so that it calls multiple functions, or using a more functional language. This however is not the case. After simulation and multiple experiments, it was observed that for a multi-threaded model even though false dependencies are removed the performance is still not up to the mark.Thus, large scale parallelism can’t be achieved merely by translating the existing code but by redesigning the application at algorithmic level. This will significantly boost smooth-ability and improve parallelism.[4]

See Also

References

  1. ^ a b c Dowling, Eric M. (Jan 2, 2001), Apparatus and method for program level parallelism in a VLIW processor, retrieved 2016-09-05
  2. ^ a b c "3 High Performance Computer Architecture". www.phy.ornl.gov. Retrieved 2016-09-05.
  3. ^ Drepper, Ulrich (May 27, 2010), Methods and systems for managing program-level parallelism, retrieved 2016-09-05
  4. ^ a b Theobald, K. B.; Gao, G. R.; Hendren, L. J. (1992-12-01). "On The Limits Of Program Parallelism And Its Smoothability". , Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992. MICRO 25: 10–19. doi:10.1109/MICRO.1992.696993.
  5. ^ Solihin, Yan (2016). Fundamentals of parallel multicore architecture. Boca Raton, FL: CRC Press, Taylor & Francis Group.

1) Data Dependency

2) Parallelism Types