Thursday 2 January 2014

Parallel IDEA Algorithm

For one of my coursework's this year we had to parallelize the International Data Encryption Algorithm using different CPU concurrency techniques and analyse the effect on performance.

This post will provide an overview of the work carried out and the results that were obtained. If you are interested in reading the full report, you can do so at the following link




Introduction
The aim of this piece of work is to evaluate the performance of the International Data Encryption Algorithm (IDEA) and to investigate different CPU concurrency and parallel techniques in an attempt to improve the performance of the algorithm.

Hardware

Home Machine

  •          CPU – i7-2600K @ 3.4Ghz 4 hardware and 4 logical = 8 cores
  •          RAM – 16GB
  •          GPU – GTX 560 Ti
  •          OS – Windows 8 64 bit 

Methodology
The following techniques were investigated in an attempt to speed up the performance on the algorithm.


OpenMP (Open Multi-Processing)
OpenMP is an API that allows programmers to easily make use of shared memory multi-processing. OpenMP uses #pragmas in C++ with a range of different constructs that allow the programmer to make use of multiple threads, work sharing, data sharing and synchronization. The main techniques that will be investigated with OpenMP are the parallel for construct and the different scheduling attributes. 

Parallel For
The parallel for directive allows the work of the for loop to be split up across a number of threads. The system parallelizes the loop for the programmer by splitting the work up by dividing the number of iterations by the number of threads. There are a number of considerations the programmer must make before using a parallel for, the programmer must know how many iterations the for loop will complete and the loop must be presented in canonical form. If either one of these conditions is not met OpenMP will not be able to parallelize the for loop.

The parallel for will be used within the cipher method of the IDEA algorithm as the majority of the work that is done inside the cipher method is contained within a for loop. A parallel for would be a good method to try as we know the number of iterations that the for loop will complete and it is in a canonical form.

C++11 Threads
With the new standard of C++ that was published in 2011 (C++11) support was given to developers to create multithreaded applications without relying on platform-specific extensions.

By using the thread library features of C++ 11 a parallel for will be implemented to handle the cipher method and a comparison will be taken between OpenMP and C++11 in terms of which gives the best performance and is the easiest to implement. 

Results
This section will look at the results obtained through testing the various techniques described. 

Sequential Version
The IDEA algorithm was ran 100 times for a range of different data sizes and an average time for each of the individual sections of the algorithm was calculated. This was done in an attempt to find the bottlenecks within the application. All times are in milliseconds unless otherwise stated. 

Initial Timings
Data Size
Generate Data
Cipher(encrypt)
Cipher(decrypt)
Validate
Total
2^20
2
56.93
57.02
2.46
118.41
2^22
6
227.86
227.99
10.66
472.51
2^24
25
911.55
912.86
43.51
1892.92
2^26
100
3647.37
3656.72
174.77
7578.86

From looking at these results in the table we can see that the cipher method is the main bottleneck in the application. By comparing the timing results of the cipher method to the generate data and validate methods we can see that the cipher method is approximately 95% slower in comparison.

Due to this large difference in time between the methods, the main focus of the parallelisation efforts will be on the cipher method and if time allows possibly the generate data and validate.

Exponential Growth Rate
By increasing the data size by a power of 2 each time, there is approximately a 300% increase in the time that it takes to complete the algorithm for both the games lab machine and the home machine. 

Old Time
New Time
Increase
118.41
472.51
299%
472.51
1892.92
300%
1892.92
7578.86
300%

Using this rate of growth we can work out what the values would be for extremely large datasets.
Data size
Expected Time (ms)
2^28
30314.58
2^30
121258.32

For approximately 1 billion values we can estimate that it would take 121.25 seconds to generate, encrypt, decrypt and validate the data on the home machine. This is a very large increase in the time taken to complete the algorithm and that an exponential growth is detrimental to the performance of the application for large data sets.

OpenMP
As the main bulk of the cipher method is a for loop, we can make use of an OpenMP parallel for to attempt to improve performance.

This is done using the construct #pragma omp parallel for num_threads(number_of_threads). There are some considerations which need to be taken into account when using the parallel for. We need to identify which variables are shared and could present problems with race conditions and which variables need to be private to each of the individual threads.

Looking at the cipher method we can identify that all the variables which are used within the for loop need to be private to each thread so that values cannot be over written by other threads. This is simple to do using OpenMP and only requires the attribute private(variables) to be added to the #pragma construct. 

Timings
Data Size
Sequential Time
Parallel Time
Speed Up
2^20
116.41
25.54
78.06 %
2^22
466.51
90.89
80.52 %
2^24
1867.92
362.94
80.57 %
2^26
7478.86
1453.7
80.56 %

Looking at the table above we can see that there was a definite increase in performance between the parallel and sequential versions of the algorithm from implementing the parallel for. For all data sizes that were tested there was an increase of at least 78% for the home machine. 

Exponential Growth Rate
Like before we can work out the time our parallel version would take for larger data sizes. 

Data size
Home Expected Time (ms)
2^28
5814.80
2^30
23259.20
In the original sequential version on the home machine to run the algorithm on a data set of over 1 billion values it would have taken 121.25 seconds but by simply adding 2 lines of code and thinking about the problem in a different way we have been able to reduce this to 23.25 seconds! This is a saving of 98 seconds which is a very long time.

C++11 Threads
Another method we could use to parallelise the for loop inside the cipher method would be to make use of the thread library that is part of C++11.

In order to use C++ threads it involves a little more work from the programmer. Rather than simply adding a #pragma statement inside the cipher method to create a parallel for and have the runtime handle how the work is divided between the threads we need to do that work ourselves. 

Timings
Data Size
Sequential Time
C++ Threads (ms)
Speed Up
2^20
116.41
23.38
79.92 %
2^22
466.51
92.38
80.12 %
2^24
1867.92
361
80.67 %
2^26
7478.86
1438.52
80.77 %
Conclusion
The aim of this report was to evaluate the performance of the International Data Encryption Algorithm (IDEA) and to investigate different CPU concurrency and parallel techniques in an attempt to improve the performance of the algorithm. This has been successful by making use of techniques such as OpenMP and C++ threads.

Of the two techniques there was very little difference in the actual time taken to complete the algorithm, however in terms of simplicity to implement OpenMP was the much better choice. A lot less time was spent on the development of the OpenMP application than the C++ threads and as both gave similar speed ups the best option to use appears to be OpenMP. 

There are still a number of CPU concurrency techniques that could be attempted in order to increase performance, however due to time these were not attempted. Further techniques which may be worthwhile to investigate are SIMD operations and a combination of OpenMP and C++ threads.

There are also optimizations that could be performed on both the generate data method and validate method that could improve the performance of the application. However as these presented little effect on performance at the data sizes tested these were not attempted. 

No comments:

Post a Comment