This post will provide an overview of the work carried out and the results that were obtained. If you are interested in reading the full report, you can do so at the following link
Introduction
The aim of this piece of work is to evaluate the performance of the International Data Encryption Algorithm (IDEA) and to investigate different CPU concurrency and parallel techniques in an attempt to improve the performance of the algorithm.
Hardware
Home Machine
- CPU – i7-2600K @ 3.4Ghz 4 hardware and 4 logical = 8 cores
- RAM – 16GB
- GPU – GTX 560 Ti
- OS – Windows 8 64 bit
Methodology
The following techniques were investigated in an attempt to speed up the performance on the algorithm.
OpenMP (Open Multi-Processing)
OpenMP is an API that allows programmers to easily make use of shared memory multi-processing. OpenMP uses #pragmas in C++ with a range of different constructs that allow the programmer to make use of multiple threads, work sharing, data sharing and synchronization. The main techniques that will be investigated with OpenMP are the parallel for construct and the different scheduling attributes.
Parallel For
The parallel for directive allows the work of the for loop to be split up across a number of threads. The system parallelizes the loop for the programmer by splitting the work up by dividing the number of iterations by the number of threads. There are a number of considerations the programmer must make before using a parallel for, the programmer must know how many iterations the for loop will complete and the loop must be presented in canonical form. If either one of these conditions is not met OpenMP will not be able to parallelize the for loop.
The parallel for will be used within the cipher method of the IDEA algorithm as the majority of the work that is done inside the cipher method is contained within a for loop. A parallel for would be a good method to try as we know the number of iterations that the for loop will complete and it is in a canonical form.
C++11 Threads
With the new standard of C++ that was published in 2011
(C++11) support was given to developers to create multithreaded applications
without relying on platform-specific extensions.
By using the thread library features of C++ 11 a parallel
for will be implemented to handle the cipher method and a comparison will be
taken between OpenMP and C++11 in terms of which gives the best performance and
is the easiest to implement.
Results
This section will look at the results obtained through testing the various techniques described.
Sequential Version
The IDEA algorithm was ran 100 times for a range of different data sizes and an average time for each of the individual sections of the algorithm was calculated. This was done in an attempt to find the bottlenecks within the application. All times are in milliseconds unless otherwise stated.
Initial Timings
Data Size
|
Generate Data
|
Cipher(encrypt)
|
Cipher(decrypt)
|
Validate
|
Total
|
2^20
|
2
|
56.93
|
57.02
|
2.46
|
118.41
|
2^22
|
6
|
227.86
|
227.99
|
10.66
|
472.51
|
2^24
|
25
|
911.55
|
912.86
|
43.51
|
1892.92
|
2^26
|
100
|
3647.37
|
3656.72
|
174.77
|
7578.86
|
From looking at these results in the table we can see that
the cipher method is the main bottleneck in the application. By comparing the
timing results of the cipher method to the generate data and validate methods
we can see that the cipher method is approximately 95% slower in comparison.
Due to this large difference in time between the methods,
the main focus of the parallelisation efforts will be on the cipher method and
if time allows possibly the generate data and validate.
Exponential Growth Rate
By increasing the data size by a power of 2 each time, there
is approximately a 300% increase in the time that it takes to complete the
algorithm for both the games lab machine and the home machine.
Old Time
|
New Time
|
Increase
|
118.41
|
472.51
|
299%
|
472.51
|
1892.92
|
300%
|
1892.92
|
7578.86
|
300%
|
Using this rate of growth we can work out what the values
would be for extremely large datasets.
Data size
|
Expected Time (ms)
|
2^28
|
30314.58
|
2^30
|
121258.32
|
For approximately 1 billion values we can estimate that it
would take 121.25 seconds to generate, encrypt, decrypt and validate the data
on the home machine. This is a very large increase in the time taken to
complete the algorithm and that an exponential growth is detrimental to the
performance of the application for large data sets.
OpenMP
As the main bulk of the cipher method is a for loop, we can
make use of an OpenMP parallel for to attempt to improve performance.
This is done using the construct #pragma omp parallel for num_threads(number_of_threads). There are
some considerations which need to be taken into account when using the parallel
for. We need to identify which variables are shared and could present problems
with race conditions and which variables need to be private to each of the
individual threads.
Looking at the cipher method we can identify that all the
variables which are used within the for loop need to be private to each thread
so that values cannot be over written by other threads. This is simple to do
using OpenMP and only requires the attribute private(variables) to be added to the #pragma construct.
Timings
Data Size
|
Sequential Time
|
Parallel Time
|
Speed Up
|
2^20
|
116.41
|
25.54
|
78.06 %
|
2^22
|
466.51
|
90.89
|
80.52 %
|
2^24
|
1867.92
|
362.94
|
80.57 %
|
2^26
|
7478.86
|
1453.7
|
80.56 %
|
Looking at the table above we can see that there
was a definite increase in performance between the parallel and sequential
versions of the algorithm from implementing the parallel for. For all data
sizes that were tested there was an increase of at least 78% for the home machine.
Exponential Growth Rate
Like before we can work out the time our parallel version would take for larger data sizes.
Data size
|
Home Expected Time (ms)
|
2^28
|
5814.80
|
2^30
|
23259.20
|
In
the original sequential version on the home machine to run the algorithm on a
data set of over 1 billion values it would have taken 121.25 seconds but by
simply adding 2 lines of code and thinking about the problem in a different way
we have been able to reduce this to 23.25 seconds! This is a saving of 98
seconds which is a very long time.
C++11 Threads
Another method we could use to parallelise the for loop inside
the cipher method would be to make use of the thread library that is part of
C++11.
In order to use C++ threads it involves a little
more work from the programmer. Rather than simply adding a #pragma statement
inside the cipher method to create a parallel for and have the runtime handle
how the work is divided between the threads we need to do that work ourselves.
Timings
Data Size
|
Sequential Time
|
C++ Threads (ms)
|
Speed Up
|
2^20
|
116.41
|
23.38
|
79.92 %
|
2^22
|
466.51
|
92.38
|
80.12 %
|
2^24
|
1867.92
|
361
|
80.67 %
|
2^26
|
7478.86
|
1438.52
|
80.77 %
|
Conclusion
The aim of this report was to evaluate the performance of
the International Data Encryption Algorithm (IDEA) and to investigate different
CPU concurrency and parallel techniques in an attempt to improve the performance
of the algorithm. This has been successful by making use of techniques such as
OpenMP and C++ threads.
Of
the two techniques there was very little difference in the actual time taken to
complete the algorithm, however in terms of simplicity to implement OpenMP was
the much better choice. A lot less time was spent on the development of the OpenMP
application than the C++ threads and as both gave similar speed ups the best
option to use appears to be OpenMP.
There are still a number of CPU concurrency techniques that
could be attempted in order to increase performance, however due to time these
were not attempted. Further techniques which may be worthwhile to investigate
are SIMD operations and a combination of OpenMP and C++ threads.
There are also optimizations that could be
performed on both the generate data method and validate method that could
improve the performance of the application. However as these presented little
effect on performance at the data sizes tested these were not attempted.
No comments:
Post a Comment