How to Write Fast Numerical Code. Assignment 2, spring 2011

263-2300-00: How To Write Fast Numerical Code

Assignment 2

Due Date: Thu March 17 17:00 by committing to your svn repository.

Questions: Make use of TA office hours, or email the TAs or Prof. Pueschel

Instructions (read carefully)

For this exercise you must submit your source code to the svn directory that we set up for everybody that is part of a project team. The Url of your SVN Directory is
https://svn.inf.ethz.ch/svn/pueschel/students/trunk/s11-fastcode/[YOUR_NETZH_LOGIN]
and there are subdirectories for this (and the following) homeworks.
Should you have troubles accessing your directory please mail me (Georg) s11-fastcode@inf.ethz.ch
Submit all other requested results (plots, tables, other results, text) also to svn in one file. (.pdf preferred, MS-Word format accepted).
For plots/benchmarks, be concise, but provide necessary information (e.g., compiler and flags) and always briefly discuss the plot and draw conclusions.
When compiling, ensure that you use optimization flags.
Disable SSE for this exercise when compiling. Under Visual Studio you will find it under Config -> Code Generator -> Enable Enhanced Instructions (should be off). With gcc their are several flags: use -mno-abm (check the flag), -fno-tree-vectorize should also do the job.

Exercises

(Microarchitectural parameters, 15 pts) Determine and create a table for the following microarchitectural parameters of your computer.
Parameters to be determined:
1. CPU-core manufacturer, model name
2. Number of CPU cores on chip
3. CPU-core frequency
4. For one core and without considering SSE:
  1. Cycles/issue for floating point additions
  2. Cycles/issue for floating point multiplications
  3. Maximum theoretical floating point peak performance (in Gflop/s)
Tips:
On Unix/Linux systems, typing 'cat /proc/cpuinfo' in a shell will give you enough information about your CPU for you to be able to find an appropriate manual for it on the manufacturer's website (typically AMD or Intel). The manufacturer's website will contain information about the on-chip details. (e.g. Intel).
For Windows 7 "Control Panel\System and Security\System" will show you your CPU, for more info "CPU-Z" will give a very detailed report on the machine configuration.
(MMM benchmarking, 30pts) The standard matrix multiplication kernel performs the following operation : C = A B + C, wher A, B, C are matrices of compatible size. We provide a C source file and a C header file that times this kernel using different methods under Windows and Linux (for x86 compatibles).
1. Inspect and understand the code.
2. Determine the exact number of (floating point) additions and multiplication performed by the compute() function in mmm.c of the code.
3. Using your computer, compile and run the code. Ensure you get consistent timings by at least 2 different timers. You need to setup the timer for your machine by setting the number of runs (NUM_RUNS) and the machine frequency (FREQUENCY). For the number of runs per n use the formular 2*(ceil(1000/n))^3.
4. Then, for all square matrices of sizes n between 100 and 1500, in increments of 100, create a plot for the following quantities (One plot per quantity, so 3 plots total, n should be the x-axis).
  - runtime.
  - performance (in Mflop/s or Gflop/s).
  - Using the data from exercise 1, percentage of the peak performance reached.
5. Submit your modified code to the SVN and call it also mmm.c
(Microbenchmarks: Mathematical functions, 40pts) Determine the runtime (in cycles) of the following computations (x, y are doubles) as accurately as possible (you should reuse the timing setup from question 2):
1. y = x
2. y = 7.12x
3. y = x + 7.12
4. y = sin(x), x in {0.0, 0.2, 4.1, 170.32}
5. y = log(x), x in {0.001, 1.00001, 10.65, 2762.32}
6. y = exp(x), x in {-1.234e-17, 0.101, 3.72, 1.234e25}
There are a total of 15 runtimes. Create a table. Briefly discuss and try to explain the results.
Submit your code to the SVN and call it microbench.c

The benchmark setup should be as follows:
1. Allocate two vector doubles x[N] and y[N] and initialize all x[i] to be one of the values from above.
2. Use:
  for(i=0; i<N; i++) y[i] = f(x[i]);
  to compute y[i] = f(x[i]) , with f() being one of the functions above and time this for-loop.
3. Choose N such that all data easily fits into L1 cache but there are adequate iterations to get a reasonable timing measurement.
4. Use the x86 time stamp counter provided to you in class.
To accurately measure these very short computations, use the following guidelines:
1. Only time the actual work, leave everything else (initializations, timing related computations, etc.) outside the timing loop.
2. Use the C preprocessor to produce a parameterized implementation to easily check different parameters.
3. You may have to run your for(N) loop multiple times to obtain reasonable timing accuracy.
4. You may have to take the minimum across multiple such measurements to obtain stable results.
5. You must perform a warm up for the experiment (see the MMM code above as example) so variables where you store the timing results, the timed routine and the data vectors are all in cache.
6. Alignment of your data vectors on cache line sizes or page sizes can influence the run.
(Performance and runtime, 15 pts) Consider a Core 2 Duo running at 3Ghz.
1. You learned in class that one processor core can complete up to one addition and one multiplication per cycle (scalar doubles). What is the resulting theoretical floating point peak performance?
2. Assume an MMM implementation of cost 2n³ for n x n matrices reaches 80% of the peak for all sizes n.
  1. What is the runtime as a function of n?
  2. What is the largest size n that can be handled in one millisecond?