How to Write Fast Numerical Code 263-2300 (ETH, CS)

Basic Information

Course number: 263-2300, 6 credits
Spring 2013, lectures: M 10:15-12:00, RZ F21; W 13:15-14:00 RZ F21; occasional substitute lectures: F 14:15-16:00 IFW B42
Instructor: Markus Püschel (RZ H18, pueschel at inf, 2-7303)
TAs:
- Georg Ofenbeck (RZ H1.1, ofgeorg at inf, 2-8414)
- Daniele Spampinato (RZ H1.1, daniele.spampinato at inf), 2-8414)
  Admin: Franziska Mäder (RZ H14, maeder at inf, 2-7311)
Office hours:
- Markus Püschel: Tues 14:00-15:00
- Daniele Spampinato: Mon 14:00-15:00
- Georg Ofenbeck: Wed 14:00-15:00
Maling lists:
- Forum to find project partner: fastcode-forum@lists.inf.ethz.ch (emails go to all students who have no partner yet and to Georg)
- For technical questions: fastcode@lists.inf.ethz.ch (emails to this address go to the lecturerer and all TAs)

Course Description

The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture.

This interdisciplinary course aims to give the student an understanding of performance and introduces foundations and state-of-the-art techniques in high performance software development using important functionality such as linear algebra algorithms, transforms, filters, and others as examples. The course will focus on optimizing for the memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.

Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.

Prerequisites: solid C programming skills, matrix algebra, Master student or above

Topics Covered

Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
Compilers: strengths, limitations, how to use
Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others

Goals of this Course

Obtain an understanding of runtime performance and how to reason about it
Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
Understand the connection between algorithms, implementations, and computer architecture

Background Material

Prior versions of this course:
- spring 2012 (ETH)
- spring 2011 (ETH)
- spring 2008 (CMU)
- spring 2005 (CMU)
Chapters 5 and 6 in Computer Systems: A Programmer's Perspective, 2nd edition (available in library)
This small tutorial
Research papers, manuals, and other material mentioned in the slides or linked in the table below

Academic Integrity

All homeworks in this course are single-student homeworks. The work must be all your own. Do not copy any parts of any of the homeworks from anyone including the web. Do not look at other students' code, papers, or exams. Do not make any parts of your homework available to anyone, and make sure noone can read your files. The university policies on academic integrity will be applied rigorously.

We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands the programming language in question (C, in our case).

It is not considered cheating to clarify vague points in the assignments or textbook, or to give help or receive help in using the computer systems, compilers, debuggers, profilers, or other facilities.

Grading

40% research project
- Topic: Very fast, ideally adaptive implementation of a numerical problem
- Team up in pairs
- March 7: find a partner, find a problem or I give you one (tip: look at the prior courses linked above for examples)
- Complete "milestones" during semester and enter them into the online check list
- Write 6 page standard conference paper (template will be provided)
- Give short presentation end of semester
20% midterm
40% homework
- Exercises on algorithms analysis
- Implementation exercises
  - study the effect of program optimizations, compilers, special instructions, etc.
  - write and submit C code & create runtime/performance plots
- Some templates will be provided
- All homeworks are single-student homeworks
There is no final Exam

Research Project

How it works:
- Weeks without homeworks should be used to work on the project
- You select a numerical problem and create a correct (verified) implementation in C
- You determine the arithmetic cost, measure the runtime and performance
- You profile the implementation to find the parts in which most the runtime spent
- Focussing on these you apply various optimization techniques from this class
- You repeat the previous steps to create various versions with (hopefully) continuously better runtime
- You write a paper about your work and give a presentation
Paper:
- Maximal 6 pages (hard limit), conference style, template and instructions below
- Everybody reads this: report.pdf
- For latex use: report.zip (start with reading the README file)
- For Word (discouraged) use this: report-word.doc
- Due date June 14th (as final-report.pdf in your svn)
Presentation
- Last week of classes (Mon/Wed/Fr lectures), each talk is 10 minutes
- Template (the use is totally optional) and some guidelines (ppt is 2007 and later): presentation-template.pptx , presentation-template.pdf
- The order will be determined randomly right before class
- Who talks will be determined randomly right before class
Projects (each one has a supervisor shown in brackets):
1. Stefan B. & Tobias S.: Fast multigrid solver for biharmonic equation (D&G)
2. Rico H. & Donjan R.: Simplex (MP)
3. Ria F. & Timon G.: Fluid dynamics (D&G)
4. Alexandros K. & Grzegorz M.: Miniball (MP)
5. Julia P. & Pascal S.: & Adrian B. Domain transform for edge-aware image and video processing (D&G)
6. Fabian H. & Alex C.: Exposure fusion (D&G)
7. Carl-Anton I. & Xavier L.: Tri-cubic interpolation (D&G)
8. Severin W. & Lorenzo B..: Arithmetics of large numbers (MP)
9. Filippo A. & Patrick S.: Image denoising (D&G)
10. Sebastian K. & Andri S.: Quantum informational entropy minimization (MP)
11. Nedyalko P. & Denis P.: Support Vector Machine (MP)
12. Ivo S. & Tim G.: High-Order Methods For Basis Pursuit (D&G)
13. Philipp H. .: Hierarchization on sparse grids (MP)
14. Pavol B. & Gadandeep S. & Vanya D.: Fast abstract domains (MP)
15. Simon L. & Stefan L. & Markus A.: Binary feature detector (D&G)
16. Alen S.: Program generator for filters (MP)
One-on-one meetings, I: 29.04.-03.05.
One-on-one meetings, II: 21.-22.05.

Tips & Tricks (From Students)

Apparently the only way to disable TurboBoost on OS X is via kernel extensions. I found one here: https://github.com/nanoant/DisableTurboBoost.kext which seems to work well.

Midterm

April 19th: 14:00-16:00 in HG D 3.2 (solution, without solution, appendix)

Homework

Homework 0: due as soon as possible
Homework 1: due Th, 07.03., 17:00 Solutions
Homework 2: due Th, 14.03., 17:00 Solutions
Homework 3: due Th, 21.03., 17:00 Solutions
Homework 4 (start project): check point Th, 28.03., 17:00
Homework 5: due Tues, 16.04., 17:00 Solutions
Homework 6 (continue project): check point one day before the meeting
Homework 7: due Thurs, 9.05., 17:00
Homework 8 (continue project): check point one day before meeting

Lectures (including pdfs)

Lecture	Date	Content	Slides	Notes	Other
1	18.02.	Course motivation, overview, organization	link
2	20.02.	Cost analysis, performance	link	link
3	25.02.	Architecture/Microarchitecture, operational intensity, Core 2/Core i7	link		Core 2/Core i7, Intel processor info
4	27.02.	Optimization for instruction level parallelism (ILP)	link
5	04.03.	Benchmarking, compiler limitations	link
6	06.03	Memory hierarchy, locality, caches	link
7	11.03.	Caches, blocking MMM		link
8	13.03.	Caches, roofline model	link	link	roofline paper
9	18.03.	Roofline model, dense linear algebra, LAPACK, ATLAS	link
10	20.03.	ATLAS, MMM optimization: cache blocking		link	model-based ATLAS paper
11	25.03.	MMM optimization: register blocking, ILP
12	27.03.	Virtual memory and TLBs		link
13	08.04.	MMM optimization: virtual memory and TLBs, sparse linear algebra/MVM, OSKI	link	link
14	10.04.	Sparse MVM, OSKI			paper
	15.04.	*No class (Sechseläuten)*
15	17.04.	SIMD vector extensions	link
16	22.04.	SSE intrinsics			Intel intrinsics guide, Intel icc manual, Visual Studio reference
17	24.04.	SSE intrinsics, Compiler vectorization
	29.04.	No class (one-on-one meetings)
	01.05.	No class (holiday)
18	06.05.	Performance counters, linear transforms, discrete and fast Fourier transform	link, link	link
19	08.05.	Fast Fourier transform (FFT)
20	13.05.	FFT optimization, FFTW	link	link	FFTW
21	15.05.	Computer generation of FFT code: Spiral	link		Spiral
	20.05.	*No Class (holiday)*
	22.05.	*No Class (one-on-one meetings)*
	27.05.	Project presentations
	29.05.	Project presentations
	31.05.	Project presentations