publications
Markus Püschel
Professor
Computer Science
ETH Zürich
Switzerland

home

publications

teaching

short CV

personal

the pub

How to Write Fast Numerical Code 263-2300 (ETH, CS)

Basic Information

  • Course number: 263-2300, 6 credits
  • Spring 2012, lectures: M 10:00-12:00, IFW C33; W 13:00-14:00 RZ F21; occasional substitute lectures: F 14:00-16:00 RZ F21
  • Instructor: Markus Püschel (RZ H18, pueschel at inf, 2-7303)
    TAs:
    • Georg Ofenbeck (RZ H1.1, ofgeorg at inf, 2-8414)
    • Daniele Spampinato (RZ H1.1, daniele.spampinato at inf), 2-8414)
      Admin: Franziska Mäder (RZ H14, maeder at inf, 2-7311)
  • Office hours:
    • Markus Püschel: Tues 14:00-15:00
    • Daniele Spampinato: Wed 9:00 - 10:00
    • Georg Ofenbeck: Wed 14:00 - 15:00
  • Maling lists:
    • Forum to find project partner: fastcode-forum@lists.inf.ethz.ch (emails go to all students who have no partner yet and to Georg)
    • For technical questions: fastcode@lists.inf.ethz.ch (emails to this address go to the lecturerer and all TAs)

Course Description

The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture.

This interdisciplinary course aims to give the student an understanding of performance and introduces foundations and state-of-the-art techniques in high performance software development using important functionality such as linear algebra algorithms, transforms, filters, and others as examples. The course will focus on optimizing for the memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.

Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.

Prerequisites: solid C programming skills, matrix algebra, Master student or above

Topics Covered

  • Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
  • Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
  • Compilers: strengths, limitations, how to use
  • Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
  • Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
  • State-of-the-art research in Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others

Goals of this Course

  • Obtain an understanding of runtime performance and how to reason about it
  • Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
  • Understand the connection between algorithms, implementations, and computer architecture

Background Material

Academic Integrity

All homeworks in this course are single-student homeworks. The work must be all your own. Do not copy any parts of any of the homeworks from anyone including the web. Do not look at other students' code, papers, or exams. Do not make any parts of your homework available to anyone, and make sure noone can read your files. The university policies on academic integrity will be applied rigorously.

We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands the programming language in question (C, in our case).

It is not considered cheating to clarify vague points in the assignments or textbook, or to give help or receive help in using the computer systems, compilers, debuggers, profilers, or other facilities.

Grading

  • 40% research project
    • Topic: Very fast, ideally adaptive implementation of a numerical problem
    • Team up in pairs
    • March 7: find a partner, find a problem or I give you one (tip: look at the prior courses linked above for examples)
    • Show "milestones" during semester
    • Write 4 page standard conference paper (template will be provided)
    • Give short presentation end of semester
  • 20% midterm
    • Algorithm/implementation analysis
    • Memory hierarchy
    • other
  • 40% homework
    • Exercises on algorithms analysis
    • Implementation exercises. Purpose: study the effect of program optimizations, compilers, special instructions, etc. Tasks: writing and submitting C code & creating runtime/performance plots
    • Some templates will be provided
    • All homeworks are single-student homeworks

There is no final Exam.

 

Research Project

Deadline for code and paper: Friday, June 8th

  • How it works:
    • Weeks without homeworks should be used to work on the project
    • You select a numerical problem and create a correct (verified) implementation in C
    • You determine the arithmetic cost, measure the runtime and performance
    • You profile the implementation to find the parts in which most the runtime spent
    • Focussing on these you apply various optimization techniques from this class
    • You repeat the previous steps to create various versions with (hopefully) continuously better runtime
    • You write a paper about your work and give a presentation
  • Paper:
    • Maximal 6 pages (hard limit), conference style, template and instructions below
    • Everybody reads this: report.pdf
    • For latex use: report.zip (start with reading the README file)
    • For Word (discouraged) use this: report-word.doc
  • Presentation
    • Last week of classes (Wed and Fr lecture), each talk is 10 minutes
    • Template and guidelines (ppt is 2007 and later; the use is totally optional): presentation-template.pptx , presentation-template.pdf
    • The order will be determined randomly right before class
    • Who talks will be determined randomly right before class
  • Projects (each one has a supervisor shown in brackets):
    1. Marios P. & Amit B.: Optimizing KD-Tree build time for Ray-Tracing (DS & GO)
    2. Anders M. & Jörn S.: Fast multipole method (DS & GO)
    3. Martin E. & Thomas B.: Shape from silhouettes (DS & GO)
    4. Oleksiy S. & Mathias D.: Optimal binary search trees (MP)
    5. Severin K. & Thomas A.: k-means clustering (DS & GO)
    6. Claudia K. & Changil K.: MLS surfaces from point clouds (DS & GO)
    7. Mattis P.: Variant of LASSO in convex optimization (MP)
    8. Raoul D. & Florian B.: Lighting coefficients from environment maps (MP)
    9. Simon E.: Light graffiti (DS & GO)
    10. Olga D. & Alec J.: Active set solver for quadratic programming (MP)
    11. Lukas B. & Danfeng Q.: Approximated Nearest Neighbor Search by Product Quantization (MP)
    12. Janosch N.: Jacobian Evaluations in Sparse Non-Linear Least Squares Solvers (MP)
    13. Raheem A. & Christian C.: Wavelet-based compression (MP)
  • One-on-one meetings, I: May 2nd and 3rd
  • One-on-one meetings, II: May 22nd and 23rd

Midterm

April 27th, 14:00 - 16:00 (solution, without solution)

Homework

Lectures (including pdfs)

Lecture
Date
Content
Slides
Notes
Other
1 20.02. Course motivation, overview, organization link    
2 22.02. Problem, algorithm, asymptotic versus cost analysis link link  
3 27.02. Architecture/microarchitecture, Core 2/Core i7 link link Core 2/Core i7, Intel processor info
4 29.02. Optimizing for instruction level parallelism (ILP) link    
5 05.03. Benchmarking, compiler limitations link   Benchmarking
6 07.03. Locality, operational intensity, memory/compute bound link    
7 12.03. Caches link link  
8 14.03. Caches continued, roofline model, memory/compute bound link link Roofline paper
9 19.03 Roofline continued, linear algebra, BLAS, ATLAS: optimizing MMM link link Model-based ATLAS paper
10 21.03. Optimizing MMM continued      
11 26.03. Optimizing MMM continued, dependencies, register renaming link    
12 28.03 Optimization related to virtual memory, TLBs   link  
13 02.04. Memory bound computation, sparse linear algebra, sparse MVM, OSKI link    
14 04.04. Sparse MVM, OSKI continued, other optimization ideas   link  
15 18.04. SIMD vector extensions, SSE link   icc manual, Visual Studio manual
16 20.04. SSE continued    
17 23.04. SSE continued, compiler vectorization   link  
  27.04. Midterm Exam      
18 30.04. Linear transform and fast algorithms, DFT and FFT link link  
  02./03.05 One-on-one meetings      
19 07.05. Cooley-Tukey FFT, DFT complexity link link  
20 14.05. Adaptive library FFTW, FFTW codelet generator link link FFTW website
21 21.05. Spiral: Program synthesis for linear transforms link   Spiral website, Spiral intro
  22./23.05. One-on-one meetings      
  30.05. Project presentations      
  01.06. Project presentations