home
publications
teaching
short CV
personal
the pub

How to Write Fast Numerical Code 2632300 (ETH, CS)
Basic Information
 Course number: 2632300, 6 credits
 Spring 2017, lectures: M 10:1512:00, HG D3.2; Th 9:1510:00 CAB G51; occasional substitute lectures: W 13:1515:00 HG D3.2
 Instructor: Markus Püschel (CAB H69.3, pueschel at inf, 27303)
TAs:
 Alen Stojanov (CAB 81.2, astojanov at inf)
 Georg Ofenbeck (CAB H65, ofgeorg at inf)
 Gagandeep Singh (CAB H66, gsingh at inf)
 Only for project supervision: Daniele Spampinato (CAB H65, daniele.spampinato at inf)
 Office hours:
 Alen Stojanov: M 34pm, CAB H81.2
 Geoerg Ofenbeck: T 34pm, CAB H65
 Gagandeep Singh: W 34pm, CAB H66
 Markus Püschel: Th 34pm, CAB H69.3
 Mailing lists:
 For technical questions: fastcode@lists.inf.ethz.ch (emails to this address go to the lecturer and all TAs)
 Forum to find project partner:
fastcodeforum@lists.inf.ethz.ch
(emails go to all students who have no partner yet and to Alen & Daniele)
Course Description
The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture.
This interdisciplinary course aims to give the student an understanding of performance and introduces foundations and stateoftheart techniques in high performance software development using important functionality such as linear algebra algorithms, transforms, filters, and others as examples. The course will focus on optimizing for the memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.
Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.
Prerequisites: solid C programming skills, matrix algebra, Master student or above
Topics Covered
 Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
 Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
 Compilers: strengths, limitations, how to use
 Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
 Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
 Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others
Goals of this Course
 Obtain an understanding of runtime performance and how to reason about it
 Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
 Understand the connection between algorithms, implementations, and computer architecture
Background Material
Academic Integrity
All homeworks in this course are singlestudent homeworks. The work must be all your own. Do not copy any parts of any of the homeworks from anyone including the web. Do not look at other students' code, papers, or exams. Do not make any parts of your homework available to anyone, and make sure noone can read your files. The university policies on academic integrity will be applied rigorously.
We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands the programming language in question (C, in our case).
It is not considered cheating to clarify vague points in the assignments or textbook, or to give help or receive help in using the computer systems, compilers, debuggers, profilers, or other facilities.
Grading
 40% research project
 Topic: Very fast, ideally adaptive implementation of a numerical problem
 Team up in pairs
 March 6: find a partner, find a problem or I give you one
(tip: look at the prior courses linked above for examples)
 Complete "milestones" during semester
and enter them into the online check list
 Write 6 page standard conference paper (template will be provided)
 Give short presentation end of semester
 25% midterm
 35% homework
 Exercises on algorithms analysis
 Implementation exercises
 study the effect of program optimizations, compilers, special instructions, etc.
 write and submit C code & create runtime/performance plots
 Some templates will be provided
 All homeworks are singlestudent homeworks
 There is no final Exam
Research Project
 All projects have to be registered at https://medellin.inf.ethz.ch/courses/2632300ETH/. This site is also used later for updates.
 How it works:
 Weeks without homeworks should be used to work on the project
 You select a numerical problem and create a correct (verified) implementation in C
 You determine the arithmetic cost, measure the runtime and performance
 You profile the implementation to find the parts in which most the runtime spent
 Focussing on these you apply various optimization techniques from this class
 You repeat the previous steps to create various versions with (hopefully) continuously better runtime
 You write a paper about your work and give a presentation
 Paper:
 Maximal 6 pages (hard limit), conference style, template and instructions below
 Everybody reads this: report.pdf
 For latex use: report.zip (start with reading the README file)
 For Word (discouraged) use this: reportword.doc
 Due date: Friday, June 16 (as finalreport.pdf in your svn)
 Presentation
 Last week of classes
 Template (the use is totally optional) and some guidelines (ppt is 2007 and later): presentationtemplate.pptx , presentationtemplate.pdf
 The order will be determined randomly right before class
 Who talks will be determined randomly right before class
 Projects (each one has a supervisor shown in brackets):
 Eliza W, Hui Z, Jingwei T, Yiqing Z: Antinspired edge detection
 Alberto M, Andreas B, Marc R F, Marko P: tDistributed stochastic neighbor embedding
 Gaurav P, Jonathan M, Luca A, Marc J F: Marching cubes
 Alexey K, Jonathan M, Jonathan R: Locality sensitive hashing
 David H, Magdalena K, Pirmin V, Stephanie C: PatchMatch algorithm
 Anton P, Jinank J, Manuel R, Milan P: Fractal compression
 Marcin D, Marius F, Patrick S, Thomas S: Fast ray tracing for TSDFs
 Ankit S, Jingxuan H, Lidia CdF, Gokula S: Binary convolutional neural network
 Andreas H, David S, Simon F, Till E: A robust descriptor for line matching
 Benjamin G, Cristine C, Frédéric L, Saurav S: Latent Dirichlet Allocation
 David S, Hasan H, Kaan K, Konstantin T: Ray tracing
 Fabio L, Henri R, Mahamadou B, Roffler C: Medial axis transform
 Ladislas JdN, Luca C, Matteo T, Yeyao Z: Quantized neural networks
 Aaker OE, Bian W, Spyridon M: Matrix multiplication over GF(2)
 Lars B, Michal W, Samuel M: Nonlinearly coupled elliptic BVPs
 Julien L, Linus H, Nilis D, Stefano P: GPUCB
 Acharya D, Li C, Victor C, Yuchen T: Online dictionary learning for sparse coding
Tips & Tricks
 Disabling TurboBoost on OS X: (no guarantees for the below)
 AVX/SSE transition penalties
 On Broadwell, the Intel intrinsic's guide seems to have a mistake in latency/throughput of mul and div. Agner Fog's tables seem correct.
Midterm
26.4., 13:1515:00
Homework
Late policy: No deadline extensions, but you have 3 late days. You can use at most 2 on one homework. For example, submitting 7 hours late costs one late day.
We will be using Moodle for the homeworks.
Lectures (including pdfs)
Lecture 
Date 
Content 
Slides 
Notes 
Other 
1 
M 20.02. 
Course motivation, overview, organization 
link 


2 
Th 23.02. 
Cost analysis and performance 
link 


3 
M 27.02. 
Intel Haswell architecture and microarchitecture, memory and computebound 
link 

Intel Haswell, Intel software optimization manual, Agner Fog's instruction tables 
4 
W 01.03. 
Instructionlevel parallelism, compiler limitations 
link, link 


5 
M 06.03 
Benchmarking, SIMD (SSE, AVX) overview 
link 


6 
M 13.03. 
SIMD (SSE, AVX) intrinsics 
link 

Intel intrinsics guide 
7 
Th 16.03. 
SIMD (SSE, AVX) 



8 
W 22.03. 
Locality, caches 
link 


9 
Th 23.03. 
Caches, analysis of blocked MMM 



10 
M 27.03. 



roofline paper 
11 
Th 30.03. 




































































































