CS:APP Notes
Preface
It's a long way to learn CS:APP. Notes can mistakenly miss something,
and some basic info/unnecessary chapters(with stars before the title)
would be skipped, so I would commend you readers to read the original
book on your own. (I mean the original North American edition;
中文版虽然努力了,但是一言难尽)
Thanks give to ouuan with CS:APP notes in Chinese. (I write notes in English just because I have to...To prove it, you would find pics from Chinese edition below.)
*01 A Tour of Computer Systems
I would just skip this; if necessary I would fill the blank in.
02 Representing and Manipulating Information
Information Storage
Bit kept 0/1. For numbers count in binary
e.g. \((\mathrm{00011011111101010010})_2\), or \((\mathrm{114514})_{10}\) in decimal. Most times we hate binary expression for its length; So we use hexadecimal \((\mathrm{1BF52})_{16}\)
Byte = 8 bits
Word Size = nominal size of integer-valued data and of addresses. Now most machines have 64-bit word size.
Byte Ordering
- Big Endian: Read/Write from high bits to low bits. e.g. \(\mathrm{00011011111101010010}\rightarrow (114514)_{10}\)
- Little Endian: Read/Write from low bits to high bits. e.g. \(\mathrm{01001010111111011000}\rightarrow(114514)_{10}\)
Usually no effect on programming; But consideration need to be taken when:
- communication (e.g. Network)
- check original byte array (in machine-level programming),
- type conversions/casting, union, etc.
Bit-level Operations
General Boolean Algebra
- Intersection &(AND)
- Union |(OR)
- Symmetric difference ^(XOR)
- Complement ~
Notice: Contrast to logical operations (&&,||,!) which results are only 0/1.
Shift Operations
Integer Representations
Encoding Integers
Warning
Be cautious when dealing with unsigned and signed. When both operated with operations, signed will be converted to unsigned, and errors can be produced like
-1 > 0u
.Expanding
- Unsigned: fill high bits with 0s
- Signed: fill in sign bit
Truncating
Unsigned/signed: high bits are truncated. Similar to mod.
Integer Arithmetic
Addition
Unsigned: \(u + v = (u+v) \mod 2^w\)
Signed: \(u+v = \mathrm{U2T}(\mathrm{T2U}(u)+\mathrm{T2U}(v))\)
Extra: $-u = (u)+1 $
Multiplication
Do mod \(2^w\) again. Do power-of-2 multiply/divide, use shift.
Floating Point
IEEE Representation: \((-1)^s\times M\times 2^E\)
- sign bit(\(s\))
- exponent field(\(E\)) [Can be negative, like signed integer.]
- fraction field(\(M\)) = \(?.f_{n-1}\dots f_1f_0\)
32-bit(float): 8 bits for exponent, 23 bits for fraction
64-bit(double): 11 bits for exponent, 52 bits for fraction
Normalized values (exponent not all 0s/1s)
\(E=1-\mathrm{Bias},M=1.f{n-1}\dots f_1f_0\)
Here \(\mathrm{Bias}=2^{w-1}-1\).
Denormalized values (exponent all 0s)
\(E=1-\mathrm{Bias},M=0.f{n-1}\dots f_1f_0\)
Special values (exponent all 1s)
\(\pm \infty\) if fraction all 0s; NaN otherwise.
Arithmetic
Rounding Modes
round-to-even(default), round-toward-zero, round-down, round-up.
Operations
Get exact result, then fixing(overflow if out of range, and round to fit precision)
Conversions/Casting
Overflow/Rounding can be occured.
*03 Machine-Level Representation of Programs
I would like to "borrow" notes from Ouuan for specific reason: it is said that we use RISC-V rather than x86-64 architecture in Tea Garden(TG in short afterwards).
*04 Processor Architecture
I would skip this two chapters for specific reason; if necessary I would fill the blank in.
05 Optimizing Program Performance
Optimize CPE(cycles per OP)
Basic Optimizations
Replace costly operation with simpler one
Shift, add instead of multiply or divide (power-of-2)
Share common subexpressions
Procedure calls(e.g.
strlen
)Removing aliasing (introducing local variables rather than iteration on arrays)
Optimizations over modern CPU Design
Make use of pipeline: parallel execution
- Loop Unrolling (normally \(\le\) 10 rows)
- Reassociated Computation
Branch Prediction
Reload takes 20-30 clock cycles cost
Locality(Chapter 06)
Temporal locality: Recent accessed data are more likely to be accessed again
Spatial locality: Data adjacent to last accessed data are more likely to be accessed
Blocking(Chapter 06)
e.g. Matrix multiplication.
06 The Memory Hierarchy
RAM/Random-Access Memory
Volatile memory: info lost once not powered.
SRAM/Static RAM
- Used as cache
- Bistable. Once powered, keep values all the time.
DRAM/Dynamic RAM
- Used as main memory, frame buffer (for graphics system)
- Cheaper & Slower than SRAM
- Sensitive to electric noise
DRAM Chip A two-dimensional array, each called supercell with \(w\)-DRAM units. Connected to memory controller, and packaged in the memory module.
To read supercell \((i,j)\), memory controller send RAS(row access strobe) \(i\) and CAS(column access strobe) \(j\), then DRAM copy \(i\)-th row to the internal row buffer, and extract \(j\)-th bit with CAS.
ROM/Read-Only Memory
Nonvolatile memory: infos kept even not powered
Apps kept in ROM called firmware.(BIOS, disk drive controller, etc.)
- PROM/Programmable ROM: Can be programmed only once
- EPROM/Erasable Programmable ROM
- Use physical way(Ultraviolet light) to erase
- EEPROM/Electrically Erasable PROM
- Application: Flash Memory
- Can be erased&programmed for \(10^5\) times
Bus
Transfer data between CPU chip and DRAM main memory(procedure called bus transaction, including read & write).
System bus: linking CPU and I/O bridge(memory controller included)
Memory bus: linking I/O bridge and main memory
(So signals from two buses need to be translated by I/O bridge.)
I/O bus: linking I/O bridge with I/O devices, like USB(Universal Serial Bus) controller, SCSI/SATA adapter, host bus adapter, etc.
Disk
Consist of platter with two surfaces, covered with circular track. Tracks are divided into sector(usually 512 kB storage), with gaps to separate sectors (so easy to recognize them). Platters share one spindle, and for a certain \(k\), all tracks with distance \(k\) to spindle called a cylinder.
Factors to affect disk storage:
- recording density(bit/inch): size of bits track kept for unit length
- track density(track/inch): size of tracks on the area of a circle with unit radius from the center of a platter.
- areal density(bit/inch^2): product of recording density and track density
To improve storage, cylinder are divided into recording zones consisting of several adjacent cylinders, and all tracks have the same size of sectors.
To read/write: using read/write head. The head seek the targeted track and modify the bit.
Logical Disk Blocks indexed sectors to coordinates (surface, track, sector) using disk controller.
Disk Format Identify sectors and fill info into gaps; Identify surfaces with failure; Reserve surfaces for potential failure surfaces.
Access Disk
- Command with target logical disk block number and target main memory address send to disk controller from CPU
- Disk read sectors and transfer data to main memory, so-called DMA(Direct Memory Access). CPU can process other commands until DMA is complete.
- Disk send interrupt signal to CPU and CPU jumps to signal handler.
High cost for accessing disk, reading/writing on sectors.
access time = seek time + rotational latency + transfer time
SSD(Solid State Disk)
- Packaged flash memories with a flash translation layer to translate logical block number to address of flash memory.
- Each block(16kB~512kB) contains 32 to 128 pages(512 bytes~4 kB). Write on pages only after the whole block is erased.
Cache
We suppose there're \(M=2^m\) addresses for the main memory with each saves a byte. Then its cache would divide into \(S=2^s\) cache sets, each contains \(E=2^e\) cache lines, where each line saves a block with size of \(B=2^b\) bytes, a valid bit, and tag bits with length of \(t=m-b-s\). So the size of the whole cache is \(C=S\times E\times B\) bytes.
How it works?
Each address contains three parts: tag for the \(t\) high bits, set index for \(s\) bits in the middle, and block offset for the \(b\) low bits. To have a cache hit, the machine with the set index find the corresponding cache set, and then find a cache line whose valid bit is true and the tag bits fit. Then with block offset the machine finds the targeted byte.
Cache miss
There would be cache miss as well when the case above is not satisfied. Once a cache miss happens, the \(k\)-level cache would gain the block containing the data from the \(k+1\)-level cache(the procedure is called placement policy). If the cache is full, the machine choose a victim block/line to evict based on a replacement policy, e.g. random replacement policy, least frequently used(LFU) and least recently used(LRU).
- For an empty cache so-called cold cache, cold miss happens. (So it's a compulsory miss.)
- Usually, the block which index is \(i\) in the \(k+1\)-level cache would push into the \(i \mod (S\times E)\)-th block in the \(k\)-level cache. This causes conflict miss, e.g. CPU calls block \(0,8,0,8,\dots\) of \(k+1\)-level cache, and conflict miss occurs in the \(k\)-level cache.
- If the working set is larger than \(C\), the cache would have capacity miss.
Different types of cache
- When \(E=1\), it is called direct-mapped cache. (Capacity miss occurs.)
- When \(E>1\), it is called
set associative cache.
- For \(s=0\), naming fully associative cache;
- Otherwise, called E-way set associative cache.
Write
For write hits:
- Write-through: write block to the lower(k to k+1) level immediately. Simple but bus flow produces.
- Write-back: delay the update until the block is going to be evicted (just like lazy tag in the segment tree). Bus flow reduces but complexity increases; and an extra dirty bit is needed to help identify whether a block is modified.
For write misses:
- Write-allocate: Gain block from the lower level and process as write hits
- No-write-allocate: Write straightly without writing into the cache.
i-cache and d-cache
i-cache only saves commands(so read-only), and d-cache only saves data, and unified cache saves both. Help optimize them separately, reduce conflict miss.
Parameters
- miss rate
- hit rate = 1 - miss rate
- hit time = time takes for transferring a word from cache to CPU
- miss penalty 10s cycles for L1/L2, and 50 cycles for L3, and 200+ cycles for main memory.
Effects
- Larger the cache size is, higher the hit rate and hit time are.
- Larger the size of block is, higher the hit rate is (programs with better spatial locality); But the lines decreases and the high rate may decrease (for programs with better temporal locality).
- Larger the \(E\) is, less the likelihood of conflict miss is; but hit time and miss penalty increases due to the complexity of comparing tags and choosing victim lines.
- Different types of write policy. Usually write-back is better for lower-level caches.
07 Linking
How to compile?
1
2
3
4graph LR
G["other .o"] -->|ld|E
A["source code(.c)"] -->|cpp| B["intermediate file(.i)"] -->|cc1| C["assembly code(.s)"] -->|as| D["relocatable object file(.o)"] -->|"ld"| E["executable object file"]
H["library(.a for archive)"] -->|ld|EHere,
cpp
,cc1
,as
are translators, andld
is linker.Object files: including relocatable object file, executable object file and shared object file(used in dynamic linking). Have different formats, e.g., Portable Executable (PE) for Windows and Mach-O for macOS, and Executable and Linkable Format(ELF) for x86-64 Linux/Unix.
Static libraries: A package of several object files. Better saving space, and less importing modules. To create one, use commands like
ar rcs libabc.a a.o b.o c.o
. When using it, use command like./libabc.a
or-L. -labc
(import path to the linker).ELF Format
Header/Section Description ELF header Word size, byte ordering, file type(.o, exec, .so), machine type, etc. Segment header table Page size, virtual addresses memory segments(sections), segment sizes. .text
Code .rodata
Read only data e.g. jump tables .data
Initialized global variables (as strong symbols. Procedures are strong as well) .bss
Uninitialized global variables (as weak symbols). ("Block Storage Start" for full name, can be remembered as "Better saving spaces") .symtab
Symbol table: procedure and static variable names, section names and locations .rel.text
Relocation info for .text
section: Addresses of instructions that will need to be modified in the executable and instructions for modifying (usually for global vars and external functions).rel.data
Relocation info for .data
section: Addresses of pointer data that will need to be modified in the merged executable (usually for values of global vars and addresses for external functions).debug
(gcc -g)Info for symbolic debugging .line
(gcc -g)Mapping machine code to line numbers of source C code. .strtab
(gcc -g)String table: symbol table of .symtab
and.debug
and section name of header, ending with null.ELF64 symbol
1
2
3
4
5
6
7
8
9
10typedef struct
{
int name; /* String table offset */
char type:4, /* Function or data (4 bits) */
binding:4; /* Local or global (4 bits) */
char reserved; /* Unused */
short section; /* Section header index */
long value; /* Section offset or absolute address */
long size; /* Object size in bytes */
} Elf64_Symbol;Here three pseudosections(for
.o
files) are tagged insection
:ABS: symbols should not be relocated
UNDEF: symbols undefined
COMMON: symbols referenced by multiple modules. When happened,
value
gives requirements for data alignment, andsize
gives minimum size.Diffs between COMMON and
.bss
: COMMON lists uninitialized global vars, and.bss
lists uninitialized static vars and global/static vars initialized with 0.
Static Linking
Symbol resolution
Linker associates each symbol reference(symbols in .o, like functions, global vars, static vars) with exactly one symbol definition.
Types of symbol:
Global symbols
locally defined and can be referenced by other modules; non-
static
functions and global vars in C, stored on the stackExternal symbols
externally defined: global vars with
extern
in CLocal symbols
locally defined and can be accessed only by local:
static
functions and vars, stored in either.bss
or.data
// Linker's Symbol Rules for multiple definitions of global vars
Rule 1: Multiple strong symbols are not allowed.
Rule 2: Given a strong symbol and multiple weak ones, choose the strong one.
Rule 3: Given multiple weak symbols, pick an arbitrary one. (for
gcc -fcommon
, if-fno-common
, then trigger error)01 If multiple ones are not in COMMON, trigger error;
02 If only one is not in COMMON, then choose it;
03 If all in COMMON, choose the biggest one; if sizes are different, trigger warnings.
(from Ouuan's Notes)
So if using multiple weak symbols, there can be nightmares:
1
2
3
4
5
6
7// a.c
int x;
int y;
p1() {}
// b.c
double x;
p2() {}Then writes to
x
inp2()
can overwritey
. So, avoid global vars if you can; Otherwise, usestatic
, initialize when defined, or useextern
if referencing an external global var.Relocation
Linker merges separate code and data sections into single sections, relocate symbols from relative locations in
.o
to absolute memory locations in the exec(0x0 as the start), and update symbol references to these symbols to reflect their new positions.