CS:APP Notes

发表于 2023-02-12 更新于 2023-02-18 分类于 ICS ，笔记

Notes for CSAPP, textbook for ICS.

Preface

It's a long way to learn CS:APP. Notes can mistakenly miss something, and some basic info/unnecessary chapters(with stars before the title) would be skipped, so I would commend you readers to read the original book on your own. (I mean the original North American edition; ~~中文版虽然努力了，但是一言难尽~~)

Thanks give to ouuan with CS:APP notes in Chinese. (I write notes in English just because I have to...To prove it, you would find pics from Chinese edition below.)

*01 A Tour of Computer Systems

I would just skip this; if necessary I would fill the blank in.

02 Representing and Manipulating Information

Information Storage

Bit kept 0/1. For numbers count in binary

e.g. $(\mathrm{00011011111101010010})_2$, or $(\mathrm{114514})_{10}$ in decimal. Most times we hate binary expression for its length; So we use hexadecimal $(\mathrm{1BF52})_{16}$
Byte = 8 bits
Word Size = nominal size of integer-valued data and of addresses. Now most machines have 64-bit word size.

Word size changes bits used by different types in C.

Byte Ordering
- Big Endian: Read/Write from high bits to low bits. e.g. $\mathrm{00011011111101010010}\rightarrow (114514)_{10}$
- Little Endian: Read/Write from low bits to high bits. e.g. $\mathrm{01001010111111011000}\rightarrow(114514)_{10}$
Usually no effect on programming; But consideration need to be taken when:
- communication (e.g. Network)
- check original byte array (in machine-level programming),
- type conversions/casting, union, etc.
Bit-level Operations
- General Boolean Algebra
  - Intersection &(AND)
  - Union |(OR)
  - Symmetric difference ^(XOR)
  - Complement ~
  Notice: Contrast to logical operations (&&,||,!) which results are only 0/1.
- Shift Operations

Integer Representations

Encoding Integers

Warning

Be cautious when dealing with unsigned and signed. When both operated with operations, signed will be converted to unsigned, and errors can be produced like -1 > 0u.
Expanding
- Unsigned: fill high bits with 0s
- Signed: fill in sign bit
Truncating

Unsigned/signed: high bits are truncated. Similar to mod.

Integer Arithmetic

Addition
- Unsigned: $u + v = (u+v) \mod 2^w$
- Signed: $u+v = \mathrm{U2T}(\mathrm{T2U}(u)+\mathrm{T2U}(v))$
Extra: $-u = (u)+1 $
Multiplication

Do mod $2^w$ again. Do power-of-2 multiply/divide, use shift.

Floating Point

IEEE Representation: $(-1)^s\times M\times 2^E$
- sign bit($s$)
- exponent field($E$) [Can be negative, like signed integer.]
- fraction field($M$) = $?.f_{n-1}\dots f_1f_0$
32-bit(float): 8 bits for exponent, 23 bits for fraction

64-bit(double): 11 bits for exponent, 52 bits for fraction
- Normalized values (exponent not all 0s/1s)
  
  $E=1-\mathrm{Bias},M=1.f{n-1}\dots f_1f_0$
  
  Here $\mathrm{Bias}=2^{w-1}-1$.
- Denormalized values (exponent all 0s)
  
  $E=1-\mathrm{Bias},M=0.f{n-1}\dots f_1f_0$
- Special values (exponent all 1s)
  
  $\pm \infty$ if fraction all 0s; NaN otherwise.
Arithmetic
- Rounding Modes
  
  round-to-even(default), round-toward-zero, round-down, round-up.
- Operations
  
  Get exact result, then fixing(overflow if out of range, and round to fit precision)
- Conversions/Casting
  
  Overflow/Rounding can be occured.

*03 Machine-Level Representation of Programs

I would like to "borrow" notes from Ouuan for specific reason: it is said that we use RISC-V rather than x86-64 architecture in Tea Garden(TG in short afterwards).

*04 Processor Architecture

I would skip this two chapters for specific reason; if necessary I would fill the blank in.

05 Optimizing Program Performance

Optimize CPE(cycles per OP)

Basic Optimizations
- Replace costly operation with simpler one
- Shift, add instead of multiply or divide (power-of-2)
- Share common subexpressions
- Procedure calls(e.g. strlen)
- Removing aliasing (introducing local variables rather than iteration on arrays)
Optimizations over modern CPU Design
- Make use of pipeline: parallel execution
  - Loop Unrolling (normally $\le$ 10 rows)
  - Reassociated Computation
- Branch Prediction
  
  Reload takes 20-30 clock cycles cost
- Locality(Chapter 06)
  
  Temporal locality: Recent accessed data are more likely to be accessed again
  
  Spatial locality: Data adjacent to last accessed data are more likely to be accessed
- Blocking(Chapter 06)
  
  e.g. Matrix multiplication.

06 The Memory Hierarchy

RAM/Random-Access Memory

Volatile memory: info lost once not powered.
- SRAM/Static RAM
  - Used as cache
  - Bistable. Once powered, keep values all the time.
- DRAM/Dynamic RAM
  - Used as main memory, frame buffer (for graphics system)
  - Cheaper & Slower than SRAM
  - Sensitive to electric noise
  DRAM Chip A two-dimensional array, each called supercell with $w$-DRAM units. Connected to memory controller, and packaged in the memory module.
  
  To read supercell $(i,j)$, memory controller send RAS(row access strobe) $i$ and CAS(column access strobe) $j$, then DRAM copy $i$-th row to the internal row buffer, and extract $j$-th bit with CAS.

ROM/Read-Only Memory

Nonvolatile memory: infos kept even not powered

Apps kept in ROM called firmware.(BIOS, disk drive controller, etc.)
- PROM/Programmable ROM: Can be programmed only once
- EPROM/Erasable Programmable ROM
  - Use physical way(Ultraviolet light) to erase
- EEPROM/Electrically Erasable PROM
  - Application: Flash Memory
  - Can be erased&programmed for $10^5$ times
Bus

Transfer data between CPU chip and DRAM main memory(procedure called bus transaction, including read & write).

System bus: linking CPU and I/O bridge(memory controller included)

Memory bus: linking I/O bridge and main memory

(So signals from two buses need to be translated by I/O bridge.)

I/O bus: linking I/O bridge with I/O devices, like USB(Universal Serial Bus) controller, SCSI/SATA adapter, host bus adapter, etc.
Disk

Consist of platter with two surfaces, covered with circular track. Tracks are divided into sector(usually 512 kB storage), with gaps to separate sectors (so easy to recognize them). Platters share one spindle, and for a certain $k$, all tracks with distance $k$ to spindle called a cylinder.

Factors to affect disk storage:
- recording density(bit/inch): size of bits track kept for unit length
- track density(track/inch): size of tracks on the area of a circle with unit radius from the center of a platter.
- areal density(bit/inch^2): product of recording density and track density
To improve storage, cylinder are divided into recording zones consisting of several adjacent cylinders, and all tracks have the same size of sectors.

To read/write: using read/write head. The head seek the targeted track and modify the bit.

Logical Disk Blocks indexed sectors to coordinates (surface, track, sector) using disk controller.

Disk Format Identify sectors and fill info into gaps; Identify surfaces with failure; Reserve surfaces for potential failure surfaces.

Access Disk
- Command with target logical disk block number and target main memory address send to disk controller from CPU
- Disk read sectors and transfer data to main memory, so-called DMA(Direct Memory Access). CPU can process other commands until DMA is complete.
- Disk send interrupt signal to CPU and CPU jumps to signal handler.
High cost for accessing disk, reading/writing on sectors.

access time = seek time + rotational latency + transfer time

SSD(Solid State Disk)
- Packaged flash memories with a flash translation layer to translate logical block number to address of flash memory.
- Each block(16kB~512kB) contains 32 to 128 pages(512 bytes~4 kB). Write on pages only after the whole block is erased.

Cache

We suppose there're $M=2^m$ addresses for the main memory with each saves a byte. Then its cache would divide into $S=2^s$ cache sets, each contains $E=2^e$ cache lines, where each line saves a block with size of $B=2^b$ bytes, a valid bit, and tag bits with length of $t=m-b-s$. So the size of the whole cache is $C=S\times E\times B$ bytes.
- How it works?
  
  Each address contains three parts: tag for the $t$ high bits, set index for $s$ bits in the middle, and block offset for the $b$ low bits. To have a cache hit, the machine with the set index find the corresponding cache set, and then find a cache line whose valid bit is true and the tag bits fit. Then with block offset the machine finds the targeted byte.
- Cache miss
  
  There would be cache miss as well when the case above is not satisfied. Once a cache miss happens, the $k$-level cache would gain the block containing the data from the $k+1$-level cache(the procedure is called placement policy). If the cache is full, the machine choose a victim block/line to evict based on a replacement policy, e.g. random replacement policy, least frequently used(LFU) and least recently used(LRU).
  - For an empty cache so-called cold cache, cold miss happens. (So it's a compulsory miss.)
  - Usually, the block which index is $i$ in the $k+1$-level cache would push into the $i \mod (S\times E)$-th block in the $k$-level cache. This causes conflict miss, e.g. CPU calls block $0,8,0,8,\dots$ of $k+1$-level cache, and conflict miss occurs in the $k$-level cache.
  - If the working set is larger than $C$, the cache would have capacity miss.
- Different types of cache
  - When $E=1$, it is called direct-mapped cache. (Capacity miss occurs.)
  - When $E>1$, it is called set associative cache.
    - For $s=0$, naming fully associative cache;
    - Otherwise, called E-way set associative cache.
- Write
  
  For write hits:
  - Write-through: write block to the lower(k to k+1) level immediately. Simple but bus flow produces.
  - Write-back: delay the update until the block is going to be evicted (just like lazy tag in the segment tree). Bus flow reduces but complexity increases; and an extra dirty bit is needed to help identify whether a block is modified.
  For write misses:
  - Write-allocate: Gain block from the lower level and process as write hits
  - No-write-allocate: Write straightly without writing into the cache.
- i-cache and d-cache
  
  i-cache only saves commands(so read-only), and d-cache only saves data, and unified cache saves both. Help optimize them separately, reduce conflict miss.
- Parameters
  - miss rate
  - hit rate = 1 - miss rate
  - hit time = time takes for transferring a word from cache to CPU
  - miss penalty 10s cycles for L1/L2, and 50 cycles for L3, and 200+ cycles for main memory.
  Effects
  - Larger the cache size is, higher the hit rate and hit time are.
  - Larger the size of block is, higher the hit rate is (programs with better spatial locality); But the lines decreases and the high rate may decrease (for programs with better temporal locality).
  - Larger the $E$ is, less the likelihood of conflict miss is; but hit time and miss penalty increases due to the complexity of comparing tags and choosing victim lines.
  - Different types of write policy. Usually write-back is better for lower-level caches.

07 Linking

How to compile?

graph LR
G["other .o"] -->|ld|E
A["source code(.c)"] -->|cpp| B["intermediate file(.i)"] -->|cc1| C["assembly code(.s)"] -->|as| D["relocatable object file(.o)"] -->|"ld"| E["executable object file"]
H["library(.a for archive)"] -->|ld|E

Here, cpp, cc1, as are translators, and ld is linker.

Object files: including relocatable object file, executable object file and shared object file(used in dynamic linking). Have different formats, e.g., Portable Executable (PE) for Windows and Mach-O for macOS, and Executable and Linkable Format(ELF) for x86-64 Linux/Unix.

Static libraries: A package of several object files. Better saving space, and less importing modules. To create one, use commands like ar rcs libabc.a a.o b.o c.o. When using it, use command like ./libabc.a or -L. -labc(import path to the linker).

ELF Format

Header/Section	Description
ELF header	Word size, byte ordering, file type(.o, exec, .so), machine type, etc.
Segment header table	Page size, virtual addresses memory segments(sections), segment sizes.
`.text`	Code
`.rodata`	Read only data e.g. jump tables
`.data`	Initialized global variables (as strong symbols. Procedures are strong as well)
`.bss`	Uninitialized global variables (as weak symbols). ("Block Storage Start" for full name, can be remembered as "Better saving spaces")
`.symtab`	Symbol table: procedure and static variable names, section names and locations
`.rel.text`	Relocation info for `.text` section: Addresses of instructions that will need to be modified in the executable and instructions for modifying (usually for global vars and external functions)
`.rel.data`	Relocation info for `.data` section: Addresses of pointer data that will need to be modified in the merged executable (usually for values of global vars and addresses for external functions)
`.debug`(gcc -g)	Info for symbolic debugging
`.line`(gcc -g)	Mapping machine code to line numbers of source C code.
`.strtab`(gcc -g)	String table: symbol table of `.symtab` and `.debug` and section name of header, ending with null.

ELF64 symbol

typedef struct
{
    int   name;      /* String table offset */
    char  type:4,    /* Function or data (4 bits) */
          binding:4; /* Local or global (4 bits) */
    char  reserved;  /* Unused */
    short section;   /* Section header index */
    long  value;     /* Section offset or absolute address */
    long  size;      /* Object size in bytes */
} Elf64_Symbol;

Here three pseudosections(for .o files) are tagged in section:

ABS: symbols should not be relocated
UNDEF: symbols undefined
COMMON: symbols referenced by multiple modules. When happened, value gives requirements for data alignment, and size gives minimum size.

Diffs between COMMON and .bss: COMMON lists uninitialized global vars, and .bss lists uninitialized static vars and global/static vars initialized with 0.

Static Linking
- Symbol resolution
  
  Linker associates each symbol reference(symbols in .o, like functions, global vars, static vars) with exactly one symbol definition.
  
  Types of symbol:
  - Global symbols
    
    locally defined and can be referenced by other modules; non-static functions and global vars in C, stored on the stack
  - External symbols
    
    externally defined: global vars with extern in C
  - Local symbols
    
    locally defined and can be accessed only by local: static functions and vars, stored in either .bss or .data
  // Linker's Symbol Rules for multiple definitions of global vars
  
  Rule 1: Multiple strong symbols are not allowed.
  
  Rule 2: Given a strong symbol and multiple weak ones, choose the strong one.
  
  Rule 3: Given multiple weak symbols, pick an arbitrary one. (for gcc -fcommon, if -fno-common, then trigger error)
  
  01 If multiple ones are not in COMMON, trigger error;
  
  02 If only one is not in COMMON, then choose it;
  
  03 If all in COMMON, choose the biggest one; if sizes are different, trigger warnings.
  
  (from Ouuan's Notes)
  
  So if using multiple weak symbols, there can be nightmares:
  1
  2
  3
  4
  5
  6
  7
  // a.c
  int x;
  int y;
  p1() {}
  // b.c
  double x;
  p2() {}
  Then writes to x in p2() can overwrite y. So, avoid global vars if you can; Otherwise, use static, initialize when defined, or use extern if referencing an external global var.
- Relocation
  
  Linker merges separate code and data sections into single sections, relocate symbols from relative locations in .o to absolute memory locations in the exec(0x0 as the start), and update symbol references to these symbols to reflect their new positions.