See also note on:

  • Concurrency

Architecture Course

arch vs microarch

data types and size addresses packed vector data fp - cray, intel extended

isa fixed - mips arm variable length - x86 1-18bytes compressed - thumb vliw

nmber of operands

reg-reg reg-mem

pipelining

Like car assembly line Make processing instructions deep. Different stages. You need to stall the pipeline sometimes Better throughput, worse latency Appendix C classic 5 stage RISC pipline

  • Instruction fetch
  • instruction decode
  • execute
  • Memory
  • Write

Hazards

structural - if only have so many units that do a certain thing Data interrupts are an interesting one you can forget about, because they are so implicit

Control hazards Branch prediction

superscalar

Make processor wide, multiple pipelines. Parallelism is decided at runtime. Check for data dependencies

Out of order execution

Scoreboard Is this where the idea of register file really comes into play? To remember dependencies

VLIW

pushes scheduling to compile time. Removes lots of complicated scheduling circuitt Software pipelining vs Loop unrolling. Sortware pipelining removes the loop head? No, software pipelining reorders the statements perhaps after a loop unroll to give dependencies space. We can also sort of reorganize the association of loop variable to statements

Trace scheduling

equal scheduling model - when does operation finish? Exactly at its latency less-equal scheduling - can’t pack something in that less than windoww anymore. The looawe the spec, the less ahead of time reeasonig we can do

compressed instructions

predication partial predication - Conditional move instructions for example full predication avoid small chunks of branched code When is computing both sides of a branch worth it? balanced or longest is frequently executed The it instructions in arm

ALAT code hammock

appendix H. Dependency analysis. The indices are Loop dependence analysis I wonder if this is what llvm wants presburger for True dependence - read after write antidependance - write after read output dependence- write after write

Use z3 for analysis

loop carried dependence For going across loops we need to know the order. i <= i’ dependence

antidependeance

branch prediction

static prediction 0 Coin toss or just assuming fallthrough may work 50% of time Alternative to delay slots. 2 delay slots can be hard to fill Backwards jumps occur commonly in loops. More commonly taken than not taken Hint flags in instructions - can be profiled to make better. Maybe %80 accuracy

1-bit predictor. did you most recently jump or not n-bit history predictor - keep table table based on jump address Predictor is trained. A table is built Time adn spatial correlation Combinding methods - boosting. gets you to ~99% which you need for deep pipelines

predicting address of indirect jumps

EPIC

Cache

associative cache cache coherence victim cache write buffer multi level cache prefetching

SMT - simultaneous multi threading

Intel hyperthreading TLBleed programs hyperhtreading can side channel info to each other

Misc

[dan luu what cpus have since the 80s(https://danluu.com/new-cpu-features/)]

dhrystone -benchmark suite. Patterson kind of rails on it.

delay slots - something that makes more sense when you’re aware of microarchecture

https://arxiv.org/pdf/1911.03282.pdf nanobench https://developer.amd.com/amd-uprof/ amd uprof https://people.freebsd.org/~lstewart/articles/cpumemory.pdf what every programmer should know baout memory

https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-fall-2005/lecture-notes/ https://www.youtube.com/c/OnurMutluLectures/playlists Onur Mutlu lectures, courses Should I do Gem5, verilog, vhdl, other?

Communication Locality in Computation: Software, Chip Multiprocessors and Brains - Greenfield thesis

visual computer architecture emulator for risc v