- Add explicitly handling of A_WIDTH=1 for completeness
- mux2 uses one LUT3 instead of a hard mux (which did use LUTs anyway)
- mux4 uses one LUT4 instead of hard muxes (which did use LUTs anyway)
- mux8 uses only bottom half of a slice
- Add a mux12 for intermediate variant between mux8 and mux16
- For sizes larger than 16 inputs, instantiate the right mux size
- More comments about implementation choices
- More tests including with -widemux and -abc9, and more comments