Skip to content

Commit

Permalink
Finish chapter 2
Browse files Browse the repository at this point in the history
  • Loading branch information
cjmlgrto committed Nov 7, 2017
1 parent 49689e0 commit d92b2d0
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ If you see an issue, just [submit one](https://github.com/cjmlgrto/fit3139-notes
## Contents

1. [Errors and Approximation](https://github.com/cjmlgrto/fit3139-notes/blob/master/notes/01-errors_and_approximation.md)
2. Computer Arithmetic
2. [Computer Arithmetic](https://github.com/cjmlgrto/fit3139-notes/blob/master/notes/02-computer_arithmetic.md)
3. Matrices
4. Eigenvalues & Eigenvectors
5. Dynamical Systems
Expand Down
90 changes: 90 additions & 0 deletions notes/02-computer_arithmetic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
[← Return to Index](https://github.com/cjmlgrto/fit3139-notes/)

# Computer Arithmetic

## Floating Point Numbers

* In a digital computer, real numbers are represented using floating points
* e.g. `-0.0007396`
* _Scientific notation_: `7.396 × 10^-4`
* _Sign_: `-`
* _Mantissa_: `7.396`
* _Exponent_: `-4`
* _Base_: decimal (base-10)
* Any floating-point number `x` has the form:
* `x = ±( [d_0÷b^0] + [d_1÷b^1] + [d_2÷b^2] + ... + [d_(p-1)÷b^(p-1)] ) × b^E
* `b` is the base or _radix_
* `p` is the precision
* `E ∈ [L,U]` is an exponent
* Note that this is the general form for an _unnormalized_ floating point representation

## Normalization

* A floating-point system is said to be **normalized** if the leading digit `d_0` is always nonzero (unless the represented number is 0)
* Thus, in a normalized system, the mantissa `m` always satisfies `1 ≤ m ≤ base`
* Floating-point systems are usually normalized because
* The representation of each number is then unique
* No digits are wasted on leading zeroes, therby maximising precision
* In binary (base = 2) system, the leading bit is always 1, thus need not be stored

## Properties of Floating-Point Systems

* The number of normalized floating-point numbers in a given floating-point system is:
* `2(b-1)b^{p-1}(U - L + 1) + 1`
* The smallest positive normalized floating-point number is:
* `Underflow level = UFL = b^L`
* The largest positive normalized floating-point number is:
* `Overflow level = OFL = (1-b^{-p})b^{U+1}`

### Single Precision (32-bit)

| 1 bit | 8 bits | 23 bits |
| --- | --- | --- |
| Sign (S) | Exponent (E) | Mantissa or Fraction (F) |
| 31 | 30 ... 23 | 22 ... 0 |

* Decimal value of floating point
* `(-1)^S × 2^{E_10 - 127} × (F)_10`
* `+0`
* `S = 0`, `E = 0`, `F = 0`
* `-0`
* `S - 1`, `E = 0`, `F = 0`
* `+∞`
* `S = 0`, `E = 255_10`, `F = 0`
* `-∞`
* `S = 1`, `E = 255_10`, `F = 0`
* `NaN`
* `E = 255_10`, `S × F < 0 or > 0`

### Double Precision (64-bit)

| 1 bit | 11 bits | 52 bits |
| --- | --- | --- |
| Sign (S) | Exponent (E) | Mantissa or Fraction (F) |
| 63 | 62 ... 52 | 51 ... 0 |

* Decimal value of floating point
* `(-1)^S × 2^{E_10 - 1023} × (F)_10`
* `+0`
* `S = 0`, `E = 0`, `F = 0`
* `-0`
* `S - 1`, `E = 0`, `F = 0`
* `+∞`
* `S = 0`, `E = 2047_10`, `F = 0`
* `-∞`
* `S = 1`, `E = 2047_10`, `F = 0`
* `NaN`
* `E = 2047_10`, `S × F < 0 or > 0`

## Rounding

* **Rounding by Chopping** — base-b expansion is truncated after the (precision - 1)th digit
* **Rounding to Nearest** — real number is rounded to the nearest floating-point number

## Machine Epsilon

* **Machine Epsilon** gives an _upper bound on the relative error_ due to rounding in floating-point arithmetic
* Rounding by chopping: `ε_mach = b^{1 - p}`
* Rounding to nearest: `ε_mach = 0.5b^{1 - p}`
* The unit roundoff is important because it bounds the relative error in representing the real number `x`
* `ε_mach ≥ |(inexact x - exact x) ÷ exact x|`

0 comments on commit d92b2d0

Please sign in to comment.