Floating point and Double Data Type

Table of Contents

Numbers can be represented using Integer data types in C++. But some numbers like real numbers cannot be stored like integers because there is a decimal part associated with the real numbers.

C++ programming language has $floating \hspace{3px} point$ and $double-precision$ data type to represent real numbers.

A real number is declared using keyword – $float$ or $double$ . The main difference between $float$ and $double$ is the size. The size of float is $4$ bytes or $32$ bits, where the size of $double$ type is $8$ bytes or $64$ bits. There is a long version of the $double$ data type which is about $12$ bytes or $16$ bytes in size.

Data Types	Byte Size	Bit Size (1 byte = 8 bits)
float	4	32
double	8	64
long double	12 or 16	96 or 128

Computer Representation of Floating Point

The real numbers are represented in scientific notation (or exponential notation) because it is easier to perform arithmetic involving real values.

The Exponential Notation

The exponential notation has two parts – $a \hspace{3px} mantissa$ and $an \hspace{3px} exponent$ . The equation to represent the $floating \hspace{3px} point$ numbers in $exponential \hspace{3px} notation$ is shown below.

\begin{aligned}&\pm M * 10^E\\ \\ &where \\ \\ &0 \leq M \leq 10\end{aligned}

For example, suppose you want to represent $20000$ in exponential notation then it becomes

\begin{aligned}&2.0 \cdot 10^4\\ \\&where \\ \\ &0 \leq 2 \leq 10\end{aligned}

If you want to represent $133$ in scientific notation, then

\begin{aligned}&1.33 \cdot 10^3\\ \\&where \\ \\&0 \leq 1.33 \leq 10\end{aligned}

The number $0.00005454$ can be represented as

\begin{aligned}&5.454 \cdot 10^{-5} \\ \\&where\\ \\&0 \leq 5.454 \leq 10\end{aligned}

The table isolates the different parts of the examples given above.

Mantissa	Exponent	E-notation
2.0	4	2.0E4
1.33	3	1.33E3
5.454	-5	5.454E-5

The above notation is suitable for human, but the computer needs a binary representation of $floating \hspace{3px} point$ numbers and that too, in $exponential$ format.

Since we already know that $4$ bytes or $32$ bit is required to store a floating point number in a computer. The $floating$ $point$ number is divided into $3$ parts – $23$ bits for the mantissa, $1$ bit for sign, and $8$ bit for exponents.

The sign bit $0$ means positive number and $1$ means a negative number.

The $8-bit$ exponent can store values between $-128$ to $127$ .

The computer representation of exponential notation is:

\begin{aligned}&(b_0. b_1 b_2 b_3 \cdots ) \cdot 2^E\\ \\&where \\ \\&b_{0} = 1\end{aligned}

This representation is called the $normalized \hspace{3px} binary \hspace{3px} representation$ .

Figure 1 - Computer Representation of Floating-Point Number — Figure 1 – Computer Representation of Floating-Point Number

Declaring Floating Type and Double Type

Declaring a floating type and double data type variabe in a C program is similar.

float PI  3.14;
double radius 5,33;

There is little difference between $float$ and $double$ though they are represented in the same way in a computer. The double precision is longer than the float in terms of allowing the real part of a floating number.

\begin{aligned}&3.244440\\ \\&3.244440000000000 \hspace{3px}(double \hspace{3px} is \hspace{3px} has \hspace{3px} longer \hspace{3px} decimal \hspace{3px} part )\end{aligned}