summaryrefslogtreecommitdiffstats
path: root/_posts/2016-05-28-Understanding-pointers.md
blob: 6b80de661610188993c08f314c18c6a7dcc603d2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
# vim: tw=80
layout: post
title: Understanding pointers
tags: [C, instructive]
---

I was recently chatting with a new contributor to Sway who is using the project
as a means of learning C, and he had some questions about what `void**` meant
when he found some in the code. It became apparent that this guy only has a
basic grasp on pointers at this point in his learning curve, and I figured it
was time for another blog post - so today, I'll explain pointers.

To understand pointers, you must first understand how memory works. Your RAM is
basically a flat array of
[octets](https://en.wikipedia.org/wiki/Octet_(computing)). Your compiler
describes every data structure you use as a series of octets. For the context of
this article, let's consider the following memory:

{:.table}
| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
| 0x00   | 0x00   | 0x00   | 0x00   | 0x08   | 0x42   | 0x00   | 0x00   |
|========|========|========|========|========|========|========|========|

We can refer to each element of this array by its index, or address. For
example, the value at address 0x0004 is 0x08. On this system, we're using 16-bit
addresses to refer to 8-bit values. On an i686 (32-bit) system, we use 32-bit
addresses to refer to 8-bit values. On an amd64 (64-bit) system, we use 64-bit
addresses to refer to 8-bit values. On Notch's imaginary DCPU-16 system, we use
16-bit addresses to refer to 16-bit values.

To refer to the value at 0x0004, we can use a pointer. Let's declare it like so:

```c
uint8_t *value = (uint8_t *)0x0004;
```

Here we're declaring a variable named value, whose type is `uint8_t*`. The *
indicates that it's a pointer. Now, because this is a 16-bit system, the size of
a pointer is 16 bits. If we do this:

```c
printf("%d\n", sizeof(value));
```

It will print 2, because it takes 16-bits (or 2 bytes) to refer to an address on
this system, even though the value there is 8 bits. On your system it would
probably print 8, or maybe 4 if you're on a 32-bit system. We could also do this:

```c
uint16_t address = 0x0004;
uint8_t *ptr = (uint8_t *)address;
```

In this case we're not casting the `uint16_t` value 0x0004 to a `uint8_t`, which
would truncate the integer. No, instead, we're casting it to a `uint8_t*`, which
is the size required to represent a pointer on this system. All pointers are the
same size.

## Dereferencing pointers

We can refer to the value at the other end of this pointer by *dereferencing* it.
The pointer is said to contain a *reference* to a value in memory. By
*dereferencing* it, we can obtain that value. For example:

```c
uint8_t *value = (uint8_t *)0x0004;
printf("%d\n", *value); // prints 8
```

## Working with multi-byte values

Even though memory is basically a big array of `uint8_t`, thankfully we can work
with other kinds of data structures inside of it. For example, say we wanted to
store the value 0x1234 in memory. This doesn't fit in 8 bits, so we need to
store it at two different addresses. For example, we could store it at 0x0006
and 0x0007:

{:.table}
| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
| 0x00   | 0x00   | 0x00   | 0x00   | 0x08   | 0x42   | 0x34   | 0x12   |
|========|========|========|========|========|========|========|========|

*0x0007 makes up the first byte of the value, and *0x0006 makes up the second
byte of the value.

<div class="well">
    Why not the other way around? Well, most systems these days use the "little
    endian" notation for storing multi-byte integers in memory, which stores the
    least significant byte first. The least significant byte is the one with the
    smallest order of magnitude (in base sixteen). To get the final number, we
    use (0x12 * 0x100) + (0x34 * 0x1), which gives us 0x1234. Read more about
    endianness <a href="https://en.wikipedia.org/wiki/Endianness">here</a>.
</div>

C allows us to use pointers that refer to these sorts of composite values, like
so:

```c
uint16_t *value = (uint16_t *)0x0006;
printf("0x%X\n", *value); // Prints 0x1234
```

Here, we've declared a pointer to a value whose type is `uint16_t`. Note that the
size of this pointer is the same size of the `uint8_t*` pointer - 16 bits, or
two bytes. The value it *references*, though, is a different type than 
`uint8_t*` references.

## Indirect pointers

Here comes the crazy part - you can work with pointers to pointers. The address
of the `uint16_t` pointer we've been talking about is 0x0006, right? Well, we
can store that number in memory as well. If we store it at 0x0002, our memory
looks like this:

{:.table}
| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
| 0x00   | 0x00   | 0x06   | 0x00   | 0x08   | 0x42   | 0x34   | 0x12   |
|========|========|========|========|========|========|========|========|

The question might then become, how do we get it out again? Well, we can use a
pointer *to that pointer*! Check out this code:

```c
uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002;
```

This code just declared a variable whose type is `uint16_t**`, which a pointer
whose value is a `uint16_t*`, which itself points to a value that is a
`uint16_t`. Pretty cool, huh?  We can dereference this too:

```c
uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002;
uint16_t *pointer = *pointer_to_a_pointer;
printf("0x%X\n", *pointer); // Prints 0x1234
```

We don't actually even need the intermediate variable. This works too:

```c
uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002;
printf("0x%X\n", **pointer_to_a_pointer); // Prints 0x1234
```

## Void pointers

The next question that would come up to your average C programmer would be,
"well, what is a `void*`?" Well, remember earlier when I said that all pointers,
regardless of the type of value they reference, are just fixed size integers?
In the imaginary system we've been talking about, pointers are 16-bit addresses,
or indexes, that refer to places in RAM. On the system you're reading this
article on, it's probably a 64-bit integer. Well, we don't actually need to
specify the type to be able to manipulate pointers if they're just a fixed size
integer - so we don't have to. A `void*` stores an arbitrary address without
bringing along any type information. You can later *cast* this variable to a
specific kind of pointer to dereference it. For example:

```c
void *pointer = (void*)0x0006;
uint8_t *uintptr = (uint8_t*)pointer;
printf("0x%X", *uintptr); // prints 0x34
```

Take a closer look at this code, and recall that 0x0006 refers to a 16-bit value
from the previous section. Here, though, we're treating it as an 8-bit value -
the `void*` contains no assumptions about what kind of data is there. The result
is that we end up treating it like an 8-bit integer, which ends up being the
least significant byte of 0x1234;

## Dereferencing structures

In C, we often work with structs. Let's describe one to play with:

```c
struct coordinates {
    uint16_t x, y;
    struct coordinates *next;
};
```

Our structure describes a linked list of coordinates. X and Y are the
coordinates, and next is a pointer to the next set of coordinates in our list.
I'm going to drop two of these in memory:

{:.table}
| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------|
| 0xAD   | 0xDE   | 0xEF   | 0xBE   | 0x06   | 0x00   | 0x34   | 0x12   |
|========|========|========|========|========|========|========|========|

Let's write some C code to reason about this memory with:

```c
struct coordinates *coords;
coords = (struct coordinates*)0x0000;
```

If we look at this structure in memory, you might already be able to pick out
the values. C is going to store the fields of this struct in order. So, we can
expect the following:

```c
printf("0x%X, 0x%X", coords->x, coords->y);
```

To print out "0xDEAD, 0xBEEF". Note that we're using the structure dereferencing
operator here, `->`. This allows us to dereference values *inside* of a
structure we have a pointer to. The other case is this:

```c
printf("0x%X, 0x-X", coords.x, coords.y);
```

Which only works if `coords` is not a pointer. We also have a pointer within
this structure named next. You can see in the memory I included above that its
address is 0x0004 and its value is 0x0006 - meaning that there's another `struct
coordinates` that lives at 0x0006 in memory. If you look there, you can see the
first part of it. It's X coordinate is 0x1234.

## Pointer arithmetic

In C, we can use math on pointers. For example, we can do this:

```c
uint8_t *addr = (uint8_t*)0x1000;
addr++;
```

Which would make the value of `addr` 0x1001. But this is only true for pointers
whose type is 1 byte in size. Consider this:

```c
uint16_t *addr = (uint16_t*)0x1000;
addr++;
```

Here, `addr` becomes 0x1002! This is because ++ on a pointer actually adds
`sizeof(type)` to the actual address stored. The idea is that if we only added
one, we'd be referring to an address that is *in the middle* of a uint16_t,
rather than the next uint16_t in memory that we meant to refer to. This is also
how arrays work. The following two code snippets are equivalent:

```c
uint16_t *addr = (uint16_t*)0x1000;
printf("%d\n", *(addr + 1));
```

```c
uint16_t *addr = (uint16_t*)0x1000;
printf("%d\n", addr[1]);
```

## NULL pointers

Sometimes you need to work with a pointer that points to something that may not
exist yet, or a resource that has been freed. In this case, we use a NULL
pointer. In the examples you've seen so far, 0x0000 is a valid address. This is
just for simplicity's sake. In practice, pretty much no modern computer has
any reason to refer to the value at address 0. For that reason, we use NULL to
refer to an uninitialized pointer. Dereferencing a NULL pointer is generally a
Bad Thing and will lead to segfaults. As a fun side effect, since NULL is 0, we
can use it in an if statement:

```c
void *ptr = ...;
if (ptr) {
    // ptr is valid
} else {
    // ptr is not valid
}
```

I hope you found this article useful! If you'd
like something fun to read next, read about ["three star
programmers"](http://c2.com/cgi/wiki?ThreeStarProgrammer), or programmers who
have variables like `void***`.