diff options
author | Drew DeVault <sir@cmpwn.com> | 2016-05-28 11:10:21 -0400 |
---|---|---|
committer | Drew DeVault <sir@cmpwn.com> | 2016-05-28 11:10:21 -0400 |
commit | 09cdb4cdc23c629e78d76155051c14eab14745d1 (patch) | |
tree | 9e19c6b7e7ad6beb8f82081577c9aa0a05246759 /_posts/2016-05-28-Understanding-pointers.md | |
parent | 418c97aabdd2104f63a8360ae284cda5da679ff4 (diff) |
Add blog post about pointers
Diffstat (limited to '_posts/2016-05-28-Understanding-pointers.md')
-rw-r--r-- | _posts/2016-05-28-Understanding-pointers.md | 278 |
1 files changed, 278 insertions, 0 deletions
diff --git a/_posts/2016-05-28-Understanding-pointers.md b/_posts/2016-05-28-Understanding-pointers.md new file mode 100644 index 0000000..f66a431 --- /dev/null +++ b/_posts/2016-05-28-Understanding-pointers.md @@ -0,0 +1,278 @@ +--- +# vim: tw=80 +layout: post +title: Understanding pointers +--- + +I was recently chatting with a new contributor to Sway who is using the project +as a means of learning C, and he had some questions about what `void**` meant +when he found some in the code. It became apparent that this guy only has a +basic grasp on pointers at this point in his learning curve, and I figured it +was time for another blog post - so today, I'll explain pointers. + +To understand pointers, you must first understand how memory works. Your RAM is +basically a flat array of +[octets](https://en.wikipedia.org/wiki/Octet_(computing)). Your compiler +describes every data structure you use as a series of octets. For the context of +this article, let's consider the following memory: + +{:.table} +| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | +|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------| +| 0x00 | 0x00 | 0x00 | 0x00 | 0x08 | 0x42 | 0x00 | 0x00 | +|========|========|========|========|========|========|========|========| + +We can refer to each element of this array by its index, or address. For +example, the value at address 0x0004 is 0x08. On this system, we're using 16-bit +addresses to refer to 8-bit values. On an i686 (32-bit) system, we use 32-bit +addresses to refer to 8-bit values. On an amd64 (64-bit) system, we use 64-bit +addresses to refer to 8-bit values. On Notch's imaginary DCPU-16 system, we use +16-bit addresses to refer to 16-bit values. + +To refer to the value at 0x0004, we can use a pointer. Let's declare it like so: + +```c +uint8_t *value = (uint8_t *)0x0004; +``` + +Here we're declaring a variable named value, whose type is `uint8_t*`. The * +indicates that it's a pointer. Now, because this is a 16-bit system, the size of +a pointer is 16 bits. If we do this: + +```c +printf("%d\n", sizeof(value)); +``` + +It will print 2, because it takes 16-bits (or 2 bytes) to refer to an address on +this system, even though the value there is 8 bits. On your system it would +probably print 8, or maybe 4 if you're on a 32-bit system. We could also do this: + +```c +uint16_t address = 0x0004; +uint8_t *ptr = (uint8_t *)address; +``` + +In this case we're not casting the `uint16_t` value 0x0004 to a `uint8_t`, which +would truncate the integer. No, instead, we're casting it to a `uint8_t*`, which +is the size required to represent a pointer on this system. All pointers are the +same size. + +## Dereferencing pointers + +We can refer to the value at the other end of this pointer by *dereferencing* it. +The pointer is said to contain a *reference* to a value in memory. By +*dereferencing* it, we can obtain that value. For example: + +```c +uint8_t *value = (uint8_t *)0x0004; +printf("%d\n", *value); // prints 8 +``` + +## Working with multi-byte values + +Even though memory is basically a big array of `uint8_t`, thankfully we can work +with other kinds of data structures inside of it. For example, say we wanted to +store the value 0x1234 in memory. This doesn't fit in 8 bits, so we need to +store it at two different addresses. For example, we could store it at 0x0006 +and 0x0007: + +{:.table} +| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | +|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------| +| 0x00 | 0x00 | 0x00 | 0x00 | 0x08 | 0x42 | 0x34 | 0x12 | +|========|========|========|========|========|========|========|========| + +*0x0007 makes up the first byte of the value, and *0x0006 makes up the second +byte of the value. + +<div class="well"> + Why not the other way around? Well, most systems these days use the "little + endian" notation for storing multi-byte integers in memory, which stores the + least significant byte first. The least significant byte is the one with the + smallest order of magnitude (in base sixteen). To get the final number, we + use (0x12 * 0x100) + (0x34 * 0x1), which gives us 0x1234. Read more about + endianness <a href="https://en.wikipedia.org/wiki/Endianness">here</a>. +</div> + +C allows us to use pointers that refer to these sorts of composite values, like +so: + +```c +uint16_t *value = (uint16_t *)0x0006; +printf("0x%X\n", *value); // Prints 0x1234 +``` + +Here, we've declared a pointer to a value whose type is `uint16_t`. Note that the +size of this pointer is the same size of the `uint8_t*` pointer - 16 bits, or +two bytes. The value it *references*, though, is a different type than +`uint8_t*` references. + +## Indirect pointers + +Here comes the crazy part - you can work with pointers to pointers. The address +of the `uint16_t` pointer we've been talking about is 0x0006, right? Well, we +can store that number in memory as well. If we store it at 0x0002, our memory +looks like this: + +{:.table} +| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | +|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------| +| 0x00 | 0x00 | 0x06 | 0x00 | 0x08 | 0x42 | 0x34 | 0x12 | +|========|========|========|========|========|========|========|========| + +The question might then become, how do we get it out again? Well, we can use a +pointer *to that pointer*! Check out this code: + +```c +uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002; +``` + +This code just declared a variable whose type is `uint16_t**`, which a pointer +whose value is a `uint16_t*`, which itself points to a value that is a +`uint16_t`. Pretty cool, huh? We can dereference this too: + +```c +uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002; +uint16_t *pointer = *pointer_to_a_pointer; +printf("0x%X\n", *pointer); // Prints 0x1234 +``` + +We don't actually even need the intermediate variable. This works too: + +```c +uint16_t **pointer_to_a_pointer = (uint16_t**)0x0002; +printf("0x%X\n", **pointer_to_a_pointer); // Prints 0x1234 +``` + +## Void pointers + +The next question that would come up to your average C programmer would be, +"well, what is a `void*`?" Well, remember earlier when I said that all pointers, +regardless of the type of value they reference, are just fixed size integers? +In the imaginary system we've been talking about, pointers are 16-bit addresses, +or indexes, that refer to places in RAM. On the system you're reading this +article on, it's probably a 64-bit integer. Well, we don't actually need to +specify the type to be able to manipulate pointers if they're just a fixed size +integer - so we don't have to. A `void*` stores an arbitrary address without +bringing along any type information. You can later *cast* this variable to a +specific kind of pointer to dereference it. For example: + +```c +void *pointer = (void*)0x0006; +uint8_t *uintptr = (uint8_t*)pointer; +printf("0x%X", *uintptr); // prints 0x34 +``` + +Take a closer look at this code, and recall that 0x0006 refers to a 16-bit value +from the previous section. Here, though, we're treating it as an 8-bit value - +the `void*` contains no assumptions about what kind of data is there. The result +is that we end up treating it like an 8-bit integer, which ends up being the +least significant byte of 0x1234; + +## Dereferencing structures + +In C, we often work with structs. Let's describe one to play with: + +```c +struct coordinates { + uint16_t x, y; + struct coordinates *next; +}; +``` + +Our structure describes a linked list of coordinates. X and Y are the +coordinates, and next is a pointer to the next set of coordinates in our list. +I'm going to drop two of these in memory: + +{:.table} +| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | +|:-------|:-------|:-------|:-------|:-------|:-------|:-------|:-------| +| 0xAD | 0xDE | 0xEF | 0xBE | 0x06 | 0x00 | 0x34 | 0x12 | +|========|========|========|========|========|========|========|========| + +Let's write some C code to reason about this memory with: + +```c +struct coordinates *coords; +coords = (struct coordinates*)0x0000; +``` + +If we look at this structure in memory, you might already be able to pick out +the values. C is going to store the fields of this struct in order. So, we can +expect the following: + +```c +printf("0x%X, 0x%X", coords->x, coords->y); +``` + +To print out "0xDEAD, 0xBEEF". Note that we're using the structure dereferencing +operator here, `->`. This allows us to dereference values *inside* of a +structure we have a pointer to. The other case is this: + +```c +printf("0x%X, 0x-X", coords.x, coords.y); +``` + +Which only works if `coords` is not a pointer. We also have a pointer within +this structure named next. You can see in the memory I included above that its +address is 0x0006 and its value is 0x0006 - meaning that there's another `struct +coordinates` that lives at 0x0006 in memory. If you look there, you can see the +first part of it. It's X coordinate is 0x1234. + +## Pointer arithmetic + +In C, we can use math on pointers. For example, we can do this: + +```c +uint8_t *addr = (uint8_t*)0x1000; +addr++; +``` + +Which would make the value of `addr` 0x1001. But this is only true for pointers +whose type is 1 byte in size. Consider this: + +```c +uint16_t *addr = (uint16_t*)0x1000; +addr++; +``` + +Here, `addr` becomes 0x1002! This is because ++ on a pointer actually adds +`sizeof(type)` to the actual address stored. The idea is that if we only added +one, we'd be referring to an address that is *in the middle* of a uint16_t, +rather than the next uint16_t in memory that we meant to refer to. This is also +how arrays work. The following two code snippets are equivalent: + +```c +uint16_t *addr = (uint16_t*)0x1000; +printf("%d\n", *(addr + 1)); +``` + +```c +uint16_t *addr = (uint16_t*)0x1000; +printf("%d\n", addr[1]); +``` + +## NULL pointers + +Sometimes you need to work with a pointer that points to something that may not +exist yet, or a resource that has been freed. In this case, we use a NULL +pointer. In the examples you've seen so far, 0x0000 is a valid address. This is +just for simplicity's sake. In practice, pretty much no modern computer has +any reason to refer to the value at address 0. For that reason, we use NULL to +refer to an uninitialized pointer. Dereferencing a NULL pointer is generally a +Bad Thing and will lead to segfaults. As a fun side effect, since NULL is 0, we +can use it in an if statement: + +```c +void *ptr = ...; +if (ptr) { + // ptr is valid +} else { + // ptr is not valid +} +``` + +I hope you found this article useful! If you'd +like something fun to read next, read about ["three star +programmers"](http://c2.com/cgi/wiki?ThreeStarProgrammer), or programmers who +have variables like `void***`. |