summaryrefslogtreecommitdiffstats
path: root/_posts/2016-04-12-How-to-write-a-better-bloom-filter-in-C.md
blob: 5a9c637647e846e1ec50e9b31824b956b8021958 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
# vim tw=80
title: How to write a better bloom filter in C
layout: post
tags: [C, instructive]
---

This is in response to
[How to write a bloom filter in C++](http://blog.michaelschmatz.com/2016/04/11/how-to-write-a-bloom-filter-cpp/),
which has good intentions, but is ultimately a less than ideal bloom filter
implementation. I put together a better one in C in a few minutes, and I'll
explain the advantages of it.

The important differences are:

* You bring your own hashing functions
* You can add arbitrary data types, not just bytes
* It uses bits directly instead of relying on the `std::vector<bool>`
    being space effecient

I chose C because (1) I prefer it over C++ and (2) I just think it's a better
choice for implementing low level data types, and C++ is better used in high
level code.

I'm not going to explain the mechanics of a bloom filter or most of the details
of why the code looks this way, since I think the original post did a fine job
of that. I'll just present my alternate implementation:

## Header

{% highlight C %}
#ifndef _BLOOM_H
#define _BLOOM_H
#include <stddef.h>
#include <stdbool.h>

typedef unsigned int (*hash_function)(const void *data);
typedef struct bloom_filter * bloom_t;

/* Creates a new bloom filter with no hash functions and size * 8 bits. */
bloom_t bloom_create(size_t size);
/* Frees a bloom filter. */
void bloom_free(bloom_t filter);
/* Adds a hashing function to the bloom filter. You should add all of the
 * functions you intend to use before you add any items. */
void bloom_add_hash(bloom_t filter, hash_function func);
/* Adds an item to the bloom filter. */
void bloom_add(bloom_t filter, const void *item);
/* Tests if an item is in the bloom filter.
 *
 * Returns false if the item has definitely not been added before. Returns true
 * if the item was probably added before. */
bool bloom_test(bloom_t filter, const void *item);

#endif
{% endhighlight %}

## Implementation

The implementation of this is pretty straightfoward. First, here's the actual
structs behind the opaque bloom_t type:

{% highlight C %}
struct bloom_hash {
    hash_function func;
    struct bloom_hash *next;
};

struct bloom_filter {
    struct bloom_hash *func;
    void *bits;
    size_t size;
};
{% endhighlight %}

The hash functions are a linked list, but this isn't important. You can make
that anything you want. Otherwise we have a bit of memory called "bits" and the
size of it. Now, for the easy functions:

{% highlight C %}
bloom_t bloom_create(size_t size) {
	bloom_t res = calloc(1, sizeof(struct bloom_filter));
	res->size = size;
	res->bits = malloc(size);
	return res;
}

void bloom_free(bloom_t filter) {
	if (filter) {
		while (filter->func) {
			struct bloom_hash *h;
			filter->func = h->next;
			free(h);
		}
		free(filter->bits);
		free(filter);
	}
}
{% endhighlight %}

These should be fairly self explanatory. The first interesting function is here:

{% highlight C %}
void bloom_add_hash(bloom_t filter, hash_function func) {
	struct bloom_hash *h = calloc(1, sizeof(struct bloom_hash));
	h->func = func;
	struct bloom_hash *last = filter->func;
	while (last && last->next) {
		last = last->next;
	}
	if (last) {
		last->next = h;
	} else {
		filter->func = h;
	}
}
{% endhighlight %}

Given a hashing function from the user, this just adds it to our linked list of
hash functions. There's a slightly different code path if we're adding the first
function. The functions so far don't really do anything specific to bloom
filters. The first one that does is this:

{% highlight C %}
void bloom_add(bloom_t filter, const void *item) {
	struct bloom_hash *h = filter->func;
	uint8_t *bits = filter->bits;
	while (h) {
		unsigned int hash = h->func(item);
		hash %= filter->size * 8;
		bits[hash / 8] |= 1 << hash % 8;
		h = h->next;
	}
}
{% endhighlight %}

This iterates over each of the hash functions the user has provided and computes
the hash of the data for that function (modulo the size of our bloom filter),
then it adds this to the bloom filter with this line:

{% highlight C %}
bits[hash / 8] |= 1 << hash % 8;
{% endhighlight %}

This just sets the nth bit of the filter where n is the hash. Finally, we have
the test function:

{% highlight C %}
bool bloom_test(bloom_t filter, const void *item) {
	struct bloom_hash *h = filter->func;
	uint8_t *bits = filter->bits;
	while (h) {
		unsigned int hash = h->func(item);
		hash %= filter->size * 8;
		if (!(bits[hash / 8] & 1 << hash % 8)) {
			return false;
		}
		h = h->next;
	}
	return true;
}
{% endhighlight %}

This function is extremely similar, but instead of setting the nth bit, it
checks the nth bit and returns if it's 0:

{% highlight C %}
if (!(bits[hash / 8] & 1 << hash % 8)) {
{% endhighlight %}

That's it! You have a bloom filter with arbitrary data types for insert and
user-supplied hash functions. I wrote up some simple test code to demonstrate
this, after googling for a couple of random hash functions:

{% highlight C %}
#include "bloom.h"
#include <stdio.h>

unsigned int djb2(const void *_str) {
	const char *str = _str;
	unsigned int hash = 5381;
	char c;
	while ((c = *str++)) {
		hash = ((hash << 5) + hash) + c;
	}
	return hash;
}

unsigned int jenkins(const void *_str) {
	const char *key = _str;
	unsigned int hash, i;
	while (*key) {
		hash += *key;
		hash += (hash << 10);
		hash ^= (hash >> 6);
		key++;
	}
	hash += (hash << 3);
	hash ^= (hash >> 11);
	hash += (hash << 15);
	return hash;
}

int main() {
	bloom_t bloom = bloom_create(8);
	bloom_add_hash(bloom, djb2);
	bloom_add_hash(bloom, jenkins);
	printf("Should be 0: %d\n", bloom_test(bloom, "hello world"));
	bloom_add(bloom, "hello world");
	printf("Should be 1: %d\n", bloom_test(bloom, "hello world"));
	printf("Should (probably) be 0: %d\n", bloom_test(bloom, "world hello"));
	return 0;
}
{% endhighlight %}

The full code is available [here](https://git.sr.ht/~sircmpwn/bloom/).