system-design-101/data/guides/how-to-avoid-crawling-duplicate-urls-at-google-scale.md at main

mirror of https://github.com/ByteByteGoHq/system-design-101.git synced 2026-04-09 11:37:25 -04:00

Files

Kamran Ahmed ee4b7305a2 Adds ByteByteGo guides and links (#106 )

This PR adds all the guides from [Visual
Guides](https://bytebytego.com/guides/) section on bytebytego to the
repository with proper links.

- [x] Markdown files for guides and categories are placed inside
`data/guides` and `data/categories`
- [x] Guide links in readme are auto-generated using
`scripts/readme.ts`. Everytime you run the script `npm run
update-readme`, it reads the categories and guides from the above
mentioned folders, generate production links for guides and categories
and populate the table of content in the readme. This ensures that any
future guides and categories will automatically get added to the readme.
- [x] Sorting inside the readme matches the actual category and guides
sorting on production

2025-03-31 22:16:44 -07:00

2.0 KiB

Raw Permalink Blame History

title, description, image, createdAt, draft, categories, tags

title

description

image

createdAt

draft

Option 3: Bloom Filter

This option is preferred. Bloom filter was proposed by Burton Howard Bloom in 1970. It is a probabilistic data structure that is used to test whether an element is a member of a set.

false: the element is definitely not in the set.
true: the element is probably in the set.

False-positive matches are possible, but false negatives are not.

The diagram below illustrates how the Bloom filter works. The basic data structure for the Bloom filter is Bit Vector. Each bit represents a hashed value.

Step 1

To add an element to the bloom filter, we feed it to 3 different hash functions (A, B, and C) and set the bits at the resulting positions. Note that both “www.myweb1.com” and “www.myweb2.com” mark the same bit with 1 at index 5. False positives are possible because a bit might be set by another element.

Step 2

When testing the existence of a URL string, the same hash functions A, B, and C are applied to the URL string. If all three bits are 1, then the URL may exist in the dataset; if any of the bits is 0, then the URL definitely does not exist in the dataset.

Hash function choices are important. They must be uniformly distributed and fast. For example, RedisBloom and Apache Spark use murmur, and InfluxDB uses xxhash.

2.0 KiB Raw Permalink Blame History

Option 3: Bloom Filter

Step 1

Step 2

2.0 KiB

Raw Permalink Blame History