date

Feb 9, 2022

URL

slug

nyquisis-pdfdropduplicate-remastered

status

Published

tags

Python

DataScience

summary

Nyquisis is a tool to drop duplicate pages in a PDF file.

type

Post

**Estimated Read time: 6 minutes**

**TL;DR:**Nyquisis is a tool to drop duplicate pages in a PDF file. It uses image hashing algorithm to differentiate between similar and different neighboring pages.

**Notice:**This article may subject to a few updates in the future to illustrate a few concepts.

This started when I was a sophomore student. The lecture slides provided by a professor had plenty of duplicate pages for display - and it just ruins your experience to review the slides. I wrote

`pdfDropDuplicate`

to automatically delete the duplicate pages. Then I forgot about this small script.Until last week. I was annoyed, again, by the lecture slides of

*Optimization Methods in SIGML*, with that much duplications again. I found out the old script and it still worked. But the code was really shitty. That’s why I made this remastered version of it, now available on PyPI.The package name

`Nyquisis`

comes after Harry Nyquist. If you know the famous **Nyquist–Shannon sampling theorem**, it’s named after Claude Shannon and him. Broadly speaking, Nyquisis is indeed doing a down-sampling job.To use the tool, simply run In your shell

`$ pip3 install nyquisis`

And then

`$ nyquisis ./pdf-to-drop.pdf`

Check out the GitHub Repository for more details:

### How it works

Nyquisis exploits a concept called image hash.

Hashing is basically mapping an immutable object to a fixed length of string. If we have two different files, even if they only differ by one character, their hashes are expected to be very different. Otherwise this is a serious hash collision.

But for image hashes, we expect something a little different. If we have two images, the more similar they are, the closer their image hashes we expect (this is the purpose of image hashes when they were proposed). Read this question on StackOverflow to learn more:

In Nyquisis, pHash (perceptual hash) among the existing four image hash functions is chosen to be the hash function. The pHash starts by performing a Type-II Discrete Cosine Transform (DCT) on the image into the frequency domain:

The DCT transform is similar to a Discrete Fourier Transform (DFT), but it uses only the cosine functions instead of both sine and cosine functions in DFT.

`scipy.fftpack`

already has a very efficient implementation and it defaults to Tyle-II DCT.After the DCT transform, the 32x32 low-frequency components from the upper-left corner of the transformed image will be taken, these are the components that really contributes to the overall structure of the image. The hash is simply whether each element in this 32x32 matrix is greater than the median.

The DCT transform grants pHash the ability to ignore details in the image, and only focusing on the broader similarity. To illustrate the effect of image hashing, here’s three pages taken from one of the slides when I took

*Introduction to Signals and Systems*two years ago.The latter two images are considered similar images. You can notice that there are simiar patterns in their image hashes in many locations. Taking bitwise-equal of the images hashes, this is the result of the first two images:

And this is the result of the latter two images:

The difference is significant - and quantizable. Hamming distance is used between the image hashes to quantize the difference. The Hamming distance between is defined as

It is essentially performing a bitwise XOR operation between the two vectors. Flatten the image hash to a 1-dimensional vector, we are thus able to quantize their difference.

### Implementing a robust K-Means clustering

Now, we have a list of hamming differences of each page’s pHash compared to its neibor’s pHash. We also know that they form two groups: a group of larger difference, indicating the two pages are different; and a group of small difference, indicating the two pages are similar. The next step is to seperate the two groups, and simply delete pages from the latter group.

In older versions, I assigned a fixed threshold to detect whether a page differs from the last too much. This threshold possibly will not work for all different slides, so there must be a better way to decide this threshold. The answer is unsupervised learning.

A K-Means clustering algorithm should suffice here. Assume the

`diff`

array is always clearly distinguishable into two clusters, then we only need to find the center of these two clusters, and take their mean as the threshold. An additional check on whether this threshold does separate the two classes clearly would be even better. Since I’m having one-dimensional data, the implementation is simple. So I wrote it myself instead of importing `sklearn`

:```
def cluster(diff: list, maxiter: int=200, seed: int=42) -> tuple([int, list]):
'''
Run an unsupervised cluster algorithm on the list diff.
Assumes the list can be divided by a clear threshold,
return that threshold using the mean of the two clusters' centers.
Arguments:
- diff (list): The list of hash differences.
- maxiter (int): Max number of K-means iteration.
- seed (int): Seed to initialize numpy's RNG.
Returns:
int - The calculated threshold
list - The centroids
'''
diff = np.array(diff)
distances = np.zeros((diff.shape[0], 2))
# initialize center of clusters
np.random.seed(seed)
np.random.shuffle(diff)
centroids = diff[0:2]
for i in range(maxiter):
centroids_temp = centroids.copy()
distances[:, 0] = np.abs(diff - centroids[0])
distances[:, 1] = np.abs(diff - centroids[1])
classes = np.argmin(distances, axis=1)
centroids[0] = epsilon * np.mean(diff[classes == 0]) + (1 - epsilon) * centroids[0]
centroids[1] = epsilon * np.mean(diff[classes == 1]) + (1 - epsilon) * centroids[1]
return (np.mean(centroids), centroids.tolist())
```

Let’s see what it can do with my slides. The input

`diff`

is```
[16711680, 6160384, 16252928, 5308416, 6029312,
7077888, 15466496, 6488064, 15138816, 16318464,
4849664, 15138816, 5505024, 3014656, 11796480,
14680064, 6225920, 13893632, 5046272, 12648448,
3801088, 12582912, 11141120, 0, 16646144, 11141120,
4653056, 6291456, 4063232, 3604480, 10944512,
6225920, 11534336, 4521984, 15794176, 14614528,
6619136, 13041664, 15990784, 7340032, 15138816,
8126464, 16056320, 17367040, 13565952, 16580608,
7012352, 16187392, 4390912, 3735552, 10682368,
15138816, 11796480, 9437184, 15269888, 17694720,
8323072, 8454144, 8454144, 8323072, 11403264,
10878976, 10551296, 6488064, 15335424, 11665408,
9961472, 16318464, 15532032, 15007744, 16646144,
14614528, 16908288, 11272192, 15532032, 13500416,
14090240, 15859712, 12713984, 12845056, 10944512,
8257536, 14876672, 15859712]
```

And the above code generates:

```
iteration 0, centroids: [14496243 7878036]
iteration 1, centroids: [11053975 11019243]
iteration 2, centroids: [11038212 11019243]
iteration 3, centroids: [11038022 11019243]
iteration 4, centroids: [11038020 11019243]
iteration 5, centroids: [11038020 11019243]
iteration 6, centroids: [11038020 11019243]
iteration 7, centroids: [11038020 11019243]
iteration 8, centroids: [11038020 11019243]
iteration 9, centroids: [11038020 11019243]
iteration 10, centroids: [11038020 11019243]
iteration 11, centroids: [11038020 11019243]
iteration 12, centroids: [11038020 11019243]
iteration 13, centroids: [11038020 11019243]
iteration 14, centroids: [11038020 11019243]
iteration 15, centroids: [11038020 11019243]
iteration 16, centroids: [11038020 11019243]
iteration 17, centroids: [11038020 11019243]
iteration 18, centroids: [11038020 11019243]
iteration 19, centroids: [11038020 11019243]
```

……What happens? This is very common with K-Means, especially when the sample size is small. The ideal cluster centers are around

`6000000`

and `14000000`

, but it is converging to wrong locations. Sometimes the convergence is even unstable, it will oscillate between two wrong locations.It’s easy to fix this. In each iteration instead of using

`centroids[i] = np.mean(diff[classes == i])`

use a soft-update like

`centroids[i] = epsilon * np.mean(diff[classes == i]) + (1 - epsilon) * centroids[i]`

Now the code would converge to a correct location. The convergence took a little bit longer (60+ iterations) since the soft-update, so I will spare the output here.

### Conclusion

To drop duplicate pages from a PDF file, Nyquisis calculates pHash of each page, takes Hamming distance of neiboring pages, and uses K-Means clustering to differentiate from two groups of Hamming distances.

According to my tests, Nyquisis is robust enough. It worked for slides of different styles from a few different lecturers. But as I can’t gurantee it will work for all inputs, if you find any bugs, please don’t hesitate to raise an issue on GitHub.

I may add a few new features to Nyquisis in the future, and some more code to ensure robustness. In the meanwhile, I conclude it is safe to put down this project for now. Thanks for reading.