Download locked PDFs from Google Drive

Google drive has an option to share PDF files in such a way that they cannot be downloaded. A silly concept for those of us who enjoy reading such documents in an alternative reader, while offline or simply like to keep certain files for future reference.

However, everything that’s displayed on your computer can be captured and it didn’t take me long to find an excellent solution at the Coding Cat Blog. It requires no extra tools. Simply make sure you “read” the entire document. So hold the Page-Dn key until the end. Maybe go back up the same way. This will generate all the images and store them in your browser cache.

Then open your browser’s JavaScript Console and paste the code (unchanged from the original author’s website for backup):

let jspdf = document.createElement("script");

jspdf.onload = function () {

    let pdf = new jsPDF();
    let elements = document.getElementsByTagName("img");
    for (let i in elements) {
        let img = elements[i];
        console.log("add img ", img);
        if (!/^blob:/.test(img.src)) {
            console.log("invalid src");
            continue;
        }
        let can = document.createElement('canvas');
        let con = can.getContext("2d");
        can.width = img.width;
        can.height = img.height;
        con.drawImage(img, 0, 0);
        let imgData = can.toDataURL("image/jpeg", 1.0);
        pdf.addImage(imgData, 'JPEG', 0, 0);
        pdf.addPage();
    }

    pdf.save("download.pdf");
};

jspdf.src = 'https://cdnjs.cloudflare.com/ajax/libs/jspdf/1.5.3/jspdf.debug.js';
document.body.appendChild(jspdf);

The resulting PDF is just fine for basic reading but has one major flaw. It’s basically a collection of images rather then text. This makes searching basically impossible, a problem if your PDF is quite big. So I took this a little further and came accross some interesting OCR software.

The simplest working solution I found is ocrmypdf. Either install it via yum or whatever package manager you use or via Python’s pip module (maybe in virtual environment for testing).

After that just run ocrmypdf input.pdf output.pdf and the result should work just fine.

Another application worth looking into is tesseract-ocr which I’ve used once before with great success … but forgot all about how to handle it by now. Just know it’s out there while I’m trying to piece my know how about it back together to update this page some time in the future.