I Constructed An AI Pipeline For Kindle Highlights

I learn, I like to spotlight stuff (I exploit a Kindle). I really feel like by studying I don’t get to retain greater than 10% of the knowledge I eat nevertheless it’s by way of re-reading the highlights or summarizing the e book utilizing them is what makes me truly perceive what I learn.

The issue is that, typically, I find yourself highlighting loads.

And by loads I imply A LOT. We can not even name them “key notes.”

So in these instances, after studying the e book, I find yourself both losing lots of time summarizing or simply give up doing it (the latter is the extra frequent).

I lately learn a e book that I loved quite a bit and want to absolutely retain what struck me probably the most. However, once more, it was a kind of books I over-highlighted.

And I didn’t need to use lots of my scarce free time on it. So I made a decision to automate the method and use my tech/information expertise. As a result of I’m pleased with the consequence, I assumed I’d share it so anybody can reap the benefits of this instrument as properly.

Disclaimer: my Kindle is sort of previous so this could work on new ones as properly. In actual fact, there’s a barely higher strategy for brand new Kindle variations (defined additionally on this put up).

The Challenge

Let’s outline the objective: generate a abstract from our Kindle highlights.

As I considered it, I imagined the next easy pipeline for a single e book:

Get the e book highlights
Create a RAG or one thing related
Export the abstract

The result’s completely different on the primary half, however all because of the preprocessing wanted bearing in mind how the info is structured.

So I’ll construction this put up into two important sections:

Knowledge retrieval and processing
AI mannequin and output

1. Knowledge Retrieval and Processing

My instinct advised me there was a strategy to extract highlights from my Kindle. Ultimately, they’re saved there, so I simply want a strategy to get them out.

There are a number of methods to do it however I needed an strategy that labored with each books purchased on the official Kindle retailer but additionally PDFs or information I despatched from my laptop computer.

And I additionally determined I wouldn’t use any present software program to extract the info. Simply my e book and my laptop computer (and a USB connecting them each).

Fortunately for us, no jailbreak is required and there are two methods of doing so relying in your Kindle model:

All Kindles (presumably) have a file within the paperwork folder named My Clippings.txt. It actually incorporates any clipping you’ve made at any level and any e book.
New Kindles even have a SQLite file within the system listing named annotations.db. This has your highlights in a extra structured manner.

On this put up I’ll be utilizing technique 1 (My Clippings.txt) primarily as a result of my Kindle doesn’t have the annotations.db database. However in case you’re fortunate sufficient to have the DB, use it because it’ll be extra simple and better high quality (a lot of the preprocessing we’ll be seeing subsequent received’t most likely be wanted).

So getting the clippings is as simple as studying the TXT. Listed below are some key facets and issues I encountered utilizing this technique:

All books are on the identical file.
I’m unsure concerning the actual “clipping” definition on Amazon’s facet however the best way I’ve seen it’s: something you spotlight at any level. Even in case you delete it or increase it, the unique will stay within the TXT. I assume that is like that as a result of, certainly, we’re working with a TXT file and it’s very exhausting to delete stuff that’s not listed in any manner.
There’s a restrict to clipping: I’m not conscious of the precise threshold however as soon as we cross it, we will’t retrieve any extra clippings. That is carried out as a result of somebody may in any other case spotlight the total e book, extract it and share it illegally.

And that is the anatomy of a clipping:

==========
Guide Identify (Creator Identify)
- Your Spotlight on web page 145 | Location 2212-2212 | Added on Sunday, August 30, 2020 11:25:29 PM

transparency downside results in the identical place as
==========

So step one is parsing the highlights, and that is the place we begin seeing Python code:

def parse_clippings(file_path):

    uncooked = Path(file_path).read_text(encoding="utf-8")
    entries = uncooked.cut up("==========")

    highlights = []

    for entry in entries:

        traces = [l.strip() for l in entry.strip().split("n") if l.strip()]

        if len(traces) < 3:
            proceed

        e book = traces[0]

        if "Highlight" not in traces[1]:
            proceed

        location_match = re.search(r"Location (d+)", traces[1])
        if not location_match:
            proceed

        location = int(location_match.group(1))

        textual content = " ".be part of(traces[2:]).strip()

        highlights.append(
            {
                "book": e book,
                "location": location,
                "text": textual content
            }
        )

    return highlights

Given the trail of the clippings file, all this perform does is cut up the textual content into the completely different entries after which loop by way of them. For every entry, it extracts the title title, the situation and the highlighted textual content.

This remaining construction (an inventory of dictionaries) makes it simple to filter by e book:

[
    h for h in highlights
    if book_name.lower() in h["book"].decrease()
]

As soon as filtered, we should order the highlights. Since clippings are appended to the TXT file, the order is predicated on after we spotlight, not on the textual content’s location.

And I personally need my outcomes to seem as they do within the e book, so ordering is critical:

sorted(highlights, key=lambda x: x["location"])

Now, in case you examine your clippings file, you may discover duplicated clippings (or duplicated subclippings). This occurs as a result of any time you edit a spotlight (that you just’ve failed to incorporate all of the phrases you geared toward, for instance), it’s accounted as a brand new one. So there can be two very related clippings within the TXT. Or much more in case you edit it lots of occasions.

We have to deal with this by making use of a deduplication in some way. It’s simpler than anticipated:

def deduplicate(highlights):

    clear = []
    for h in highlights:

        textual content = h["text"]
        duplicate = False

        for c in clear:

            if textual content == c["text"]:
                duplicate = True
                break
            if textual content in c["text"]:
                duplicate = True
                break
            if c["text"] in textual content:
                c["text"] = textual content
                duplicate = True
                break

        if not duplicate:
            clear.append(h)

    return clear

It’s quite simple and could possibly be perfected, however we mainly examine if there are consecutive clippings with the identical textual content (or a part of it) and maintain the longest.

Proper now we’ve got the e book highlights correctly sorted, and we may cease the pre-processing right here. However I can’t do this. I like to spotlight titles each time as a result of, when summarizing, I get to correctly assign a bit to every spotlight.

However our code isn’t capable of differ between an actual spotlight and a bit title… But. See beneath:

def is_probable_title(textual content):

    textual content = textual content.strip()
    if len(textual content) > 120:
        return False
    if textual content.endswith("."):
        return False

    phrases = textual content.cut up()
    if len(phrases) > 12:
        return False

    # chapter fashion prefix
    if has_chapter_prefix(textual content):
        return True

    # capitalization ratio
    capitalized = sum(
        1 for w in phrases if w[0].isupper()
    )

    cap_ratio = capitalized / len(phrases)

    # stopword ratio
    stopword_count = sum(
        1 for w in phrases if w.decrease() in STOPWORDS
    )

    stop_ratio = stopword_count / len(phrases)

    rating = 0
    if cap_ratio > 0.6:
        rating += 1
    if stop_ratio < 0.3:
        rating += 1
    if len(phrases) <= 6:
        rating += 1

    return rating >= 2

It could appear fairly arbitrary, and it’s not the most effective resolution to this downside, nevertheless it does work fairly properly. It makes use of a heuristic based mostly on capitalization, size, stopwords and prefixes.

This perform is named inside a loop by way of all of the highlights, as we’ve seen in earlier features, to examine if a spotlight is a title or not. The result’s a “sections” checklist of dictionaries the place the dictionary has two keys:

Title: the part title.
Highlights: the part’s highlights.

Proper now, sure, we’re able to summarize.

AI Mannequin and Output

I needed this to be a free challenge, so we want an AI mannequin that’s open supply.

I assumed that Ollama [1] was probably the greatest choices to run a challenge like this (a minimum of regionally). Plus, our information at all times stay ours with them and we will run the fashions offline.

As soon as put in, the code was simple. I’m not a immediate engineer so anybody with the know-how would get even higher outcomes, however that is what works for me:

def summarize_with_ollama(textual content, mannequin):

    immediate = f"""
    You're summarizing a e book from reader highlights.

    Produce a structured abstract with:

    - Foremost thesis
    - Temporary abstract
    - Key concepts
    - Necessary ideas
    - Sensible takeaways

    Highlights:

    {textual content}
    """

    consequence = subprocess.run(
        ["ollama", "run", model],
        enter=immediate,
        textual content=True,
        capture_output=True
    )

    return consequence.stdout

Easy, I do know. But it surely works partly as a result of the info preprocessing has been intense but additionally as a result of we’re already leveraging the fashions constructed on the market.

However what will we do with the abstract? I like utilizing Obsidian [2] so exporting a Markdown file is what makes extra sense. Right here you will have it:

def export_markdown(e book, sections, abstract, output):

    md = f"# {book}nn"
    for part in sections:
        md += f"## {section['title']}nn"
        for h in part["highlights"]:
            md += f"- {h}n"
        md += "n"

    md += "n---nn"
    md += "## Book Summarynn"
    md += abstract

    output_path = Path(output)
    output_path.father or mother.mkdir(mother and father=True, exist_ok=True)
    output_path.write_text(md, encoding="utf-8")
    print(f"nSaved to {output_path}")

Et voilà.

And that is how I’m going from highlights to a full Markdown abstract (straight to Obsidian if I need to) with lower than 300 traces of Python code!

Full Code and Check

Right here’s the total code, simply in case you need to copy-paste it. It incorporates what we’ve seen plus some helper features and argument parsing:

import re
import argparse
from pathlib import Path
import subprocess


# ---------- PARSE CLIPPINGS ----------

def parse_clippings(file_path):

    uncooked = Path(file_path).read_text(encoding="utf-8")
    entries = uncooked.cut up("==========")

    highlights = []

    for entry in entries:

        traces = [l.strip() for l in entry.strip().split("n") if l.strip()]

        if len(traces) < 3:
            proceed

        e book = traces[0]

        if "Highlight" not in traces[1]:
            proceed

        location_match = re.search(r"Location (d+)", traces[1])
        if not location_match:
            proceed

        location = int(location_match.group(1))

        textual content = " ".be part of(traces[2:]).strip()

        highlights.append(
            {
                "book": e book,
                "location": location,
                "text": textual content
            }
        )

    return highlights


# ---------- FILTER BOOK ----------

def filter_book(highlights, book_name):

    return [
        h for h in highlights
        if book_name.lower() in h["book"].decrease()
    ]


# ---------- SORT ----------

def sort_by_location(highlights):

    return sorted(highlights, key=lambda x: x["location"])


# ---------- DEDUPLICATE ----------

def deduplicate(highlights):

    clear = []

    for h in highlights:

        textual content = h["text"]
        duplicate = False

        for c in clear:

            if textual content == c["text"]:
                duplicate = True
                break

            if textual content in c["text"]:
                duplicate = True
                break

            if c["text"] in textual content:
                c["text"] = textual content
                duplicate = True
                break

        if not duplicate:
            clear.append(h)

    return clear


# ---------- TITLE DETECTION ----------

STOPWORDS = {
    "the","and","or","but","of","in","on","at","for","to",
    "is","are","was","were","be","been","being",
    "that","this","with","as","by","from"
}


def has_chapter_prefix(textual content):

    return bool(
        re.match(
            r"^(chapter|part|section)s+d+|^d+[.)]|^[ivxlcdm]+.",
            textual content.decrease()
        )
    )


def is_probable_title(textual content):

    textual content = textual content.strip()
    if len(textual content) > 120:
        return False
    if textual content.endswith("."):
        return False

    phrases = textual content.cut up()

    if len(phrases) > 12:
        return False
    # chapter fashion prefix
    if has_chapter_prefix(textual content):
        return True

    # capitalization ratio
    capitalized = sum(
        1 for w in phrases if w[0].isupper()
    )
    cap_ratio = capitalized / len(phrases)

    # stopword ratio
    stopword_count = sum(
        1 for w in phrases if w.decrease() in STOPWORDS
    )
    stop_ratio = stopword_count / len(phrases)

    rating = 0
    if cap_ratio > 0.6:
        rating += 1
    if stop_ratio < 0.3:
        rating += 1
    if len(phrases) <= 6:
        rating += 1

    return rating >= 2


# ---------- GROUP SECTIONS ----------

def group_by_sections(highlights):

    sections = []
    present = {
        "title": "Introduction",
        "highlights": []
    }

    for h in highlights:
        textual content = h["text"]

        if is_probable_title(textual content):
            sections.append(present)
            present = {
                "title": textual content,
                "highlights": []
            }
        else:
            present["highlights"].append(textual content)
    sections.append(present)
    return sections


# ---------- SUMMARY ----------




# ---------- EXPORT MARKDOWN ----------

def export_markdown(e book, sections, abstract, output):

    md = f"# {book}nn"
    for part in sections:
        md += f"## {section['title']}nn"
        for h in part["highlights"]:
            md += f"- {h}n"
        md += "n"

    md += "n---nn"
    md += "## Book Summarynn"
    md += abstract

    output_path = Path(output)
    output_path.father or mother.mkdir(mother and father=True, exist_ok=True)
    output_path.write_text(md, encoding="utf-8")
    print(f"nSaved to {output_path}")


# ---------- MAIN ----------

def important():

    parser = argparse.ArgumentParser()

    parser.add_argument("--book", required=True)
    parser.add_argument("--output", required=False, default=None)
    parser.add_argument(
        "--clippings",
        default="Data/My Clippings.txt"
    )
    parser.add_argument(
        "--model",
        default="mistral"
    )

    args = parser.parse_args()

    highlights = parse_clippings(args.clippings)
    highlights = filter_book(highlights, args.e book)
    highlights = sort_by_location(highlights)
    highlights = deduplicate(highlights)
    sections = group_by_sections(highlights)

    all_text = "n".be part of(
        h["text"] for h in highlights
    )

    abstract = summarize_with_ollama(all_text, args.mannequin)

    if args.output:
        export_markdown(
            args.e book,
            sections,
            abstract,
            args.output
        )
    else:
        print("n---- HIGHLIGHTS ----n")
        for h in highlights:
            print(f"{h['text']}n")

        print("n---- SUMMARY ----n")
        print(abstract)


if __name__ == "__main__":
    important()

However let’s see the way it works! The code itself is helpful however I guess you’re keen to see the outcomes. It’s a protracted one so I made a decision to delete the primary half as all it does is simply copy-paste the highlights.

I randomly selected a e book I learn like 6 years in the past (2020) known as Speaking to Strangers by Malcolm Gladwell (a bestseller, fairly pleasurable learn). See the mannequin’s printed output (not the Markdown):

$ python3 kindle_summary.py --book "Talking to Strangers"

---- HIGHLIGHTS ----

...


---- SUMMARY ----

 Title: Speaking to Strangers: What We Ought to Know About Human Interplay

Foremost Thesis: The e book explores the complexities and paradoxes of human 
interplay, notably in conversations with strangers, and emphasizes 
the significance of warning, humility, and understanding the context in 
which these interactions happen.

Temporary Abstract: The writer delves into the misconceptions and shortcomings 
in our dealings with strangers, specializing in how we regularly make incorrect 
assumptions about others based mostly on restricted data or preconceived 
notions. The e book presents insights into why this occurs, its penalties, 
and techniques for enhancing our capability to know and talk 
successfully with folks we do not know.

Key Concepts:
1. The transparency downside and the default-to-truth downside: Individuals usually 
assume that others are open books, sharing their true feelings and 
intentions, when in actuality this isn't at all times the case.
2. Coupling: Behaviors are strongly linked to particular circumstances and 
situations, making it important to know the context during which a 
stranger operates.
3. Limitations of understanding strangers: There isn't a excellent mechanism 
for peering into the minds of these we have no idea, emphasizing the necessity 
for restraint and humility when interacting with strangers.

Necessary Ideas:
1. Emotional responses falling exterior expectations
2. Defaulting to fact
3. Transparency as an phantasm
4. Contextual understanding in coping with strangers
5. The paradox of speaking to strangers (want versus terribleness)
6. The phenomenon of coupling and its affect on conduct
7. Blaming the stranger when issues go awry

Sensible Takeaways:
1. Acknowledge that folks could not at all times seem as they appear, each 
emotionally and behaviorally.
2. Perceive the significance of context in decoding strangers' 
behaviors and intentions.
3. Be cautious and humble when interacting with strangers, acknowledging 
our limitations in understanding them absolutely.
4. Keep away from leaping to conclusions about strangers based mostly on restricted 
data or preconceived notions.
5. Settle for that there'll at all times be a point of ambiguity and 
complexity in coping with strangers.
6. Keep away from penalizing others for defaulting to fact as a protection mechanism.
7. When interactions with strangers go awry, think about the function one may 
have performed in contributing to the scenario quite than solely blaming 
the stranger.

And all this inside just a few seconds. Fairly cool for my part.

Conclusion

And that’s mainly how I’m now saving lots of free time (that I can use to put in writing posts like this one) by leveraging my information expertise and AI.

I hope you loved the learn and felt motivated to offer it a strive! It received’t be higher than the abstract you’d write with your individual notion of the e book… But it surely received’t be removed from that!

Thanks on your consideration, be at liberty to remark in case you have any concepts or solutions!

Assets

[1] Ollama. (n.d.). Ollama.

[2] Obsidian. (n.d.). Obsidian.

Top Posts

Ethereum Whales Add $950 Million as Bottom Hopes Build, but the Story Has a Hole

Chinese Hackers Exploit Google Workspace to Pilfer Sensitive Research and Defense Emails

Microsoft Partners with AWS for GitHub Resources as Azure Faces Legal Battle

I Constructed an AI Pipeline for Kindle Highlights

The Protocol That Transformed Our Agent Architecture

3 Clever Pandas Tricks to Supercharge Your Data Cleaning & Preparation

A generalizable Hi-C foundation model for chromatin architecture, single-cell and multiomics analysis across species

Flash-KMeans: The IO-Aware, Exact K-Meams That Outpaces FAISS by 200× on GPUs

Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge

OWL’s Guide: 3D Spleen Segmentation with MONAI UNet on CT Volumes

Ethereum Whales Add $950 Million as Bottom Hopes Build, but the Story Has a Hole

Chinese Hackers Exploit Google Workspace to Pilfer Sensitive Research and Defense Emails

Microsoft Partners with AWS for GitHub Resources as Azure Faces Legal Battle

Industrial AI Evolves: Prioritizing Knowledge Preservation Over Predictive Maintenance

AI Red Teaming Decoded: The Essential Guide to Outsmarting Cyber Threats

The Protocol That Transformed Our Agent Architecture

CDC’s Ebola Battle Hamstrung by Staffing Cuts and Crumbling Morale

How a 15-in-1 Docking Station Transformed My PC Setup in Ways I Never Expected

Trending

Ethereum Whales Add $950 Million as Bottom Hopes Build, but the Story Has a Hole

Chinese Hackers Exploit Google Workspace to Pilfer Sensitive Research and Defense Emails

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

I Constructed an AI Pipeline for Kindle Highlights

The Challenge

1. Knowledge Retrieval and Processing

AI Mannequin and Output

Full Code and Check

Conclusion

Assets

Related Posts