qdrant uniqueness check clj

Prerequisites

A running Qdrant instance (local or remote).
Clojure project with dependencies for HTTP requests and JSON handling.
URLs stored in a Qdrant collection’s payload under a field like url.

Setup

Add the following dependencies to your project.clj (if using Leiningen):

:dependencies [[org.clojure/clojure "1.11.1"]
               [clj-http "3.12.0"]          ;; For HTTP requests
               [cheshire "5.12.0"]]         ;; For JSON parsing

Or, if you’re using deps.edn:

{:deps {org.clojure/clojure {:mvn/version "1.11.1"}
        clj-http/clj-http {:mvn/version "3.12.0"}
        cheshire/cheshire {:mvn/version "5.12.0"}}}

Clojure Implementation

We’ll implement two versions, similar to the Python code:

Individual URL Check: Check each URL one by one (simpler but slower for many URLs).
Batch URL Check: Check all URLs in a single query using MatchAny (faster for large lists).

1. Individual URL Check

This version queries Qdrant for each URL individually using the /collections/{collection}/points/scroll endpoint.

(ns qdrant-checker
  (:require [clj-http.client :as http]
            [cheshire.core :as json]))

(def qdrant-config
  {:host "http://localhost:6333"
   :collection "scraped_pages"
   :url-field "url"})

(defn build-url-filter
  "Build a Qdrant filter for a single URL."
  [url url-field]
  {:must [{:key url-field
           :match {:value url}}]})

(defn check-url
  "Check if a single URL exists in the Qdrant collection."
  [{:keys [host collection url-field]} url]
  (let [endpoint (str host "/collections/" collection "/points/scroll")
        payload {:filter (build-url-filter url url-field)
                 :limit 1
                 :with_payload true
                 :with_vector false}
        response (http/post endpoint
                           {:body (json/generate-string payload)
                            :headers {"Content-Type" "application/json"}
                            :as :json})
        points (get-in response [:body :result :points])]
    (if (seq points)
      {:url url :exists? true}
      {:url url :exists? false})))

(defn check-urls
  "Check which URLs exist in the Qdrant collection. Returns {:existing [], :non-existing []}."
  [config urls]
  (let [results (map #(check-url config %) urls)
        existing (map :url (filter :exists? results))
        non-existing (map :url (remove :exists? results))]
    {:existing existing
     :non-existing non-existing}))

;; Example usage
(def urls-to-check
  ["https://example.com/page1"
   "https://example.com/page2"
   "https://example.com/page3"])

(let [{:keys [existing non-existing]} (check-urls qdrant-config urls-to-check)]
  (println "Existing URLs:" existing)
  (println "Non-existing URLs:" non-existing)
  (if (seq non-existing)
    (println "Ready to insert" (count non-existing) "new URLs into Qdrant.")
    (println "All URLs already exist in the collection.")))

2. Batch URL Check

This version queries all URLs in a single request using a MatchAny filter, which is more efficient for large lists.

(ns qdrant-checker
  (:require [clj-http.client :as http]
            [cheshire.core :as json]))

(def qdrant-config
  {:host "http://localhost:6333"
   :collection "scraped_pages"
   :url-field "url"})

(defn build-batch-url-filter
  "Build a Qdrant filter for multiple URLs using MatchAny."
  [urls url-field]
  {:must [{:key url-field
           :match {:any urls}}]})

(defn check-urls-batch
  "Check which URLs exist in the Qdrant collection in a single query.
   Returns {:existing [], :non-existing []}."
  [{:keys [host collection url-field]} urls]
  (let [endpoint (str host "/collections/" collection "/points/scroll")
        payload {:filter (build-batch-url-filter urls url-field)
                 :limit (count urls)
                 :with_payload true
                 :with_vector false}
        response (http/post endpoint
                           {:body (json/generate-string payload)
                            :headers {"Content-Type" "application/json"}
                            :as :json})
        points (get-in response [:body :result :points])
        existing (map #(get-in % [:payload url-field]) points)
        non-existing (remove (set existing) urls)]
    {:existing existing
     :non-existing non-existing}))

;; Example usage
(def urls-to-check
  ["https://example.com/page1"
   "https://example.com/page2"
   "https://example.com/page3"])

(let [{:keys [existing non-existing]} (check-urls-batch qdrant-config urls-to-check)]
  (println "Existing URLs:" existing)
  (println "Non-existing URLs:" non-existing)
  (if (seq non-existing)
    (println "Ready to insert" (count non-existing) "new URLs into Qdrant.")
    (println "All URLs already exist in the collection.")))

Explanation

Qdrant Config: The qdrant-config map holds the Qdrant host, collection name, and payload field name (url-field). Update these to match your setup (e.g., host for Qdrant Cloud, url-field if you use a different key like page_url).
HTTP Requests: We use clj-http to send POST requests to Qdrant’s REST API (/collections/{collection}/points/scroll). The :as :json option ensures the response is parsed as JSON.
Filter Construction:
- For individual checks, build-url-filter creates a must filter with a single match condition for one URL.
- For batch checks, build-batch-url-filter uses match.any to match any URL in the list.
Response Handling:
- The scroll endpoint returns a :result.points array. If it’s non-empty, the URL exists.
- In the batch version, we extract all URLs from the points’ payloads and compute non-existing URLs by set difference.
Output: Both functions return a map with :existing and :non-existing keys, containing lists of URLs.

Example Output

For urls-to-check ["https://example.com/page1" "https://example.com/page2"]:

If page1 exists and page2 doesn’t:

Existing URLs: (https://example.com/page1)
Non-existing URLs: (https://example.com/page2)
Ready to insert 1 new URLs into Qdrant.

Key Notes

Batch Efficiency: The batch version (check-urls-batch) is preferred for large URL lists because it minimizes HTTP requests. The MatchAny filter checks all URLs in one go.
Payload Field: Assumes URLs are stored in a payload field named url. If your field is different (e.g., metadata.url), update :url-field in qdrant-config and ensure the filter key matches (e.g., metadata.url in the filter).

Error Handling: For production, add error handling for network issues or Qdrant errors:

(defn check-urls-batch
  [config urls]
  (try
    (let [endpoint ...] ;; Same as above
      ...)
    (catch Exception e
      (println "Error querying Qdrant:" (.getMessage e))
      {:existing [] :non-existing urls})))

Collection Existence: To verify the collection exists, you can query /collections/{collection} before running checks:

(defn collection-exists? [{:keys [host collection]}]
  (let [response (http/get (str host "/collections/" collection))]
    (= 200 (:status response))))

REST vs. gRPC: This uses the REST API for simplicity. If you’re using gRPC, you’d need to interop with the qdrant-java-client. I can provide that version if needed.
Inserting Non-existing URLs: After identifying :non-existing URLs, you’ll need to:
1. Generate vectors for the scraped pages (e.g., using a Java/Clojure-compatible embedding library like sentence-transformers via interop).
2. Upsert points to Qdrant using the /collections/{collection}/points endpoint. Let me know if you want code for this part!

Configuration Adjustments

Qdrant Cloud: If using Qdrant Cloud, update :host to your cluster URL (e.g., https://your-cluster.qdrant.io) and add an API key:

(def qdrant-config
  {:host "https://your-cluster.qdrant.io"
   :collection "scraped_pages"
   :url-field "url"
   :api-key "your-api-key"})

;; Add to http/post
:headers {"Content-Type" "application/json"
          "api-key" (:api-key config)}

Custom Payload: If URLs are nested (e.g., {:metadata {:url "..."}}), use a dotted key in the filter:
```
{:must [{:key "metadata.url" :match {:value url}}]}
```

Running the Code

Save the code in a file (e.g., src/qdrant_checker.clj).
Run with Leiningen: lein run (or clojure -M -m qdrant-checker for deps.edn).
Adjust urls-to-check and qdrant-config to match your data.

Next Steps

If you’re ready to insert the :non-existing URLs, I can provide Clojure code to generate vectors and upsert points.
If you have a specific embedding model or Qdrant schema, share details for tailored code.
If you prefer the gRPC client or have other constraints (e.g., async HTTP), let me know.

Let me know how this works or if you need further tweaks!

usametov/url-check.md