- A running Qdrant instance (local or remote).
- Clojure project with dependencies for HTTP requests and JSON handling.
- URLs stored in a Qdrant collection’s payload under a field like
url
.
Add the following dependencies to your project.clj
(if using Leiningen):
:dependencies [[org.clojure/clojure "1.11.1"]
[clj-http "3.12.0"] ;; For HTTP requests
[cheshire "5.12.0"]] ;; For JSON parsing
Or, if you’re using deps.edn
:
{:deps {org.clojure/clojure {:mvn/version "1.11.1"}
clj-http/clj-http {:mvn/version "3.12.0"}
cheshire/cheshire {:mvn/version "5.12.0"}}}
We’ll implement two versions, similar to the Python code:
- Individual URL Check: Check each URL one by one (simpler but slower for many URLs).
- Batch URL Check: Check all URLs in a single query using
MatchAny
(faster for large lists).
This version queries Qdrant for each URL individually using the /collections/{collection}/points/scroll
endpoint.
(ns qdrant-checker
(:require [clj-http.client :as http]
[cheshire.core :as json]))
(def qdrant-config
{:host "http://localhost:6333"
:collection "scraped_pages"
:url-field "url"})
(defn build-url-filter
"Build a Qdrant filter for a single URL."
[url url-field]
{:must [{:key url-field
:match {:value url}}]})
(defn check-url
"Check if a single URL exists in the Qdrant collection."
[{:keys [host collection url-field]} url]
(let [endpoint (str host "/collections/" collection "/points/scroll")
payload {:filter (build-url-filter url url-field)
:limit 1
:with_payload true
:with_vector false}
response (http/post endpoint
{:body (json/generate-string payload)
:headers {"Content-Type" "application/json"}
:as :json})
points (get-in response [:body :result :points])]
(if (seq points)
{:url url :exists? true}
{:url url :exists? false})))
(defn check-urls
"Check which URLs exist in the Qdrant collection. Returns {:existing [], :non-existing []}."
[config urls]
(let [results (map #(check-url config %) urls)
existing (map :url (filter :exists? results))
non-existing (map :url (remove :exists? results))]
{:existing existing
:non-existing non-existing}))
;; Example usage
(def urls-to-check
["https://example.com/page1"
"https://example.com/page2"
"https://example.com/page3"])
(let [{:keys [existing non-existing]} (check-urls qdrant-config urls-to-check)]
(println "Existing URLs:" existing)
(println "Non-existing URLs:" non-existing)
(if (seq non-existing)
(println "Ready to insert" (count non-existing) "new URLs into Qdrant.")
(println "All URLs already exist in the collection.")))
This version queries all URLs in a single request using a MatchAny
filter, which is more efficient for large lists.
(ns qdrant-checker
(:require [clj-http.client :as http]
[cheshire.core :as json]))
(def qdrant-config
{:host "http://localhost:6333"
:collection "scraped_pages"
:url-field "url"})
(defn build-batch-url-filter
"Build a Qdrant filter for multiple URLs using MatchAny."
[urls url-field]
{:must [{:key url-field
:match {:any urls}}]})
(defn check-urls-batch
"Check which URLs exist in the Qdrant collection in a single query.
Returns {:existing [], :non-existing []}."
[{:keys [host collection url-field]} urls]
(let [endpoint (str host "/collections/" collection "/points/scroll")
payload {:filter (build-batch-url-filter urls url-field)
:limit (count urls)
:with_payload true
:with_vector false}
response (http/post endpoint
{:body (json/generate-string payload)
:headers {"Content-Type" "application/json"}
:as :json})
points (get-in response [:body :result :points])
existing (map #(get-in % [:payload url-field]) points)
non-existing (remove (set existing) urls)]
{:existing existing
:non-existing non-existing}))
;; Example usage
(def urls-to-check
["https://example.com/page1"
"https://example.com/page2"
"https://example.com/page3"])
(let [{:keys [existing non-existing]} (check-urls-batch qdrant-config urls-to-check)]
(println "Existing URLs:" existing)
(println "Non-existing URLs:" non-existing)
(if (seq non-existing)
(println "Ready to insert" (count non-existing) "new URLs into Qdrant.")
(println "All URLs already exist in the collection.")))
- Qdrant Config: The
qdrant-config
map holds the Qdrant host, collection name, and payload field name (url-field
). Update these to match your setup (e.g.,host
for Qdrant Cloud,url-field
if you use a different key likepage_url
). - HTTP Requests: We use
clj-http
to send POST requests to Qdrant’s REST API (/collections/{collection}/points/scroll
). The:as :json
option ensures the response is parsed as JSON. - Filter Construction:
- For individual checks,
build-url-filter
creates amust
filter with a singlematch
condition for one URL. - For batch checks,
build-batch-url-filter
usesmatch.any
to match any URL in the list.
- For individual checks,
- Response Handling:
- The
scroll
endpoint returns a:result.points
array. If it’s non-empty, the URL exists. - In the batch version, we extract all URLs from the points’ payloads and compute non-existing URLs by set difference.
- The
- Output: Both functions return a map with
:existing
and:non-existing
keys, containing lists of URLs.
For urls-to-check
["https://example.com/page1" "https://example.com/page2"]
:
- If
page1
exists andpage2
doesn’t:Existing URLs: (https://example.com/page1) Non-existing URLs: (https://example.com/page2) Ready to insert 1 new URLs into Qdrant.
- Batch Efficiency: The batch version (
check-urls-batch
) is preferred for large URL lists because it minimizes HTTP requests. TheMatchAny
filter checks all URLs in one go. - Payload Field: Assumes URLs are stored in a payload field named
url
. If your field is different (e.g.,metadata.url
), update:url-field
inqdrant-config
and ensure the filter key matches (e.g.,metadata.url
in the filter). - Error Handling: For production, add error handling for network issues or Qdrant errors:
(defn check-urls-batch [config urls] (try (let [endpoint ...] ;; Same as above ...) (catch Exception e (println "Error querying Qdrant:" (.getMessage e)) {:existing [] :non-existing urls})))
- Collection Existence: To verify the collection exists, you can query
/collections/{collection}
before running checks:(defn collection-exists? [{:keys [host collection]}] (let [response (http/get (str host "/collections/" collection))] (= 200 (:status response))))
- REST vs. gRPC: This uses the REST API for simplicity. If you’re using gRPC, you’d need to interop with the
qdrant-java-client
. I can provide that version if needed. - Inserting Non-existing URLs: After identifying
:non-existing
URLs, you’ll need to:- Generate vectors for the scraped pages (e.g., using a Java/Clojure-compatible embedding library like
sentence-transformers
via interop). - Upsert points to Qdrant using the
/collections/{collection}/points
endpoint. Let me know if you want code for this part!
- Generate vectors for the scraped pages (e.g., using a Java/Clojure-compatible embedding library like
- Qdrant Cloud: If using Qdrant Cloud, update
:host
to your cluster URL (e.g.,https://your-cluster.qdrant.io
) and add an API key:(def qdrant-config {:host "https://your-cluster.qdrant.io" :collection "scraped_pages" :url-field "url" :api-key "your-api-key"}) ;; Add to http/post :headers {"Content-Type" "application/json" "api-key" (:api-key config)}
- Custom Payload: If URLs are nested (e.g.,
{:metadata {:url "..."}}
), use a dotted key in the filter:{:must [{:key "metadata.url" :match {:value url}}]}
- Save the code in a file (e.g.,
src/qdrant_checker.clj
). - Run with Leiningen:
lein run
(orclojure -M -m qdrant-checker
fordeps.edn
). - Adjust
urls-to-check
andqdrant-config
to match your data.
- If you’re ready to insert the
:non-existing
URLs, I can provide Clojure code to generate vectors and upsert points. - If you have a specific embedding model or Qdrant schema, share details for tailored code.
- If you prefer the gRPC client or have other constraints (e.g., async HTTP), let me know.
Let me know how this works or if you need further tweaks!