Skip to content

Instantly share code, notes, and snippets.

@mimansajaiswal
Created January 10, 2025 23:49
Show Gist options
  • Save mimansajaiswal/6d7dd397718d7124aeec33b91117b17a to your computer and use it in GitHub Desktop.
Save mimansajaiswal/6d7dd397718d7124aeec33b91117b17a to your computer and use it in GitHub Desktop.
Zotero Cleanup Scripts

Unfortunately, I didn't start using Zotero from the beginning.

I initially used Mendeley, then switched to Paperpile, and later moved to Notion. While I really liked Notion, it created a separate page for every entry - pages I never used, which was bothersome even though they could be ignored. Finally, I switched to Zotero.

However, Zotero lacks a built-in duplicate manager, and I often don't care about the item type of duplicates - I just want a single entry. I prefer conference papers or journal articles over preprints, and preprints over websites, as I sometimes save open review websites. Unfortunately, Zotero doesn't have this preference built-in, doesn't always show duplicate items, and won't merge fuzzily different items.

While there's a duplicate extension for Zotero (Zoplicate), it doesn't handle fuzzy matching well. Another major issue is that if you don't let an arXiv page load completely before clicking the Zotero button, it adds the arXiv as a webpage (with the arXiv id in title), creating a mess. Or adding from semantic scholar sometimes adds the \[PDF\] in title, so now they are no longer considered duplicates.

I've been using some scripts to clean this up. I tried contributing to Zoplicate, but it was more complex than anticipated. At least I can now bulk merge by ignoring item type, though it still doesn't account for fuzzy matching or ignore item type when adding new items.

This is why I don't prefer the second script as much - it processes all items instead of just newly added ones.

I run this using Raycast script command on silent mode, so that it shows a popup when ids are copied to my clipboard (and I use Raycast's clipboard history to make sure I have them)


I also have a zotero scripts folder that I have uv in and have pre-installed packages so that they are not installed on every run. You can also use script declarations like this

# /// script
# dependencies = [
#   "pyzotero",
# ]
# ///

The raycast script is simple. The commented part is created by raycast, and then I just add the cd and run parts. You can probably get away with just using uv run but I wanted to make sure it was portable beyond Raycast (for example, Apple Shortcuts needs the full path)

#!/bin/bash

# Required parameters:
# @raycast.schemaVersion 1
# @raycast.title Get ArXiv IDs Added as Webpage in Zotero
# @raycast.mode silent

# Optional parameters:
# @raycast.icon 

# Documentation:
# @raycast.author mimansa
# @raycast.authorURL https://raycast.com/mimansa

cd /Users/mimansajaiswal/Documents/Coding/Scripting/zotero_scripts
/Users/mimansajaiswal/.local/bin/uv run {{script_name}}.py | pbcopy
echo "{{Informational Note}}"
# Purpose: find incorrect arXiv ids so that I can use magic add using DOI in Zotero.
import re
from pyzotero import zotero
# Configuration
API_KEY = "API_KEY_HERE"
LIBRARY_ID = LIBRARY_ID_AS_NUMBER
# Initialize Zotero client
zot = zotero.Zotero(LIBRARY_ID, 'user', API_KEY)
# Get all webpage items
items_webpages = zot.everything(zot.items(itemType='webpage'))
# Regular expressions
arxiv_url_pattern = re.compile(r'arxiv\.org/(?:abs|pdf|html)/([0-9]{4}\.[0-9]+)(v[0-9]+)?')
arxiv_id_in_title_pattern = re.compile(r'\s*\[[0-9]{4}\.[0-9]+(v[0-9]+)?\]')
# List to store arXiv IDs
arxiv_ids = []
# Process items
for item in items_webpages:
url = item['data'].get('url', '')
if 'arxiv' in url.lower():
match = arxiv_url_pattern.search(url.lower())
if match:
arxiv_id = match.group(1)
version = match.group(2)
if version:
arxiv_id += version
arxiv_ids.append(arxiv_id)
# Update title if needed
title = item['data'].get('title', '')
new_title = title
if ' | Abstract' in new_title:
new_title = new_title.replace(' | Abstract', '')
new_title = re.sub(arxiv_id_in_title_pattern, '', new_title)
new_title = new_title.strip()
if new_title != title:
item['data']['title'] = new_title
zot.update_item(item)
print("\n".join(f"arXiv:{id}" for id in arxiv_ids))
# Purpose: i try to merge as much as i can using (modded) zoplicate. so this is just for deleting stuff
import re
from pyzotero import zotero
from pyzotero.zotero_errors import ResourceNotFound
# Configuration
API_KEY = "API_KEY_HERE"
LIBRARY_ID = LIBRARY_ID_AS_NUMBER
ITEM_TYPE_PRECEDENCE = {
'journalArticle': 1,
'conferencePaper': 2,
'preprint': 3,
'webpage': 4
}
def normalize_title(title):
"""Normalize the title by converting to lowercase, removing special characters, and trimming."""
title = title.lower()
title = re.sub(r'[^\w\s]', '', title)
title = re.sub(r'\s+', ' ', title)
return title.strip()
def get_arxiv_id_from_url(url):
"""Extract arXiv ID from a URL, ignoring version numbers."""
arxiv_url_pattern = re.compile(r'arxiv\.org/(?:abs|pdf|html)/([0-9]{4}\.[0-9]+)')
match = arxiv_url_pattern.search(url.lower())
return match.group(1) if match else None
def get_item_precedence(item):
"""Get the precedence value for an item based on its itemType."""
return ITEM_TYPE_PRECEDENCE.get(item['data'].get('itemType', ''), float('inf'))
# Initialize Zotero client
zot = zotero.Zotero(LIBRARY_ID, 'user', API_KEY)
# Retrieve all items
items = zot.everything(zot.items())
# Create paper groups using both arXiv IDs and normalized titles
paper_groups = {}
# First pass: Group by arXiv IDs and collect normalized titles
arxiv_to_title = {} # Maps arXiv IDs to normalized titles
title_to_arxiv = {} # Maps normalized titles to arXiv IDs
for item in items:
url = item['data'].get('url', '')
title = normalize_title(item['data'].get('title', ''))
arxiv_id = get_arxiv_id_from_url(url) if url else None
if arxiv_id:
# Record the mapping between arXiv ID and normalized title
arxiv_to_title[arxiv_id] = title
title_to_arxiv[title] = arxiv_id
# Add to paper groups using arXiv ID as key
if arxiv_id not in paper_groups:
paper_groups[arxiv_id] = []
paper_groups[arxiv_id].append(item)
# Second pass: Handle items without arXiv IDs
for item in items:
url = item['data'].get('url', '')
title = normalize_title(item['data'].get('title', ''))
arxiv_id = get_arxiv_id_from_url(url) if url else None
if not arxiv_id: # Only process items without arXiv ID
# Check if we've seen this title with an arXiv ID
known_arxiv_id = title_to_arxiv.get(title)
if known_arxiv_id:
# Add to existing arXiv group
paper_groups[known_arxiv_id].append(item)
else:
# Create new group by title if we haven't seen this paper before
if title not in paper_groups:
paper_groups[title] = []
paper_groups[title].append(item)
# Process each group and delete lower precedence items
items_to_delete = []
for group_key, group_items in paper_groups.items():
if len(group_items) > 1:
# Sort by precedence
group_items.sort(key=get_item_precedence)
# Keep the highest precedence item, mark others for deletion
items_to_delete.extend(group_items[1:])
# Delete the lower precedence items with error handling
if items_to_delete:
for item in items_to_delete:
try:
zot.delete_item(item)
except ResourceNotFound:
# Item already deleted or modified, skip it
continue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment