Interactive Data Map of Wikipedia

Author

lmcinnes commented Jun 28, 2025

Note that, as with the Azure AI Foundry you will need to install the relevant package to enable it within toponymy. So if you want to use OpenAI then you'll need to install openai into your environment for toponymy to see it, and so on. Anthropic, Cohere, and OpenAI are all available, as well as local LLMs (assuming you have a GPU) via llamm_cpp, and in the most recent version on github, vLLM.

Note also that, at the time of writing, the async/batch versions of the service wrappers wasn't available, so you may want to consider using those instead as it will be faster. Just prefix with Async to get that to work, so for example AsyncOpenAI etc.

rodighiero commented Jul 21, 2025

I integrated Toponymy with OpenAI following your suggestion and used AsyncOpenAI, which works well for embeddings and the initial clustering. However, during the topic naming step I’m hitting a BadRequestError when naming clusters with very large keyphrase sets. It seems some prompts exceed the API input limits.

What would you recommend as the best solution? Should I patch make_prompts() to truncate keyphrases per cluster (e.g. top 30–50), or is there an existing parameter or preferred way to limit the prompt size for topic naming?

The OpenAI integration otherwise works fine, with async speeding things up as expected!

rodighiero commented Jul 21, 2025

It works pretty nicely now, but I couldn't recreate the interface. It seems the package has been updated and some functions do not exist anymore—someone has some thoughts on it?

Author

lmcinnes commented Jul 22, 2025

Yes, sorry, things are under reasonably active development and I haven't had time to update this gist. You'll want enable_topic_tree=True instead of enable_table_of_contents=True for newer versions of datamapplot.

lmcinnes/wikipedia_data_map.ipynb

lmcinnes commented Jun 28, 2025

Uh oh!

rodighiero commented Jul 21, 2025

Uh oh!

rodighiero commented Jul 21, 2025

Uh oh!

lmcinnes commented Jul 22, 2025

Uh oh!