Skip to content

Instantly share code, notes, and snippets.

@sneakers-the-rat
Last active June 10, 2024 08:32
Show Gist options
  • Select an option

  • Save sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f to your computer and use it in GitHub Desktop.

Select an option

Save sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f to your computer and use it in GitHub Desktop.
Elsevier PDF "hashes"
[
"FCi27mtaKod38ztmGndn-y8NNz.r.lt6SndqGztz_ztr-ngqQm9aMo9eOnMeJntuNntu",
"D2ei2mgqJz9b-m.mGmPqRyLNNnwmOlt7.ywiGmt-Kndr9otqRywv8o9ePmtiNmd2Sn92Tma",
"6U7vcmPuOn9uLnMaGyM7-nLNNntv9lt6RmtaGmweOyMmJnMmSmgmOo9eOnM6LnMaRmM-Tma",
"lXLf8owyQztiMzwqGnMz7zcNNotb7lwf.m9qGzt6Km.qMngqLndqLo9eOotaNm96Mmt6Tma",
"FCi27y9qOnd-Ny96GmPmOmcNNzwf-lwj-m9mGztz7ytaMnM78n9v-o9ePmM6Rm9-Qn9eTma",
"XlEDumMz7nM7-m9iGogmRmLNNyt_8lwiKz9eGm9-Pm.v7ztiLztz_o9eOnMeQnd-Sodm",
"lXLf8yt-JywmNmPeGm9n9n8NNzgn.lt_8zwqGogz7zgn7zt6SyPr-o9eOnM6Pot2Mn9qTma",
"FCi27zgf8mdqMmMeGnMmMy8NNz9eQlweNy.eGmMiMm96Qmgr9nMb-o9ePmtuRmt6JotmTma",
"FCi27nwmKnMeSodeGm.z.y8NNntz.lt-PywmGy9__ngqQmtiPmtb7o9ePmteJotyJoduTma",
"HIoniz.qOnd-Nmt-GmteNn8NNot7.lt-QndaGnPv.mdaMmt6RnMqMo9ePmdmOmdiKod-Tma",
"ZtV1wntuPyPn9z.qGyPv7msNNytz7lwiKyM6GntmJnt_-nteRm.mRo9eOnM6Pot2MnMyTma",
"d2UUdywiJmtz7zt-Gm9eQmcNNzt2Qlwf7m9uGzd_7zdf7owr9yMqOo9ePmtaKnM2NmduTma",
"tprDsnMeJn9iOnweGnPuQnsNNz.eMlt-Qm.mGotz.ytiNz.yRmd-Mo9eOnM6Pot2OmM6Tma",
"tprDsyPiNn9iQn9-GmMiSy8NNn96Llwf9owiGowqQyMiRzwv_ngqPo9eOnM6Pot2OndyTma",
"ZIFNOztmRotn9owiGzduNmsNNnd-Rlt_8otiGot-Oy92QnMeSyMqKo9eOnM6Pot2OntaTma",
"D2ei2nMb_zwmSowyGzwv8mLNNotj8lt-My9yGmtaModaNm92RytySo9ePmtaKn92Qmt2Tma",
"d2UUdot__owr-y9mGodqLocNNn.eOlwmPmtaGmgj7ndn_nMiMndiNo9ePmdiLnMmPotmTmq",
"6U7vcmtuSndmSntqGmdiMy8NNnPz7lt_7ndeGmtv7n9eLndj_zduJo9ePmtiOntmNntmTma",
"ZtV1wn9mMnd2MzwiGz9eRysNNmgySlt7_ot-Gy97.mgiKotqKnt_.o9eOnM6Pot2Mn96Tma",
"XlEDuyweNmtz9ntqGm9aMocNNodr9lt__z9iGmdj_n9yNnt6Sm9-Lo9ePmd6KotmRnM2Tma",
"HIonintn-z9uPogmGnMeSzsNNogf-lwj.z.qGmgqSn9yPndf7mdmLo9eOotuLm9aNodqTma",
"ZlkjsyMj7mPr.ndiGowuMmcNNy.mNlwj9m.yGmtb7z.qRz.iKyt38o9eOnM6Pot2MnMeTma",
"Dpairmdj9mPr8nwmGn.r7z8NNnMb7lwj8otiGyt-MzwuKzd__nt39o9ePmtaPotaJm9-Tma",
"6mIUqngiNzduNn9iGmgeJnsNNot2Rlt-SzguGzt2Oodf_n.eNodz.o9eOn9mQnMqOm9e",
"FCi27mwr_mPn-m.mGmPuKncNNmduOlweOytuGogj.yMv-z92Pyt6Mo9eOnM6Pot2Mn9yTma",
"6U7vcngj-zt2Ln.uGodr8mcNNmdeSlweKmd2Gzdz9nM3_mgf7yt2Ro9ePmt6Sn9qLntyTma",
"zjJBNmPn.mdiRntiGzgmPnLNNmM2Klt6JmMqGy9aNz9aMmdv_mwuNo9ePm96Qm9iRndiTma",
"FCi27mPmRnPiKngeGngqJzcNNogj8lwj-zwiGnPiLmtb7y9qKzgeMo9eOnMeLn9aNm9m"
]
import exiftool
from pathlib import Path
import json
import pdb
import re
paper_root = Path().home() / 'location/of/papers'
hashes = []
get_n = 100
processed = 0
rehash = re.compile(r'<([0-9A-Za-z_.-]{40,})/>')
try:
with exiftool.ExifTool() as et:
for path in paper_root.glob('**/*.pdf'):
md = et.execute(b'-b', b'-xmp', str(path).encode('utf-8'))
try:
md = md.decode('utf-8')
except UnicodeDecodeError:
print(f'Couldnt decode {path}')
continue
ahash = rehash.findall(md)
hashes.extend(ahash)
if len(ahash)>0:
processed += 1
finally:
with open('elsev_hashes.json', 'w') as hashfile:
json.dump(hashes, hashfile, indent=2)
print(f'processed {processed} files')
@sneakers-the-rat
Copy link
Copy Markdown
Author

Updated after
https://twitter.com/horsemankukka/status/1486268962119761924?s=20

let me know that the tags were being parsed incorrectly. Rescanned and found a few more. Also attaching the v simple code so you can check my work.

@cbandy
Copy link
Copy Markdown

cbandy commented Jan 27, 2022

The few I downloaded from open access were visible to grep; usually toward the end of the file in an XML stream:

grep -Ena '<[^/]{50,}/>' *.pdf

A variation on https://twitter.com/Jofkos/status/1486244612960366593.

@Aariq
Copy link
Copy Markdown

Aariq commented Jan 28, 2022

Some more examples here with associated DOIs: https://gist.github.com/Aariq/a23958e168e347f1bacf9dfa777b911f

@rgrunbla
Copy link
Copy Markdown

rgrunbla commented Jan 30, 2022

I managed to get hashes that are very close on the same paper ( https://doi.org/10.1016/j.ijhydene.2021.11.149 ) :

lXLf8 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMoti Tma
FCi27 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMot2 Tma
LMfns mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmdq Tma
w8arl mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmd- Tma

I put some spaces in the hashes, because I think there are some patterns at such positions.

Later obtained hashes seem very different, still.

Here are some informations regarding the files, in the same order than the hashes :

  File: 1-s2.0-S0360319921045377-main.pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067528     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:19:53.072211357 +0100
Modify: 2022-01-29 14:19:53.185217711 +0100
Change: 2022-01-29 14:19:53.325225583 +0100
 Birth: 2022-01-29 14:19:53.072211357 +0100
  File: 1-s2.0-S0360319921045377-main(1).pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067359     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:19:57.310442520 +0100
Modify: 2022-01-29 14:19:57.493452096 +0100
Change: 2022-01-29 14:19:57.539454503 +0100
 Birth: 2022-01-29 14:19:57.310442520 +0100
  File: 1-s2.0-S0360319921045377-main(2).pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067360     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:20:04.484795768 +0100
Modify: 2022-01-29 14:20:04.608801481 +0100
Change: 2022-01-29 14:20:04.663804016 +0100
 Birth: 2022-01-29 14:20:04.484795768 +0100
  File: 1-s2.0-S0360319921045377-main(3).pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067005     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:20:09.293007869 +0100
Modify: 2022-01-29 14:20:09.448014381 +0100
Change: 2022-01-29 14:20:09.492016229 +0100
 Birth: 2022-01-29 14:20:09.293007869 +0100

@sneakers-the-rat
Copy link
Copy Markdown
Author

WOW that looks like they might just be timestamps, that is LAZY on their part. I'll try and systematically sample across time and see if i can get repeating patterns/match subsections with times. I think you're right, those do seem to be independent and repeatable sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment