See: https://stackoverflow.com/questions/5296667/pdftk-compression-option
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dBATCH -dQUIET -sOutputFile=output.pdf input.pdf
See: https://stackoverflow.com/questions/5296667/pdftk-compression-option
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dBATCH -dQUIET -sOutputFile=output.pdf input.pdf
''' | |
DESCRIPTION: | |
This simple convenience function provides parallelization of pandas .apply() | |
Adapted from: https://proinsias.github.io/tips/How-to-use-multiprocessing-with-pandas/ | |
REQUIREMENTS: | |
`multiprocess` and `dill` packages are required. | |
``` | |
python -m pip install multiprocess dill | |
``` |
def load_csv(csv_path: Path, ignore_first_row=True, ignore_empty_rows=True, delimiter=','): | |
''' | |
Returns all the rows of a csv file | |
''' | |
rows = [] | |
with csv_path.open() as csvfile: | |
csv_reader = csv.reader(csvfile, delimiter=delimiter) | |
if ignore_first_row: | |
next(csv_reader) | |
for row in csv_reader: |
import pandas as pd | |
def jsonl_to_df(jsonl_filepath): | |
return pd.read_json(jsonl_filepath, lines=True) | |
def df_to_jsonl(df, jsonl_filepath): | |
payload = df.to_json(orient='records', lines=True) | |
with open(jsonl_filepath, 'w') as writer: |
''' | |
Simple script to split a PDF using PyPDF2 package in Python. | |
Often times we would need to split an academic paper into the main paper and the | |
supplementary material before submission. | |
To do that, the script may be simply run as: | |
`python split_pdf.py -in CVPR.pdf -s 15 -o` | |
This produces 2 files: 'CVPR.01-14.pdf' and 'CVPR.15-20.pdf', where the starting | |
page numbers for each split file are 1 and 15 respectively. |
''' | |
Workaround for logging a simple table that supports step sliding. (See issue https://github.com/wandb/wandb/issues/6286) | |
It's a great pity that wandb currently doesn't support this with the `wandb.Table` which is too overkill. | |
The `wandb_htmltable` function follows the same signature as `wandb.Table` and takes as input parameters of the same type. | |
It currently only supports text and image type data. Image data is realized via its byte string declared in the <img /> tag | |
Example: | |
``` | |
my_data = [ |
''' | |
Resizes images in source image directory within given size bounds (keeping | |
aspect ratio) and outputs in target directory with identical directory tree | |
structure. Uses Magick for image resizing. | |
''' | |
import os | |
import argparse | |
import subprocess | |
from pathlib import Path |
# STEP 1: `$ mkdir ~/bin` | |
# STEP 2: `$ touch ~/bin/sshfr` | |
# STEP 3: `$ chmod +x ~/bin/sshfr` | |
# STEP 4: Copy the following contents into `~/bin/sshfr` | |
# STEP 5: Update .profile or .bash_profle: `$ export PATH=$PATH":$HOME/bin"` | |
# STEP 6: Reload .profile or .bash_profle E.g. `$ . ~/.bash_profile` | |
# The contents of sshfr is as follows | |
ADDRESS=$1 | |
PORT_START=${2-49151} |
from datetime import datetime | |
import os | |
import pandas as pd | |
import argparse | |
''' | |
Note: | |
- Entries start on row 3 of EduRec excel exports | |
- 'Student Number' column is mandatory! |
""" | |
Intended usage scenario: | |
You have a directory of pdfs, each comprising of sequential image scans of | |
human-annotated documents (e.g. written questionaries/forms/exams) where every | |
document share the same number of pages. Each pdf may contain different | |
numbers of such scanned documents. You want to split all these pdfs up into | |
smaller pdfs at fixed page index intervals such that each smaller pdf | |
correspond to a single scanned document. In addition, you want to place them | |
place them under a specific output directory while ensuring no filename | |
collisons. |