See: https://stackoverflow.com/questions/5296667/pdftk-compression-option
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dBATCH -dQUIET -sOutputFile=output.pdf input.pdfSee: https://stackoverflow.com/questions/5296667/pdftk-compression-option
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dBATCH -dQUIET -sOutputFile=output.pdf input.pdf| ''' | |
| DESCRIPTION: | |
| This simple convenience function provides parallelization of pandas .apply() | |
| Adapted from: https://proinsias.github.io/tips/How-to-use-multiprocessing-with-pandas/ | |
| REQUIREMENTS: | |
| `multiprocess` and `dill` packages are required. | |
| ``` | |
| python -m pip install multiprocess dill | |
| ``` |
| def load_csv(csv_path: Path, ignore_first_row=True, ignore_empty_rows=True, delimiter=','): | |
| ''' | |
| Returns all the rows of a csv file | |
| ''' | |
| rows = [] | |
| with csv_path.open() as csvfile: | |
| csv_reader = csv.reader(csvfile, delimiter=delimiter) | |
| if ignore_first_row: | |
| next(csv_reader) | |
| for row in csv_reader: |
| import pandas as pd | |
| def jsonl_to_df(jsonl_filepath): | |
| return pd.read_json(jsonl_filepath, lines=True) | |
| def df_to_jsonl(df, jsonl_filepath): | |
| payload = df.to_json(orient='records', lines=True) | |
| with open(jsonl_filepath, 'w') as writer: |
| ''' | |
| Simple script to split a PDF using PyPDF2 package in Python. | |
| Often times we would need to split an academic paper into the main paper and the | |
| supplementary material before submission. | |
| To do that, the script may be simply run as: | |
| `python split_pdf.py -in CVPR.pdf -s 15 -o` | |
| This produces 2 files: 'CVPR.01-14.pdf' and 'CVPR.15-20.pdf', where the starting | |
| page numbers for each split file are 1 and 15 respectively. |
| ''' | |
| Workaround for logging a simple table that supports step sliding. (See issue https://github.com/wandb/wandb/issues/6286) | |
| It's a great pity that wandb currently doesn't support this with the `wandb.Table` which is too overkill. | |
| The `wandb_htmltable` function follows the same signature as `wandb.Table` and takes as input parameters of the same type. | |
| It currently only supports text and image type data. Image data is realized via its byte string declared in the <img /> tag | |
| Example: | |
| ``` | |
| my_data = [ |
| ''' | |
| Resizes images in source image directory within given size bounds (keeping | |
| aspect ratio) and outputs in target directory with identical directory tree | |
| structure. Uses Magick for image resizing. | |
| ''' | |
| import os | |
| import argparse | |
| import subprocess | |
| from pathlib import Path |
| # STEP 1: `$ mkdir ~/bin` | |
| # STEP 2: `$ touch ~/bin/sshfr` | |
| # STEP 3: `$ chmod +x ~/bin/sshfr` | |
| # STEP 4: Copy the following contents into `~/bin/sshfr` | |
| # STEP 5: Update .profile or .bash_profle: `$ export PATH=$PATH":$HOME/bin"` | |
| # STEP 6: Reload .profile or .bash_profle E.g. `$ . ~/.bash_profile` | |
| # The contents of sshfr is as follows | |
| ADDRESS=$1 | |
| PORT_START=${2-49151} |
| from datetime import datetime | |
| import os | |
| import pandas as pd | |
| import argparse | |
| ''' | |
| Note: | |
| - Entries start on row 3 of EduRec excel exports | |
| - 'Student Number' column is mandatory! |
| """ | |
| Intended usage scenario: | |
| You have a directory of pdfs, each comprising of sequential image scans of | |
| human-annotated documents (e.g. written questionaries/forms/exams) where every | |
| document share the same number of pages. Each pdf may contain different | |
| numbers of such scanned documents. You want to split all these pdfs up into | |
| smaller pdfs at fixed page index intervals such that each smaller pdf | |
| correspond to a single scanned document. In addition, you want to place them | |
| place them under a specific output directory while ensuring no filename | |
| collisons. |