Skip to content

Instantly share code, notes, and snippets.

View jin-zhe's full-sized avatar

Jin Zhe jin-zhe

View GitHub Profile
@jin-zhe
jin-zhe / parallel_apply.py
Last active December 4, 2024 08:32
Pandas parallel apply function
'''
DESCRIPTION:
This simple convenience function provides parallelization of pandas .apply()
Adapted from: https://proinsias.github.io/tips/How-to-use-multiprocessing-with-pandas/
REQUIREMENTS:
`multiprocess` and `dill` packages are required.
```
python -m pip install multiprocess dill
```
@jin-zhe
jin-zhe / io.py
Created November 26, 2024 07:11
IO convenience functions for Python
def load_csv(csv_path: Path, ignore_first_row=True, ignore_empty_rows=True, delimiter=','):
'''
Returns all the rows of a csv file
'''
rows = []
with csv_path.open() as csvfile:
csv_reader = csv.reader(csvfile, delimiter=delimiter)
if ignore_first_row:
next(csv_reader)
for row in csv_reader:
@jin-zhe
jin-zhe / pandas_jsonl.py
Created November 24, 2024 11:15
Pandas convenience functions for reading and writing jsonl files.
import pandas as pd
def jsonl_to_df(jsonl_filepath):
return pd.read_json(jsonl_filepath, lines=True)
def df_to_jsonl(df, jsonl_filepath):
payload = df.to_json(orient='records', lines=True)
with open(jsonl_filepath, 'w') as writer:
@jin-zhe
jin-zhe / split_pdf.py
Created November 22, 2024 15:58
Simple convenience script to split a PDF using PyPDF2 package in Python.
'''
Simple script to split a PDF using PyPDF2 package in Python.
Often times we would need to split an academic paper into the main paper and the
supplementary material before submission.
To do that, the script may be simply run as:
`python split_pdf.py -in CVPR.pdf -s 15 -o`
This produces 2 files: 'CVPR.01-14.pdf' and 'CVPR.15-20.pdf', where the starting
page numbers for each split file are 1 and 15 respectively.
@jin-zhe
jin-zhe / wandb_htmltable.py
Created September 13, 2024 08:54
HTML table for wandb that supports images
'''
Workaround for logging a simple table that supports step sliding. (See issue https://github.com/wandb/wandb/issues/6286)
It's a great pity that wandb currently doesn't support this with the `wandb.Table` which is too overkill.
The `wandb_htmltable` function follows the same signature as `wandb.Table` and takes as input parameters of the same type.
It currently only supports text and image type data. Image data is realized via its byte string declared in the <img /> tag
Example:
```
my_data = [
'''
Resizes images in source image directory within given size bounds (keeping
aspect ratio) and outputs in target directory with identical directory tree
structure. Uses Magick for image resizing.
'''
import os
import argparse
import subprocess
from pathlib import Path
@jin-zhe
jin-zhe / sshfr
Created May 10, 2023 08:07
Custom ssh command with port range forwarding
# STEP 1: `$ mkdir ~/bin`
# STEP 2: `$ touch ~/bin/sshfr`
# STEP 3: `$ chmod +x ~/bin/sshfr`
# STEP 4: Copy the following contents into `~/bin/sshfr`
# STEP 5: Update .profile or .bash_profle: `$ export PATH=$PATH":$HOME/bin"`
# STEP 6: Reload .profile or .bash_profle E.g. `$ . ~/.bash_profile`
# The contents of sshfr is as follows
ADDRESS=$1
PORT_START=${2-49151}
@jin-zhe
jin-zhe / edurec_scripts.py
Last active September 7, 2022 07:15
Script for EduRec exports
from datetime import datetime
import os
import pandas as pd
import argparse
'''
Note:
- Entries start on row 3 of EduRec excel exports
- 'Student Number' column is mandatory!
@jin-zhe
jin-zhe / unpack_pdfs.py
Last active November 22, 2024 16:11
Python script to split a directory of pdfs into smaller pdfs
"""
Intended usage scenario:
You have a directory of pdfs, each comprising of sequential image scans of
human-annotated documents (e.g. written questionaries/forms/exams) where every
document share the same number of pages. Each pdf may contain different
numbers of such scanned documents. You want to split all these pdfs up into
smaller pdfs at fixed page index intervals such that each smaller pdf
correspond to a single scanned document. In addition, you want to place them
place them under a specific output directory while ensuring no filename
collisons.
@jin-zhe
jin-zhe / CS_paper_scaffold.md
Last active November 20, 2019 08:02
Scaffold for computer science paper

THIS IS A WORK IN PROGRESS

ABSTRACT

Summarize

  • What the problem is
  • What prior methods entail
  • What you propose/claim/hypothesize in this work
  • How/why is it better
  • Experimental support for your proposal
  • Any additonal insights (if applicable)