Awesome tools to process lots of files

When dealing with large, old code bases (like I'm doing on a daily basis) you'll often want to migrate something in a large quantity of files. For this, you should have the necessary tests and linters in place, to be able to change things confidently. However, you still need to edit a huge amount of files, probably more than you can change by manually. And what happens when errors happen? I adopted some tools for dealing with errors in a large number of files.

Tool 1: Finding files with `find` and `git ls-files | grep`

If you need to change a specific kind of file, the most basic step will be to find and list them all. The standard CLI utility for this is find, and can be used in various ways:

find . -name 'tsconfig.json' # Find all files with a specific name

find src/ -iname '*test*' # Find all files in a directory with "test" in the name

# Find all package.json files except in the node_modules directory:
find . -type d -name node_modules -prune -false -o -name package.json

As happening in the last example, more often than not you'd like to exclude files outside of your codebase, e.g. in the node_modules directory or in some other files ignored by git. For that I found it rather useful to just use git ls-files in combination with grep:

git ls-files | grep '.tsx\?$' # Find all TS files known to git

git ls-files src/ # Find all files in src known to git

git ls-files | grep '.tsx\?$'  | grep -v '/i18n' # exclude i18n

If you are using git anyways, you could also look at files changed in e.g. the last 2 commits, since branching off of your man branch, or the files changed in a certain commit:

git diff --name-only HEAD~2
git diff --name-only $(git merge-base HEAD main)
git show --pretty= --name-only c0ffeeba5e

Tool 2: Common command line tools

While refining your file searches, pipe to | wc -l to count the results or | head -n 10 to show only the first 10. Occasionally you might also need | sort -u to remove duplicates. You can use > myfiles so persist the list to a file and cat myfiles to load it again.

You can use xargs to pass your list to commands that expect a list of files as positional argument:

cat myfiles | xargs $(yarn bin prettier) --write
git ls-files src/controllers/ | xargs yarn --silent eslint

Tool 3: VS Code

Most likely you are using VS Code to write code. But it is also a great tool to have in the belt for dealing with huge error lists and composing and processing file lists is VS Code. You can paste surprisingly huge file lists into an editor and work with them comfortably.

Multi-select

I love using the multi-select feature to extract file lists from command outputs. For example tsc might give you a huge output with errors like this:

/your-code/some-file:2:21 - error TS7016: Could not find a declaration file for module './foo'. '/user/your-code/foo.js' implicitly has an 'any' type.
 2 export { Foo } from './foo';
                       ~~~~~~~
/your-code/some-other-file:3:21 - error TS7016: Could not find a declaration file for module './foo'. '/user/your-code/foo.js' implicitly has an 'any' type.
 3 export { Foo } from './foo';
                      ~~~~~~~

Well, VS Code will of course also show you those errors – if only you would open all of the broken files. Here's how you could extract the broken files:

Select - error TS
Hit cmd + shift + L, this will multi-select all instances of that now you have a cursor in each line with a filename
Hit option + ← 7 times, this will get you 7 words to the front – to right at where the filenames end
Hit shift + control + E, this will select everything to the end of the lines – all the error message
Hit backspace to remove that
Hit cmd + C to copy the lines with the filenames
Hit cmd + N to open a new file
Hit cmd + V to paste just the lines with the filenames
Bonus if you have the awesome Sort lines extension installed:
- Hit cmd + A to select the whole list of filenames
- Hit cmd + shift + P to open the command palette
- Type "rem dup" and from the autocomplete select "Sort lines (remove duplicate lines)"

Nice! This should give you a nice list of files with compiler errors.

The `code` CLI tool

Once we have a list of files we can do something cool: We can use xargs to pass the list to the code CLI command (assuming you have configured it) for example all files in the helpers directory:

git ls-files src/helpers/ | xargs code

This way, we have all the files open in tabs, and you can fix and close them one by one.

Search and replace

Of course with a huge number of errors, fixing them manually might take ages. Luckily VS code also has a great search functionality. And more importantly it lets us replace things.

Did you know you can search for multiline strings? Just hit shift + enter in the search field to start a new line
If you use the regular expression search, you can put parts of the pattern in braces () and then reference whatever content was matched in your replacement pattern with $1, $2 and so on. For example search for process$(.*), (.*)$ and replace with process($2, $1) to switch the parameters around.
In a more complex codebase you might need more clever regular expressions. You can e.g. use [^)] instead of .* to only match to the closing brace at most. Often it's also a good idea to click the Aa button to have a case-sensitive search. Watch the number of matches as you do changes to see if you are going into the right direction.
You can use cmd + shift + F to search in multiple files. You can replace \n with , in a file list and then paste it into the "files to include" list to be sure to not replace things in random places of your codebase.
If you first select the regex search and then select a search term before hitting cmd (+ shift) + F VS Code is so nice and escapes all those special characters that would otherwise have a different meaning in the regular expression.
Another feature is case replacements, e.g. recently I replaced navTitle: (.) with navTitle: \u$1 to make the first letter of all titles an upper case one. There's also \l for one lower case character, and \U / \L to make all letters of the pattern upper / lower case. You can even say e.g. \l\l\u\L$1 to make the 3rd latter upper case and all others lower case.

Tool 4: `pbpaste`

This utility will print whatever text you copied to the standard output in the terminal. After composing a list of files in VS Code, I like to bring it back to the terminal using pbpaste and pipe it to some other command. This has two advantages. First, you don't spam your shell's history with huge file lists and you don't need a temporary file to save it. The other is, you can copy a different list of files and run the same command again. There is one caveat: If you copy something else (often the command you want to run) just before running your command, that new content will be returned by pbpaste, so be careful to first type the command and then copy the file list always.

Many times, I compose a list of files in VS Code. And then I use the following to open exactly those files in tabs in VS Code:

pbpaste | xargs code

There's also the opposite pbcopy which can be useful occasionally, but more often than not you can just select the text that is already in your terminal.

You can use the two commands in combination to process whatever you have copied. For example if you don't have the above mentioned Sort Lines plugin installed, you can do the following:

Type (but don't execute yet) the following into your terminal: pbpaste | sort -u | pbcopy
Hit cmd + A in VS Code to select the whole list of filenames
Hit cmd + C to copy the lines with the filenames
Hit enter in the terminal, this will execute the command and replace what you had copied
Hit cmd + V in VS Code to paste the deduplicated lines

The possibilities are endless... I also like to pipe the output of curl to pbcopy. This way the terminal does not get flooded and you can paste the results into an editor if needed.

Tool 5: `jq`

Another great tool to have in the belt is jq, it allows you to work with JSON efficiently. The most basic functionality is pretty printing, just pipe any JSON output or open any JSON files with it and it will show them nicely formatted and with syntax highlighting in your terminal.

However, where it really shines is processing JSON contents. And many tools happen to have an option to output JSON instead of plain text.

For example recently I wanted to introduce some new linter rules, so I wanted to get on overview over how many violations we would get, and I ran the following to summarize it by rule name and also errors only:

yarn --silent eslint --format json src/foo/bar | \
  jq 'map(.filePath as $f | .messages[] | select(.severity >= 2) | . + {filePath: $f})
    | group_by(.ruleId)
    | map({ruleId: .[0].ruleId, count: length})'

After figuring out which rule I want to fix (by hand :/) I combined this with the code tool to open all the affected files in VS Code:

yarn --silent eslint --rule "@typescript-eslint/no-unused-vars: error" --format json src/foo/bar | \
  jq 'map(.filePath as $f | .messages[] | select(.severity >= 2) | . + {filePath: $f})
    | group_by(.filePath)
    | map(.[0].filePath) | .[]' --raw-output \
  | xargs code

I then have them all open in tabs and can decide what to do one by one.

And btw LLMs are also quite good at writing jq code, so you can ask your preferred one with a snippet of your JSON to write it for you. You could even try a local one to not give away any sensitive data:

cat <(yarn --silent eslint --format json xinglets/login/login/xinglet.json)  <(echo can you give my a jq command to give me a count of violations per rule for the above JSON please) | ollama run llama3.2

However, I found that anything running locally is not quite as clever yet.

Tool 6: Perl, `sed` and `awk`

Last but not least, there are still some classical tools like Perl, sed and awk that still sometimes come in handy to process certain texts. For example here I replaced some imports:

git ls-files | grep '.tsx\?$’ | \
xargs perl -i -p0e 's/import type \{(([^}]|\n)+ )type /import type {$1/smg'

Most editors also have search-an-replace, however, consider the situation where you have a bigger branch that you work on for some time. After each rebase, there might be new files that need the same changes. In this case it is easier to just run the same perl commands once more than doing it in an editor.

For perl, I hand over -p0 which makes it treat the whole file as a single line, in combination with /m at the end of the regex, this means you can replace without being limited to a single line at a time. I think this mostly replaces sed for me actually, which can also be used to write small programs that modify files.

Last but not least, there is awk. I is a bit like jq for non-JSON data. With its focus on "records" you can easily output the nth field of a text with space separated values. For example I built this command to list large files:

git ls-files src/components | grep '.tsx\?$' | xargs wc -l | \
  awk '$1 > 300 && $2 != "total" { print $2 }'

You can go totally crazy with awk but I guess nowadays for most complex things you would just write a small script for Node.JS.

bxt/many-files-with-errors-tutorial.md

Select an option

No results found

Select an option

No results found

Awesome tools to process lots of files

Tool 1: Finding files with `find` and `git ls-files | grep`

Tool 2: Common command line tools

Tool 3: VS Code

Multi-select

The `code` CLI tool

Search and replace

Tool 4: `pbpaste`

Tool 5: `jq`

Tool 6: Perl, `sed` and `awk`

bxt/many-files-with-errors-tutorial.md

Awesome tools to process lots of files

Tool 1: Finding files with find and git ls-files | grep

Tool 2: Common command line tools

Tool 3: VS Code

Multi-select

The code CLI tool

Search and replace

Tool 4: pbpaste

Tool 5: jq

Tool 6: Perl, sed and awk

Tool 1: Finding files with `find` and `git ls-files | grep`

The `code` CLI tool

Tool 4: `pbpaste`

Tool 5: `jq`

Tool 6: Perl, `sed` and `awk`