Problem: To determine which Knowledge Articles in NA33 contain embedded images (<img
). Of these, some will reference an opaque Salesforce URL because those images were directly pasted into the Salesforce Knowledge Article Rich Text Editor; others will not. Those that contain no IMGs can be processed with an ETL tool to migrate them from NA33 to NA35. This with IMGs will need to further be divided into those that have Salesforce opaque image URLs and those that have other URLs (either publicly available ones or ones that were not adequatly captured when the original curation took place during TSA12.)
Solution: Extract the IDs and corresponding articles in a way that makes it easy to search the article for IMG tags and know the article's ID when a match is found. In the procedure listed here, the way to do this is to ensure that each article is entirely on a single line.
Prerequisites and Tools This process will require the following:
- Bash shell (such as Git Bash or Cygwin)
- perl installed in the Bash shell
- a trustworthy text editor such as Notepad++ or Sublime Text
Strictly speaking, Bash and perl are not required. These steps can be performed in a sophisticated text editor. However, Bash and perl make the process go very quickly. For example, a listing of ~50K line file containing about 4500 articles with line-breaks, Sublime Text took several minutes to perform step 2 below. Perl took under 1 second.
- Get Article IDs and Contents:
SELECT Id, Article__c FROM Knowledge_Article_Type__kav
. Download the result of this query as bulk CSV using https://workbench.developerforce.com/. Rename this CSV file toArticle_IDs_and_Contents.csv
. - Ideally, each line in the CSV will correspond to one article. However, articles likely have embedded hard returns in them, which makes searching for a string in the article and correlating it to the appropriate ID difficult. To solve for this, delete all the hard returns (
\n
characters) and then re-insert them in front of each article ID (detectable by the string""ka1
). An easy way to do this using the Bash shell is:perl -pe "chomp;" Article_IDs_and_Contents.csv > tmp.csv perl -pe 's/"ka1/\n"ka1/sg;' tmp.csv > Article_IDs_and_Contents_one-per-line.csv rm tmp.csv
- Get a list of articles that contain images. Using a Bash shell run the following command:
grep "<img" Article_IDs_and_Contents_one-per-line.csv > IDs_and_Contents_with_IMGs.csv
- Get a list of articles with images that were pasted into the rich text editor
grep 'src=""http://c\.na33' IDs_and_Contents_with_IMGs.csv > IDs_and_Contents_with_rtaImages.csv
- Get a list of articles with images that were either publicly available or were not curated to add in missing images
perl -ne 'if($_=~/src=""(?!http:\/\/c\.na33)/){print;}' IDs_and_Contents_with_IMGs.csv > IDs_and_Contents_with_non-rtaImages.csv
For any of the CSV files that are results of steps 2, 3, 4, and 5 if you seek only a list of IDs without the content you can open the file in Excel and delete the column containing the content. Alternately, you can use the following perl command from the Bash shell:
perl -ne 'if(/^"(ka1[^"]+)",.+$/){print "$1\n";}' input.csv > output.txt
where input.csv is the CSV from from step 2,, 3, 4 or 5 and output.txt will be a file with one ID per line, and with no quotes.