shell script Archives - Jody Bruchon

#!/bin/sh # Convert Sage PRO exported text to CSV text format # Copyright (C) 2014-2020 by Jody Bruchon <jody@jodybruchon.com> # Distributed under The MIT License # Distributed AS-IS with ABSOLUTELY NO WARRANTY. Use at your own risk! # Program synopsis: # Converts a Sage PRO ".out" text file to CSV for use as a spreadsheet # OUT files are generally fixed-width plain text with a variety of # header and footer information. # The general process of converting them to CSV text is as follows: # - Read each line in the file # - Skip lines that aren't part of the financial data # - Skip irrelevant page/column headers and any empty lines # - Read the account number/name information header # - Consume columns of transaction data in order; convert to CSV data # - Ignore account/grand totals and beginning balance fields # - Loop through all the lines until input data is exhausted # This script has only been tested on a specific version of Sage PRO # and with one year of financial data output from one company. It may # not work properly on your exported data, in which case you'll need # to fix it yourself. # ALWAYS ***ALWAYS*** CHECK OUTPUT FILES FOR CORRECTNESS. This script # will throw an error if it encounters unexpected data; however, this # does not always happen if the data appears to conform to expected # input data ordering and formatting. For example, financial data is # assumed to be fixed-width columns and the data is not checked for # correct type i.e. a valid float, integer, or string. echo "A tool to convert Sage PRO exported text to CSV text format" echo "Copyright (C) 2014-2020 by Jody Bruchon <jody@jodybruchon.com>" echo "Distributed under The MIT License" echo -e "Distributed AS-IS with ABSOLUTELY NO WARRANTY. Use at your own risk.\n" if [ ! -e "$1" ] then echo "Specify a file to convert." echo -e "\nUsage: $0 01-2014.out > 01-2014.csv\n\n" exit 1 fi SKIP=0 # Number of lines to skip LN=0 # Current processing ine number TM=0 # Transaction output mode HEADERS='"Tran Date","Source","Session","Transaction Description","Batch","Tran No","Debit Amt.","Credit Amt.","Ending Bal."' # Column widths C1=8 # Tran Date C2=2 # Source (initials C3=9 # Session C4=23 # Transaction Description C5=9 # Batch C6=6 # Tran No C7=26 # Debit Amt. C8=20 # Credit Amt. C9=18 # Ending Bal. CMAX=9 # Number of columns pad_col () { X=$(expr $CMAX - $1) while [ $X -gt 0 ] do echo -n "," X=$((X - 1)) done echo } consume_col () { # Read next item in line CNT=$(eval echo \$C$Z) #echo CNT $CNT I="$(echo -E "$T" | sed "s/\$.\{$CNT\}\$.*/\"\1\",/")" T="$(echo -E "$T" | sed "s/^.\{$CNT\} //")" # Strip extraneous spaces in fields if [ $Z != 4 ] then I="$(echo -E $I | sed 's/^ *//;s/ *$//')" fi echo -n "$I" } while read -r LINE do # Count line numbers in case we need to report an error LN=$((LN + 1)) # Handle line skips as needed if [ $SKIP -gt 0 ] then SKIP=$((SKIP - 1)) continue fi # Strip common page headers (depaginate) if echo "$LINE" | grep -q "^Page:" then SKIP=7 continue fi # Strip standard column headers if echo "$LINE" | grep -q "^Tran Date"; then continue; fi if echo "$LINE" | grep -q "^Account Number"; then continue; fi # Don't process totally empty lines if [ -z "$LINE" ]; then continue; fi # Pull account number and name if echo "$LINE" | grep -q '^[0-9]\{5\}' then ACCT="$(echo -E "$LINE" | cut -d\ -f1)" ACCTNAME="$(echo -E "$LINE" | sed 's/ */ /g;s/^ *//' | cut -d\ -f2-)" pad_col 0 echo -n "$ACCT,\"$ACCTNAME\""; pad_col 2 continue fi # Sometimes totals end up on the previous line if echo -E "$LINE" | grep -q '^[0-9][0-9][^/]' then LL="$LINE" continue fi if echo -E "$LINE" | grep -q '^\$' then LL="$LINE" continue fi if [ ! -z "$LL" ] then LINE="$LINE $LL" unset LL fi if echo "$LINE" | grep -q "Beginning Balance" # then BB="$(echo -E "$LINE" | awk '{print $3}')" # echo -n "\"Begin Bal:\",$BB"; pad_col 2 # pad_col 0 then TM=1; AT=0 echo "$HEADERS" continue fi if echo "$LINE" | grep -q '^[0-9][0-9]/[0-9][0-9]/[0-9][0-9]' then if [ $TM -eq 1 ] then T="$LINE" Z=0 while [ $Z -lt $CMAX ] do Z=$((Z + 1)) consume_col done echo continue else echo "error: unexpected transaction" >&2 exit 1 fi fi # Handle account totals line if echo "$LINE" | grep -q "^Account Total:" then TM=0; AT=1 continue fi if echo "$LINE" | grep -q "^Begin. Bal." then if [ $AT -eq 1 ] then echo -n '"Begin Bal",' T="$(echo -E "$LINE" | sed 's/Begin[^$]*//;s/\$ */$/g;s/\$/"$/g;s/ Net Change: */","Net Change/g;s/\$/,"$/g;s/$/"/;s/ *//g;s/^",//')" T2="$(echo -E "$T" | cut -d\" -f1-7)" T3="$(echo -E "$T" | cut -d\" -f7-)" echo $T2,$T3 continue else echo "error: unexpected totals line" >&2 fi fi if echo "$LINE" | grep -q "^Grand Total:" then pad_col 0; pad_col 0 echo '"Grand Total"'; pad_col 1 continue fi # Output error (unknown line) echo "ERROR: Unknown data while processing line $LN" >&2 echo -E "$LINE" >&2 exit 1 # echo -E "$LINE" done < "$1"

I have a cleanup program that I’ve written as a Bash shell script. Over the years, it has morphed from a thing that just deleted a few fixed directories if they existed at all (mostly temporary file directories found on Windows) to a very flexible cleanup tool that can take a set of rules and rewrite and modify them to apply to multiple versions of Windows, along with safeguards that check the rules and auto-rewritten rules to prevent the equivalent of an “rm -rf /*” from happening. It’s incredibly useful for me; when I back up a customer’s PC data, I run the cleaner script first to delete many gigabytes of unnecessary junk and speed up the backup and restore process significantly.

Unfortunately, having the internal rewrite and safety check rules has the side effect of massively slowing the process. I’ve been tolerating the slowness for a long time, but as the rule set increased in size over the past few years the script has taken longer and longer to complete, so I finally decided to find out what was really going on and fix this speed problem.

Profiling shell scripts isn’t quite as easy as profiling C programs; with C, you can just use a tool like Valgrind to find out where all the effort is going, but shell scripts depend on the speed of the shell, the kernel, and the plethora of programs executed by the script, so it’s harder to follow what goes on and find the time sinks. However, I observed that a lot of time was spent in the steps between deleting items; since each rewrite and safety check is done on-the-fly as deletion rules are presented for processing, those were likely candidates. The first thing I wanted to know was how many times the script called an external program to do work; you can easily kill a shell script’s performance with unnecessary external program executions. To gather this info, I used the strace tool:

strace -f -o strace.txt tt_cleaner

This produced a file called “strace.txt” which contains every single system call issued by both the cleaner script and any forked programs. I then looked for the execve() system call and gathered the counts of the programs executed, excluding “execve resumed” events which aren’t actual execve() calls:

The resulting output consisted of numbers below 100 until the last two lines, and that’s when I realized where the bottleneck might be:

4157 /bin/sed
11227 /usr/bin/grep

That’s a LOT of calls to sed, but the number of calls to grep was almost three times bigger, so that’s where I started to search for ways to improve. As I’ve said, the rewrite code takes each rule for deletion and rewrites it for other possible interpretations; “Username\Application Data” on Windows XP was moved to “Username\AppData\Roaming” on Vista and up, while “All Users\Application Data” was moved to “C:\ProgramData” in the same, plus there is a potential mirror of every single rule in “Username\AppData\Local\VirtualStore”. The rewrite code handles the expansion of the deletion rules to cover every single one of these possible cases. The outer loop of the rewrite engine grabs each rewrite rule in order while the inner loop does the actual rewriting to the current rule AND and all prior rewrites to ensure no possibilities are missed (VirtualStore is largely to blame for this double-loop architecture). This means that anything done within the inner loop is executed a huge number of times, and the very first command in the inner loop looked like this:

if echo “${RWNAMES[$RWNCNT]}” | grep -qi “${REWRITE0[$RWCNT]}”

This checks to see if the rewrite rule applies to the cleaner rule before doing the rewriting work. It calls grep once for every single iteration of the inner loop. I replaced this line with the following:

if [[ “${RWNAMES[$RWNCNT]}” =~ .*${REWRITE0[$RWCNT]}.* ]]

I had to also tack a “shopt -s nocasematch” to the top of the shell script to make the comparison case-insensitive. The result was a 6x speed increase. Testing on an existing data backup which had already been cleaned (no “work” to do) showed a consistent time reduction from 131 seconds to 22 seconds! The grep count dropped massively, too:

97 /usr/bin/grep

Bash can do wildcard and regular expression matching of strings (the =~ comparison operator is a regex match), so anywhere your shell script uses the “echo-grep” combination in a loop stands to benefit greatly by exploiting these Bash features. Unfortunately, these are not POSIX shell features and using them will lead to non-portable scripts, but if you will never use the script on other shells and the performance boost is significant, why not use them?

The bigger lesson here is that you should take some time to learn about the features offered by your shell if you’re writing advanced shell scripts.

Update: After writing this article, I set forth to eliminate the thousands of calls to sed. I was able to change an “echo-sed” combination to a couple of Bash substring substitutions. Try it out:

FOO=${VARIABLE/string_to_replace/replacement}

It accepts $VARIABLES where the strings go, so it’s quite powerful. Best of all, the total runtime dropped to 10.8 seconds for a total speed boost of over 11x!

Jody Bruchon

How can one man do so much? Coffee. That's how.

Tag: shell script

Shell script that converts Sage PRO exported text (.out files) to CSV text format

Click here to download the Sage PRO to CSV shell script.

The key to faster shell scripts: know your shell’s features and use them!