One can use Regular Expressions (RegEx) in Stata functions when working with string data
Finding whether a String matches a Pattern
One can use the ustrregexm
function. You can remember it as an abbreviation for “Unicode String Regular Expression Match” = UStrRegExM
count if ustrregexm(guardian_rel, "[Ss]\/[O0o]" )
Code language: Stata (stata)
The above function will be TRUE if the following sequences are present ANYWHERE in the record – S/O s/o s/O s/0 etc.
Let use break down the RegEx pattern: It has three distinct sections [Ss]\/[O0o]
- [Ss] = will match for ‘S’ or ‘s’. Please note that we have placed all the valid characters inside square brackets ‘[ ]’.
- \/ = will match for slash ‘/’ Only . Note that we have placed a back-slash ‘\’ before the slash ‘/’ . This is called escaping. If you need to check for special characters, these need to bee escaped by a back-slash ‘\’.
- [O0o] = will match for ‘O’, ‘0″ or ‘o’. Please note again that we have placed all the valid characters inside square brackets ‘[ ]’.
Checking whether a text sequence is present anywhere inside a string
di ustrregexm("patient is S/O mr Ram", "are") // 0
di ustrregexm("patient is S/O mr Ram", "is" ) // 1
di ustrregexm("patient is S/O mr Ram", "Is") // 0, note the case
Code language: JavaScript (javascript)
Using Brackets to account for capital and small case
// using square brackets to list all possible allowed values at that position
di ustrregexm("patient is S/O mr Ram", "[Ii]s") // 1, note that we have grouped I,i inside[]
di ustrregexm("patient Is S/O mr Ram", "[Ii]s") // 1, note that we have grouped I,i inside[]
di ustrregexm("patient IS S/O mr Ram", "[Ii]s") // 0
di ustrregexm("patient IS S/O mr Ram", "[Ii][Ss]") // 1, two groups
di ustrregexm("patient Is S/O mr Ram", "[Ii][Ss]") // 1, two groups
di ustrregexm("patient iS S/O mr Ram", "[Ii][Ss]") // 1, two groups
di ustrregexm("patient is S/O mr Ram", "[Ii][Ss]") // 1, two groups
Code language: JavaScript (javascript)
Checking whether a text sequence is present in the beginning of the string: using ^
// using hat symbol ^ just before the regular expression
di ustrregexm("patient IS S/O mr Ram", "^[Ii][Ss]") // 0
di ustrregexm("patient IS S/O mr Ram", "^p") // 1
di ustrregexm("patient IS S/O mr Ram", "^P") // 0
di ustrregexm("patient IS S/O mr Ram", "^pat") // 1
di ustrregexm("Patient IS S/O mr Ram", "^pat") // 0
di ustrregexm("Patient IS S/O mr Ram", "^[Pp]") // 1
di ustrregexm("Patient IS S/O mr Ram", "^[Pp]Atient") // 0
di ustrregexm("Patient IS S/O mr Ram", "^[Pp][Aa][Tt][Ii]ent") // 1
Code language: JavaScript (javascript)
Searching for Characters like * % $ – / \ ! etc – Using backslash `\’ Escape character
// using backslash to escape special characters
di ustrregexm("patient is S/O mr Ram", "[Ss]/[Oo]" ) // 1
di ustrregexm("patient is S\O mr Ram", "[Ss]\[Oo]" ) // 0 <= What happened here !
di ustrregexm("patient is S\O mr Ram", "[Ss]\\[Oo]" ) // 1 We had to escape the '\' to be searched for with '\'
di ustrregexm("patient is S/O mr Ram", "[Ss]/[Oo]" ) // 1 works seemingly
di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1 Always escaape if there is a special character you need to search for
Code language: JavaScript (javascript)
One Character, then anything, then another character: use Period ‘.’
// If you Do not care what comes between two charaacters
di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1
di ustrregexm("patient is S-O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\[/-\][Oo]" ) // 0 - does not work
di ustrregexm("patient is S\O mr Ram", "[Ss][\/\-\\][Oo]" ) // 1 - need to escape each character
di ustrregexm("patient is S\O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period . , means anything goes here
di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1
di ustrregexm("patient is S-O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\[/-\][Oo]" ) // 0 - does not work
di ustrregexm("patient is S\O mr Ram", "[Ss][\/\-\\][Oo]" ) // 1 - need to escape each character
di ustrregexm("patient is S\O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period . , means anything goes here
di ustrregexm("patient is S-O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period
di ustrregexm("patient is S/O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period
di ustrregexm("patient is S*O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period
di ustrregexm("patient is SO mr Ram", "[Ss].[Oo]" ) // 0 - There was nothing there so match failed, Oh No
Code language: JavaScript (javascript)
Check if Anything or Nothing is in a sequence: using *
di ustrregexm("patient is SO mr Ram", "[Ss].[Oo]" ) // 0 - There was nothing there sso failed, Oh No
// is the same as the next statement
di ustrregexm("patient is SO mr Ram", "[Ss][.]+[Oo]" ) // 0, + means match one or more of previous character, which can be anything since we are umarsing a period ' . '
di ustrregexm("patient is SO mr Ram", "[Ss][.]*[Oo]" ) // 1, * means match zero or one of previous character, which can be anything here since we have used a period ' . '
di ustrregexm("patient is S/O mr Ram", "[Ss][.]*[Oo]" ) // 0 --- Aah this is frustrating
Code language: JavaScript (javascript)
Mixing it up using parentheses ‘ ( ) ‘
di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1
di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1
di ustrregexm("patient is S\O mr Ram", "([Ss])(.)*([Oo])" ) // 1
di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1
di ustrregexm("patient is SoO mr Ram", "([Ss])(.)*([Oo])" ) // 1
di ustrregexm("patient is S.O mr Ram", "([Ss])(.)*([Oo])" ) // 1
Code language: JavaScript (javascript)
Now we have three sub-expressions within the RegEx ([Ss]) (.)* ([Oo])
- ([Ss]) will match S or s at first place
- (.)*. will match any character at second place because we have placed a period ‘ . ‘ here. However, After the parentheses we have placed a *. This means match zero or more of previous expression. Essentially, it means anything ‘ . ‘ or nothing ‘ * ‘ gets matched
- ([Oo]) will match O or o at third character , or at second character
Reference:
https://www.stata.com/support/faqs/data-management/regular-expressions/