Counting occurrence of strings within strings

Somebody asked how to count the number of occurrences of a string within a string. For example, if I have the following data, I want to generate new variables countSS, countSM, and countSG that contains the number of occurrences of “SS”, “SM”, or “SG” in variable awards.

input id str40 awards
1    “SS; SS; SM; SG”
2    “SM; SG”
3    “SG; SG; SG; SS”
4    “SS; SS; SG; SG; SS; SM; SG”

Here is one solution using the macro extended function -subinstr- (-help extended_fcn-).

local tocount SS SM SG
foreach t of local tocount{
gen countt'</em>=0
<strong>local </strong><em>N</em> = _N
<strong>forvalues </strong>i = 1/
local a = awards[i']
<strong>local </strong><em>c</em> : subinstr local  a  "
t'” “t'" , all  count(local <em>c2</em>)
<strong>replace </strong>count
t’ = c2' ini’

*Thanks to Jacob Reynolds ( for the question. Although, for the best advise on Stata, Statalist is the best place to ask :). See Stuck? Hello Statalist .

6 Responses

  1. Nick & Mitch,
    That last comment about comparing lengths was the best ticket. I was able to count the awards like I needed by generating as many counting variables as req’d (g pa_XX); total of 14.

    I wish I could have gotten the more “eloquent” code above to work, but the comparison line is more my speed in thesis work…maybe when I come back for a PhD :)

    Thank you for your time and attention to this guys!


    • I always like simpler solution. Not knowing any better, I had come up with a complex one. ‘Eloquence’, I think, is not about complexity but simplicity. Nick’s solution is an example. :)

  2. There are also two [sic] -egen- functions for this within -egenmore- from SSC. Neither of them uses the trick above. I’d prefer to believe that the reason for that was that -subinstr()- wasn’t available when the functions were written, both about ten years ago, but I can’t rule out without checking that the authors (one of them me) just overlooked this simpler way to do it.

  3. The number of occurrences can be got from a comparison of lengths before and after blanking out.

    gen noccur_SS = (length(awards) – length(subinstr(awards, “SS”, “”,.))) / length(“SS”)

    In this case we know that the length of “SS” is 2. I wrote it out like this to lead up to the more general rule (mixing now Stata and pseudocode)

    (length(original) – length(original_with_substr_blanked)) / length(substr)

    Thus you don’t need a loop over observations. I think you do need to do this separately for each substring.

Leave a Reply