Import data from Excel sheets


How do we import data from all sheets in a number of Excel files? Each Excel file has a different number of sheets with names of no discernible pattern, but (thankfully) each sheet has the same structure: the first observation is in the same row and the columns correspond to each other. An example is the set of 17 Excel files of census data of barangays (villages) that was provided to me. Each Excel file corresponds to one region and within each file are sheets corresponding to the province in the region.  How do we consolidate all sheets in all files into one data file?



We wish the answer would be, “With a wave of a wand…” Unfortunately, we are muggles. One way is to import every sheet of every file manually until eternity. Doing this would create a new problem: headache. How do we do it like magic?

Here are the spells we need: fs, foreach, and import excel.

fs, written by Nick Cox, lists the names of files in a specified folder and saved the list in a local macro r(files) and is downloadable via the SSC Archive:

ssc install fs

First, change the working directory to where the Excel files are (see help cd). The following block of codes reads every sheet in each of the of the Excel file in the folder and saves these into a single data file with filename newdata.dta.
fs *.xlsx  
foreach f in `r(files)' {
    import excel using `f', describe
    forvalues s = 1/`r(N_worksheet)' {
        import excel using `f',         ///
            sheet(`r(worksheet_`s')')   ///
            cellrange(`r(range_`s')')   ///
            firstrow clear
        cap append newdata
        save newdata, replace
        }   
    }    

import excel with the option describe does not actually load the data from Excel but only lists all the sheets and the cell ranges. Below is an example of the output:
. import excel using ARMM.xlsx, describe

          Sheet | Range
  --------------+--------------
        BASILAN | A1:AR237
  LANAO DEL SUR | A1:AR1214
    MAGUINDANAO | A1:AR559
           SULU | A1:AR444
      TAWI-TAWI | A1:AR229

import excel using filename, describe saves the number of sheets and the list of sheets and their corresponding ranges in macros. Typing return list gives:
. return list

scalars:
        r(N_worksheet) =  5

macros:
        r(worksheet_1) : "BASILAN"
            r(range_1) : "A1:AR237"
        r(worksheet_2) : "LANAO DEL SUR"
            r(range_2) : "A1:AR1214"
        r(worksheet_3) : "MAGUINDANAO"
            r(range_3) : "A1:AR559"
        r(worksheet_4) : "SULU"
            r(range_4) : "A1:AR444"
        r(worksheet_5) : "TAWI-TAWI"
            r(range_5) : "A1:AR229"


These saved results were used to read the sheets in each file. Note that we are assuming here that the first row of each of the sheets contains the variable names, thus the option firstrow with the second import excel command that loads the Excel sheet into memory. If this is not the case, some cleaning may be required (e.g., dropping unnecessary rows and columns) before the data is appended. How complex the cleaning maybe depends on how unstructured the data is in the sheets.

 

13 Responses

  1. This works in Stata 14 as it stores the sheet names (which are lost once a new set of results overwrites them – when importing the first sheet):

    foreach file in `r(files)’ {
    import excel using “`file'”, describe
    local S = `r(N_worksheet)’
    forvalues s = 1/`S’ {
    local sheet`s’= “`r(worksheet_`s’)'”
    }
    forvalues s = 1/`S’ {
    import excel using “`file'”, sheet(“`sheet`s””) firstrow clear allstring
    capture append using results
    save results, replace
    }
    }
    }

  2. I am using STATA 14.0. The code doesn’t work. It only imports the first worksheet. When s =2, `r(worksheet_`s’)’ doesn’t refer to the second worksheet. It returns to “”. Isn’t “cap append newdata” supposed to be “cap append using newdata”?

    • If I use “cap append newdata”, `r(worksheet_`s’)’ refers to the right worksheet. But the final dataset still only has data on the first worksheet because “cap append newdata” doesn’t work.

      • “cap append using”..

      • “cap append using” messes up the stored results after “import excel .., describe”. The loop doesn’t work properly. When s =2, `r(worksheet_`s’)’ doesn’t refer to the second worksheet. It returns to “”. I can save each worksheet in a separate dta file and then append them in a separate loop. Just wondering if there is a more efficient way.

  3. Hello, I get a r(198) invalid ‘(‘
    Here is the code used:

    clear
    set more off

    cd “C:\Users\Documents\files”

    fs *.xlsx
    foreach f in `r(files)’ {
    import excel using `f’, describe
    forvalues s = 1/`r(N_worksheet)’ {
    import excel using `f’, ///
    sheet(`r(worksheet_`s’)’) ///
    cellrange(`r(range_`s’)’) ///
    firstrow clear
    cap append newdata
    save newdata, replace
    }
    }

  4. Hi, is there a way to add a variable to track the source of the data (the file name)? The only way I can see is to save all the individual datasets as .dta first, add the source variable, and then append. Do you have another idea? Thanks for å great post! =)

  5. I kept getting an error – invalid ‘of’ … r(198) when I ran this syntax. Any help would be appreciated.

  6. Hi there!

    This command would be extremely helpful, but stata on my computer is saying that “the server refused to send file”. http://fmwww.bc.edu/repec/bocode/f/

    just wondering is this still available for download? I am using stata13.1MP.

    Thank you!

Leave a Reply