Mata code for data management in two dimensional datasets
Datasets like DREAM has two dimensions, rows are individuals and columns are mainly many measurements over time.
Handling datasets with two dimensions are described in (2014) Jakob Hjort.
What is needed is to retrieve information regarding one or more events in a time window.
In essence, there is two challenges:
-
Getting a time window, eg. 52 weeks (a year) from a time point It is important for speed of calculations to limit the time window to what is needed
-
Getting event information in that window
Using Mata is much faster than using Stata.
The DREAM example data is used. It is described on this website.
use dream.dta, clear
Most of the columns represent time points (one per week) for the participants.
A random start point (diagnosed) is added for later.
mata: rseed(123)
mata: nhb_sae_addvars("diagnosed", st_varname(runiformint(1, st_nobs(), 4, st_nvar()))')
It is very important that time point variables are ordered. Otherwise, the code examples below will lead to wrong results.
order *, alphabetic
order id male age diagnosed
Getting a time window
Time window with fixed starting/ending point
Assume that the starting point is week 7 in 2005 and that a time window of one year (52 weeks) is wanted.
Using the naming convention of the data first week is named y2005w07.
To get the index of variable y2005w07:
mata: start = st_varindex("y2005w07")
mata: time_window = st_data(., start..(start+51))
By subtracting 51 instead of adding it is easy to look at the year ending at y2005w07.
The first 10 rows for the first quarter (13 weeks) can be looked upon using:
mata: time_window[1..10,1..13]
1 2 3 4 5 6 7 8 9 10 11 12 13
+------------------------------------------------------------------+
1 | 2 2 2 2 2 2 2 2 2 2 2 2 2 |
2 | 0 0 0 0 0 0 1 1 0 0 0 0 0 |
3 | 0 0 0 0 0 1 0 0 0 0 1 0 1 |
4 | 0 0 0 0 0 0 0 1 0 0 0 1 0 |
5 | 0 0 0 0 0 0 0 0 0 1 0 0 0 |
6 | 0 1 0 0 0 0 0 1 0 0 0 0 0 |
7 | 1 0 0 0 0 0 0 0 0 0 0 0 0 |
8 | 0 0 0 0 1 0 0 0 0 0 0 0 0 |
9 | 0 0 0 0 0 0 1 1 0 0 0 0 0 |
10 | 0 1 0 0 0 0 0 0 0 0 0 0 0 |
+------------------------------------------------------------------+
Time window with a floating starting/ending point to the end of time
To find the starting point requires a little bit more code, since each participant has an individual starting point.
mata: tw_start = st_varindex(st_sdata(., "diagnosed")')'
To speed up the code the space required to save data needs to be specified. So, the maximal time window width (max) needs to be found.
mata: max = colmax( (st_nvar() :- tw_start) ) + 1
Hence, a matrix of the required size can be specified. The default value is set to a missing value (.m). Most participants will have a smaller time window than max.
Getting matrix column size (assuming last variable is the last time observed).
mata: time_window = J(st_nobs(), max, .m)
The end of the time window is found by:
mata: tw_end = J(st_nobs(), 1, st_nvar())
With the code below all time windows are aligned to the same starting point in matrix.
mata: for(r=1;r<=st_nobs();r++) time_window[r,1..(tw_end[r] - tw_start[r] + 1)] = st_data(r, tw_start[r]..tw_end[r])
Time window with a floating starting/ending point and fixed time interval
Finding the individual starting point and calculating the end point for the time window.
mata: tw_start = st_varindex(st_sdata(., "diagnosed")')'
mata: tw_end = rowmin( ( tw_start :+ 51, J(st_nobs(), 1, st_nvar())) )
To speed up calculations a time window matrix of width 52 is defined.
mata: time_window = J(st_nobs(), 52, .m)
Individual time windows are aligned below, possibly having missings (.m) at the end of the time windows.
mata: for(r=1;r<=st_nobs();r++) time_window[r,1..(tw_end[r] - tw_start[r] + 1)] = st_data(r, tw_start[r]..tw_end[r])
Getting event information in that window
Finding rows with a specific value at the last column
Below the rows not having a full (52) time window is found. Such a row will have the value .m in the last column.
mata: has_m = select(st_data(.,"id"), time_window[., 52] :== .m)
The first 5 are (note that has_m is a column vector so only one index is needed):
mata: has_m[1..5]
1
+-------+
1 | 39 |
2 | 61 |
3 | 85 |
4 | 100 |
5 | 101 |
+-------+
Count the number of a value for each participant
The number of missings .m can be found by
mata: countm = rowsum(time_window :== .m)
The first 5 counts are:
mata: countm[1..5]
1
+-----+
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
+-----+
The non-zero counts with ids
mata: count_nz = (has_m, select((st_data(.,"id"), countm), countm :> 0))
The first 15 counts with ids
mata: count_nz[1..15,.]
1 2 3
+-------------------+
1 | 39 39 33 |
2 | 61 61 15 |
3 | 85 85 25 |
4 | 100 100 19 |
5 | 101 101 38 |
6 | 132 132 9 |
7 | 152 152 42 |
8 | 197 197 8 |
9 | 201 201 9 |
10 | 213 213 49 |
11 | 216 216 23 |
12 | 249 249 20 |
13 | 280 280 4 |
14 | 283 283 42 |
15 | 326 326 41 |
+-------------------+
Looking at a single participant
Participant 213 has 49 missing values. The time window for 213 is:
mata time_window[213, .]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | 3 3 3 .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m .m
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
50 51 52
----------------+
1 .m .m .m |
----------------+
Below it is seen that participant 213 only is followed for three weeks before the dataset ends.
mata: st_varname(tw_start[213]), st_varname(tw_end[213]), st_varname(st_nvar())
1 2 3
+----------------------------------+
1 | y2017w50 y2017w52 y2017w52 |
+----------------------------------+
The last three weeks for participant 213 matches what is found in the time window.
mata: st_data(213, tw_start[213]..st_nvar())
Finding unique values
First, the time window values are made into one column and the unique values are found.
mata: dta = colshape(time_window,1)
mata: uniq = uniqrows(dta)
To see the labels
label list state
state:
0 Working
1 On benefits
2 Early retirement
3 Died
4 Retired
Do count table for unique values
mata: uniq, rowsum(J(rows(uniq),1, dta') :== uniq)
1 2
+-------------------+
1 | 0 416875 |
2 | 1 30311 |
3 | 2 6783 |
4 | 3 31640 |
5 | 4 25273 |
6 | .m 9118 |
+-------------------+
Finding first event
First number of rows and columns for the time window is found.
mata: R = rows(time_window); C = cols(time_window)
To find first event simply write
mata: early_retirement_at_1 = rowmin( J(R,1,1..C) :/ (time_window :== 2) )
The matrix rowmin( J(R,1,1..C) specifies the column indeces for each participant in the time window.
The matrix is one or zero depending on whether the time window value is 2 or not.
By dividing the two martrices cellwise the result is either a column index if the time window value is 2 or else missing due to division with zero.
In Stata the missing values are numerically plus infinity. By taking the rowmin the lowest column index different from missing is returned. In other words, for each participant the first column index where the time window value is 2.
To verify
- The participants with events are found
- Looking at selected participants with an event
The rows/participant having an event are found.
mata: has_1_ev = select(((1::R), early_retirement_at_1), early_retirement_at_1 :< .)
The first three rows with an event are
mata: has_1_ev[1..3,.]
1 2
+-------------+
1 | 1 35 |
2 | 53 43 |
3 | 170 1 |
+-------------+
The id of the participants for first three rows with an event are
mata: st_data(has_1_ev[1..3,1], "id")
1
+-------+
1 | 1 |
2 | 53 |
3 | 170 |
+-------+
The participant id value is the rownumber in this dataset.
Looking at the time window rows for the first 45 weeks, it is seen that the first event appears as calculated.
mata: time_window[has_1_ev[1..3,1],1..45]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 |
2 | 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 2 2 2 |
3 | 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Last update: 2022-04-22, Stata version 17