Paper 109-25
Merges and Joins
Timothy J Harrington, Trilogy Consulting Corporation
Abstract
This paper discusses methods of joining SASâ data
sets. The different methods and the reasons for
choosing a particular method of joining are contrasted
and compared. Potential problems and limitations
when joining data sets are also discussed.
The need to combine data sets constantly arises
during software development, as does the need to
validate and test new code. There are two basic types
of join, vertical, and horizontal. Vertical joining is
appending one data set to another, whereas
horizontal joining is using one or more key variables
to combine different observations.
Vertical Joining
A good example of vertical joining is adding to a data
set in time sequence, for example, adding February’s
sales data to January’s sales data to give a year-to-
date data set. Providing both data sets have the
same variables and all the variables have the same
attributes such as data type, length, and label, there
is no problem. However, once the data sets are
combined at least one of the variables should, in
practice, be able to identify which of the source data
sets any given observation originated from. In this
sales data example a date or month name should be
present to indicate whether a given observation came
from January’s data or February’s data. Another issue
may be the sort order. In this example there is no
need to sort the resulting data set if the source data
sets are in date order, but if, say, the data sets were
sorted by product code, or sales representative the
resulting data set would need to be resorted by date.
Most importantly, when vertically joining data sets, is
the issue vertical compatibility. This is whether the
corresponding variables in each data set have the
same attributes, and if there are any variables which
are present in one data set but not in the other.
Using PROC DATASETS and APPEND
One method of vertical joining is to use the utility
procedure PROC DATASETS with the APPEND
statement. More than two data sets many be joined in
this way, but all of the data sets should be vertically
compatible. However, vertical incompatibility may be
overridden by using the FORCE option. When this
option is used, variables which are absent in one data
set are created with the same attributes in the
resulting data set, but the values are missing in each
observation which originated from the data set
without those variables. Where variable lengths are
different the shorter length values are right padded
with spaces to equal the longer length. Where data
types are different the numeric type is made
character. If labels are different the label from the
latest data set is used. If the FORCE option is not
specified and any of the data sets are not completely
vertically compatible applicable NOTES and
WARNINGS are written to the log file. If a variable is
present in the DATA data set but is absent in the
BASE data set the appending is not done. The
example below appends two data sets DATA01 and
DATA02 to the data set DATA99. DATA99 is the
‘Base’ data set, which, if it does not exist is created
and becomes the compound of DATA01 and DATA02
(A NOTE of this is written to the Log file). The
NOLIST option in PROC DATASETS prevents it from
running interactively.
PROC DATASETS NOLIST;
APPEND BASE= DATA99 DATA= DATA01
APPEND BASE= DATA99 DATA= DATA02;
RUN;
If observation order is important after appending, a
PROC SORT should be performed on the compound
data set (DATA99 in this example) by the appropriate
BY variables.
Vertical Joining with UNION Corresponding
In PROC SQL two or more data sets may be vertically
joined used UNION CORRESPONDING ALL. (If the
‘ALL’ is omitted only one of any duplicate
observations are kept). This is analogous to APPEND
in PROC DATASETS but if the data sets to be joined
are not vertically compatible only variables common
to both data sets are placed in the resulting table.
This is the same example as above, but using PROC
SQL with UNION CORRESPONDING ALL.
PROC SQL;
CREATE TABLE DATA99 AS
SELECT *
FROM DATA01
UNION CORRESPONDING ALL
SELECT *
FROM DATA02;
QUIT;
This PROC SQL works if DATA99 is being created as
new, but if DATA99 already exists and the intention is
append DATA01 and DATA02 to this data set the
code must be written as
PROC SQL;
CREATE TABLE DATA99 AS
SELECT *
FROM DATA99
UNION CORRESPONDING ALL
SELECT *
FROM DATA01;
UNION CORRESPONDING ALL
SELECT *
FROM DATA02;
QUIT;
Coders' Corner