Context
As a statistician, my work has the double-edged sword of being highly visible. Whether I'm making the presentation myself or fulfilling a viz request for someone else's high-level meeting in 20 minutes, knowing what's going to be digestible and compelling for the intended audience is the one soft skill that I wouldn't want to be in this business without. It's also one that you never truly master, a lot like my experience in education. When I can say that I fully grasp the spectrum of human emotions and capabilities in every possible context, then I'll be able to say that I can fully communicate any technical topic to any audience. Until then, all I can do is practice, practice, practice and hopefully be granted the opportunities to experiment and innovate with all kinds of business requirements and data layouts. I recently was up against such a challenge at work, with a pretty tight deadline no less, and I'm sharing some of the key points (on simulated data) of that day's journey here.
The Request
Excel pivot charts are an essential tool these days, and the skillset to develop them is pretty common even among non-technical personnel. They are extremely versatile and powerful, and would do the job for a lot of contexts on their own. Here's the kind of table that I received and was asked to make a quickly-digestible visual out of:
Definitely not half bad, as far as pivot tables go. In case it isn't obvious, this business is tracking productivity by the number of documents created in a medical records system. For this report, they wanted the above named practitioners, submitting three particular document types, over the previous four months. The data is organized well, but this wouldn't play well as a PowerPoint slide. That's where we come in!
The Problem
Since pivot tables make things look a lot tidier than raw data sources, it appears to the untrained eye that visualizing this in some standard format is a simple task with not much consideration required. A lot of times you'll be asked to 'just make a chart'. But there are two problems with the situation as it stands:
1) Looking closely, you'll see that there are no less than four dimensions that the stakeholders are interested in. You have provider names for rows, three types of documents being aggregated, four months over which they are aggregated, and the aggregated sums themselves. This isn't exactly the 2-D problem it appears to be, so representing everything we've been given is a creative process that the analyst is trusted to carry out.
2) Excel doesn't have the granular charting capabilities required to solve this problem, and the pivot table format isn't digestible by any statistical suites that I know of. Here's an example of what Excel could pull off quickly if you asked it to:
I really don't hate this idea, but 3D plots like this are generally discouraged since you will always be obscuring data points behind others. Remember also that you and I will understand it much better than the user because we developed it. This is a lot to take in for the average report consumer, and that's why we are going to respectfully request the raw data from which the pivot table was made and start over in SAS making something much simpler. Here's what we are going to end up with, which I think does a better job of communicating everything that the pivot table does. It has the added benefit of maintaining a color scheme consistent with the company style guide:
Data Processing
Of course I'm starting with the data import of the raw file into SAS, and inspecting it to see what we're dealing with. We'll look at the first 15 rows:
libname proclib 'SAS-Library'; FILENAME REFFILE FILESRVC FOLDERPATH='*/folder' FILENAME='raw_file.csv'; PROC IMPORT DATAFILE=REFFILE DBMS=CSV OUT=prod; GETNAMES=YES; RUN;
options obs=10; proc print data=prod; run;
Ok, this activity log has recorded the number of actions of each type performed by each user on each day in the 4-month window. This, unlike the pivot table, is something statistical software can work with- but one thing that isn't done yet is the segmentation into months. For that, it will be necessary to create a column that extracts the month from the date data, which itself means we need to make sure that SAS understands the second column as a date.
data prod; set prod; format Stamp date.; Month = Stamp; format Month monname.; run;
Visualization Conceptualization
With all the information of the original pivot table now available in standard columnar format, in theory we can get to the charting. Remember we are working with for dimensions here: name, quantity, type, and month. There are a few things to consider when designing a visualization with data like this:
- The performance of each provider relative to the others should be clearly communicated at a glance.
- The ranking of providers should be apparent, so users can quickly tell who is the most and least productive.
- The fact that there is a time element here (months) means that some level of chronological trend should be visible, if it exists.
- The segmentation of productivity into document types should be available, since it was clearly important to the designer of the pivot table.
Of course you've already seen the final product, so you know where I'm going with this. My chart is going to do the following to address each of the above points:
- Bar chart facets, or panels, for each user
- Facets ordered by the overall sum of documents created for each user
- Chronological order of monthly productivity numbers for each user
- Bar chart stacks colored by document type
Now what happens when we take the data we have and stick it into proc sgpanel, the SAS method for faceted charts?
options obs=max; proc sgpanel data=prod; ods graphics on / width=1200px height=800px; title '4-Month Provider Productivity'; panelby Name / rows = 3 columns = 7 novarname; vbar Month / response=Quantity group=Type stat=sum; colaxis valueattrs=(size=7); rowaxis label='Productivity (as measured by document creation)'; keylegend / position = top; styleattrs datacolors=(BIGB VLIGB GGR); run;
What an improvement! But we're not seeing the sorting yet, since by default SAS is going to plot the target panel variable in the most obvious order it can think of, which in this case is alphabetical. It's not like the users are ranked in the original data, so that's something we have to fix ourselves before plotting. This, as it happens, is the trickiest part.
Visualization Processing
The first order of business is to create a ranking of each provider by the sum of their overall productivity. Here is where we get to take advantage of SAS's primary means of data processing being SQL within proc statements.
proc sql; create table rank as select Name, SUM(Quantity) as Total from prod group by Name order by Total DESC; select * from rank; quit;
To get around certain limitations of SAS-SQL, I'm going to create a second table which uses monotonic() to explicitly rank the users with an integer row number and no other columns.
proc sql; create table rank_names as select monotonic() as rank, Name from rank; quit;
With all this in mind, I'm going to create a second version of the original 'prod' table called 'rank_prod' which includes a column with the ranking number of each user. This is so that we can use this as the panelby argument and have the panels sorted automatically in increasing order. This is easy enough using an inner join to the rank table we created.
proc sql noprint; create table rank_prod as select a.Name, b.rank, a.Month, a.Type, a.Quantity from prod a inner join rank_names b on a.Name = b.Name order by rank asc; run;
This creates one small issue, since by default the label for each panel will by be replaced by a rank number. This hardly communicates what we need, so we need a solution to format the panel labels by name instead. I must confess I didn't know how to do this without manually typing out the correspondence as seen in this article, and with 21 names I knew there had to be a better solution. This is what I came up with after a bit of research, to be passed to the format argument of sgplot. It uses the 'rank_names' table from earlier to map the numbers to names:
data format_input; set rank_names; retain fmtname 'rank_fmt'; start = rank; end = rank; label = Name; run; proc format cntlin=format_input; run;
Now the format argument in proc sgpanel will use rank_fmt to label the panels instead of the default rank variable. Let's take one more look at the complete visualization:
proc sgpanel data=rank_prod; ods graphics on / width=1200px height=800px; title '4-Month Provider Productivity'; panelby rank / rows = 3 columns = 7 novarname; format rank rank_fmt.; vbar Month / response=Quantity group=Type stat=sum; colaxis valueattrs=(size=7); rowaxis label='Productivity (as measured by document creation)'; keylegend / position = top; styleattrs datacolors=(BIGB VLIGB GGR); run;