I would suggest that rather than using actual noise (antisocial and not necessarily easily reproduced), a better approach would be to create a dummy load that would emulate a speaker and measure the voltage across that to establish output power. A crucial point would be a means of measuring THD and maybe have two sets of measurements, one at 1% THD and one at 10% THD. Then do four plots, two with tone controls set at 12 o'clock at 1% and 10% THD respectively, and another two with tone controls set where they give the flattest frequency response (and document those settings), again at 1% and 10% THD.
This is just off the top of my head, I'm sure there are lots of holes that can be picked in it.