Jump to content
Why become a member? ×

A maths question, chance vs skill.


Downunderwonder
 Share

Recommended Posts

In blind audio A/B comparison testing results can be compared with random odds.

 

A and B are similar but different. Subjects get to switch between A and B as many times as they like to determine which is which based on life skills experience of A and B.

 

A selection of unskilled public will be assumed zero skill in determining the selection and expected result would be 50% get it correct by random.

 

We assume all participants are at least partially skilled.

 

What is the actual skill level if 75% get it correct when no skill is required for a 50% result?

Link to comment
Share on other sites

Simpleton maths says if 50% can get the correct result without listening then a 75% success rate indicates 25% have the skill to discern since the other 50% were the product of a guess.

 

Then it gets too complicated for me. 25% got it wrong. They emphatically could not discern A from B. It doesn't logically follow that the other 75% could due to correct guesses providing a large chunk of the results but I don't have any maths for it.

Link to comment
Share on other sites

If 75% of people answered correctly, then you can say that an average person has skill level 0.75 (whatever that means); i.e. the probability that a randomly chosen participant answered correctly is 0.75. This is about as much as you can say if the prior skill level distribution is uniform across all participants.

 

If you start with a different prior distribution, for example 90% of people have skill level 0.5 and 10% have skill level X, you will be able to calculate what X is for a given experiment result Y (in expectation).

Edited by Chaosanator
Link to comment
Share on other sites

7 hours ago, Chaosanator said:

If 75% of people answered correctly, then you can say that an average person has skill level 0.75 (whatever that means); i.e. the probability that a randomly chosen participant answered correctly is 0.75. This is about as much as you can say if the prior skill level distribution is uniform across all participants.

 

If you start with a different prior distribution, for example 90% of people have skill level 0.5 and 10% have skill level X, you will be able to calculate what X is for a given experiment result Y (in expectation).

The actual skill level is 1 or zero for individual test subjects. The returned test data does not differentiate between correct guess choices and skilled choices. It's a one shot trial.

 

There is a hard and fast accurate assumption that at an actual skill number of zero the test would return a 50% correct selection rate. Instead there is a 75% success rate.

 

If the success rate came in close to 100% you could say the ones who got it wrong have bad hearing and be  confident there were clearly audible differences. 75% is no man's land where it would be nice to have some actual robust maths to describe the percentage with the skill to hear the difference.

Link to comment
Share on other sites

Suppose that the population consists of only two types of person:

- type A has no skill, and therefore guess correctly with probability 0.5

- type B has full skills, and therefore answers correctly with probability 1

then after a single experiment that resulted in 75% correct scores you'd expect that 50% of your population is type A and 50% is type B.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

1 hour ago, Chaosanator said:

Suppose that the population consists of only two types of person:

- type A has no skill, and therefore guess correctly with probability 0.5

- type B has full skills, and therefore answers correctly with probability 1

then after a single experiment that resulted in 75% correct scores you'd expect that 50% of your population is type A and 50% is type B.

That's what I figured but still have no maths to back it up.

 

With a result of 90% things get squiffy. 20% and 80%, but surely 90% would claim they heard it right! How does the statistician prove the existence of the other 10% that are just as tone deaf as the other 10%? Nobody even knows if they are in the 10% or the 80%.

Link to comment
Share on other sites

Indeed with just one experiment you are unable to properly distinguish between types A and B. Run the experiment 10 times (or ask them to answer on 10 independent audio samples) then the probability that someone of type A guesses correctly for every sample is less than 0.001. Type B people never make mistakes, so if someone answers correctly on all 10 samples then the probability that they are of type B is at least 0.999.

Link to comment
Share on other sites

6 hours ago, Chaosanator said:

Indeed with just one experiment you are unable to properly distinguish between types A and B. Run the experiment 10 times (or ask them to answer on 10 independent audio samples) then the probability that someone of type A guesses correctly for every sample is less than 0.001. Type B people never make mistakes, so if someone answers correctly on all 10 samples then the probability that they are of type B is at least 0.999.

Not how I posed the trial.

Link to comment
Share on other sites

4 hours ago, agedhorse said:

What if there is no difference between A and B?

It simply wouldn't rate investigation unless that was the point of a different investigation to the one I posted.

 

Do feel free to expand on your almost famous accounts of trialing the new kid on the block GenzBenz ClassD power stage against its predecessor. I only remember the result was not popular among the class A/B rocks best club when they couldn't pick which was which.

Link to comment
Share on other sites

2 hours ago, Downunderwonder said:

The trial has already been done as described! 75% got it correct. All that remains is the interpretation of the very simple data.

There's not really enough data to draw a conclusion though. You could say that your results suggest that there's a 50% better than random chance that someone can tell the difference, but that's really all. 

 

You want what's called a p value, which is a measure of how likely it is that your outcome is not chance, but that's probably not going to be significant depending on what exactly you did for the experiments. As one of many examples, if your 75% outcome represents 3 out of 4 participants then you're not going to meet any kind of normal standard for significance. 

Edited by Jack
  • Like 2
Link to comment
Share on other sites

9 hours ago, Downunderwonder said:

It simply wouldn't rate investigation unless that was the point of a different investigation to the one I posted.

 

Do feel free to expand on your almost famous accounts of trialing the new kid on the block GenzBenz ClassD power stage against its predecessor. I only remember the result was not popular among the class A/B rocks best club when they couldn't pick which was which.

It serves as a control, because if statistically greater than 50-50 is the result, something else is going on.

 

Yes, double-blind testing (including control testing) is an important part of moving technology and designs forward. This is how we rule in or out a particular aspect of a design being the cause/effect of what’s being investigated. 

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

7 hours ago, Jack said:

There's not really enough data to draw a conclusion though. You could say that your results suggest that there's a 50% better than random chance that someone can tell the difference, but that's really all. 

 

You want what's called a p value, which is a measure of how likely it is that your outcome is not chance, but that's probably not going to be significant depending on what exactly you did for the experiments. As one of many examples, if your 75% outcome represents 3 out of 4 participants then you're not going to meet any kind of normal standard for significance. 

I forget the exact number but I am confident it was a significant number of volunteers into triple digits.

 

As I recall there were two or three levels of overdrive dialed in as close as possible to equal and recording levels were also equalised. It was very well done. Participants were asked to listen to each set as many times as they liked before deciding which set was device A and which B. 75% decided correctly.

 

Nobody has corrected the idea that the simple calculation is the truth, so long as the sample size is significant, so I am good with that.

 

Result, statistically only 50% of bass players could properly tell a recording of a VT bass pedal from an all tube DI.

 

You can imagine the consternation when 75% got it 'correct'!!!!

Edited by Downunderwonder
Fixed conclusion.
Link to comment
Share on other sites

47 minutes ago, Downunderwonder said:

I forget the exact number but I am confident it was a significant number of volunteers into triple digits.

 

As I recall there were two or three levels of overdrive dialed in as close as possible to equal and recording levels were also equalised. It was very well done. Participants were asked to listen to each set as many times as they liked before deciding which set was device A and which B. 75% decided correctly.

 

Nobody has corrected the idea that the simple calculation is the truth, so long as the sample size is significant, so I am good with that.

 

Result, statistically only 25% of bass players could properly tell a recording of a VT bass pedal from an all tube DI.

 

You can imagine the consternation when 75% got it 'correct'!!!!

But 75% did get it correct, which suggests that there is probably something more than random chance going on. Anything beyond that is kind of shakey. 

 

I think what you're trying to do is to sort of forget about 50% of the participants as you think that accounts for random chance but it doesn't work that way. That would leave us with 25% of people (which is now 'half') who got it right and the same number who didn't. So, what does that tell us? 

Edited by Jack
  • Like 1
Link to comment
Share on other sites

On 30/12/2021 at 22:57, Downunderwonder said:

We assume all participants are at least partially skilled.

 

On 31/12/2021 at 19:32, Downunderwonder said:

The actual skill level is 1 or zero for individual test subjects.

 

I think you'll find the answer is "some".

Link to comment
Share on other sites

There used to be an advertising campaign for margarine. Blindfolded people couldn't tell Stork from butter.

 

I used to repeat this test with my 'a' level biology students as an introduction to stats. 10 slices of bread, five buttered and five with marge'. Nobody ever got 10/10 and I can't remember anyone ever getting 5/10. I had to explain that getting 1/10 meant they were good at this they just preferred marge'! we consistently year to year came up with averages between 7 and 8/10. Humans are notoriously bad at detecting these differences which is why we measure so much. Good luck if you are ever falsely arrested and depend upon an identity parade!

 

The reality is that certainly in biology and medicine truth is statistical. My students could do better than chance on identifying butter. Vaccines improve the outcomes for a proportion of people  and some vaccines work better than others. If the efficacy is very different you can pick this up with a small sample but the less difference there is the larger the sample you need to be confident in your test. There are accepted mathematical techniques for testing whether a set of data are actually significant and at what level. For a scientist statistics is actually the way of challenging and examining the data, ironically you don't lie with statistics, you lie with data.

 

In this case as described it's hard to see what on earth they were trying to test, the assumption that the skill level of an ordinary person was 0 isn't a safe one to make. It seems unlikely to be true and they may have been as good as a musician at hearing a difference. The musicians and non musicians also might be able to tell that there was a difference but might have preferred the VT pedal. It just isn't clear what hypothesis they were testing and what null-hypothesis was used in the statistical analysis, if any was done. If indeed the test was as described there were too many variables to make any sense of the results. 

 

Edited by Phil Starr
  • Like 1
Link to comment
Share on other sites

31 minutes ago, Phil Starr said:

In this case as described it's hard to see what on earth they were trying to test, the assumption that the skill level of an ordinary person was 0 isn't a safe one to make. It seems unlikely to be true and they may have been as good as a musician at hearing a difference. The musicians and non musicians also might be able to tell that there was a difference but might have preferred the VT pedal. It just isn't clear what hypothesis they were testing and what null-hypothesis was used in the statistical analysis, if any was done. If indeed the test was as described there were too many variables to make any sense of the results. 

The test was set to challenge bass players to pick which was tubes and which was solid state emulating tubes. I think it's pretty straightforward! It wasn't asking which was preferred, only which was which.

 

The hypothesis being tested was ''tube emulation is close but no cigar, bass players know this and can tell the difference''. Further, ''here is our chance to prove it''.

 

Since it was such a simple test the binary choice enforces an allowance for a coin toss answer affecting the results so strongly.

 

On further reflection, the responses that were wrong had to be from folks that ''had an inkling'' but were wrong. I don't think anyone would have given an actual coin toss answer but even if they did they are also ''don't really knows'' split 50/50. So mathematically we know the ''can't tell'' group comes to 50% of the total being 2x 25% who answered wrong.

 

The key to it is I think it is just as likely to be picking correctly as incorrectly if you don't have the capability to properly discern A from B.

 

One of these days I might have a crack myself. I would make it a lot harder by including an unknown number of tube devices to pick out of the samples.

 

If I do I will need a real statistician to interpret the results back to the likelihood any one basschatter can correctly identify a recording is from a solid state or tube amp.

Link to comment
Share on other sites

If you decide to run a similar trial later down the line, please may I suggest seeking advice from a statistician before conducting and even designing the trial. They will be able to help design the experiment so that it delivers useful data from which they can make meaningful inferences. Asking for help from a statistician after the experiment has been conducted may be too late.

Edited by Chaosanator
  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...