CS 160: Lecture 16
based on notes by James Landay
Outline
Review
- E-commerce: shopping carts, checkout
- Web site usability survey
- Readability of start page?
- Graphic design?
- Short vs. long anchor text in links?
Why do User Testing?
- Can’t tell how good or bad UI is until?
- Other methods are based on evaluators who?
- may know too much
- may not know enough (about tasks, etc.)
- Summary
Hard to predict what real users will do
Choosing Participants
- Representative of eventual users in terms of
- job-specific vocabulary / knowledge
- tasks
- If you can’t get real users, get approximation
- system intended for doctors
- system intended for electrical engineers
- Use incentives to get participants
Ethical Considerations
- Sometimes tests can be distressing
- users have left in tears (embarrassed by mistakes)
- You have a responsibility to alleviate
- make voluntary with informed consent
- avoid pressure to participate
- will not affect their job status either way
- let them know they can stop at any time [Gomoll]
- stress that you are testing the system, not them
- make collected data as anonymous as possible
- Often must get human subjects approval
User Test Proposal
- A report that contains
- objective
- description of system being testing
- task environment & materials
- participants
- methodology
- tasks
- test measures
- Get approved & then reuse for final report
Selecting Tasks
- Should reflect what real tasks will be like
- Tasks from analysis & design can be used
- may need to shorten if
- they take too long
- require background that test user won’t have
- Avoid bending tasks in direction of what your design best supports
- Don’t choose tasks that are too fragmented
Data Types
- Independent Variables: the ones you control
- Aspects of the interface design
- Characteristics of the testers
- Discrete: A, B or C
- Continuous: Time between clicks for double-click
- Dependent variables: the ones you measure
- Time to complete tasks
- Number of errors
Deciding on Data to Collect
- Two types of data
- process data
- observations of what users are doing & thinking
- bottom-line data
- summary of what happened (time, errors, success…)
- i.e., the dependent variables
Process Data vs. Bottom Line Data
- Focus on process data first
- gives good overview of where problems are
- Bottom-line doesn’t tell you where to fix
- just says: “too slow”, “too many errors”, etc.
- Hard to get reliable bottom-line results
- need many users for statistical significance
The “Thinking Aloud” Method
- Need to know what users are thinking, not just what they are doing
- Ask users to talk while performing tasks
- tell us what they are thinking
- tell us what they are trying to do
- tell us questions that arise as they work
- tell us things they read
- Make a recording or take good notes
- make sure you can tell what they were doing
Thinking Aloud (cont.)
- Prompt the user to keep talking
- “tell me what you are thinking”
- Only help on things you have pre-decided
- keep track of anything you do give help on
- Recording
- use a digital watch/clock
- take notes, plus if possible
- record audio and video (or even event logs)
Administrivia
- Yep, we know the server is down
- Could be a hard or easy fix
- Check www.cs.berkeley.edu/~jfc for temp replacement
- Please hand in projects tomorrow.
- Use zip and email if no server.
Using the Test Results
- Summarize the data
- make a list of all critical incidents (CI)
- include references back to original data
- try to judge why each difficulty occurred
- What does data tell you?
- UI work the way you thought it would?
- consistent with your cognitive walkthrough?
- users take approaches you expected?
- something missing?
Using the Results (cont.)
- Update task analysis and rethink design
- rate severity & ease of fixing CIs
- fix both severe problems & make the easy fixes
- Will thinking aloud give the right answers?
- not always
- if you ask a question, people will always give an answer, even it is has nothing to do with the facts
- try to avoid specific questions
Measuring Bottom-Line Usability
- Situations in which numbers are useful
- time requirements for task completion
- successful task completion
- compare two designs on speed or # of errors
- Do not combine with thinking-aloud. Why?
- talking can affect speed & accuracy (neg. & pos.)
- Error or successful completion is harder
- define in advance what these mean
Some statistics
- A relation (hypothesis) e.g. X > Y
- We would often like to know if a relation is true
- e.g. X = time taken by novice users
- Y = time taken by users with some training
- To find out if the relation is true we do experiments to get lots of x’s and y’s (observations)
- Suppose avg(x) > avg(y), or that most of the x’s are larger than all of the y’s. What does that prove?
Significance
- The significance or p-value of an outcome is the probability that it happens by chance if the relation does not hold.
- E.g. p = 0.05 means that there is a 1/20 chance that the observation happens if the hypothesis is false.
- So the smaller the p-value, the greater the significance.
Significance
- And p = 0.001 means there is a 1/1000 chance that the observation happens if the hypothesis is false. So the hypothesis is almost surely true.
- Significance increases with number of trials.
- CAVEAT: You have to make assumptions about the probability distributions to get good p-values.
Normal distributions
- Many variables have a Normal distribution
- At left is the density, right is the cumulative prob.
- Normal distributions are completely characterized by their mean and variance (mean squared deviation from the mean).
Normal distributions
- The difference between two independent normal variables is also a normal variable, whose variance is the sum of the variances of the distributions.
- Asserting that X > Y is the same as (X-Y) > 0, whose probability we can read off from the curve.
Analyzing the Numbers
- Example: trying to get task time <=30 min.
- test gives: 20, 15, 40, 90, 10, 5
- mean (average) = 30
- median (middle) = 17.5
- looks good!
- wrong answer, not certain of anything
- Factors contributing to our uncertainty
- small number of test users (n = 6)
- results are very variable (standard deviation = 32)
- std. dev. measures dispersal from the mean
Analyzing the Numbers (cont.)
- Crank through the procedures and you find
- 95% certain that typical value is between 5 & 55
- Usability test data is quite variable
- need lots to get good estimates of typical values
- 4 times as many tests will only narrow range by 2x
- breadth of range depends on sqrt of # of test users
- this is when online methods become useful
- easy to test w/ large numbers of users (e.g., NetRaker)
Measuring User Preference
- How much users like or dislike the system
- can ask them to rate on a scale of 1 to 10
- or have them choose among statements
- “best UI I’ve ever…”, “better than average”…
- hard to be sure what data will mean
- novelty of UI, feelings, not realistic setting, etc.
- If many give you low ratings -> trouble
- Can get some useful data by asking
- what they liked, disliked, where they had trouble, best part, worst part, etc. (redundant questions)
Comparing Two Alternatives
- Between groups experiment
- two groups of test users
- each group uses only 1 of the systems
- Within groups experiment
- one group of test users
- each person uses both systems
- can’t use the same tasks or order (learning)
- best for low-level interaction techniques
- Between groups will require many more participants than a within groups experiment
- See if differences are statistically significant
- assumes normal distribution & same std. dev.
Experimental Details
- Order of tasks
- choose one simple order (simple -> complex)
- unless doing within groups experiment
- Training
- depends on how real system will be used
- What if someone doesn’t finish
- assign very large time & large # of errors
- Pilot study
- helps you fix problems with the study
- do 2, first with colleagues, then with real users
Reporting the Results
- Report what you did & what happened
- Images & graphs help people get it!
Summary
- User testing is important, but takes time/effort
- Early testing can be done on mock-ups (low-fi)
- Use real tasks & representative participants
- Be ethical & treat your participants well
- Want to know what people are doing & why
- i.e., collect process data
- Using bottom line data requires more users to get statistically reliable results