LABORATORY SCHEDULE

Professor Charles E. McKenna
Department of Chemistry
Last updated October 23, 2003

All labs are in SGM 121

Previous Lab

Back to lab Schedule

Next Lab

Week of October 27: Statistics of Drug Candidates
Due date:
Your lab date plus one week

The most fundamental question you can ask about any drug is "Does it work?". This turns out to be a much harder question to answer than most people assume. Different people react to drugs in different ways, effects vary over time, results can be affected by things other than the drug, pills designed to do nothing (placebos) often have positive results if the subject doesn't know that the pill is fake, and so forth. This lab is designed to give you a brief exposure to the types of studies that are done to try and determine if a drug works or not. (Or if it has a decent chance of causing serious side effects.) You'll again be using our favorite spreadsheet program, Excel. Either the Mac or the PC version can be used.

Exercises for this lab are given at the bottom of the page.

Excel Notes

You'll need the Excel Data Analysis tools, available under the Excel Tools menu on both Macs and PCs. Check the menu to see if they are there. (They should be available on the machines in our lab and also those in campus computer rooms.) When you do so, make sure that you have a cell selected containing something other than a chart. Excel likes to hide things on its menus, and won't show the option if you have a chart selected. If it's not there and you are using a PC, choose "Add-Ins" from the Tools menu and click on the box marked "Analysis Tools". (Not "Analysis Tools-VBA".) On a Mac, it works the same way, although the positions on the menus are different. If for some reason Analysis Tools isn't there or won't load correctly, contact Bruno.

For this lab, we'll include links to .xls files that contain the data you are interested in. Clicking on them may do several things- it may start Excel automatically and load the file or it may save the .xls file to the local hard drive. If it's the latter, open Excel and load the file directly. (You'll have to make sure when you try to open the file that the box marked "Show Files of Type" says "All files" or "MS Excel files .xls")

 

General Statistics

The data analysis tools provide a wide variety of options. The simplest offers a wide variety of "basic statistics" on the data sample that you have. For an example, this file simulates the results from a first semester organic class exam.

Load this file into Excel. It will have two columns; ignore the second for now, it's used in the step below. . To start, select "Data Analysis" from the Tools menu. (If it's not there, see above.) Select "Descriptive Statistics". A window showing the various options appears; the "Input Range" box should be highlighted. Click on the first data cell (A2) and then hold down the shift key and click on the last data cell. You'll have to scroll down to find the last cell- there are 329 people taking the course. When you do this, the Input range box should read $A$2:$A$330. The rest of the options can be left as is. Click Ok, and the descriptive statistics are generated.

The Mean is the average, the median the "half above/half below" value. Note that the median is a lot higher than the mean- why? The variance is the square of the standard deviation. Don't worry about the Kurtosis or Skewness. The rest of the numbers should be self explanatory, the confidence value is the number from the normal distribution table corresponding to 95% of the area of the normal curve.

 

Histogram Plot

Computing a Histogram of the data is a quick way to get a feel for how the data is distributed. In a histogram plot, you divide up the range of results into "bins", then count the number of results in each bin. You're very familiar with these sorts of plots- they appear with the key to the exam after many tests at USC, where the input data are the scores of the students in the class and the bins are typically every 5-10 points on the test. Let's compute the histogram of the organic exam results above to show how familiar this format is.

Load the file into Excel. It should have two columns, the left one is the data, the right one shows how we want to divide up the data into bins. Here, we've set up the file such that every five points on the exam is a different bin. To start, select "Data Analysis" from the Tools menu. (If it's not there, see above.) Select "Histogram". The "Input Range" box should be highlighted. Click on the first data cell (A2) and then hold down the shift key and click on the last data cell, the same as you did above. When you do this, the Input range box should read $A$2:$A$330. Next, press the Tab key- this will select the "Bin Range" box which are the numbers that Excel needs to divide up the data. Click on the first bin cell (B1) and then shift click on the last (B21). The final result should read $B$1:$B$21 in the Bin Range box. Finally, click on "Chart Output" and hit OK.

The Chart will appear, along with a table showing the numbers in each bin. By default, Excel chooses an amazingly small cell to print the histogram in. Click on the chart, then drag the bottom edge until it is a bit larger. As we noted before, you've seen this kind of plot before. Note that there is a spike at zero; while you wouldn't expect this with normally distributed data, here you always get one since a few people didn't take the test.

For most of the data we expect to see, you would expect that something close to a "Normal distribution" would apply. For example, if your input data was the heights of all of the male students on the USC campus, you should get a curve very similar to a normal distribution. (If you did one for everyone on USC, you wouldn't see this, since on average women are shorter than men, and you'd have two peaks, one around 5'10", the other around 5'7".) In a similar fashion, you would expect to see data for tests of AIDS drugs to follow a normal curve; for example, after six months, people taking a drug would have viral loads distributed like a normal curve.

 

Hypothesis testing

So, how can we tell if a drug works? We do this as shown in class: first, formulate a null hypothesis. This is typically expressed as "Drug does nothing"- viral loads don't drop, long term survival rates are the same as for people not taking the drug, or something similar. We then test a number of people- half get the drug, half get a placebo. (In what's known as a "double-blind" study, neither the patient nor the doctor knows who gets what. This is to prevent the doctor from skewing the results.) After a while, we collect the data, and then ask if the data is significant. I.e., is the result we got different from the null hypothesis?

This is done with a couple of the equations you learned in class. For large data sets, (N>=30)

Z = |X-m0|/(s/sqrt(N))

Here, X is the mean of the data you are looking at, mu0 is the expected answer from your null hypothesis, s is the standard deviation of the data and N the number of data points in the set. Once you have computed Z, you compare it to the standard normal distribution to see if the results are significant.

 

Percentage points of the standard normal distribution

 

Z-95%

Z-97.5%

Z-99%

Z-99.5%

Z-99.9%

Z-99.95%

1.645

1.96

2.33

2.58

3.09

3.29

 If Z is negative, consider only its absolute value for the above table.

A specific example is helpful here. Load this file.

We have a new drug that might help people with advanced AIDS. We collect 50 people with advanced AIDS and give them the drug. On average, the people in the study should have five years to live from the point when they start taking the drug. This file is the lifespan of each of the people taking the drug in years. It appears that the drug extends lifespan, but are the results statistically significant?

To check, we use the above formula. Compute the descriptive statistics on the set of data given to get the mean and standard deviation. Here, m0 is 5- the number of years we expect people to live. X is the mean of the data in this set- 5.989 years. The standard deviation of the data s is 3.24 years, and the number of samples is 50, so Z = (5.989-5.0)/(3.24/ sqrt(50)) = 2.16. Checking the table above, the results are close to the 99% confidence level. We can be fairly sure that the drug does in fact increase lifespan. Note that even though the results are statistically significant, they aren't very useful- an extra year of life is nice, but if the drug is very costly or has a lot of side effects chances are that the drug company won't bother to continue to develop it.

For smaller data sets, we use the student's t distribution instead of the method above. The equation is the same, but the distribution is different. For the student's T distribution, we don't have a single table like the above for the percentage points- rather, the shape of the distribution is determined by the number of "degrees of freedom". The number of degrees of freedom in a problem is 1 fewer than the number of data points- if you have 20 samples, the problem has 19 degrees of freedom. As the number of degrees of freedom increases, the student's T distribution looks more and more like the standard normal distribution- above 40, they are virtually identical..

Excel has a built in function to return the value of the student's t distribution for any number of degrees of freedom. Once you have calculated a Z value, you can find out exactly how unlikely that result is statistically. To use it, click in a box anywhere on the spreadsheet, and enter =TDIST(Z, DOF, TAILS), where Z is the Z value you computed above, DOF is the number of degrees of freedom, and TAILS is the number of tails you want on the distribution. (1 or 2- for our purposes, always use the two tailed distribution)

As another example, load this file. These are the results for a trial of a new drug with a group of AIDS patients. They were tested for total viral load, then took the drug for six months, at which time their viral load was measured again- the table is the decrease in the viral load for each patient. (Note that some are negative- the viral load actually increased while on the drug.) We check the descriptive statistics for these data and compute Z.

Our null hypothesis is that the drug does nothing- viral load won't change, so mu0 is 0. The mean is 8747, the standard deviation is 9723, and N is 20. Thus,

Z = (8747-0)/(9723/sqrt(20)) = 4.023

Is this result significant with only 20 samples? We have to check the students-T distribution for 19 degrees of freedom. (N =20). Pick a cell on the page and enter =TDIST(4.023,20,2) The result is 0.000727, meaning that the result is significant at the (1-.000727)*100% or 99.92% level. We can conclusively say that the drug works. Use the absolute value of Z for checking the student-T distribution if Z is negative.

Another example: 50 people tested, controls showed no change.


Exercises:

Exercise #1.
QuickIPO Pharmaceuticals has come up with two different drugs that it thinks may reduce viral loads in AIDS patients. It has tested each drug on 50 people and computed the change in viral load over a year period. An additional 50 similar people were studied, but not given any drugs. For the 50 people given no drug, there was no overall change in viral load in the year. Here are the results from drug1 and drug2.
  1. Compute the histograms for each data set. Do they appear to be distributed normally
  2. Compute the Z value for each drug and check it against the standard normal distribution. Does either drug give results that are significantly better than chance?


Exercise #2.
Murk, Inc. has developed two similar drugs. It has had a harder time finding people to test however, and could only find twenty people to take each drug. The average decrease in the viral load was noted over a year period. The data for drug3 and drug4 are given here. 20 other people were given no drug, and overall their viral loads were unchanged.
  1. Compute the Z values for each data set, and then the student's t distribution value for each. Should the drug company keep developing either drug?

Exercise #3.

Unwelcome Burrows, Inc. has developed drug5. It found 50 people to take the test, and another fifty to not take the drug. With the people who didn't take the drug, the viral load increased by an average of 5000. (Note the change in the null hypothesis, and also that the data given are a decrease, whereas this is an increase.)

  1. Compute the descriptive statistics. What's the average of the data. Just based on this, should the company keep developing the drug? Compute Z- do you think the company should keep developing the drug based on this data? Why?

(C) CE McKenna, Ph.D. USC, Chemistry Dept., 2003

The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees