What common challenges do young biostatisticians face?

Well, it strongly depends on:

  • what is the area of your interest: epidemiology, ecology, widely understood medicine, pharmacy and drug development, bio-informatics (for instance merging machines with human body), clinical diagnostics and so on
  • where you are going to play the role of biostatistician: public offices, pharmaceutical companies, laboratories. You can be also an independent researcher/investigator (a freelancer) supporting scientists in doing their scientific investigations and writing dissertations and so on.

Let me describe briefly the challenges I had to play with, working as a freelancer, who supported scientists and as a biostatistician working in a Contract Research Organization, where I assess statistically efficacy, safety, toxicity and other measures related to new drugs and therapies.

Freelancer, Evidence-Based Medicine:

  • huge number of statistical methods and tests. I remember the first time I bought and opened my beloved handbook: Amazon.com: Handbook of Parametric and Nonparametric Statistical Procedures, Fifth Edition (9781439858011): David J. Sheskin: Books (Table of contents in PDF)
    The content just… got me off my feet and knocked out. So many tests, methods, assumptions, notes, corrections! How to embrace all of them? The key is to understand what they do and organize the knowledge.

    So I slowly started to organize my knowledge: a-ha! some tests verifies assumptions of other statistical tests. Some answer the questions about the effect (this is what you want to examine). Many questions can be answered in many ways – there is no just single way to get the answer. There are parametric, semi-parametric and non-parametric test. A-ha! But what are their relationships to each other?

    One of the most important moment for me was to realize that ANOVA, ANCOVA, t-test, linear regression are realizations of a linear model. Then I added a whole set of logistic regressions and labelled this GLM.

    This was the most important moment in my data-analysts life: I started to see relationships and analogies between statistical methods, to group them.

Wiki software is very helpful for organizing the knowledge. WikiPad is a good example of knowledge-base tool, offering extraordinary capabilities of searching, navigating and graphing relationships between terms. (sorry for Polish names)

  • Testing for normality. Gosh! I remember my confusion: why, the hell, are there so many (> 10) normality tests? Do they all test the same thing(s)? No! So why they are called “normality tests”? Do we need all of them? Does the best normality test even exist? In fact there are no universal normality test, so one should know what *exactly* he wants to check for. And WHY he really wants to test the normality…

    Well… yeah… why should we formally test for normality? This was the longest story and the most important question in my career. It took a long time for me to make my own deep investigation and answer this question for myself. It was complicated even more due to the fact, that the greatest authorities in statistics still discuss this very issue – and they cannot agree on…

  • Medical terminology. Yes. You are going to be a BIOstatistician. The “bio” prefix means that sooner or later, but it will be impossible to avoid medical terminology around you. Well, remember, this is the choice of yours – you could have chosen econometrics, sociology, psychology, physics, but you chose biostatistics.

    The better you understand medical processes and terminology, the better you communicate with doctors/investigators, the better you “feel” the subject, the easier you see patterns and relationships and the more “independent” you are. This is the key thing!

    Of course it doesn’t mean you should take a full course in medicine at the university and get PhD in cardiology 🙂 But if you don’t LIKE the medicine (or more generally – biology), you will quickly find your every day work hard, dark and misty. Investigators will use specialized terminology and mental shortcuts all the time. Let me give you a piece of advice: If you find bio-sciences difficult to learn, don’t even start.

  • Software. Which software is the best option for you? Should you learn and use only a single package, or perhaps use different programs? My advice is: choose one and become a specialist, then learn the basics of other packages and learn how to mix them.

    Seriously, you should recognize this area thoroughly, as this will be your future workshop. SAS? R? Stata? It’s all up to you. But when yours choice is done, take it seriously. Invest some time (and money) and become a specialist. Learn the basics right before you take your first tasks. You cannot waste your time on learning things when you’re supposed to do your job in timely manner!

    Differences between statistical packages in: types of sums of squares (SAS: III, R: I), contrasts, formulas (even for quantilles!) and corrections are a big source of headache. Not to mention different methods of handling dates, missing values and so on.

  • Real data challenge. I don’t know how it looks like in other areas of science, but bio-processes usually produce very “unpleasant” data. Forget the nicely shaped (bell-shaped) distributions. Forget the “lack of outliers”. Forget about automated methods of removing them. Forget about clean situations: mixed distributions, skewed distributions, “suspicious observations”, “influential observations” and lots of confusion (especially in clinical diagnostics), missing observations and missing classes, unequal sample sizes are likely to become your every-day reality.

    And you will have to deal with them. Removing outliers is often really bad idea (they often bring important information about the process or form a separate sub-group), data transformation doesn’t “cure” the shape of a distribution (and removes the default, well-established meaning from data), normality tests fail (but is this bad?), the size of a sample is often low (this is not “Big Data”, where you have billions of records at your disposal), co-linearity, variable dependency and strange patterns in model residuals are waiting for you 🙂

  • Violated assumptions of statistical methods. This is the direct consequence of the previous paragraph. You will discover bootstrap methods, robust methods (M-estimators, quantille regression and others) and the power of Central Limit Theorem. You will learn to love mixed models over repeated ANOVA (which has some strong assumptions).
  • New statistical methods. Deming regression. Survival analysis. Meta-analysis. New tests, like Tajima-D, Tryon’s, Westlake-Schuirmann’s. TOST (for testing bio-equivalence). Complex mixed models. Non-linear models, including GAM.

Biostatistician in CRO – clinical research

  • Confirmatory analysis. Mostly. Forget the exploratory analysis, where you could pick any method, depending on data, and freely experiment. In the world of clinical research everything is planned a’priori and written down in Statistical Analysis Plan. You didn’t anticipated issues? You have to switch to non-parametric methods (due to violated assumptions) but they’re not described in SAP? Sorry, you lose! Any change must result in amendments to SAP and Protocol, and you must have bloody good explanation and justification of the changes. You will quickly learn to predict “bad things”.
  • SOPs. SOPs everywhere… Standard Operating Procedures regulate almost everything you (could) do and how you do. Calculation of the sample size, writing statistical programs, validating the programs, storing input data, writing the report, organizing and managing files, contacting Spor – everything is regulated in details. Even the process of writing SOPs is… covered by another SOP 🙂
  • Regulated environment. Every possible aspect of the data analysis is going to be regulated. Data storage. Analytical software (including the process of updates and configuration; you will have to set up a library of packages/codes/scripts and tools you usually use; it should be tested and versions – frozen). Backup policy. The process of versioning documents and tracking changes. Controlling access to objects.

    You will quickly start to think in a context of “processes”: who owns and initiates the process, how it is requested, what is to be done, who performs this, who controls this, what is the product and how it is documented (logs and trackers).

  • Validation. Full validation. Every critical part of any program must be validated by another statistician or statistical programmer. But that’s not all. Not only the programs are going to be validated, but also the whole computing environment. This is done in detailed audits.
  • Responsibility. The game stops here. This is not a game. You play with humans life. OK, not directly, but the decisions will be taken upon the results of your analysis.
  • Training. You will be trained constantly. Not only in statistical guidances (ICH, FDA) but also in GCP (Good Clinical Practice), prevention and detection of frauds and misconduct and, of course, SOPs.

I think these are the most important things you should be aware of.

Leave a Comment

Your email address will not be published. Required fields are marked *

/* add by OCEANUS */