Hypothesis Generation

Moisl, HL

doi:10.1017/CBO9780511976360.005

Hypothesis Generation

Lookup NU author(s): Dr Hermann Moisl

Downloads

Accepted version [.pdf]

Abstract

The aim of science is to understand reality. An academic discipline, philosophy of science, is devoted to explicating the nature of science and its relationship to reality, and, perhaps predictably, both are controversial; for an excellent introduction to the issues see (Chalmers 1999). In practice, however, most scientists explicitly or implicitly assume a view of scientific methodology based on the philosophy of Karl Popper (Popper 1959; Popper 1963), in which one or more non-contradictory hypotheses about some domain of interest are stated, the validity of the hypotheses is tested by observation of the domain, and the hypotheses are either confirmed (but not proven) if they are compatible with observation, or rejected if they are not. Where do such hypotheses come from? In principle it doesn't matter, because the validity of the claims they make can always be assessed with reference to the observable state of the world. Any one of us, whatever our background, could wake up in the middle of the night with an utterly novel and brilliant hypothesis that, say, unifies quantum mechanics and Einsteinian relativity, but this kind of inspiration is highly unlikely and must be exceedingly rare. In practice, scientists develop hypotheses in something like the following sequence of steps: the researcher (i) selects some aspect of reality that s/he wants to understand, (ii) becomes familiar with the selected research domain by observation of it, reads the associated research literature, and formulates a research question which, if convincingly answered, will enhance scientific understanding of the domain, (iii) abstracts data from the domain and draws inferences from it in the light of the research literature, and (iv) on the basis of these inferences states a hypothesis to answer the research question. The hypothesis is subsequently tested for validity with reference to the domain and emended as required. Linguistics is a science, and as such uses or should use scientific methodology. The research domain is human language, and, in the process of hypothesis generation, the data comes from observation of language use. Such observation can be based on introspection, since every native speaker is an expert on the usage of his or her language. It can also be based on observation of the linguistic usage of others in either spoken or written form. In some subdisciplines like historical linguistics, sociolinguistics, and dialectology, the latter is in fact the only possible alternative, and this is why D'Arcy (this volume) stresses the importance of linguistic corpora in language variation research: corpora are 'the foundation of everything we do'. Traditionally, hypothesis generation based on linguistic corpora has involved the researcher listening to or reading through a corpus, often repeatedly, noting features of interest, and then formulating a hypothesis. The advent of information technology in general and of digital representation of text in particular in the past few decades has made this often-onerous process much easier via a range of computational tools, but, as the amount of digitally-represented language available to linguists has grown, a new problem has emerged: data overload. Actual and potential language corpora are growing ever-larger, and even now they can be on the limit of what the individual researcher can work through efficiently in the traditional way. Moreover, as we shall see, data abstracted from such large corpora can be impenetrable to understanding. One approach to the problem is to deal only with corpora of tractable size, or, equivalently, with tractable subsets of large corpora, but ignoring potential data in so unprincipled a way is not scientifically respectable. The alternative is to use mathematically-based computational tools for data exploration developed in the physical and social sciences, where data overload has long been a problem. This latter alternative is the one explored here. Specifically, the discussion shows how a particular type of computational tool, cluster analysis, can be used in the formulation of hypotheses in corpus-based linguistic research. The discussion is in three main parts. The first describes data abstraction from corpora, the second outlines the principles of cluster analysis, and the third shows how the results of cluster analysis can be used in the formulation of hypotheses. Examples are based on the Newcastle Electronic Corpus of Tyneside English (NECTE), a corpus of dialect speech (Allen et al. 2007). The overall approach is introductory, and as such the aim has been to make the material accessible to as broad a readership as possible.