Tools for Researching Vocabulary available in Paperback, eBook
Tools for Researching Vocabulary
- ISBN-10:
- 1783096454
- ISBN-13:
- 9781783096459
- Pub. Date:
- 11/15/2016
- Publisher:
- Multilingual Matters Ltd.
- ISBN-10:
- 1783096454
- ISBN-13:
- 9781783096459
- Pub. Date:
- 11/15/2016
- Publisher:
- Multilingual Matters Ltd.
Tools for Researching Vocabulary
Buy New
$33.96-
SHIP THIS ITEM— Temporarily Out of Stock Online
-
PICK UP IN STORE
Your local store may have stock of this item.
Available within 2 business hours
Temporarily Out of Stock Online
Overview
Product Details
ISBN-13: | 9781783096459 |
---|---|
Publisher: | Multilingual Matters Ltd. |
Publication date: | 11/15/2016 |
Series: | Second Language Acquisition Series , #105 |
Pages: | 328 |
Product dimensions: | 6.10(w) x 16.70(h) x 0.90(d) |
Age Range: | 18 Years |
About the Author
Read an Excerpt
Tools for Researching Vocabulary
By Paul Meara, Imma Miralpeix
Multilingual Matters
Copyright © 2017 Paul Meara and Imma MiralpeixAll rights reserved.
ISBN: 978-1-78309-648-0
CHAPTER 1
V_Words v2.0 and V_Lists v1.0
Introduction
This chapter describes two simple programs that are useful for carrying out basic operations on vocabulary data. The programs are not especially innovative: to some extent they duplicate the programs that already exist, some of which are briefly described at the end of the chapter. However, using our programs (V_Words and V_Lists) will help you to become familiar with the conventions used in the other programs in this book.
Part 1. V_Words v2.0
V_Words is a small utility program that turns short texts into word lists. It produces a basic count of all types and tokens in the text, in addition to alphabetical and frequency lists of the word types that the text contains.
Using V_Words
You will find the V_Words program on the Lognostics Tools website, available at:http://www.lognostics.co.uk/tools/V_Words/V_Words.htm.
(1) The V_Words workspace will open for you.
(2) V_Words may require some data preparation so that the text can accurately be turned into lists. There are also some features that you should be aware of before you use the program:
– V_Words is a small program, and it is not designed to be used with very large texts. Your text should be in simple text format and it should not be longer than 8500 words. If your text is longer than this, then you should split it into smaller sections.
– V_Words takes as a word (token) any written unit with a space on either side. Contractions (e.g. he's, you'd) will be treated as one word unless you replace them by the full forms (he is, you would ...).
– We recommend that you add hyphens whenever you want to take different contiguous tokens as just one word (e.g. open compounds: school-bus, phrasal verbs: get-off ...).
– The program is NOT case sensitive and therefore uppercase or lowercase letters will not interfere with the counts.
(3) Load the text that you want to work with into the V_Words workspace. You can do this by typing your text directly, or by copying and pasting a pre-prepared text into the workspace. The V_Words workspace looks like Figure 1.1.
(4) Edit your text as necessary.
(5) V_Words will ignore any punctuation in your text. If you want V_Words to ignore any other symbols, then you can add them to the punctuation list underneath the workspace.
(6) Click the Submit button when your text is ready for analysis.
(7) V_Words generates a report that looks like Figure 1.2. The report contains three columns:
– Column 1 on the left is a simple list of all the words (tokens) in the text. They are presented in the same order as they appear in the text.
– Column 2 is an alphabetically organised type count for the text.
– Column 3 is a frequency ordered type count for the text.
(8) You can save the reports by copying and pasting the relevant columns to a new document.
(9) Click the Reset button to start over with another text.
Even if your research does not involve collecting large word sets, you will find that it is worthwhile experimenting with the V_Words program before you go any further, as many of the features in this program appear in other programs in later chapters. In particular, you need to make sure that you know how to copy and paste a text to be analysed into the V_Words working space, and how to copy and save the results of an analysis back into another document. If you do not know how to do these basic operations, then you need to ask someone to show you before you go any further.
Part 2. V_Lists v1.0
V_Lists allows you to carry out some basic operations on word lists. Specifically, V_Lists takes two word lists as input and reports the words which occur in only one of the lists, the words which occur in both of the lists and a cumulative list that contains words that appear in either of the original lists. This output can be used to perform different types of analyses.
Using V_Lists
You will find the V_Lists program on the Lognostics Tools website, available at:http://www.lognostics.co.uk/tools/V_Lists/V_Lists.htm
(1) The V_Lists workspace will open for you.
(2) V_Lists may require some data preparation in order to produce accurate reports. There are also some features you should be aware of before using the program:
– V_Lists requires two word lists as input. The lists should be in simple text format. Each list will consist of a set of words, one word to a line. The easiest way to prepare a list of this sort is to use V_Words, the program described in the previous section. The total number of words in the lists to be compared should not exceed 4000.
– The V_Lists data input page looks like Figure 1.3. It contains two data slots. You can input data to the program directly by typing your word lists into the V_Lists data page, or you can copy and paste data from other sources if your lists are too long to type. Check your lists carefully to make sure that they do not contain inconsistent spellings.
– Any punctuation marks may need to be removed from your list items. You will also need to make sure that you have not accidentally left blank spaces at the end of lines. The easiest way to do this is to use the search and replace function in the program you used to prepare the lists.
– V_Lists treats as a word any unit/s written on the same line. For this reason, make sure that what you want to count as one word is always written on the same line and that spelling is consistent in both input lists (e.g. well-being, wellbeing and well being would each be counted as different words). It is also recommended that contractions are replaced by full forms (e.g. he's by he is) and that each word is written on a separate line.
– V_Lists is NOT case sensitive. This means that HAPPY, Happy and happy will all be treated as a single word type.
(3) Edit your lists as necessary.
(4) Click the Submit button.
(5) V_Lists generates a report that looks like Figure 1.4. The report contains four columns:
– Column 1 lists the word types that appear in list 1 but not in list 2.
– Column 2 lists the word types that appear in list 2 but not in list 1.
– Column 3 lists the word types that occur in both list 1 and in list 2.
– Column 4 is a cumulative word list that contains the word types occurring either in list 1 or in list 2.
(6) You can save the reports by copying and pasting the relevant columns into a new document.
(7) Click the Reset button to start another analysis.
Reflections on Morgan (1926)
We chose this article for inclusion in this chapter because it is typical of an important strand of vocabulary research that appeared in the 1920s. Surprisingly, much of this work seems to have been forgotten in later years, and had to be reinvented in the 1990s. One of the reasons for this must have been that research of this kind is very labour intensive without computers. Morgan's paper involved carrying out type counts and token counts on a longish text, and then comparing these counts to counts derived from another set of texts. All of this work would have been carried out by hand. It is sobering to realise how very much easier work of this sort becomes once you have access to a suitable set of computer programs that will do the donkey work for you. Replicating Morgan's study would be very easy today. The most time-consuming part of the work would be to generate a computer-readable version of the original texts, but with modern optical character readers even this would not be a big job. The rest of the work could be completed in a few hours at most.
In this article Morgan analyses the vocabulary of a book that students of Spanish as a foreign language had to read at high school. The book, which is entitled Las Confesiones de un Pequeño Filósofo (Confessions of a Little Philosopher), was written in 1904 by Azorín, a well-known Spanish writer. It is composed of very short chapters and in its pages the author describes his infancy in his hometown from a child's point of view.
The analysis Morgan performs consists of determining the number of tokens, types and their frequency of occurrence in the book. This work could easily have been carried out by a program like V_Words. She also compared types occurring in the text with the types occurring in Spanish word lists and grammars of that time (a grammar here means a textbook designed to teach Spanish to speakers of English – this paper was written long before modern communicative approaches to language teaching were developed). These more complex analyses involving word lists comparisons could have easily been performed using V_Lists. Among the results she reports, there are three that deserve close attention, especially because they have some resonance with more modern research on vocabulary acquisition.
The first finding that Morgan considers worthy of notice is the high percentage of words in the book (82% of the word types) that occur just four times or less. She considers that this is a very low value for a text containing about 20,000 word tokens. This observation is one that would probably not surprise modern researchers. Anyone who has done any work using corpora will be familiar with an idea developed in George Zipf (1935), which later became known as Zipf's Law. The law implies that in any large text, the number of times a word appears in the text is inversely proportional to its rank in the frequency table. That is, the most common word occurs approximately twice as often as the second most common word, three times as often as the third most common word, four times as often as the fourth most common word, and so on. One consequence of this relationship is that a few highly frequent words will generally make up a very large proportion of any normal text. Another consequence is that any normal text will contain a very large number of word types that occur only once. These generalisations are very reliable for long texts, but not so reliable for short texts, although, as Morgan shows, in broad terms they still apply to short texts too. Zipf's work has seen something of a revival in recent years, notably in work by Edwards and Collins, who used Zipf's Law to model the outputs made by L2 learners (Edwards & Collins, 2011, 2013). We have also made use of Zipf's work in some of the programs that appear later in this book, particularly in the V_Size program presented in Chapter 7.
A second finding in Morgan's paper that needs to be considered is the fact that the Spanish grammars analysed include the words most frequently used in the book by Azorín (79% of the vocabulary is shared). This suggests that the textbooks of the time accurately included the most useful words that learners needed when reading real materials (i.e. the most frequent words in real written texts). This finding would not necessarily coincide with more recent research on the lexis in textbooks for language learners. For example, an analysis of the vocabulary in textbooks for beginner learners of German in the UK (1990s–2006) suggests that they include very specific vocabulary in a limited number of semantic fields, neglecting words that are essential for independent language use (Häcker, 2008). Similarly, Davidson et al. (2008) claim that standard frequency counts do not provide a good account of the words that are likely to appear in textbooks for learners of Dutch. Other studies examining EFL coursebooks report findings along the same lines: Jiménez Catalán and Mancebo (2008) analyse four popular textbooks in Spain, two in primary and two in secondary education, and show that even those from the same educational stages contain different amounts of words and a high proportion of vocabulary which is not shared, highlighting a lack of systematic criteria in vocabulary selection. In another study with EFL books, O'Loughlin (2012) explores in detail three levels of a coursebook series and concludes that intermediate-level EFL learners using these textbooks will have encountered only about 1500 English words, which will not allow learners to function at an appropriate level.
The third finding to comment on is Morgan's estimate of the amount of vocabulary unknown to the students that appears on each page of the reading text (about 24 words). Her view is that this is an unreasonably high vocabulary load for learners at this level, and that the syllabus is probably underestimating the difficulty of the vocabulary that students will be confronted with when reading the book. More recent research would agree with this view: Morgan's analysis suggests that the number of words in the text amounts to 19,647 tokens across 78 pages, which is about 250 words per page. If the number of unknown words per page amounts to 24, this would in turn mean that students would be familiar with about 90% of the words on a page, but one word in every 10 would be a new, unfamiliar one. Recent studies on vocabulary and reading (e.g. Nation, 2001) argue that 90% text coverage is not sufficient for text comprehension, and that 95% text coverage (i.e. one unknown word in every 20 words) might allow for intensive reading. However, for extensive reading to be effective and enjoyable, coverage of about 98% is needed – i.e. only one unknown word in every 50 running words. Morgan also points out that the large number of similar words in English and Spanish (cognate words) might favour comprehension and facilitate reading. However, recent research suggests that learners – especially beginner and intermediate learners – do not always recognise cognates when they meet them in texts (Dressler et al., 2011; Hall, 2002; Tonzar et al., 2009).
As we have seen, Morgan's paper is concerned with analysing the written inputs (in books) that students are exposed to. However, it would also be possible to use V_Words and V_Lists to analyse aural inputs and outputs generated by language learners. Regarding aural input, there is a line of research investigating the effects of listening to radio broadcasts in the target language on learners' interlanguage, a topic initially explored in Meara (1993) with a detailed lexical analysis of a large number of individual BBC broadcasts. Similarly, aural input in language or CLIL classes can be analysed by making initial use of the same methods and tools (see, for instance, Meara et al., 1997, studying classrooms as lexical environments). Regarding the analysis of outputs, one of the main uses for a simple program like V_Words in modern research would be to analyse word association data (cf., for example, Fitzpatrick, 2007; Meara, 1978). If you have a large study, say 500 subjects generating a single response to a set of 100 stimulus words, then you will have 50,000 data points, which is a huge amount of data to sort and evaluate by hand. However, if you sorted the data into 100 sets of 500 responses – one set for each of the stimulus words – then V_Words would be able to analyse the data in a very short space of time, listing the responses by frequency. Likewise, if you asked a large group of participants to give you 10 words that might be useful in a given context, then a program like V_Words could easily sort out the resulting responses by frequency. This kind of data is common in research on 'lexical availability'. Research of this type was commonly used to build basic vocabularies for foreign language teaching in the 1960s (e.g. Gougenheim et al., 1964) and it is now seeing something of a revival – see, for instance, Morris (2010/2011) for Welsh, and Jiménez Catalán's (2014) collection of papers dealing with lexical availability research in Spanish. This approach also seems to be a useful technique for developing special purposes vocabulary. V_Lists is also particularly helpful if you want to compare the words used by two groups of learners on a single task (e.g. Jiménez Catalán & Ojeda, 2005), or if you want to compare a set of texts generated by a single student over a long period of time (cf. Bell, 2009).
If you want to perform analyses that are more sophisticated than the simple analyses that V_Words and V_Lists can handle, then you will find on the internet a number of other programs that are worth looking at. Frequency (Heatley et al., 2002), Voyant Tools (Sinclair et al., 2012), AntConc v3.4.3 (Anthony, 2014) or Wordsmith Tools v7 (Scott, 2016) can extract basic characteristics from texts and analyse them in multiple ways (e.g. making frequency lists of the words they contain). Regarding word lists comparisons, programs like Range for texts v3 by Tom Cobb allows the uploading of different texts or word lists in order to find word frequencies in the combined collection, the range of each word across texts, etc. Also Range (Heatley et al., 2002) compares texts with three frequency lists, but in this case it shows how much coverage of a text each list provides (i.e. it gives the 'lexical frequency profile' of a text).
(Continues...)
Excerpted from Tools for Researching Vocabulary by Paul Meara, Imma Miralpeix. Copyright © 2017 Paul Meara and Imma Miralpeix. Excerpted by permission of Multilingual Matters.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
AcknowledgementsIntroduction
Section 1: Processing Vocabulary Data
Chapter 1. V_Words and V_Lists
Section 2: Measuring Lexical Variation, Sophistication and Originality
Chapter 2. D_Tools
Chapter 3. P_Lex
Chapter 4. Lexical Signatures
Chapter 5. V_Unique
Section 3: Estimating Vocabulary Size
Chapter 6. V_YesNo
Chapter 7. V_Size
Chapter 8. V_Capture
Section 4: Measuring Lexical Access
Chapter 9. Q_Lex
Section 5: Assessing Aptitude for L2 Vocabulary Learning
Chapter 10. LLAMA_B
Section 6: Modelling Vocabulary Growth
Chapter 11. Mezzofanti
Envoi
Index