Example: Extracting Substrings

Extracting or replacing parts of a substring is quite straight forward but requires some more typing than e.g. in Python. The main function you will use is the substring function.

The substr function

Let us extract parts of a given string starting with the second and ending with the seventh character.

s <- c("Hello World")
substring(s, 2, 7)
## [1] "ello W"

The second argument in the substring function defines the starting position, the third the ending position of the string extraction.

If you apply the function to a vector of characters, it is applied to each element of the vector.

v <- c("Hello Earth", "Hi Pluto")
substring(v, 2, 7)
## [1] "ello E" "i Plut"

A more genreal task might be to extract the second part of a character string which is of type “some_string some_other_string” i.e. with the two strings separated by blank or any other specific character. In the example above, this would imply the following:

substring(v[1], 7, 11)
## [1] "Earth"
substring(v[2], 4, 8)
## [1] "Pluto"

The regexpr function

It is also possible and often more convenient to use the powerful regexpr function which looks for a pattern in a character element and returns the index of this.

regexpr(" ", v[1])
## [1] 6
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

The first line gives the position within the character string. The other two lines are just attributes with additional information. Of course, this can also be applied to an entire vector:

regexpr(" ", v)
## [1] 6 3
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

If you compare the result with the vector elements, than 6 and 3 gives the position of the blank within each of the two strings. Since we acutally do not want the second word to include a leading blank but only the second one, we have to increase the starting position by one.

regexpr(" ", v) + 1
## [1] 7 4
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

Hence, the problem of the starting position is solved. As ending position, we use the result of the nchar function which returns the number of characters in a string.

nchar(v)
## [1] 11  8

In summary, this is one possible solution:

substring(v, regexpr(" ", v)+1, nchar(v))
## [1] "Earth" "Pluto"

For more information on the regexpr and similar functions, have a look at the help page (i.e. ?regexpr).

For an overview of regular expressions which e.g. match any digit or alphabetic character see ?regex.

To get an idea of the full power of regular expressions for text mining and manipulation, see www.regular-expressions.info.

Updated: