In this post I’ll describe a problem for manipulating data in R, that I think might be useful for those working on genetics and related fields.

Motivation

Some days ago I received an email from a student of University of Buenos Aires, Argentina, asking me a question about a problem in R, and requesting some help. Although I usually cannot answer to this type of emails (mostly for lack of time), her question was interesting enough, and worth a blog post.

So here’s the problem. Imagine that you are working with a file containing allele copies for several individuals. Specifically, consider you have a toy matrix like the following one:

but you need to transform the data such that each allele copy appears in a separate column, like this:

That is, the transformed data involves splitting each element of the matrix to obtain an unfolded matrix with the same number of rows (i.e. individuals) but with twice the number of columns.

Solution

Here’s one possible solution to the previous problem. The idea is to take each row of the matrix, unfold the alleles (i.e. splitting them by removing the underscore symbol ‘_’), and then store the output in a new matrix which will have the double of columns as the original data.