Very often the first thing one needs to do before any data science project is to clean up the raw data and transform it into a format that is readily understood and easy to use for all downstream analysis. This process usually involves:
,
or .
, or ;
from numbers etcIn addition, you often want to filter rows according to vertain criteria: for example, selecting only women of age more than 45.
Here, we will learn how to
In order to do these manipulations easily, we will rely on a library which has a lot of functions to easily manipulate tables: dplyr. You can download a nice cheatsheet here. This library is part of a large eco-system of data analysis called tidyverse.
Let's load the library
library(tidyverse)
We read again our dataset:
dat = read.delim("https://tinyurl.com/ex9hxvvr", stringsAsFactors = FALSE)
Lets do some clean up of our own diabetes data!
As it is, our data frame has no row names (check by running rownames(dat)
); however, it might be interesting to have row names.
IMPORTANT: for data frames, all row names must be distinct!! This can be a problem if you are working with genes, which have sometimes ambiguous names.
Here, we will assign the column id
as row names:
## Assign the id column as row names
dat.clean = dat %>% column_to_rownames(c("id"))
You noticed how we used the syntax dat %>% column_to_rownames(c('id'))
; what does it mean?
The symbol %>%
is a so called pipe, and as the name indicates, it "pipes" the content on the left hand side to the function on the right hand side.
Here, the content of the variable dat
is transfered to the function column_to_rownames(c('id'))
which then applies the operation (here, using column id
as rownames, and then removing this column) to the content of the data frame.
Note that we also assigned the result of this whole operation to a new variable data.clean
.
We could extend the pipe principle and chain several commands!
Very often, we want to add new columns to an existing data frame! For example, we could add a column names ageMonth
in which we compute the age in month instead of year.
This can be done with the function mutate()
from the dplyr
package:
dat.clean = dat.clean %>% mutate(ageMonth = age * 12)
head(dat.clean)
chol stab.glu hdl ratio glyhb location age gender height weight frame
1 203 82 56 3.6 4.31 Buckingham 46 female 62 121 medium
2 165 97 24 6.9 4.44 Buckingham 29 female 64 218 large
3 228 92 37 6.2 4.64 Buckingham 58 female 61 256 large
4 78 93 12 6.5 4.63 Buckingham 67 male 67 119 large
5 249 90 28 8.9 7.72 Buckingham 64 male 68 183 medium
6 248 94 69 3.6 4.81 Buckingham 34 male 71 190 large
bp.1s bp.1d bp.2s bp.2d waist hip time.ppn ageMonth
1 118 59 NA NA 29 38 720 552
2 112 68 NA NA 46 48 360 348
3 190 92 185 92 49 57 180 696
4 110 50 NA NA 33 38 480 804
5 138 80 NA NA 44 41 300 768
6 132 86 NA NA 36 42 195 408
See what changed? We could also have more complex formula in the mutate()
function, involving, for example, several columns!
As the newly created column is not really informative, we can remove it with the select()
function. This function selects some specific columns. If we however append the "-" sign, it will select all column BUT the ones specified!
dat.clean = dat.clean %>% select(-ageMonth)
Now, we want to reorder the columns so that all numerical columns come first, and the other columns after:
The relocate
function allows to put a column to a certain position: try the following:
head(dat.clean)
chol stab.glu hdl ratio glyhb location age gender height weight frame
1 203 82 56 3.6 4.31 Buckingham 46 female 62 121 medium
2 165 97 24 6.9 4.44 Buckingham 29 female 64 218 large
3 228 92 37 6.2 4.64 Buckingham 58 female 61 256 large
4 78 93 12 6.5 4.63 Buckingham 67 male 67 119 large
5 249 90 28 8.9 7.72 Buckingham 64 male 68 183 medium
6 248 94 69 3.6 4.81 Buckingham 34 male 71 190 large
bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
1 118 59 NA NA 29 38 720
2 112 68 NA NA 46 48 360
3 190 92 185 92 49 57 180
4 110 50 NA NA 33 38 480
5 138 80 NA NA 44 41 300
6 132 86 NA NA 36 42 195
Notice the order of the columns! Now:
dat.clean %>% relocate(gender)
gender chol stab.glu hdl ratio glyhb location age height weight frame
1 female 203 82 56 3.6 4.31 Buckingham 46 62 121 medium
2 female 165 97 24 6.9 4.44 Buckingham 29 64 218 large
3 female 228 92 37 6.2 4.64 Buckingham 58 61 256 large
4 male 78 93 12 6.5 4.63 Buckingham 67 67 119 large
5 male 249 90 28 8.9 7.72 Buckingham 64 68 183 medium
6 male 248 94 69 3.6 4.81 Buckingham 34 71 190 large
7 male 195 92 41 4.8 4.84 Buckingham 30 69 191 medium
8 male 227 75 44 5.2 3.94 Buckingham 37 59 170 medium
9 male 177 87 49 3.6 4.84 Buckingham 45 69 166 large
10 female 263 89 40 6.6 5.78 Buckingham 55 63 202 small
11 female 242 82 54 4.5 4.77 Louisa 60 65 156 medium
bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
1 118 59 NA NA 29 38 720
2 112 68 NA NA 46 48 360
3 190 92 185 92 49 57 180
4 110 50 NA NA 33 38 480
5 138 80 NA NA 44 41 300
6 132 86 NA NA 36 42 195
7 161 112 161 112 46 49 720
8 NA NA NA NA 34 39 1020
9 160 80 128 86 34 40 300
10 108 72 NA NA 45 50 240
11 130 90 130 90 39 45 300
[ reached 'max' / getOption("max.print") -- omitted 392 rows ]
Notice the difference? You can also position a column to a specific position, before or after a specified column:
dat.clean %>% relocate(gender, .before = age)
chol stab.glu hdl ratio glyhb location gender age height weight frame
1 203 82 56 3.6 4.31 Buckingham female 46 62 121 medium
2 165 97 24 6.9 4.44 Buckingham female 29 64 218 large
3 228 92 37 6.2 4.64 Buckingham female 58 61 256 large
4 78 93 12 6.5 4.63 Buckingham male 67 67 119 large
5 249 90 28 8.9 7.72 Buckingham male 64 68 183 medium
6 248 94 69 3.6 4.81 Buckingham male 34 71 190 large
7 195 92 41 4.8 4.84 Buckingham male 30 69 191 medium
8 227 75 44 5.2 3.94 Buckingham male 37 59 170 medium
9 177 87 49 3.6 4.84 Buckingham male 45 69 166 large
10 263 89 40 6.6 5.78 Buckingham female 55 63 202 small
11 242 82 54 4.5 4.77 Louisa female 60 65 156 medium
bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
1 118 59 NA NA 29 38 720
2 112 68 NA NA 46 48 360
3 190 92 185 92 49 57 180
4 110 50 NA NA 33 38 480
5 138 80 NA NA 44 41 300
6 132 86 NA NA 36 42 195
7 161 112 161 112 46 49 720
8 NA NA NA NA 34 39 1020
9 160 80 128 86 34 40 300
10 108 72 NA NA 45 50 240
11 130 90 130 90 39 45 300
[ reached 'max' / getOption("max.print") -- omitted 392 rows ]
Here, we would like to place all numeric columns first, and the columns containing string or other types of variables after:
dat.clean = dat.clean %>% relocate(where(is.numeric))
head(dat.clean)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d bp.2s bp.2d waist
1 203 82 56 3.6 4.31 46 62 121 118 59 NA NA 29
2 165 97 24 6.9 4.44 29 64 218 112 68 NA NA 46
3 228 92 37 6.2 4.64 58 61 256 190 92 185 92 49
4 78 93 12 6.5 4.63 67 67 119 110 50 NA NA 33
5 249 90 28 8.9 7.72 64 68 183 138 80 NA NA 44
6 248 94 69 3.6 4.81 34 71 190 132 86 NA NA 36
hip time.ppn location gender frame
1 38 720 Buckingham female medium
2 48 360 Buckingham female large
3 57 180 Buckingham female large
4 38 480 Buckingham male large
5 41 300 Buckingham male medium
6 42 195 Buckingham male large
try to place all string variables first; use the
is.character
function!
dat.clean %>% relocate(where(is.character))
location gender frame chol stab.glu hdl ratio glyhb age height weight
1 Buckingham female medium 203 82 56 3.6 4.31 46 62 121
2 Buckingham female large 165 97 24 6.9 4.44 29 64 218
3 Buckingham female large 228 92 37 6.2 4.64 58 61 256
4 Buckingham male large 78 93 12 6.5 4.63 67 67 119
5 Buckingham male medium 249 90 28 8.9 7.72 64 68 183
6 Buckingham male large 248 94 69 3.6 4.81 34 71 190
7 Buckingham male medium 195 92 41 4.8 4.84 30 69 191
8 Buckingham male medium 227 75 44 5.2 3.94 37 59 170
9 Buckingham male large 177 87 49 3.6 4.84 45 69 166
10 Buckingham female small 263 89 40 6.6 5.78 55 63 202
11 Louisa female medium 242 82 54 4.5 4.77 60 65 156
bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
1 118 59 NA NA 29 38 720
2 112 68 NA NA 46 48 360
3 190 92 185 92 49 57 180
4 110 50 NA NA 33 38 480
5 138 80 NA NA 44 41 300
6 132 86 NA NA 36 42 195
7 161 112 161 112 46 49 720
8 NA NA NA NA 34 39 1020
9 160 80 128 86 34 40 300
10 108 72 NA NA 45 50 240
11 130 90 130 90 39 45 300
[ reached 'max' / getOption("max.print") -- omitted 392 rows ]
Now lets look at our cleaned data
summary(dat.clean)
chol stab.glu hdl ratio
Min. : 78.0 Min. : 48.0 Min. : 12.00 Min. : 1.500
1st Qu.:179.0 1st Qu.: 81.0 1st Qu.: 38.00 1st Qu.: 3.200
Median :204.0 Median : 89.0 Median : 46.00 Median : 4.200
Mean :207.8 Mean :106.7 Mean : 50.45 Mean : 4.522
3rd Qu.:230.0 3rd Qu.:106.0 3rd Qu.: 59.00 3rd Qu.: 5.400
Max. :443.0 Max. :385.0 Max. :120.00 Max. :19.300
NA's :1 NA's :1 NA's :1
glyhb age height weight
Min. : 2.68 Min. :19.00 Min. :52.00 Min. : 99.0
1st Qu.: 4.38 1st Qu.:34.00 1st Qu.:63.00 1st Qu.:151.0
Median : 4.84 Median :45.00 Median :66.00 Median :172.5
Mean : 5.59 Mean :46.85 Mean :66.02 Mean :177.6
3rd Qu.: 5.60 3rd Qu.:60.00 3rd Qu.:69.00 3rd Qu.:200.0
Max. :16.11 Max. :92.00 Max. :76.00 Max. :325.0
NA's :13 NA's :5 NA's :1
bp.1s bp.1d bp.2s bp.2d
Min. : 90.0 Min. : 48.00 Min. :110.0 Min. : 60.00
1st Qu.:121.2 1st Qu.: 75.00 1st Qu.:138.0 1st Qu.: 84.00
Median :136.0 Median : 82.00 Median :149.0 Median : 92.00
Mean :136.9 Mean : 83.32 Mean :152.4 Mean : 92.52
3rd Qu.:146.8 3rd Qu.: 90.00 3rd Qu.:161.0 3rd Qu.:100.00
Max. :250.0 Max. :124.00 Max. :238.0 Max. :124.00
NA's :5 NA's :5 NA's :262 NA's :262
waist hip time.ppn location
Min. :26.0 Min. :30.00 Min. : 5.0 Length:403
1st Qu.:33.0 1st Qu.:39.00 1st Qu.: 90.0 Class :character
Median :37.0 Median :42.00 Median : 240.0 Mode :character
Mean :37.9 Mean :43.04 Mean : 341.2
3rd Qu.:41.0 3rd Qu.:46.00 3rd Qu.: 517.5
Max. :56.0 Max. :64.00 Max. :1560.0
NA's :2 NA's :2 NA's :3
gender frame
Length:403 Length:403
Class :character Class :character
Mode :character Mode :character
The ordering and selection of columns looks right, however it seems that there are certain rows that have missing values (note which columns seem problematic!). We will now deal with the missing values.
Some columns and rows contain missing values, which are encoded by NA
in the data frame. Missing values, especially if there are many can be an issue, as it will bias the results.
Dealing with missing values is a chapter in itself! Basically, there can be two strategies:
imputing missing value: this means that we try to make a "best guess" of what the missing value might be. For example, if the weight is missing for one patient, you could replace it with the average weight of the other patients. Of course, there are more sophisticated methods....
removing missing values: you could remove all patients (i.e. rows) are have
Let's implement some of these strategies:
dat.nona = dat.clean %>% drop_na()
dim(dat.clean)
[1] 403 18
dim(dat.nona)
[1] 130 18
Do you understand why we reduced so dramatically our dataset? Could there be an alternative approach?
Two of the columns have very high number of missing values bp.2s
and bp.2d
; we should remove these columns first and then remove rows with missing values:
dat.clean = dat.clean %>% select(-c("bp.2s", "bp.2d"))
head(dat.clean)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 203 82 56 3.6 4.31 46 62 121 118 59 29 38
2 165 97 24 6.9 4.44 29 64 218 112 68 46 48
3 228 92 37 6.2 4.64 58 61 256 190 92 49 57
4 78 93 12 6.5 4.63 67 67 119 110 50 33 38
5 249 90 28 8.9 7.72 64 68 183 138 80 44 41
6 248 94 69 3.6 4.81 34 71 190 132 86 36 42
time.ppn location gender frame
1 720 Buckingham female medium
2 360 Buckingham female large
3 180 Buckingham female large
4 480 Buckingham male large
5 300 Buckingham male medium
6 195 Buckingham male large
We can now remove the rows with missing values:
dat.nona = dat.clean %>% drop_na()
dim(dat.nona)
[1] 366 16
See? We have lost much less rows (or patients) here...
Let's continue with the cleaned matrix obtained with strategy 2:
Factors is a data type besides numeric
, characters
(or strings) boolean
. It is very similar to the string type, but introduces the notion of levels, which indicates which categories are represented. Let's see an example:
head(dat.nona$location)
[1] "Buckingham" "Buckingham" "Buckingham" "Buckingham" "Buckingham"
[6] "Buckingham"
These are strings. We cannot see easily however, how many different locations are represented in the column location
. Let's convert this column into factors:
dat.nona$location = factor(dat.nona$location)
head(dat.nona$location)
[1] Buckingham Buckingham Buckingham Buckingham Buckingham Buckingham
Levels: Buckingham Louisa
See the difference?
Convert the
gender
andframe
columns into factors!
dat.nona$gender = factor(dat.nona$gender) # Making data nominal
dat.nona$frame = factor(dat.nona$frame, levels = c("small", "medium", "large")) # Making data ordinal
Notice that in the last command, we also indicated in which order the levels should be considered. Since we have ordinal data here (there is a clear order between small/medium/large), we should indicate this order here. Otherwise, the levels are ordered by alphabetical order!
Let's inspect our cleaned dataset again:
summary(dat.nona)
chol stab.glu hdl ratio
Min. : 78.0 Min. : 48.0 Min. : 12.00 Min. : 1.500
1st Qu.:179.0 1st Qu.: 81.0 1st Qu.: 38.00 1st Qu.: 3.200
Median :203.5 Median : 90.0 Median : 46.00 Median : 4.200
Mean :207.5 Mean :107.4 Mean : 50.27 Mean : 4.536
3rd Qu.:228.8 3rd Qu.:108.0 3rd Qu.: 59.00 3rd Qu.: 5.400
Max. :443.0 Max. :385.0 Max. :120.00 Max. :19.300
glyhb age height weight
Min. : 2.680 Min. :19.00 Min. :52.00 Min. : 99.0
1st Qu.: 4.393 1st Qu.:34.00 1st Qu.:63.00 1st Qu.:151.0
Median : 4.860 Median :45.00 Median :66.00 Median :174.0
Mean : 5.607 Mean :46.69 Mean :66.05 Mean :178.1
3rd Qu.: 5.630 3rd Qu.:60.00 3rd Qu.:69.00 3rd Qu.:200.0
Max. :16.110 Max. :92.00 Max. :76.00 Max. :325.0
bp.1s bp.1d waist hip
Min. : 90.0 Min. : 48.00 Min. :26.00 Min. :30.00
1st Qu.:121.2 1st Qu.: 75.00 1st Qu.:33.00 1st Qu.:39.00
Median :136.0 Median : 82.00 Median :37.00 Median :42.00
Mean :137.2 Mean : 83.36 Mean :37.93 Mean :43.05
3rd Qu.:148.0 3rd Qu.: 92.00 3rd Qu.:41.75 3rd Qu.:46.00
Max. :250.0 Max. :124.00 Max. :56.00 Max. :64.00
time.ppn location gender frame
Min. : 5.00 Buckingham:175 female:214 small : 98
1st Qu.: 93.75 Louisa :191 male :152 medium:172
Median : 240.00 large : 96
Mean : 339.04
3rd Qu.: 480.00
Max. :1560.00
We now have a clean dataset to work with, congratulations!
Let's see how we can order the rows according to certain columns. We will use the function arrange()
from the dplyr
package.
### order the rows by increasing age
dat.nona = dat.nona %>% arrange(age)
head(dat.nona)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 193 77 49 3.9 4.31 19 61 119 118 70 32 38
2 146 79 41 3.6 4.76 19 60 135 108 58 33 40
3 230 112 64 3.6 4.53 20 67 159 100 90 31 39
4 193 106 63 3.1 6.35 20 68 274 165 110 49 58
5 170 69 64 2.7 4.39 20 64 161 108 70 37 40
6 164 71 63 2.6 4.51 20 72 145 108 78 29 36
time.ppn location gender frame
1 300 Louisa female small
2 240 Buckingham female medium
3 1440 Louisa male medium
4 60 Buckingham female small
5 120 Buckingham female medium
6 1080 Buckingham male small
If we want to order by decreasing age:
dat.nona = dat.nona %>% arrange(desc(age))
head(dat.nona)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 165 94 69 2.4 4.98 92 62 217 160 82 51 51
2 301 90 118 2.6 4.28 89 61 115 218 90 31 41
3 226 279 52 4.3 10.07 84 60 192 144 88 41 48
4 227 105 44 5.2 5.71 83 59 125 150 90 35 40
5 240 88 49 4.9 4.92 82 63 170 180 86 41 46
6 271 121 40 6.8 4.57 81 64 158 146 76 36 43
time.ppn location gender frame
1 180 Buckingham female large
2 210 Louisa female medium
3 210 Louisa female small
4 300 Louisa female medium
5 720 Buckingham female medium
6 10 Louisa female medium
dat.nona = dat.nona %>% arrange(gender)
head(dat.nona)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 165 94 69 2.4 4.98 92 62 217 160 82 51 51
2 301 90 118 2.6 4.28 89 61 115 218 90 31 41
3 226 279 52 4.3 10.07 84 60 192 144 88 41 48
4 227 105 44 5.2 5.71 83 59 125 150 90 35 40
5 240 88 49 4.9 4.92 82 63 170 180 86 41 46
6 271 121 40 6.8 4.57 81 64 158 146 76 36 43
time.ppn location gender frame
1 180 Buckingham female large
2 210 Louisa female medium
3 210 Louisa female small
4 300 Louisa female medium
5 720 Buckingham female medium
6 10 Louisa female medium
Given that there are many patients with the same value in this column, how can we order the patients within a certain category (for example, sorting the female patients by increasing age):
dat.nona = dat.nona %>% arrange(gender, age)
head(dat.nona)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 193 77 49 3.9 4.31 19 61 119 118 70 32 38
2 146 79 41 3.6 4.76 19 60 135 108 58 33 40
3 193 106 63 3.1 6.35 20 68 274 165 110 49 58
4 170 69 64 2.7 4.39 20 64 161 108 70 37 40
5 149 77 49 3.0 4.50 20 62 115 105 82 31 37
6 226 97 70 3.2 3.88 20 64 114 122 64 31 39
time.ppn location gender frame
1 300 Louisa female small
2 240 Buckingham female medium
3 60 Buckingham female small
4 120 Buckingham female medium
5 720 Buckingham female small
6 90 Louisa female small
Order the rows by location, then gender, and decreasing weight!
dat.nona = dat.nona %>% arrange(location, gender, desc(weight))
head(dat.nona)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 192 109 44 4.4 4.86 43 64 325 141 79 53 62
2 235 109 59 4.0 7.48 62 63 290 175 80 55 62
3 203 299 43 4.7 12.74 38 69 288 136 83 48 55
4 193 106 63 3.1 6.35 20 68 274 165 110 49 58
5 180 84 69 2.6 5.20 40 68 264 142 98 43 54
6 228 92 37 6.2 4.64 58 61 256 190 92 49 57
time.ppn location gender frame
1 60 Buckingham female large
2 300 Buckingham female large
3 240 Buckingham female large
4 60 Buckingham female small
5 240 Buckingham female medium
6 180 Buckingham female large
Often, we want to filter the rows accordin to certain criteria. This can be easily done using the filter()
command from the dplyr()
package.
## filter femal patients
dat.nona %>% filter(gender == "female")
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 192 109 44 4.4 4.86 43 64 325 141 79 53 62
2 235 109 59 4.0 7.48 62 63 290 175 80 55 62
3 203 299 43 4.7 12.74 38 69 288 136 83 48 55
4 193 106 63 3.1 6.35 20 68 274 165 110 49 58
5 180 84 69 2.6 5.20 40 68 264 142 98 43 54
6 228 92 37 6.2 4.64 58 61 256 190 92 49 57
7 199 153 77 2.6 4.74 36 66 255 118 66 47 52
8 176 90 34 5.2 4.24 32 63 252 100 72 45 58
9 164 86 40 4.1 5.23 23 69 245 126 75 44 47
10 228 115 61 3.7 6.39 71 63 244 170 92 48 51
11 191 155 58 3.3 8.06 31 62 237 140 87 53 56
12 443 185 23 19.3 14.31 51 70 235 158 98 43 48
time.ppn location gender frame
1 60 Buckingham female large
2 300 Buckingham female large
3 240 Buckingham female large
4 60 Buckingham female small
5 240 Buckingham female medium
6 180 Buckingham female large
7 360 Buckingham female large
8 180 Buckingham female medium
9 420 Buckingham female large
10 660 Buckingham female large
11 240 Buckingham female large
12 420 Buckingham female medium
[ reached 'max' / getOption("max.print") -- omitted 202 rows ]
## filter female patients which are over 50
dat.nona %>% filter(gender == "female" & age > 50)
chol stab.glu hdl ratio glyhb age height weight bp.1s bp.1d waist hip
1 235 109 59 4.0 7.48 62 63 290 175 80 55 62
2 228 92 37 6.2 4.64 58 61 256 190 92 49 57
3 228 115 61 3.7 6.39 71 63 244 170 92 48 51
4 443 185 23 19.3 14.31 51 70 235 158 98 43 48
5 160 122 41 3.9 6.49 55 67 223 136 83 43 48
6 289 111 50 5.8 9.39 70 60 220 126 80 51 54
7 157 74 47 3.3 5.57 55 66 219 150 82 43 52
8 165 94 69 2.4 4.98 92 62 217 160 82 51 51
9 263 89 40 6.6 5.78 55 63 202 108 72 45 50
10 342 251 48 7.1 12.67 63 65 201 178 88 45 46
11 283 145 39 7.3 8.25 63 61 200 190 110 44 48
12 242 74 55 4.4 3.97 70 66 200 140 65 41 47
time.ppn location gender frame
1 300 Buckingham female large
2 180 Buckingham female large
3 660 Buckingham female large
4 420 Buckingham female medium
5 960 Buckingham female medium
6 780 Buckingham female medium
7 360 Buckingham female medium
8 180 Buckingham female large
9 240 Buckingham female small
10 180 Buckingham female medium
11 720 Buckingham female medium
12 180 Buckingham female medium
[ reached 'max' / getOption("max.print") -- omitted 70 rows ]
If we have a matrix or data frame consisting only of numerical values, we sometimes want to apply a certain operation on all rows or columns; some examples might be
To do this, we can use the function apply()
; let us see an example to understand how it works:
## we generate a matrix with random numbers
X = matrix(rnorm(50), nrow = 10)
colnames(X) = paste0("Patient_", 1:5)
rownames(X) = paste0("Gene_", 1:10)
X
Patient_1 Patient_2 Patient_3 Patient_4 Patient_5
Gene_1 -0.642850422 0.2082593 0.06076296 -0.47294686 1.02932897
Gene_2 -2.342413559 -0.7357086 1.28711595 -1.09925472 -0.39417331
Gene_3 -0.135153430 0.2476242 0.27649464 -0.78488392 -0.08263922
Gene_4 -1.877205879 1.0377687 1.43434256 1.51335535 0.88305311
Gene_5 -1.508192721 -0.5228552 -1.98692275 -0.20357999 1.95837784
Gene_6 2.187096034 -1.1210439 0.05821113 0.79418218 2.63162760
Gene_7 0.807973477 -3.0603129 -1.12875259 -0.31617807 0.62522740
Gene_8 -0.001030084 2.2060391 -1.67553435 -0.08741737 1.43354321
Gene_9 -0.272979511 0.4592043 -0.16011621 0.79268712 1.42489121
Gene_10 -1.264476879 0.6239750 0.47079525 -0.73233590 1.34029653
Suppose we want to compute the mean expression of the 10 genes for all patients; hence, we want to apply the mean()
function on all columns. Here we go...
## mean over all columns
apply(X, 2, mean)
Patient_1 Patient_2 Patient_3 Patient_4 Patient_5
-0.50492330 -0.06570500 -0.13636034 -0.05963722 1.08495333
The arguments in the apply
function are as follows:
Compute the maximum expression for all genes
Compute the standard deviation for all genes
Order the genes in the matrix by decreasing standard deviation (see 3.6)
Click here for solution!
## maximum expression for all genes max.exp = apply(X,1,max) ## standard deviation for all genes sd.exp = apply(X,1,sd) X %>% arrange(desc(sd.exp))
Compute the bmi index for all patients. Since the weight is in pound and the height in inches, the formula is
bmi = weight/height^2 * 703
Add this bmi index as a new column
bmi
using themutate()
function.Filter the women with a bmi index over 30; how many do you find?
Same question for the men younger that 50 with a bmi over 30.
## bmi
bmi = dat$weight/dat$height^2*703
##
dat = dat %>% mutate(bmi=weight/height^2*703)
##
dat %>% filter(gender=='female' & bmi > 30) %>% nrow()
##
dat %>% filter(gender=='male' & age < 50 & bmi > 30) %>% nrow()
Previous Chapter (Reading in a data table)| Next Chapter (Plotting)