Last active
January 2, 2025 14:08
-
-
Save PietrH/ae55cddd3187548e90fdacf4592153fe to your computer and use it in GitHub Desktop.
Find the closest gene per allele for every chromosome
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# We are trying to find the closest allele per gene | |
## You could possibly do this with this package as well: | |
## https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects | |
library(dplyr) # A Grammar of Data Manipulation | |
library(assertthat) # Easy Pre and Post Assertions | |
# Some example data | |
## df with gen positions | |
gen_positions <- | |
tibble( | |
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"), | |
start = c(1, 15, 2, 50, 90, 110, 160, 300), | |
end = c(10, 30, 28, 100, 100, 150, 260, 400), | |
gen = c("A", "B", "C", "D", "E", "F", "G", "H") | |
) | |
View(gen_positions) | |
## df with allele positions | |
allele_positions <- | |
tibble( | |
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"), | |
pos = c(12, 50, 72, 300, 106, 107, 285) | |
) | |
View(allele_positions) | |
# All gen in gen_positions are unique | |
assertthat::assert_that( | |
length(unique(gen_positions$gen)) == | |
length(gen_positions$gen) | |
) | |
# All pos in allele_positions are unique | |
assertthat::assert_that( | |
length(unique(allele_positions$pos)) == | |
length(allele_positions$pos) | |
) | |
# What is the closest gene per allele (SNP) | |
full_join(gen_positions, | |
allele_positions, | |
by = join_by("chrom"), | |
relationship = "many-to-many" | |
) %>% | |
mutate( | |
diff_start = abs(start - pos), | |
diff_end = abs(end - pos) | |
) %>% | |
mutate( | |
closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min) | |
) %>% | |
group_by(pos) %>% | |
filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>% | |
select(chrom, pos, closest_gene_dinstance, gen) %>% | |
View("Closest gene per allele (SNP)") | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Find the closest gene per allele for every chromosome" | |
author: "PieterH" | |
date: "`r Sys.Date()`" | |
output: | |
html_document: | |
df_print: paged | |
code_folding: show | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
``` | |
In this document I will show how to find the closest gene to an allele for every chromosome. We will use two datasets, one with the positions of genes and one with the positions of alleles. We will join these datasets on the chromosome, and calculate the distance between the gene and the allele. We will then filter for the gene with the smallest distance to the allele. | |
You could possibly do this with the [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package as well. But I didn't find it until after I had written this code. | |
https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects | |
### Load required packages | |
```{r load libraries, echo = TRUE, message=FALSE} | |
library(dplyr) | |
library(purrr) | |
library(assertthat) | |
``` | |
## Two example datasets | |
We use two example datasets, the first containing the positions of the genes and the second containing the positions of the alleles. | |
### `gen_positions` | |
```{r make example gene data, class.source = 'fold-hide'} | |
gen_positions <- | |
tibble( | |
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"), | |
start = c(1, 15, 2, 50, 90, 110, 160, 300), | |
end = c(10, 30, 28, 100, 100, 150, 260, 400), | |
gen = c("A", "B", "C", "D", "E", "F", "G", "H") | |
) | |
gen_positions | |
``` | |
### `allele_positions` | |
```{r make example allele data, class.source = 'fold-hide'} | |
## df with allele positions | |
allele_positions <- | |
tibble( | |
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"), | |
pos = c(12, 50, 72, 300, 106, 107, 285) | |
) | |
allele_positions | |
``` | |
## Check that the example data conforms to our expectations | |
- all values for `gen` in `gen_positions` are unique | |
- all values for `pos` in `allele_positions` are unique | |
```{r check if example data is valid, class.source = 'fold-hide', collapse=TRUE} | |
# All gen in gen_positions are unique | |
assertthat::assert_that( | |
length(unique(gen_positions$gen)) == | |
length(gen_positions$gen) | |
) | |
# All pos in allele_positions are unique | |
assertthat::assert_that( | |
length(unique(allele_positions$pos)) == | |
length(allele_positions$pos) | |
) | |
``` | |
## Calculate the closest allele per gene, for every chromosome | |
1. First we join the two datasets on the chromosome, with a many to many relationship. This means that every row in `gen_positions` will be joined with every row in `allele_positions` for the same chromosome. When there are | |
multiple alleles on the same chromosome, the gene will be repeated for every allele. | |
2. Next, we calculate the distance between the start and end of the gene and the position of the allele. We do this for both the start and end of the gene, and take the minimum of the two. This is the distance between the gene and the allele. | |
3. Finally we group by the position of the allele, and filter for the gene with the smallest distance to the allele. | |
```{r} | |
# What is the closest gene per allele (SNP) | |
full_join(gen_positions, | |
allele_positions, | |
by = join_by("chrom"), | |
relationship = "many-to-many" | |
) %>% | |
mutate( | |
diff_start = abs(start - pos), | |
diff_end = abs(end - pos) | |
) %>% | |
mutate( | |
closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min) | |
) %>% | |
group_by(pos) %>% | |
filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>% | |
select(chrom, pos, closest_gene_dinstance, gen) | |
``` | |
#### Compliment yourself on a job well done!! | |
```{r, echo = FALSE, comment=""} | |
if(all(c("cowsay","praise") %in% installed.packages()[,"Package"])){ | |
cowsay::say(praise::praise())}else{message("You did a great job!")} | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment