Skip to content

Instantly share code, notes, and snippets.

@PietrH
Last active January 2, 2025 14:08
Show Gist options
  • Save PietrH/ae55cddd3187548e90fdacf4592153fe to your computer and use it in GitHub Desktop.
Save PietrH/ae55cddd3187548e90fdacf4592153fe to your computer and use it in GitHub Desktop.
Find the closest gene per allele for every chromosome
# We are trying to find the closest allele per gene
## You could possibly do this with this package as well:
## https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects
library(dplyr) # A Grammar of Data Manipulation
library(assertthat) # Easy Pre and Post Assertions
# Some example data
## df with gen positions
gen_positions <-
tibble(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"),
start = c(1, 15, 2, 50, 90, 110, 160, 300),
end = c(10, 30, 28, 100, 100, 150, 260, 400),
gen = c("A", "B", "C", "D", "E", "F", "G", "H")
)
View(gen_positions)
## df with allele positions
allele_positions <-
tibble(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
pos = c(12, 50, 72, 300, 106, 107, 285)
)
View(allele_positions)
# All gen in gen_positions are unique
assertthat::assert_that(
length(unique(gen_positions$gen)) ==
length(gen_positions$gen)
)
# All pos in allele_positions are unique
assertthat::assert_that(
length(unique(allele_positions$pos)) ==
length(allele_positions$pos)
)
# What is the closest gene per allele (SNP)
full_join(gen_positions,
allele_positions,
by = join_by("chrom"),
relationship = "many-to-many"
) %>%
mutate(
diff_start = abs(start - pos),
diff_end = abs(end - pos)
) %>%
mutate(
closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min)
) %>%
group_by(pos) %>%
filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>%
select(chrom, pos, closest_gene_dinstance, gen) %>%
View("Closest gene per allele (SNP)")
---
title: "Find the closest gene per allele for every chromosome"
author: "PieterH"
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
code_folding: show
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In this document I will show how to find the closest gene to an allele for every chromosome. We will use two datasets, one with the positions of genes and one with the positions of alleles. We will join these datasets on the chromosome, and calculate the distance between the gene and the allele. We will then filter for the gene with the smallest distance to the allele.
You could possibly do this with the [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package as well. But I didn't find it until after I had written this code.
https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects
### Load required packages
```{r load libraries, echo = TRUE, message=FALSE}
library(dplyr)
library(purrr)
library(assertthat)
```
## Two example datasets
We use two example datasets, the first containing the positions of the genes and the second containing the positions of the alleles.
### `gen_positions`
```{r make example gene data, class.source = 'fold-hide'}
gen_positions <-
tibble(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"),
start = c(1, 15, 2, 50, 90, 110, 160, 300),
end = c(10, 30, 28, 100, 100, 150, 260, 400),
gen = c("A", "B", "C", "D", "E", "F", "G", "H")
)
gen_positions
```
### `allele_positions`
```{r make example allele data, class.source = 'fold-hide'}
## df with allele positions
allele_positions <-
tibble(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
pos = c(12, 50, 72, 300, 106, 107, 285)
)
allele_positions
```
## Check that the example data conforms to our expectations
- all values for `gen` in `gen_positions` are unique
- all values for `pos` in `allele_positions` are unique
```{r check if example data is valid, class.source = 'fold-hide', collapse=TRUE}
# All gen in gen_positions are unique
assertthat::assert_that(
length(unique(gen_positions$gen)) ==
length(gen_positions$gen)
)
# All pos in allele_positions are unique
assertthat::assert_that(
length(unique(allele_positions$pos)) ==
length(allele_positions$pos)
)
```
## Calculate the closest allele per gene, for every chromosome
1. First we join the two datasets on the chromosome, with a many to many relationship. This means that every row in `gen_positions` will be joined with every row in `allele_positions` for the same chromosome. When there are
multiple alleles on the same chromosome, the gene will be repeated for every allele.
2. Next, we calculate the distance between the start and end of the gene and the position of the allele. We do this for both the start and end of the gene, and take the minimum of the two. This is the distance between the gene and the allele.
3. Finally we group by the position of the allele, and filter for the gene with the smallest distance to the allele.
```{r}
# What is the closest gene per allele (SNP)
full_join(gen_positions,
allele_positions,
by = join_by("chrom"),
relationship = "many-to-many"
) %>%
mutate(
diff_start = abs(start - pos),
diff_end = abs(end - pos)
) %>%
mutate(
closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min)
) %>%
group_by(pos) %>%
filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>%
select(chrom, pos, closest_gene_dinstance, gen)
```
#### Compliment yourself on a job well done!!
```{r, echo = FALSE, comment=""}
if(all(c("cowsay","praise") %in% installed.packages()[,"Package"])){
cowsay::say(praise::praise())}else{message("You did a great job!")}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment