PietrH · January 2, 2025 14:08
diff --git a/find_allele_per_gene.R b/find_allele_per_gene.R
 # We are trying to find the closest allele per gene

 ## You could possibly do this with this package as well: 
 ## https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects

 library(dplyr) # A Grammar of Data Manipulation
 library(assertthat) # Easy Pre and Post Assertions

 # Some example data

 ## df with gen positions
 gen_positions <-
  tibble(
    chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"),
    start = c(1, 15, 2, 50, 90, 110, 160, 300),
    end = c(10, 30, 28, 100, 100, 150, 260, 400),
    gen = c("A", "B", "C", "D", "E", "F", "G", "H")
  )

 View(gen_positions)

 ## df with allele positions
 allele_positions <-
  tibble(
    chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
    pos = c(12, 50, 72, 300, 106, 107, 285)
  )

 View(allele_positions)

 # All gen in gen_positions are unique
 assertthat::assert_that(
  length(unique(gen_positions$gen)) ==
  length(gen_positions$gen)
 )

 # All pos in allele_positions are unique
 assertthat::assert_that(
  length(unique(allele_positions$pos)) ==
  length(allele_positions$pos)
 )
  

 # What is the closest gene per allele (SNP)
 full_join(gen_positions,
          allele_positions,
          by = join_by("chrom"),
          relationship = "many-to-many"
          ) %>% 
  mutate(
    diff_start = abs(start - pos),
    diff_end = abs(end - pos)
  ) %>% 
  mutate(
    closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min)
  ) %>% 
  group_by(pos) %>% 
  filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>% 
  select(chrom, pos, closest_gene_dinstance, gen) %>% 
  View("Closest gene per allele (SNP)")

diff --git a/find_gen_closest_to_allele.Rmd b/find_gen_closest_to_allele.Rmd
 ---
 title: "Find the closest gene per allele for every chromosome"
 author: "PieterH"
 date: "`r Sys.Date()`"
 output: 
  html_document:
    df_print: paged
    code_folding: show
 ---

 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 ```

 In this document I will show how to find the closest gene to an allele for every chromosome. We will use two datasets, one with the positions of genes and one with the positions of alleles. We will join these datasets on the chromosome, and calculate the distance between the gene and the allele. We will then filter for the gene with the smallest distance to the allele.

 You could possibly do this with the [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package as well. But I didn't find it until after I had written this code. 

 https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects

 ### Load required packages

 ```{r load libraries, echo = TRUE, message=FALSE}
 library(dplyr)
 library(purrr)
 library(assertthat)
 ```


 ## Two example datasets
 We use two example datasets, the first containing the positions of the genes and the second containing the positions of the alleles.

 ### `gen_positions`
 ```{r make example gene data, class.source = 'fold-hide'}
 gen_positions <-
  tibble(
    chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"),
    start = c(1, 15, 2, 50, 90, 110, 160, 300),
    end = c(10, 30, 28, 100, 100, 150, 260, 400),
    gen = c("A", "B", "C", "D", "E", "F", "G", "H")
  )
 gen_positions


 ```
 ### `allele_positions`
 ```{r make example allele data, class.source = 'fold-hide'}
 ## df with allele positions
 allele_positions <-
  tibble(
    chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
    pos = c(12, 50, 72, 300, 106, 107, 285)
  )

 allele_positions
 ```


 ## Check that the example data conforms to our expectations
 - all values for `gen` in `gen_positions` are unique
 - all values for `pos` in `allele_positions` are unique

 ```{r check if example data is valid, class.source = 'fold-hide', collapse=TRUE}
 # All gen in gen_positions are unique
 assertthat::assert_that(
  length(unique(gen_positions$gen)) ==
  length(gen_positions$gen)
 )

 # All pos in allele_positions are unique
 assertthat::assert_that(
  length(unique(allele_positions$pos)) ==
  length(allele_positions$pos)
 )
 ```

 ## Calculate the closest allele per gene, for every chromosome
 1. First we join the two datasets on the chromosome, with a many to many relationship. This means that every row in `gen_positions` will be joined with every row in `allele_positions` for the same chromosome. When there are
 multiple alleles on the same chromosome, the gene will be repeated for every allele.
 2. Next, we calculate the distance between the start and end of the gene and the position of the allele. We do this for both the start and end of the gene, and take the minimum of the two. This is the distance between the gene and the allele.
 3. Finally we group by the position of the allele, and filter for the gene with the smallest distance to the allele.


 ```{r}
 # What is the closest gene per allele (SNP)
 full_join(gen_positions,
          allele_positions,
          by = join_by("chrom"),
          relationship = "many-to-many"
          ) %>% 
  mutate(
    diff_start = abs(start - pos),
    diff_end = abs(end - pos)
  ) %>% 
  mutate(
    closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min)
  ) %>% 
  group_by(pos) %>% 
  filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>% 
  select(chrom, pos, closest_gene_dinstance, gen)

 ```

 #### Compliment yourself on a job well done!!
 ```{r, echo = FALSE, comment=""}
 if(all(c("cowsay","praise") %in% installed.packages()[,"Package"])){
 cowsay::say(praise::praise())}else{message("You did a great job!")}
 ```
	# We are trying to find the closest allele per gene

	## You could possibly do this with this package as well:
	## https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects

	library(dplyr) # A Grammar of Data Manipulation
	library(assertthat) # Easy Pre and Post Assertions

	# Some example data

	## df with gen positions
	gen_positions <-
	tibble(
	chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"),
	start = c(1, 15, 2, 50, 90, 110, 160, 300),
	end = c(10, 30, 28, 100, 100, 150, 260, 400),
	gen = c("A", "B", "C", "D", "E", "F", "G", "H")
	)

	View(gen_positions)

	## df with allele positions
	allele_positions <-
	tibble(
	chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
	pos = c(12, 50, 72, 300, 106, 107, 285)
	)

	View(allele_positions)

	# All gen in gen_positions are unique
	assertthat::assert_that(
	length(unique(gen_positions$gen)) ==
	length(gen_positions$gen)
	)

	# All pos in allele_positions are unique
	assertthat::assert_that(
	length(unique(allele_positions$pos)) ==
	length(allele_positions$pos)
	)


	# What is the closest gene per allele (SNP)
	full_join(gen_positions,
	allele_positions,
	by = join_by("chrom"),
	relationship = "many-to-many"
	) %>%
	mutate(
	diff_start = abs(start - pos),
	diff_end = abs(end - pos)
	) %>%
	mutate(
	closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min)
	) %>%
	group_by(pos) %>%
	filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>%
	select(chrom, pos, closest_gene_dinstance, gen) %>%
	View("Closest gene per allele (SNP)")
	---
	title: "Find the closest gene per allele for every chromosome"
	author: "PieterH"
	date: "`r Sys.Date()`"
	output:
	html_document:
	df_print: paged
	code_folding: show
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```

	In this document I will show how to find the closest gene to an allele for every chromosome. We will use two datasets, one with the positions of genes and one with the positions of alleles. We will join these datasets on the chromosome, and calculate the distance between the gene and the allele. We will then filter for the gene with the smallest distance to the allele.

	You could possibly do this with the [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package as well. But I didn't find it until after I had written this code.

	https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html#finding-the-nearest-genomic-position-in-granges-objects

	### Load required packages

	```{r load libraries, echo = TRUE, message=FALSE}
	library(dplyr)
	library(purrr)
	library(assertthat)
	```


	## Two example datasets
	We use two example datasets, the first containing the positions of the genes and the second containing the positions of the alleles.

	### `gen_positions`
	```{r make example gene data, class.source = 'fold-hide'}
	gen_positions <-
	tibble(
	chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3", "chr3"),
	start = c(1, 15, 2, 50, 90, 110, 160, 300),
	end = c(10, 30, 28, 100, 100, 150, 260, 400),
	gen = c("A", "B", "C", "D", "E", "F", "G", "H")
	)
	gen_positions


	```
	### `allele_positions`
	```{r make example allele data, class.source = 'fold-hide'}
	## df with allele positions
	allele_positions <-
	tibble(
	chrom = c("chr1", "chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
	pos = c(12, 50, 72, 300, 106, 107, 285)
	)

	allele_positions
	```


	## Check that the example data conforms to our expectations
	- all values for `gen` in `gen_positions` are unique
	- all values for `pos` in `allele_positions` are unique

	```{r check if example data is valid, class.source = 'fold-hide', collapse=TRUE}
	# All gen in gen_positions are unique
	assertthat::assert_that(
	length(unique(gen_positions$gen)) ==
	length(gen_positions$gen)
	)

	# All pos in allele_positions are unique
	assertthat::assert_that(
	length(unique(allele_positions$pos)) ==
	length(allele_positions$pos)
	)
	```

	## Calculate the closest allele per gene, for every chromosome
	1. First we join the two datasets on the chromosome, with a many to many relationship. This means that every row in `gen_positions` will be joined with every row in `allele_positions` for the same chromosome. When there are
	multiple alleles on the same chromosome, the gene will be repeated for every allele.
	2. Next, we calculate the distance between the start and end of the gene and the position of the allele. We do this for both the start and end of the gene, and take the minimum of the two. This is the distance between the gene and the allele.
	3. Finally we group by the position of the allele, and filter for the gene with the smallest distance to the allele.


	```{r}
	# What is the closest gene per allele (SNP)
	full_join(gen_positions,
	allele_positions,
	by = join_by("chrom"),
	relationship = "many-to-many"
	) %>%
	mutate(
	diff_start = abs(start - pos),
	diff_end = abs(end - pos)
	) %>%
	mutate(
	closest_gene_dinstance = purrr::map2_int(diff_start, diff_end, min)
	) %>%
	group_by(pos) %>%
	filter(closest_gene_dinstance == min(closest_gene_dinstance)) %>%
	select(chrom, pos, closest_gene_dinstance, gen)

	```

	#### Compliment yourself on a job well done!!
	```{r, echo = FALSE, comment=""}
	if(all(c("cowsay","praise") %in% installed.packages()[,"Package"])){
	cowsay::say(praise::praise())}else{message("You did a great job!")}
	```