thefotes · May 23, 2025 11:37
diff --git a/data.rmd b/data.rmd
 ---
 title: "NYPD Shooting Incidents"
 author: "P. Foti"
 date: "2025-05-13"
 output:
  pdf_document: default
  html_document: default
 ---

 # Introduction
 In this project, I investigate trends in shooting incidents across NYC, exploring whether incident characteristics such as borough, victim age group, and location, predict fatal outcomes. Data for this project is downloaded from [Data.gov](https://catalog.data.gov/dataset)

 ```{r load_data}
 library(readr)
 library(tidyverse)
 library(ggplot2)
 library(broom)
 library(lubridate)
 library(ggridges)

 incident_data <- read_csv("data/NYPD_Shooting_Incident_Data__Historic_.csv")
 head(incident_data)
 ```

 ## What we'll answer
 The goal of this data exploration is to answer the following questions:

 1. **Where** are the majority of shooting incidents within the 5 boroughs of NYC happening?

 2. **When** are the majority of shooting incidents within the 5 boroughs of NYC taking place?

 3. **What** are the key factors in predicting whether or not a shooting incident will turn fatal?

 ## Data and Cleaning Pipeline
 ```{r}
 # Convert (null), Unknown, and "U" values to NA
 incident_data <- incident_data %>%
  mutate(across(where(is.factor), ~na_if(as.character(.), "(null)"))) %>%
  mutate(across(where(is.character), ~na_if(., "(null)")))

 # Treat "UNKNOWN" and "U" the same as (null)
 incident_data <- incident_data %>%
  mutate(across(where(is.character), ~na_if(., "UNKNOWN"))) %>%
  mutate(across(where(is.character), ~na_if(., "U"))) %>%
  mutate(across(where(is.factor), ~na_if(as.character(.), "UNKNOWN"))) %>%
  mutate(across(where(is.factor), ~na_if(as.character(.), "U")))

 # Convert categorical columns to factors
 incident_data <- incident_data %>%
  mutate(across(
    c(BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION_DESC,
      PERP_AGE_GROUP, PERP_SEX, PERP_RACE,
      VIC_AGE_GROUP, VIC_SEX, VIC_RACE, PRECINCT, JURISDICTION_CODE),
    as.factor
  ))

 # Perp age group has 5 outliers that aren't valid age groups. Recoding these to NA
 incident_data <- incident_data %>%
  mutate(PERP_AGE_GROUP = fct_recode(PERP_AGE_GROUP,
                                     NULL = "1020",
                                     NULL = "1028",
                                     NULL = "2021",
                                     NULL = "224",
                                     NULL = "940"))

 # VIC_AGE_GROUP also has an invalid age group. Recoding to NA
 incident_data <- incident_data %>%
  mutate(VIC_AGE_GROUP = fct_recode(VIC_AGE_GROUP,
                                     NULL = "1022"))

 # Transform dates to actual data type
 incident_data <- incident_data %>%
  mutate(OCCUR_DATE = mdy(OCCUR_DATE))


 summary(incident_data)
 ```

 # Visualizations

 ### Shooting Incidents by Borough
 Brooklyn and the Bronx account for the majority of gun violence in NYC.
 ```{r}
 incident_data %>%
  count(BORO) %>%
  ggplot(aes(x = reorder(BORO, -n), y = n)) +
  geom_col(fill = "darkgreen") +
  labs(title = "Shooting Incidents by Borough",
       x = "Borough", y = "Incident Count") +
  theme_minimal()
 ```

 ### Number of Incidents by Victim Age Group
 There is a steady decline in gun violence up until age 44, at which point it drops off precipitously
 ```{r}
 incident_data %>%
  count(VIC_AGE_GROUP) %>%
  ggplot(aes(x = VIC_AGE_GROUP, y = n)) +
  geom_col(fill = "steelblue") +
  labs(title = "Number of Incidents by Victim Age Group",
       x = "Victim Age Group", y = "Incident Count") +
  theme_minimal()
 ```

 ### Annual Shooting Incidents in NYC
 While our data set doesn't allow for us to meaningfully answer why there was such a decline from 2016-2018, it is a question worthy of follow up
 ```{r}
 incident_data %>%
  count(OCCUR_DATE) %>%
  mutate(year = lubridate::year(OCCUR_DATE)) %>%
  count(year) %>%
  ggplot(aes(x = year, y = n)) +
  geom_line(color = "darkred") +
  geom_point() +
  labs(title = "Annual Shooting Incidents in NYC",
       x = "Year", y = "Incident Count") +
  theme_minimal()
 ```

 ### Plotting Incidents by Time of Day
 After reviewing other student assignments in previous weeks I grew to love the idea of plotting incidents by time of day. In searching for a visualization that I hadn't seen before I stumbled upon that of a "ridge line", which is what i've used here. The data is striking, although perhaps what we would expect, the majority of shooting incidents happen very late at night, or very early in the morning. This suggests that those times are when we'd most benefit from an enhanced police presence, among other policy features.
 ```{r, warning = FALSE}
 shootings <- incident_data %>%
  mutate(
    OCCUR_DATETIME = ymd_hms(paste(OCCUR_DATE, OCCUR_TIME)),
    hour = hour(OCCUR_DATETIME),
    weekday = wday(OCCUR_DATETIME, label = TRUE, abbr = FALSE)
  ) %>% 
  filter(!is.na(hour))

 ggplot(shootings, aes(x = hour, y = fct_rev(weekday), fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_x_continuous(breaks = seq(0, 23, 3)) +
  scale_fill_viridis_c(option = "C") +
  labs(
    title = "When Do Shootings Occur?",
    subtitle = "Ridgeline density of incident start hour by weekday",
    x = "Hour of day", y = NULL, fill = "Hour"
  ) +
  theme_minimal(base_size = 12)
 ```

 # Modeling Likelihood of Fatal Shooting
 To explore factors associated with whether a shooting incident resulted in a fatality, I built a logistic regression model using `STATISTICAL_MURDER_FLAG` as the outcome variable. Predictor variables included borough, victim age group, and location type. We observe that shootings involving older victims (65+), or those occurring in private residences, are significantly more likely to result in fatalities. Meanwhile, boroughs like Queens appear associated with lower odds of fatality. This model helps quantify the relative influence of contextual factors on shooting outcomes.

 ```{r}
 model_data <- incident_data %>%
  filter(!is.na(STATISTICAL_MURDER_FLAG)) %>%
  mutate(
    is_murder = as.integer(STATISTICAL_MURDER_FLAG),  # convert TRUE/FALSE to 1/0
    BORO = fct_drop(BORO),
    VIC_AGE_GROUP = fct_drop(VIC_AGE_GROUP),
    LOCATION_DESC = fct_lump_n(LOCATION_DESC, n = 5)  # reduce rare levels
  ) %>%
  select(is_murder, BORO, VIC_AGE_GROUP, LOCATION_DESC)

 model <- glm(is_murder ~ BORO + VIC_AGE_GROUP + LOCATION_DESC,
             data = model_data,
             family = "binomial")
 summary(model)

 # Extract model coefficients, confidence intervals, and significance
 tidy_model <- broom::tidy(model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(significant = p.value < 0.05)

 ggplot(tidy_model, aes(x = estimate, y = reorder(term, estimate), color = significant)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
  scale_color_manual(values = c("TRUE" = "firebrick", "FALSE" = "gray60")) +
  labs(
    title = "Logistic Regression Coefficients",
    x = "Effect on Log Odds of Fatality",
    y = "Predictor Variable",
    color = "Statistically Significant"
  ) +
  theme_minimal()
 ```

 ## Conclusion

 In this project, I analyzed over 29,000 shooting incidents recorded by the NYPD between 2006 and 2024. After cleaning and transforming the dataset to address issues like missing values and malformed age groups, I produced several visualizations to better understand trends over time and across demographics.

 One of the most striking patterns is the sharp decline in annual shooting incidents around 2016–2018, followed by a rapid increase starting in 2020. While the dataset doesn't contain direct explanations for this trend, it raises compelling questions about the impact of policy changes, community interventions, or shifts in reporting practices. Additionally, analysis of victim age group distributions revealed a bell-shaped curve centered on adults aged 18–44, indicating this demographic is most commonly affected by gun violence in NYC.

 ## Possible Sources of Bias

 - **Underreporting or inconsistent classification**: Not all shootings may be reported or classified uniformly over the years.
 - **Missing demographic information**: Many entries list "Unknown" or are entirely `NA` for perpetrator and victim race, sex, and age, limiting the ability to draw complete conclusions.
 - **Policy-influenced reporting**: Law enforcement practices, such as changes to "stop-and-frisk" or reclassification standards, can affect the data without reflecting actual crime trends.

 ## Personal Bias and Mitigation

 I came into this project expecting to see a steady decline in shootings over time, an assumption influenced by media narratives and personal bias as a resident of a relatively low crime area. To mitigate this, I approached the analysis with a structured process: I cleaned the data methodically, avoided cherry-picking visualizations, and verified each insight through summaries. I also chose to highlight anomalies (like unexpected drops) instead of ignoring them, recognizing that they may point to either reporting artifacts or significant social changes.


 # Session Info
 ```{r, echo=FALSE}
 sessionInfo()
 ```
	---
	title: "NYPD Shooting Incidents"
	author: "P. Foti"
	date: "2025-05-13"
	output:
	pdf_document: default
	html_document: default
	---

	# Introduction
	In this project, I investigate trends in shooting incidents across NYC, exploring whether incident characteristics such as borough, victim age group, and location, predict fatal outcomes. Data for this project is downloaded from [Data.gov](https://catalog.data.gov/dataset)

	```{r load_data}
	library(readr)
	library(tidyverse)
	library(ggplot2)
	library(broom)
	library(lubridate)
	library(ggridges)

	incident_data <- read_csv("data/NYPD_Shooting_Incident_Data__Historic_.csv")
	head(incident_data)
	```

	## What we'll answer
	The goal of this data exploration is to answer the following questions:

	1. Where are the majority of shooting incidents within the 5 boroughs of NYC happening?

	2. When are the majority of shooting incidents within the 5 boroughs of NYC taking place?

	3. What are the key factors in predicting whether or not a shooting incident will turn fatal?

	## Data and Cleaning Pipeline
	```{r}
	# Convert (null), Unknown, and "U" values to NA
	incident_data <- incident_data %>%
	mutate(across(where(is.factor), ~na_if(as.character(.), "(null)"))) %>%
	mutate(across(where(is.character), ~na_if(., "(null)")))

	# Treat "UNKNOWN" and "U" the same as (null)
	incident_data <- incident_data %>%
	mutate(across(where(is.character), ~na_if(., "UNKNOWN"))) %>%
	mutate(across(where(is.character), ~na_if(., "U"))) %>%
	mutate(across(where(is.factor), ~na_if(as.character(.), "UNKNOWN"))) %>%
	mutate(across(where(is.factor), ~na_if(as.character(.), "U")))

	# Convert categorical columns to factors
	incident_data <- incident_data %>%
	mutate(across(
	c(BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION_DESC,
	PERP_AGE_GROUP, PERP_SEX, PERP_RACE,
	VIC_AGE_GROUP, VIC_SEX, VIC_RACE, PRECINCT, JURISDICTION_CODE),
	as.factor
	))

	# Perp age group has 5 outliers that aren't valid age groups. Recoding these to NA
	incident_data <- incident_data %>%
	mutate(PERP_AGE_GROUP = fct_recode(PERP_AGE_GROUP,
	NULL = "1020",
	NULL = "1028",
	NULL = "2021",
	NULL = "224",
	NULL = "940"))

	# VIC_AGE_GROUP also has an invalid age group. Recoding to NA
	incident_data <- incident_data %>%
	mutate(VIC_AGE_GROUP = fct_recode(VIC_AGE_GROUP,
	NULL = "1022"))

	# Transform dates to actual data type
	incident_data <- incident_data %>%
	mutate(OCCUR_DATE = mdy(OCCUR_DATE))


	summary(incident_data)
	```

	# Visualizations

	### Shooting Incidents by Borough
	Brooklyn and the Bronx account for the majority of gun violence in NYC.
	```{r}
	incident_data %>%
	count(BORO) %>%
	ggplot(aes(x = reorder(BORO, -n), y = n)) +
	geom_col(fill = "darkgreen") +
	labs(title = "Shooting Incidents by Borough",
	x = "Borough", y = "Incident Count") +
	theme_minimal()
	```

	### Number of Incidents by Victim Age Group
	There is a steady decline in gun violence up until age 44, at which point it drops off precipitously
	```{r}
	incident_data %>%
	count(VIC_AGE_GROUP) %>%
	ggplot(aes(x = VIC_AGE_GROUP, y = n)) +
	geom_col(fill = "steelblue") +
	labs(title = "Number of Incidents by Victim Age Group",
	x = "Victim Age Group", y = "Incident Count") +
	theme_minimal()
	```

	### Annual Shooting Incidents in NYC
	While our data set doesn't allow for us to meaningfully answer why there was such a decline from 2016-2018, it is a question worthy of follow up
	```{r}
	incident_data %>%
	count(OCCUR_DATE) %>%
	mutate(year = lubridate::year(OCCUR_DATE)) %>%
	count(year) %>%
	ggplot(aes(x = year, y = n)) +
	geom_line(color = "darkred") +
	geom_point() +
	labs(title = "Annual Shooting Incidents in NYC",
	x = "Year", y = "Incident Count") +
	theme_minimal()
	```

	### Plotting Incidents by Time of Day
	After reviewing other student assignments in previous weeks I grew to love the idea of plotting incidents by time of day. In searching for a visualization that I hadn't seen before I stumbled upon that of a "ridge line", which is what i've used here. The data is striking, although perhaps what we would expect, the majority of shooting incidents happen very late at night, or very early in the morning. This suggests that those times are when we'd most benefit from an enhanced police presence, among other policy features.
	```{r, warning = FALSE}
	shootings <- incident_data %>%
	mutate(
	OCCUR_DATETIME = ymd_hms(paste(OCCUR_DATE, OCCUR_TIME)),
	hour = hour(OCCUR_DATETIME),
	weekday = wday(OCCUR_DATETIME, label = TRUE, abbr = FALSE)
	) %>%
	filter(!is.na(hour))

	ggplot(shootings, aes(x = hour, y = fct_rev(weekday), fill = ..x..)) +
	geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
	scale_x_continuous(breaks = seq(0, 23, 3)) +
	scale_fill_viridis_c(option = "C") +
	labs(
	title = "When Do Shootings Occur?",
	subtitle = "Ridgeline density of incident start hour by weekday",
	x = "Hour of day", y = NULL, fill = "Hour"
	) +
	theme_minimal(base_size = 12)
	```

	# Modeling Likelihood of Fatal Shooting
	To explore factors associated with whether a shooting incident resulted in a fatality, I built a logistic regression model using `STATISTICAL_MURDER_FLAG` as the outcome variable. Predictor variables included borough, victim age group, and location type. We observe that shootings involving older victims (65+), or those occurring in private residences, are significantly more likely to result in fatalities. Meanwhile, boroughs like Queens appear associated with lower odds of fatality. This model helps quantify the relative influence of contextual factors on shooting outcomes.

	```{r}
	model_data <- incident_data %>%
	filter(!is.na(STATISTICAL_MURDER_FLAG)) %>%
	mutate(
	is_murder = as.integer(STATISTICAL_MURDER_FLAG), # convert TRUE/FALSE to 1/0
	BORO = fct_drop(BORO),
	VIC_AGE_GROUP = fct_drop(VIC_AGE_GROUP),
	LOCATION_DESC = fct_lump_n(LOCATION_DESC, n = 5) # reduce rare levels
	) %>%
	select(is_murder, BORO, VIC_AGE_GROUP, LOCATION_DESC)

	model <- glm(is_murder ~ BORO + VIC_AGE_GROUP + LOCATION_DESC,
	data = model_data,
	family = "binomial")
	summary(model)

	# Extract model coefficients, confidence intervals, and significance
	tidy_model <- broom::tidy(model, conf.int = TRUE) %>%
	filter(term != "(Intercept)") %>%
	mutate(significant = p.value < 0.05)

	ggplot(tidy_model, aes(x = estimate, y = reorder(term, estimate), color = significant)) +
	geom_point(size = 3) +
	geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
	geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
	scale_color_manual(values = c("TRUE" = "firebrick", "FALSE" = "gray60")) +
	labs(
	title = "Logistic Regression Coefficients",
	x = "Effect on Log Odds of Fatality",
	y = "Predictor Variable",
	color = "Statistically Significant"
	) +
	theme_minimal()
	```

	## Conclusion

	In this project, I analyzed over 29,000 shooting incidents recorded by the NYPD between 2006 and 2024. After cleaning and transforming the dataset to address issues like missing values and malformed age groups, I produced several visualizations to better understand trends over time and across demographics.

	One of the most striking patterns is the sharp decline in annual shooting incidents around 2016–2018, followed by a rapid increase starting in 2020. While the dataset doesn't contain direct explanations for this trend, it raises compelling questions about the impact of policy changes, community interventions, or shifts in reporting practices. Additionally, analysis of victim age group distributions revealed a bell-shaped curve centered on adults aged 18–44, indicating this demographic is most commonly affected by gun violence in NYC.

	## Possible Sources of Bias

	- Underreporting or inconsistent classification: Not all shootings may be reported or classified uniformly over the years.
	- Missing demographic information: Many entries list "Unknown" or are entirely `NA` for perpetrator and victim race, sex, and age, limiting the ability to draw complete conclusions.
	- Policy-influenced reporting: Law enforcement practices, such as changes to "stop-and-frisk" or reclassification standards, can affect the data without reflecting actual crime trends.

	## Personal Bias and Mitigation

	I came into this project expecting to see a steady decline in shootings over time, an assumption influenced by media narratives and personal bias as a resident of a relatively low crime area. To mitigate this, I approached the analysis with a structured process: I cleaned the data methodically, avoided cherry-picking visualizations, and verified each insight through summaries. I also chose to highlight anomalies (like unexpected drops) instead of ignoring them, recognizing that they may point to either reporting artifacts or significant social changes.


	# Session Info
	```{r, echo=FALSE}
	sessionInfo()
	```