Data Engineering — Building an ETL Pipeline for NYPD Arrest Data

using Python, Snowflake, and Power BI (part 2)

Sarthak Girdhar
6 min readFeb 10, 2024
Photo by Lukas Blazek on Unsplash

Introduction

New York Police Department (NYPD) releases a breakdown of every arrest made in New York City (NYC) each quarter. This is part 2 of building an ETL pipeline for the 2023 arrest data.

In part 1, we extracted data from NYC Open Data API, did some transformations, performed data validation, and finally loaded the data into a table in Snowflake.

In this part, we will do some Exploratory Data Analysis (EDA) to answer some questions. We will also connect to Power BI to apprise ourselves of some further detailed insights.

Exploratory Data Analysis (EDA) in Snowflake

Before we begin our EDA, it is a very important skill to be able to put yourselves in the stakeholder’s shoes and write down vital questions. The answers to these questions should provide the stakeholders with some actionable insights.

Let’s understand this with some specific examples/questions.

❓ Which boroughs witness the most number of arrests?

❓ Which months see the most number of arrests?

❓ Does New York City have more serious crimes (like, felonies) or less serious ones (like, misdemeanors & violations)?

❓ What kind of crimes are committed by less than 18 years old?

❓ Felony vs Misdemeanor comparison for all age groups.

Let’s write some SQL code to get answers to our questions.

Which boroughs witness the most number of arrests?

SELECT "arrest_boro", COUNT("arrest_boro") AS "Number of Arrests"
FROM NYPD_ARRESTS
GROUP BY "arrest_boro"
ORDER BY "Number of Arrests" DESC;
Number of arrests by borough in New York City

Brooklyn (K) is at the top, followed by Bronx (B), Manhattan (M), Queens (Q), and Staten Island (S).

❇️ This result should help the residents (the stakeholders for this question) where to look for their next rental home. Staten Island sees the least number of arrests, but let’s be real, who wants to live there 🤷

Which months see the most number of arrests?

SELECT MONTHNAME("arrest_date") as "Month", COUNT(date_part(month, "arrest_date")) as "Number of Arrests"
FROM NYPD_ARRESTS
GROUP BY "Month"
ORDER BY "Number of Arrests" DESC;
Number of arrests by month in New York City

Months like May, October, August, and June have more arrests compared to April and December.

❇️ While this result doesn’t reveal any patterns, perpetrators (the stakeholders for this question) can probably come out of hiding in December (the Police force personnel could be on leave due to the holidays). I believe February has the least number of arrests because it has 2 fewer days than the rest of the months.

Does New York City have more serious crimes (like, felonies) or less serious ones (like, misdemeanors & violations)?

SELECT "law_cat_cd" AS "Violation type", COUNT("law_cat_cd") AS "Count of Violation"
FROM NYPD_ARRESTS
GROUP BY "Violation type"
ORDER BY "Count of Violation" DESC;
Felony vs Misdemeanor Vs Violation in New York City

A Felony (F) is a more serious crime than a Misdemeanor (M) which is more serious than a Violation (V).

❇️ The population of New York City is about 8 million. The number of arrests for Misdemeanor + Felony is about 220,000. While the local Government (the stakeholders for this question) can breathe a sigh of relief that the number of misdemeanors is more than the number of felonies, there are about 2.5 arrests made for every 100 people living in New York City. Also, let’s not forget that these are just the number of arrests; the actual instances of these crimes could be much higher. Something to think about.

What kind of crimes are committed by less than 18 years old?

SELECT "law_cat_cd" AS "Violation type", COUNT("law_cat_cd") AS "Count of Violation"
FROM NYPD_ARRESTS
WHERE "age_group" = '<18'
GROUP BY "Violation type"
ORDER BY "Count of Violation" DESC;
Number of minors being arrested for crimes in New York City

Minors (< 18-year-olds) are committing more felonies (2.5 times more) than misdemeanors.

❇️ This is a huge concern for Social Welfare services (the stakeholders for this question) in New York City. Minors are being arrested for more serious crimes. Attempts should be made by these agencies to stop the kids from going down these paths. Later on, in Power BI, we’ll take a look at what specific kinds of crimes are being committed by these teenagers.

Felony vs Misdemeanor comparison for all age groups.

SELECT "age_group", 
COUNT(CASE WHEN "law_cat_cd" = 'F' THEN 1 END) AS "Number of felonies",
COUNT(CASE WHEN "law_cat_cd" = 'M' THEN 1 END) AS"Number of misdemeanors"
FROM NYPD_ARRESTS
GROUP BY "age_group"
ORDER BY 2,3 DESC;
Number of arrests by age group in New York City

The age group 25–44 has the most number of arrests. If you look closely, minors (< 18-year-olds) is the only category that has more felonies than misdemeanors (something very concerning for the social welfare groups).

❇️ The working class age group (25–44) is getting arrested the most. Once again, the local Government (stakeholders for this question) should come up with ways wherein, these people work towards adding value to the society and choose not to commit unlawful acts.

Further analysis in Power BI

The last piece of this puzzle is connecting to Power BI and presenting a report which helps the relevant stakeholders get some deeper insights into the data.

As mentioned previously, let’s try answering for what particular kinds of offences, minors are being arrested in New York City.

Perpetrators demographics

As you can see from the top right graph, about 4% of the arrests in New York City in 2023 were minors. If you do a drill-down on that particular part of the plot, it will take you to this…

Offence description

Some of the crimes for which minors are being arrested in New York City include robbery, assault, and grand larceny. These are some serious offences and initiatives should be taken “before” these teenagers are pushed down the path of crime. I also strongly believe that early intervention can also help reduce the number of arrests later on in life as most of the time, these are repeat offenders.

Conclusion

This brings us to the end of this project. Even though, it was a simple ETL pipeline, we covered a lot of ground in regard to Data Engineering practices.

We looked at how we can write code to control the number of API calls we make. We looked at the importance of data quality testing. Finally, we also learned to put ourselves in stakeholders’ shoes and present a report which can help them make intelligent decisions.

--

--