刘老师

verify-tagVisualizing Chicago Crime Data

social sciencepublic safety

1

已售 0
90.47MB

数据标识:D17220775567439093

发布时间:2024/07/27

以下为卖家选择提供的数据验证报告:

数据描述

#Prelude This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

#About the Dataset ##Important Facts The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date. ##Reliability Using the ROCCC method, we can see that:

  • The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago.
  • The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original.
  • The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is.
  • The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current.
  • The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

#Processing the Data ##Cleaning the Dataset The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following:

  • Check for any null values and determine whether we should remove them.
  • Update any values where there may be typos.
  • Check for outliers and determine if we should remove them.

The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax)

#Examining the dataset #There are over 7.5 million rows of data #Putting a limit so it does not take a long time to run SELECT   * FROM   `portfolioproject-350601.ChicagoCrime.Crime` LIMIT 1000; 
#Seeing which points are null #There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset #Most of the null points are in the lat and long, which we will need later #Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data SELECT    *  FROM    `portfolioproject-350601.ChicagoCrime.Crime` WHERE    unique_key IS NULL OR   case_number IS NULL OR   date IS NULL OR   primary_type IS NULL OR   location_description IS NULL OR   arrest IS NULL OR   longitude IS NULL OR   latitude IS NULL; 
#Deleting all null rows DELETE FROM    `portfolioproject-350601.ChicagoCrime.Crime` WHERE      unique_key IS NULL OR   case_number IS NULL OR   date IS NULL OR   primary_type IS NULL OR   location_description IS NULL OR   arrest IS NULL OR   longitude IS NULL OR   latitude IS NULL; 
#Checking for any duplicates in the unique keys #None to be found SELECT   unique_key,   COUNT(unique_key) FROM   `portfolioproject-350601.ChicagoCrime.Crime` GROUP BY   unique_key HAVING   COUNT(unique_key) > 1; 
#Counting other unique values  SELECT   COUNT(DISTINCT(primary_type)) AS unique_crime_type_count,   COUNT(DISTINCT(description)) AS unique_description_count,   COUNT(DISTINCT(location_description)) AS unique_location_count FROM   `portfolioproject-350601.ChicagoCrime.Crime`; 
#Double checking the primary types #There are some typos, two values shouldn't be there #To be specific, it's crim sexual assault and non - crime SELECT   DISTINCT(primary_type) FROM    `portfolioproject-350601.ChicagoCrime.Crime` ORDER BY   primary_type; 
#Updating values to clean dataset UPDATE `portfolioproject-350601.ChicagoCrime.Crime` SET primary_type = 'CRIMINAL SEXUAL ASSAULT'  WHERE primary_type = 'CRIM SEXUAL ASSAULT';  UPDATE `portfolioproject-350601.ChicagoCrime.Crime` SET primary_type = 'NON-CRIMINAL'  WHERE primary_type = 'NON - CRIMINAL'; 
#Seeing the count for each primary type SELECT   primary_type,   COUNT(primary_type) AS count FROM   `portfolioproject-350601.ChicagoCrime.Crime` GROUP BY    primary_type ORDER BY   primary_type; 
#Checking latitudes and longitudes  #The latitude and longitude points are (-91, 36), which leads to Missouri #This is a huge jump from Illinois SELECT   * FROM   `portfolioproject-350601.ChicagoCrime.Crime` WHERE   longitude < -88 OR   longitude > -87 OR   latitude < 41 OR   latitude > 43; 
#Deleting outliers DELETE FROM    `portfolioproject-350601.ChicagoCrime.Crime` WHERE   longitude < -88 OR   longitude > -87 OR   latitude < 41 OR   latitude > 43; 

##Organizing the Data For this step, I focused solely on creating a dataset that could easily be translated to a Tableau Dashboard. I planned everything ahead for what I wanted to show so that way I could organize the data more efficiently. The visualizations would include:

  • A filter for the year and what type of crime occurred
  • A map displaying the latitude and longitude points color coded by district number
  • A time series graph displaying the crime taking place over the years
  • A description and count table
  • A graph that showed the amount of crimes taking place each month

After planning this out, I organized the data in such a way that would easily allow me to do this, which included:

  • Making sure each dataset had the date and primary type to create a filter
  • Creating a time series database that could be used for multiple tables (i.e. the count table, month table, etc.)
  • Creating a dataset that had the latitude and longitude along with the district number to color code it on the map

The following code segments will explain how I did each step.

#Creating a table to set up a geographic map SELECT   EXTRACT(DATE FROM date AT TIME ZONE "UTC") AS date,   CAST(district AS string) AS district,   latitude,   longitude,   primary_type,   COUNT(primary_type) AS crime_count FROM   `portfolioproject-350601.ChicagoCrime.Crime` GROUP BY   date, district, latitude, longitude, primary_type ORDER BY   date ASC; 
#Creating a time series table displaying arrests vs total crimes #Starting by creating two temporary tables that will have arrests and total records #Will merge the two together WITH total_time AS(   SELECT     EXTRACT(DATE FROM date AT TIME ZONE "UTC") as date,     primary_type,     COUNT(arrest) AS crime_count,   FROM     `portfolioproject-350601.ChicagoCrime.Crime`   GROUP BY     date, primary_type ), arrest_time AS(    SELECT     EXTRACT(DATE FROM date AT TIME ZONE "UTC") as date,     primary_type,     COUNT(arrest) AS arrest_count,   FROM     `portfolioproject-350601.ChicagoCrime.Crime`   WHERE     arrest = true     GROUP BY     date, primary_type )  SELECT   t.date,   t.primary_type,   t.crime_count,   a.arrest_count,   t.crime_count - a.arrest_count AS false_count FROM   total_time t JOIN   arrest_time a ON a.date = t.date     AND a.primary_type = t.primary_type ORDER BY   date; 
#Creating another table to dispay description SELECT   EXTRACT(DATE FROM date AT TIME ZONE "UTC") AS date,   primary_type,   description FROM   `portfolioproject-350601.ChicagoCrime.Crime` ORDER BY   date ASC; 

#Visualizing the Data After organizing the dataset into three subsets, I used Tableau to create tables which would later come together to create an interactive dashboard. This portion will focus on the tables I created and what variables I used to create the visualizations. ##Table 1 - Geographic Location Table Table1 The first table I made was a table showing the geographic location of each crime. The variables I used included:

  • The latitude and longitude points
  • The district number, which was used to cluster the different geographic points
  • The crime count, which is the total number of crimes reported. The total number of crimes reported are shown by how big each point is.

##Table 2 - Time Series Chart Table2 The second table made was a time series chart, which displays the total number of crimes, arrests, and crimes where there were no arrests. As a quick note, this dataset was not just used for this table but also for two others as well. (Which we will see later)

##Table 3 - Count Chart Table3 The third table, which uses the same dataset as the time series chart, displays specific metrics for each crime, including:

  • The total number of crimes within a given year
  • The total number of arrests within a given year
  • The average amount of crime occurrences experienced daily

##Table 4 - Description Table Table4 The fourth table is a description for a given crime within a given year and the total number of crimes for that description. (i.e. the total amount of armed instances for Burglary, the total amount of stealing over $500 for theft, etc.)

##Table 5 - Total Amount of Crimes by Month Table 5 The fifth and final table, which uses the same dataset as tables 2 and 3, displays the monthly count for a given crime within a given year.

##Final Product - Chicago Crime Dashboard Dashboard All these tables come together to create an interactive dashboard that displays the metrics for a given crime within a given year. The filter (which is on the right) displays the list of crimes along with a slider to choose for a specific year. While I thought it was best to use the crime filter for the time series chart, I excluded the year filter from it. My reasoning for this is that I thought it was best to show the accumulative timeline throughout the years for a specific crime rather than a specific one. Say, for instance, there was a year where crime was at an all-time high in comparison to other years. This point will spike on the graph and will allow users to tune in on that specific year and see the metrics for that crime.

data icon
Visualizing Chicago Crime Data
1
已售 0
90.47MB
申请报告