Chapter 2 Data sources
Our main data source is about dog bite incidents that happened in NYC from 2015 to 2017 reported to the Department of Health and Mental Hygiene (DOHMH). There are about 10.3k records in total. Each record represents a single dog bite incident and information on breed, age, gender. Spayed and neutered status are also provided. With those features given, we can explore the relationships between bite incidents and dogs’ related features.
We also choose two auxiliary data sources. The first one provides information about licensing in different areas in NYC. We want to use this data to explore more about licensing with areas in order to make a better assumption about relationship dog bitten incidents with environments. The second one reported the monthly mean average temperature for NYC, which we speculate has a certain effect on dog bites.
We collected these data sources together with discussion. We will introduce these three data sources separately.
2.1 Dog Bite Data
Dataset: DOHMH_Dog_Bite_Data.csv
We download this dataset from https://data.cityofnewyork.us/Health/DOHMH-Dog-Bite-Data/rsgh-akpg, which is collected by the Department of Health and Mental Hygiene from reports received online, mail, fax or by phone to 311 or NYC DOHMH Animal Bite Unit. We show the description of the 9 columns as follows.
Column | Description | Type |
---|---|---|
UniqueID | Unique dog bite case identifier | Number |
DateOfBite | Date bitten | Date & Time |
Species | Animal Type (Dog) | Plain Text |
Breed | Breed type | Plain Text |
Age | Dog’s age at time of bite. Numbers with ‘M’ indicate months. | Plain Text |
Gender | Sex of Dog. M=Male, F=Female, U=Unknown | Plain Text |
SpayNeuter | Surgical removal of dog’s reproductive organs (Y/N). Y=Yes (Spayed or Neutered), N = No (Not Spayed or Neutered) | Checkbox |
Borough | Dog bite Borough. Other’ indicates that the bite took place outside New York City | Plain Text |
ZipCode | Dog bite Zipcode. Blank ZipCode indicates that information was not available | Plain Text |
2.2 NYC Dog Licensing Data
Dataset: NYC_Dog_Licensing_Dataset.csv
We download this dataset from https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp, which is sourced from the DOHMH Dog Licensing System. There are about 493k records in total. Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period. We just use this data source to roughly reflect the distribution of dogs in NYC for some features (e.g., age, sex, and bread) to help our analysis. For example, in our main dog bites dataset, we can see the breed with the highest proportion of biting people. In order to see if the reason behind it is about its higher population among all dogs or simply about this breed’s natural potential of being aggressive, we could use this auxiliary dataset, which is much larger than the main dog bites dataset, to roughly estimate the population composition of different breeds and analyze the above question. With this exploration, we can have a clearer picture of the relationship between breeds and its potential aggressiveness. We do not want to make a rash decision of naming some breeds “aggressive ones” without comparing the proportion. In order to give a more scientific answer, this step is unignorable. We show the description of the 11 columns as follows.
Column | Description | Type |
---|---|---|
RowNumber | Row number | Number |
AnimalName | User-provided dog name (unless specified otherwise) | Plain Text |
AnimalGender | M (Male) or F (Female) dog gender | Plain Text |
AnimalBirthMonth | Month and year dog was born | Plain Text |
BreedName | Dog breed | Plain Text |
Borough | Owner Borough | Plain Text |
ZipCode | Owner zip code | Number |
LicenseIssuedDate | Date the dog license was issued | Date & Time |
LicenseExpiredDate | Date the dog license expires | Date & Time |
Extract Year | Extract Year | Plain Text |
Unique Dog ID | Unique Dog ID | Plain Text |
2.3 Average Temperature in NYC Data
Dataset: tempbymonth.csv
We download this dataset from https://www.weather.gov/wrh/climate?wfo=okx, which provides the monthly average temperature for NYC. This dataset records the average temperature for NYC from January 2015 to December 2017. During the process of analyzing our main dataset, we observed that the number of dog bites will change periodically over time (month). We speculate that this may be related to the temperature, thus we choose this data source to assist our analysis.
2.4 Issues with the data
First, our main dataset only contains the records of dog bites while does not contain data for the population of total dogs. With this dataset, we could not determine what causes the different proportions of dog bites for different breeds, boroughs, and other characteristics. For example, we can see the breed with the highest proportion of biting people. Why does this breed have the highest proportion of biting people? Because a large proportion of dog owners have this breed of dogs and it becomes the most common breed in NYC, or because of its natural potential of being aggressive? To solve this problem, we introduce our auxiliary dataset (NYC Dog Licensing Dataset), which is much larger than the dog bites dataset, to roughly estimate the population composition of dogs with different characteristics.
Second, in our datasets, some columns have missing values. The specific analysis is contained in the fourth part. For example, the column “Breed” in Dog Bite Data contains many missing values and the values of “Borough” is missing for all of the rows in NYC Dog Licensing Dataset. We will solve this question in 3 Data Transformation.