Chapter 3 Data transformation
Since we will only extract some information from the two auxiliary data sources to help analyze our main dataset, we mainly discuss the data transformation of the main dog bites dataset.
3.1 The column “DateOfBite”
In our main dog bites dataset, the data type of the column named “DateOfBite” is “Date & Time”, e.g., “January 02, 2015”. In our analysis, we aim to discuss the passage of time in months, thus we extract the “year” and “month” from this column to add two new columns named “MonthOfBite” and “YearOfBite”. We show the original “DataOfBite” and the two new columns for some rows as follows.
## [1] "C"
## # A tibble: 10,280 x 3
## DateOfBite MonthOfBite YearOfBite
## <chr> <dbl> <dbl>
## 1 January 02 2015 1 2015
## 2 January 02 2015 1 2015
## 3 January 02 2015 1 2015
## 4 January 01 2015 1 2015
## 5 January 03 2015 1 2015
## 6 January 05 2015 1 2015
## 7 January 04 2015 1 2015
## 8 January 05 2015 1 2015
## 9 January 04 2015 1 2015
## 10 January 04 2015 1 2015
## # ... with 10,270 more rows
3.2 The column “Breed”
Many rows contain an empty “Breed”, we transform them as “UNKNOWN”. We show the “Breed” column and it successfully shows “UNKNOWN” in some rows.
## # A tibble: 10,280 x 1
## Breed
## <chr>
## 1 Poodle, Standard
## 2 HUSKY
## 3 UNKNOWN
## 4 American Pit Bull Terrier/Pit Bull
## 5 American Pit Bull Terrier/Pit Bull
## 6 American Pit Bull Terrier/Pit Bull
## 7 MORKIE
## 8 Chihuahua
## 9 PIT BULL MIXED
## 10 UNKNOWN
## # ... with 10,270 more rows
3.3 Temperature data from Average Temperature in NYC Data
We use our Average Temperature in NYC Data to obtain the monthly average temperature for the corresponding month of each dog bite. We use “MonthOfBite” and “YearOfBite” to match our rows to the corresponding temperature, and add the corresponding temperature as a new column named “Avg_temp” to the main dog bites dataset. We show the “Avg_temp”, “MonthOfBite” and “YearOfBite” columns as follows.
## # A tibble: 10,280 x 3
## Avg_temp MonthOfBite YearOfBite
## <dbl> <dbl> <dbl>
## 1 29.9 1 2015
## 2 29.9 1 2015
## 3 29.9 1 2015
## 4 29.9 1 2015
## 5 29.9 1 2015
## 6 29.9 1 2015
## 7 29.9 1 2015
## 8 29.9 1 2015
## 9 29.9 1 2015
## 10 29.9 1 2015
## # ... with 10,270 more rows
3.4 Bourough data in NYC Dog Licensing Dataset
In our auxiliary dataset (NYC Dog Licensing Dataset), the values of “Borough” is missing for all of the rows, thus we use the column “ZipCode” to classify the records to the corresponding boroughs. We queried the range of ZipCodes in NYC corresponding to each borough, and marked each row with the corresponding “Borough” according to its “ZipCode”. We show the “ZipCode” and “Borough” columns as follows.
## # A tibble: 493,072 x 2
## ZipCode Borough
## <dbl> <chr>
## 1 10035 Manhattan
## 2 10465 Bronx
## 3 10013 Manhattan
## 4 10013 Manhattan
## 5 10028 Manhattan
## 6 10013 Manhattan
## 7 10025 Manhattan
## 8 10013 Manhattan
## 9 11215 Brooklyn
## 10 11201 Brooklyn
## # ... with 493,062 more rows