STAT 29000: Project 12 — Fall 2020
Motivation: Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.
Context: We’ve introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
Scope: SQL, sqlite, joins
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
A public sample of the data can be found here.
For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd: ````markdown
Each solution should then be placed in a code chunk like this: ````markdown
|
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
````markdown
SELECT * FROM station LIMIT 5;
``
This will allow the code to be displayed without throwing an error.
There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform all of the other types of joins (RIGHT JOIN, FULL OUTER JOIN).
Questions
Question 1
Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the sqlite
concatenate operator ||
see here. Write a query that returns the first 5 records of information from the station
table formatted in the following way:
(id) name @ (lat, long)
For example:
(84) Ryland Park @ (37.342725, -121.895617)
Here is a video about how to concatenate strings in SQLite. |
-
SQL query used to solve this problem.
-
The first 5 records of information from the
station
table.
Question 2
There is a variety of interesting weather information in the weather
table. Write a query that finds the average mean_temperature_f
by zip_code
. Which is on average the warmest zip_code
?
Use aliases to format the result in the following way:
Zip Code|Avg Temperature
94041|61.3808219178082
Note that this is the output if you use sqlite
in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.
Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT |
-
SQL query used to solve this problem.
-
The results of the query copy and pasted.
Question 3
From (2) we can see that there are only 5 zip_code`s with weather information. How many unique `zip_code`s do we have in the `trip
table? Write a query that finds the number of unique zip_code`s in the `trip
table. Write another query that lists the zip_code
and count of the number of times the zip_code
appears. If we had originally assumed that the zip_code
was related to the location of the trip itself, we were wrong. Can you think of a likely explanation for the unexpected zip_code
values in the trip
table?
There could be missing values in |
-
SQL queries used to solve this problem.
-
1-2 sentences explainging what a possible explanation for the `zip_code`s could be.
Question 4
In (2) we wrote a query that finds the average mean_temperature_f
by zip_code
. What if we want to tack on our results in (2) to information from each row in the station
table based on the zip_code
? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.
Use the query from (2) as a sub query within your solution. |
Here is a video about JOIN and LEFT JOIN. |
-
SQL query used to solve this problem.
Question 5
In (3) we alluded to the fact that many zip_code
in the trip
table aren’t very consistent. Users can enter a zip code when using the app. This means that zip_code
can be from anywhere in the world! With that being said, if the zip_code
is one of the 5 zip_code
for which we have weather data (from question 2), we can add that weather information to matching rows of the trip
table. In (4) we used an INNER JOIN to append some weather information to each row in the station
table. For this question, write a query that performs an INNER JOIN and appends weather data from the weather
table to the trip data from the trip
table. Limit your output to 5 lines.
Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the |
You will want to wrap your dates and datetimes in sqlite’s |
-
SQL query used to solve this problem.
-
First 5 lines of output.