Introducing Canadian Football Datasette

Published on January 28, 2019 8:41 PM by dbo.

CFLdb is happy to introduce Canadian Football Datasette as a resource for Canadian Football researchers. The datasette provides a data source to perform queries on game data and export the results, or export the available games for study and research using their own (advanced) tools and techniques. In addition, the datasette site provides a simple API to interface with the data programatically for advanced users. This experiment is provided to the community with the request for feedback to determine the value and feasibility of expanding the information in the datasette.

What is the Canadian Football Datasette?

The datasette consists of regular season and playoff games of the IRFU, WIFU, CFC and CFL along with playoffs games for other leagues competing for the Grey Cup (including service teams during the Second World War years), all Grey Cup games, and the Dominion Championship games for 1907 and 1908 for a total of 6,264 games. This is roughly equivalent to the regular season games for the year-by-year standings and playoff game results listed in the CFL Guide and Record Book (previously CFL Facts, Figures and Records). There may be playoff games included or missing in the datasette compared to the 2018 CFL Guide and Record Book; see below for how to report corrections.

The information available for each game includes year, game number, week, game event, game date/time, visitor team, visitor score, home team, home score, overtime, quarter scores by team (when available) and more. See below for detailed list of fields and explanation and caveats on their contents.

The information available consists of the 1907-2011 standard game information provided to CFLdb Statistics by the CFL in 2012. Data from 2012 to 2018 has been compiled from official CFL sources. Updates in upstream sources for historical data may not have been updated in the published data. Where data is not available, fields will be empty. See the field list below for details on which fields may not contain information for all games. Data is provided as-is and without warranty. Consumers of the data accept the terms of the site, and accept their responsibility for validating data or accept an unknown level of inaccuracy in the data.

The data is licensed with the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. This requires all published usage of the data to attribute the source, and if you remix, transform or build upon the data, you must distribute your contributions under the same license. This is not optional, using the data in any works requires attribution and distributing the data under the same license if you remix, transform or build upon the data in any way. If this is not agreeable, then the Canadian Football Datasette data should not be downloaded and used.

How to use Canadian Football Datasette

Canadian Football Datasette is a read-only database of Canadian Football game data fronted by an interface for querying the information. At the introduction of the datasette, there is a single source of information to query — games. From the home page, click on the games link to see the data in the games source.

From the games source, all games are shown by default. The results can be paged through with 100 results displayed per page. To get the most out of a datasette, filtering the data source provides a way to explore the data, search for patterns and examine smaller result sets.

There are two methods of filtering or querying information in sources: the simple interface and the SQL interface.

The Simple Interface

The simple interface allows filtering by multiple fields and ordering the data. At the top of the page, select a column, comparison and value to filter by. Multiple fields can be filtered on to build complex queries; click the Apply button to apply the filter, and another field filter row will be added to further narrow the results if required. Sorting can be applied to a single field only. There is no limit on the results for the simple interface, all results will be available, paged in 100 game increments. The page forward and backward buttons are found at the bottom of the result table.

By default all columns are returned in the result set, and this cannot be changed.

Facets

Facets provide an interface for quickly exploring and filtering the data. A list of fields suggested for facets appear above the result table. Clicking on a facet will show a list of values for that field along with a count of rows with that value. Clicking on a value from that list will show only records for that value, allowing fast drill-down into the data. Suggested facets are listed for fields with few unique values. If more than 30 unique values exist, only the most frequent 30 values will be shown and an ellipsis will appear at the end of the list to indicate there are more values not listed. Facets can be combined with filters to quickly explore subsets of data. For instance, the year field is not facetted (there are over 100 years in the data source), but a filter can be used to select all years less than 1958, for example, and then the home team facet used to show the counts for each team, with the ability to quickly filter to each team’s home games.

The SQL Interface

Note: The SQL Interface is for advanced, technical individuals only. No support is provided. While technical minded self-learners are encouraged to explore the SQL interface using the many resources available on the Internet, others should avoid frustration and stick to the simple interface.

To initiate the SQL Interface, click the View and Edit SQL button from the game data source page. This can be done after any simple interface filter is applied to use those filters and sorting as a starting point for the SQL statement in the editor.

The SQL statement editor allows the use of any valid SQL SELECT statement to show results. This includes limiting fields displayed, filtering using WHERE clause, ordering results by multiple fields, using aggregate functions and GROUP BY, and performing calculations.

For example, the following query will show all games where the home team score more than 42 points than the visiting team, ordering the results by year, restricting the returned columns, and adding a calculated column with the score differential.

select year, game_event, game_datetime, visitor, home, (home_score - visitor_score) as score_differential
from game 
where (home_score - visitor_score) > 42
order by year

SQL queries are limited to a maximum 2000 non-paged results. Queries should be used for drilling into the data, not returning all results. If a condition occurs more than 2000 times, it is considered not unique enough to be targeting the condition. Queries can use the LIMIT statement to limit results, and page through results manually. See the documentation links below for more information. If this limitation is a problem, consider downloading the data and importing into the tool of your choice to perform the data manipulation needed.

Experienced SQL users will be able to use the SQL interface to create advanced queries to explore and mine the data source for answers to almost any question found in the data.

Downloading Data

The complete data or subset from any filter/query can be downloaded in json or csv format. This is recommended for technical individuals only and how-to instructions are not detailed here.

For more information on advanced datasette topics, see the following (for advanced, technical individuals only):

The Experiment

The Canadian Football Datasette is provided as an experiment. As such, there are no guarantees over the availability of the site nor its long term future (see below). While open to the public, the site is prevented from being indexed by search engines at this time. If the datasette initiative continues after its experiment period, I commit to updating it with the results of new seasons following the conclusion of each season.

Civil and thoughtful feedback is requested from the research community. Is the datasette a helpful resource? What information is missing that would make the datasette more valuable? What other areas, beside games, would provide the most value in a public datasette?

Corrections

Think that data is included or excluded in error? Found an issue with data in the collection? Courteous and civil corrections are welcome. Here is an example:

I’ve been using the Canadian Football datasette and believe it is missing these games from the collection that the CFL considers part of its and Grey Cup history.

[list of games]

My source for this is the [name/link source]. I believe these games should be included to make the datasette match official sources.

Make adjustments to the above if the correction is a deletion or report of inaccurate data, and feel free to paraphrase, but be civil, humble, concise and complete. Reports can be made by email or the contact form found on the main CFLdb contact page. All reports should clearly indicate the games by id, date and participants if in the database, and include all data if a requested addition (a csv file is acceptable for this). I will not perform research to fill in data for partial addition requests, nor guess or read minds as to what games in the database are being referred to.

Field Definitions

A list of fields, their purpose and other details on the data contained is found below. All fields without values are blank/null. This may be because the information is not available (attendance, quarter scores), or because it is not applicable (game_series for regular season games).

id INTEGER PRIMARY KEY
Internal unique identifier given to each game. It has no meaning except as a reference in the CFLdb Statistics database.
year INTEGER
Season game belongs to.
game_number INTEGER
An index for the game sequence within a season.
week TEXT
Week of the season. This data is generated, as for historical seasons there wasn’t a concept of weeks, and teams played multiple times in a calendar week.
game_event TEXT
Indicates type of game, either regular season or type of playoff (Semi-final, Final, Inter-league Playoff, Grey Cup Semi-final, Tie-breaker, Dominion Championship or Grey Cup).
game_series TEXT
For playoff games, indicates the game number and series length. Two-game total points series will be listed as 1/2 and 2/2. Best 2 of 3 series will be listed as 1/3, 2/3 and 3/3 if necessary. Single game playoffs are listed as 1/1.
game_datetime TEXT
The date of the game in ISO 8601 format. If the game time is known, it is included, otherwise the time component is midnight. Timezone displayed is UTC, which is 4 hours ahead of ET. Converting games without known time to local timezone will result it the date listed being 1 day before the actual game date.
visitor TEXT
A short textual representation for the visiting team, using standard city/team abbreviations. There is overlap between team abbreviations (e.g. all Montreal IRFU/CFC/CFL teams use MTL) and the datasette does not conform to the abbreviations used in most recent CFL publications. See visitor_id for exact team.
visitor_id INTEGER
An id representing the visiting team. Details on the team found in the team table. Queries using a field referencing the team table will result in a link, which will display the record details for that team_id.
visitor_score INTEGER
The final score for the visiting team in the game.
home TEXT
A short textual representation for the home team, using standard city/team abbreviations. Same details for the visitor field apply, see that field for details. When a game is at a neutral site, the team that home game should belong to or would be considered the home team is listed as the home team.
home_id INTEGER
An id representing the home team. Same details for the visitor_id field apply, see that field for details.
home_score INTEGER
The final score for the home team in the game.
overtime TEXT
Boolean field to indicate whether overtime was played to decide the game. “t” indicates true, “f” indicates false.
forfeit_team TEXT
If a team forfeited the game (by action or by decision of league), their textual representation will be listed here, otherwise this field will be blank. As a result of a forfeit, the opposing team is granted the win.
forfeit_team_id INTEGER
A team_id representing the forfeit team, blank when game not forfeit.
valid_game TEXT
A boolean field indicating whether the game is valid. Non-valid games are not counted in standings nor their decisions honoured. “t” indicates true, a valid game, and “f” indicates a non-valid game. A non-valid game is usually replayed at a later date.
standing_points REAL
Indicates the number of standings points available to the winning team. Ties usually result in a splitting of the standings points. Relevant in the regular season only.
attendance INTEGER
Reported attendance at the game. When attendance is not known, this field is blank.
visitor_division TEXT
The division membership of the visiting team. If playing in a league with no divisions (e.g. IRFU), this will be “None”.
home_division TEXT
The division membership of the home team. If playing in a league with no divisions (e.g. IRFU), this will be “None”.
cfl_id INTEGER
The unique id the CFL has assigned the game on its online platform. Games not available on the online platform, and therefore have no unique id in this manner, have a blank cfl_id. Useful for referencing these games with the CFL and others.
cfl_index INTEGER
The unique id Steve Daniel has assigned all IRFU, WIFU, CFC and CFL regular season games, as provided to CFLdb. All other games will have a blank cfl_index. Useful for referencing these games with the CFL.
game_datetime_epoch INTEGER
An integer representing the game time in seconds since the epoch. Useful for programic conversion to a date time object . Not meant to be human readable.
v_q1 INTEGER
The points scored by the visiting team in the first quarter.
h_q1 INTEGER
The points scored by the home team in the first quarter.
v_q2 INTEGER
The points scored by the visiting team in the second quarter.
h_q2 INTEGER
The points scored by the home team in the second quarter.
v_q3 INTEGER
The points scored by the visiting team in the third quarter.
h_q3 INTEGER
The points scored by the home team in the third quarter.
v_q4 INTEGER
The points scored by the visiting team in the fourth quarter.
h_q4 INTEGER
The points scored by the home team in the fourth quarter.
v_ot INTEGER
The points scored by the visiting team in all overtime periods/possessions.
h_ot INTEGER
The points scored by the home team in all overtime periods/possessions.

Next Steps and Community Responsibility

CFLdb owes its existence to the Canadian Football research community and there have been many requests for making the data on CFLdb Statistics available to researchers and developers to create their own works. The Canadian Football Datasette is the first step in providing the data and an API for researchers to try. Feedback from this initiative will drive whether the datasette direction is continued and expanded. Finding a way to make the data public and accessible with the minimal amount of technical knowledge is the goal, and an assessment on the results of this experiment will be made in June, 2019. A lack of feedback will indicate a lack of interest and usage, and the datasette will disappear and I will return back to the drawing board in search of a better idea to accomplish this goal. Feedback that provides improvements, better ideas, or clearer understanding of the needs will lead to improvements in the datasette or a pivot towards something better.

Conversation

Comments are closed. Continue the conversation on Twitter.

Meta

Introducing Canadian Football Datasette was published on January 28, 2019 8:41 PM by dbo.

2,701 words.

This article is categorized under Research and tagged with datasette.