[last update: 2024-05-06]
[back to
Overview]
Download data
Main sources [accessed July 2023]
Additional sources
Quick guide
- books.json
- The "metadata" of the books: ID, titles (English, Chinese), links to thumbnails and cover images as well as the site on hambaanglaang.hk, attributions, and indicators whether it's a story specifically for teens/adults, whether the sentence translations are verified or not, and whether it has a version in Standard Written Chinese or not.
- pages.json
- For technical reasons (having to do with the stroke animations) we have subdivided the stories into smaller parts; for lack of a better term we call these "pages". The data corresponding to this level are the starting time of the audio file (for the entire story) so that it matches the beginning of the page, as well as the "structure" of the page, that is, how the sentences and the pictures fit together.
- sents.json
- The sentences: Chinese, English translation, indicator whether the translation is verified or not.
- mand.json
- The sentences in Standard Written Chinese.
- glosses.json
- The glosses (subdivision of sentences into words): Chinese, English translation (these are part of the original Hambaanglaang books and thus verified).
- chars.json
- The romanization of the Chinese characters: jyutping, Yale, as well as csdb ID (Cantonese sounds database; collection of all available romanizations, links to audio files, visualizations, etc; with the ID it's straightforward to use the data from csdb in the Hambaanglaang stories).
Note that the Hambaanglaang books also provide jyutping for English words and names when they are part of the Chinese text; we dropped those to get a clean data set.
Info
- books.json
- num: the numbering of the stories as they appear in the overview; this follows the ordering on hambaanglaang.hk, which seems to follow some underlying system
- id: the five-digit number that Hambaanglaang uses for their stories; in our presentation of the stories we only use the last three digits
- lvl: the language level of the book (between 1 and 7)
- title_C: the Chinese title
- title_E: the English title
- verif_transl: a Boolean variable - true if the sentence translations are verified, false if they are not verified
- adult: a Boolean variable - true if the story is specifically for teens/adults, false if it's not
- thumb: a link to the thumbnail image (jpg, 200x200), source: hambaanglaang.hk with a few modifications to make them uniform
- cover: a link to the cover image (jpg, 912x513 or 912x643), source: screenshots of YouTube videos, see hambaanglaang.hk
- hbl_url: a link to the website on hambaanglaang.hk that is specifically for this story
- attribs: attributions; single string, but basically a list with the items separated by \n, source: description of YouTube videos, tidied up to make them more uniform
- pages.json
- id: five-digit story ID
- pg: page number
- audio: link to the audio file of the story with exact starting time so that it starts at the beginning of the page, source: either the audio file linked on hambaanglaang.hk or the audio from teh corresponding YouTube video, in most cases it's edited, so doesn't completely match its original
- struct: the order of sentences and pictures; a single string, but basically a list with the items separated by \n, the (inner) items either start with "S - " indicating a sentence and followed by the sentence number, or they start with "I - " indicating an image and followed by a link to the image; the images are screenshots of the corresponding YouTube videos or (in the case of "additional titles") are taken from the pdf of the book
- sents.json
- id: five-digit story ID
- pg: page number
- sent: sentence number
- C: Chinese text
- E: English text
- verif_transl: a Boolean variable - true if the sentence translation is verified, false if it's not
- mand.json
- id: five-digit story ID
- pg: page number
- sent: sentence number
- C: Chinese text in traditional Chinese characters
- C_s: Chinese text in simplified Chinese characters
- glosses.json
- id: five-digit story ID
- pg: page number
- sent: sentence number
- glo: glosse number
- C: Chinese text
- E: English text
- chars.json
- id: five-digit story ID
- pg: page number
- sent: sentence number
- char: character number
- C: Chinese character
- jyutping: jyutping romanization
- yale: Yale romanization
- csdb: ID from Cantonese sounds database