Our dataset is divided into three partitions, the original frames, the computed optical flow, and the extracted skeletons. We got a request to include the audio for multimodal approaches, and have included it below. It is missing a few clips (25) that have since been taken down from YT, but the majority of the videos are still there.
The skeletal data follows the same folder structure as the original frames, and the skeletal points are in the same resolution as their original frame. They are formatted as a JSON, so parsing it as that is likely the most convenient approach (we use simplejson in Python).