Note The pretrained model is configured as specified in NaturalSpeech 2, so it has different channels/strides than the original SoundStream.