Transcription Formatting and Editing Styles: Wide Variety of Choices

04.20.2021

Transcription is the process of transcribing speech from an audio or video recording to text. As a rule, this is done manually, but the development of speech recognition technologies is gradually opening up the possibility of solving this problem using computers. The file can be recorded and then uploaded to a speech recognition service to retrieve the text.

What is an Automatic Speech Recognition?

Automatic Speech Recognition (ASR) can be defined as a standalone, computer-controlled transcription of spoken language into readable text in real-time. In a nutshell, ASR is a technology that allows a computer to identify the words that a person is speaking into a microphone or telephone and turn them into written text.

Although speech recognition technology is not yet at the point where machines can understand all words in any acoustic environment, it is used on a daily basis in a number of applications and services.

ASP research’s ultimate goal is to recognize in real-time, with 100% accuracy, all words spoken by any person, regardless of vocabulary size, noise, speaker, or accent. Today, if the system is trained to recognize the speaker’s voice, then many more dictionaries are available, and the accurateness can be greater than 90%.

As a rule, ASR systems use methods based on hidden Markov models (HMMs).

Acoustic models of phonemes that make up the general acoustic model are obtained by preliminary training the system on a large data array, including several tens or hundreds of hours of sounding speech together with its transcription. Acoustic models are based on the allophonic variability of pronunciation (within the phoneme). Speaker-independent speech recognition requires the speech of many hundreds of speakers for learning.

The linguistic model (LM) specifies possible sequences of words either explicitly or in the form of probabilities of one word following another. In the latter case, the LM is obtained by preliminary analysis of a large array of texts.

The third component of the ASR system is a dictionary of word forms with transcriptions (recognition dictionary), which is used directly in the recognition process. It is in this dictionary that the variability of pronunciation at the phonemic level should be reflected. However, a simple expansion of the transcription dictionary by adding variable pronunciations sometimes leads not to an increase but to a decrease in recognition accuracy because different words are represented by the same or similar transcriptions. Nevertheless, a successful choice of the number of variants of transcriptions of one word made it possible to increase the recognition accuracy from 78% to 85.7%.

Manual Transcription

Manual audio decoding involves doing the work yourself without using any auxiliary programs. The only software used for this transcription method is a text editor.

Decryptors are capable of typing text very quickly, and they are attentive to details. Before decoding, the specialist listens to the entire recording to determine its topic, characters, and the specifics of the conversation. Having received a general impression of the audio recording, an expert proceeds to transcribe it into text format.

The accuracy of the manual transcription method is as high as possible. The downside of manual decryption is that it takes a lot of time. For example, it can take a whole day to transcribe an approximately two-hour audio recording.

Try it now

Select audio/video file

Taking into account all of the above, it is necessary to indicate that modern tools that use machine software, segmentation techniques, and artificial intelligence can now produce texts with about the same accuracy as humans. At the same time, much less time is spent on work.

Content Editing

There are different styles of editing and formatting the recognized text. Therefore, when contacting the online service with a request to transcribe your audio file, do not forget to clarify which style should be followed.

When it comes to content editing, there are three main styles:

Full verbatim. Speech is converted into text in the way it sounds on the recording, including speech errors, repetitions, etc.;
Clean verbatim. Insert words and exclamations that do not carry any semantic load are removed from the text;
Edited verbatim. The editor carefully checks the text, the grammar is corrected, the proper structure of sentences is observed, and a high level of readability is achieved.

Formatting Styles

If we are talking about quality formatting of the text after it is recognized from an audio file, then there are a number of styles that can also be selected:

Identification of speakers. When decrypting audio files, the participants may be differentiated, or the text may be presented as if one person was speaking it. As practice shows, the price for transcription services also varies in this case. You can check this with the manager of the online platform where you place your order;
Timestamps. As a rule, timestamps are provided every 2-3 minutes or when one person’s speech is replaced by another’s. However, if there is a need to produce the timestamps in some other way, the required style can be set;
SRT. In keeping with this style, timestamps are provided at specific intervals. As a rule, this format is applied in closed captioning;
NVivo. This style is used in order to carry out the high-quality transcription of interviews, which will be further run through the NVivo software. Academic researchers often use NVivo to describe particular words in a qualitative study, but as practice shows, market investigators can also use it successfully;
Market investigation. This style is chosen by market scientists who do not need to use NVivo. There are no timestamps, but a clear distinction between the manager and the participants is provided;
Specific formatting. Those who do not find the desired formatting style among the above can indicate special requirements and set the necessary formatting style for the recognized text.

Choose the necessary formatting and editing style for the decrypted file and get high-quality text that meets all your requirements!