Home Blog Automated vs Manual Transcription Service: Comparative Characteristic

Automated vs Manual Transcription Service: Comparative Characteristic

04.20.2021
Automated vs Manual Transcription Service: Comparative Characteristic

Historically, speech recognition techniques have evolved along with the development of computers. The task of speech recognition was originally posed as the task of recovering the text of separately spoken words. Only in recent decades, computer technology has reached such a level when the task of recognizing continuous or even spontaneous oral speech has become meaningful.

At this stage, it turned out that to solve the speech recognition problem, it is not enough to be able to recognize individual sounds and words (commands) with reliability comparable to the reliability of human recognition of individual commands. As practice has shown, when recognizing continuous speech, a person essentially uses their own knowledge of the natural language and the meaning of what is spoken to eliminate the ambiguity of restoring the text of a sentence. Therefore, it is natural to divide the speech recognition problem into two independent problems:

  1. The problem of local speech recognition (that is, recognition of a separate command);
  2. The problem of recovering a continuous speech text from a set of possible recognition hypotheses.

To solve the first problem, knowledge of the nature of the speech production process is essential. There are universal models of this process that are common to various natural languages. The solution to the second problem, on the contrary, strongly depends on the characteristics of the natural language in which the words are pronounced. In fact, at this level, the construction of a speech transcription system for each new language group requires its own special mathematical and technical approaches and comes down to the use of some consistent formal model of this natural language.

Transcription is widely used today for processing and documenting materials of meetings and conferences of various levels, for the work of secretaries, journalists, and so on. Computers have significantly expanded the capabilities and made it possible to increase the flexibility of using audio and video-to-text transcription systems. At the moment, it becomes relevant to reduce the share of manual labor in such systems. 

Try it now
Select audio/video file
Select audio/video file

For this, it is proposed to use automatic speech recognition to convert sound into text since it greatly simplifies the operator’s work, reducing it to correcting errors made by the transcription system. Despite this, manual transcription of audio and video files into text does not lose its relevance since it is characterized by the highest quality and accuracy of the result obtained.

Taking all of the above into account, it seems appropriate to consider automated vs manual transcription services in more detail and highlight the main characteristics and features of each service’s application.

Automated Transcription Service: Basic Characteristics

Automated transcription service is a dynamically developing direction in the field of artificial intelligence. Significant advances have been made in this area over the past half-century – there are many commercial applications that make investments in this area worthwhile and profitable. 

Among such applications, first of all, one can note the introduction of call-centers or IVR-systems (Interactive Voice Response) – systems of automatic access to information, bypassing the operator. In modern call-centers, questions are formulated by the user in natural language, and the answer is synthesized by the computer in the user’s language. The introduction of call centers has freed up a huge number of operators and improved the quality of service at many airports and railway stations.

Automated speech recognition systems are widely used in medical research requiring information input when the operator’s hands are busy (X-ray) or required to control autonomous devices for examining internal organs. Even the filling out of medical records by mid-level personnel in advanced medical institutions is conducted by voice.

An important area of application of automated speech transcription systems is helping people with disabilities, both with problems of the musculoskeletal system and the visually impaired (assistive technologies).

Automated Transcription Service Capabilities

The main characteristics of modern automated transcription service are the following:

  • Dictionaries tens and hundreds of thousands of words in size;
  • Continuous speech recognition and transcription;
  • Work in real-time;
  • The ability to work both with preliminary tuning for the voice of the announcer and without tuning;
  • Reliability of work 95–98% for grammatically correct texts.

How Does the Automated Transcription Process Work?

The digitized speech signal is fed to the input of the computer. Then the signal with a certain constant step is divided into windows. For each window in the acoustic analysis unit, the vector of some spectral parameters’ values, most often cepstral coefficients, and their first and second discrete derivatives, is calculated.

The parameter vectors are sequentially fed to the local recognition unit’s input. It, as a result, is usually based on a universal monotone probabilistic automaton that unites the reference probabilistic automata of all-natural language words with which the transcription service works. When each new analysis window arrives at the input of this block, the oriented loaded graph of recognition hypotheses is modified. Then, new hypotheses about the spoken sequence of words of the language are added to it, and existing hypotheses are removed, the probability of which becomes less than a certain fixed threshold. When the last vector of parameter values ​​arrives, only those hypotheses are left in the graph that ends with a whole (complete) word of the language. For the local recognition unit’s effective functioning, an essential role is played by the choice of the phonetic alphabet.

Knowledge about the structure of natural language is used to isolate the recognition hypothesis of a single natural language sentence from the graph as a result of recognition. The language model (most often based on a statistical approach) allows choosing among the entire set of paths in the hypothesis graph one that has the maximum final probability. The found hypothesis is considered the result of recognition.

It should be noted that the functioning of the described system is effective only after training on the basis of text and acoustic databases (corpuses), which have a sufficiently large volume and representativeness. Text databases are necessary for training and testing the effectiveness of language models, and speech databases are needed to adjust the parameters of local recognition algorithms, most often based on the use of monotone probabilistic automata. The collection and processing of such databases are perhaps one of the most laborious stages in the construction of speech transcription service and require a sufficient complete natural language vocabulary and morphological analysis systems.

Manual Transcription Service

Manual speech-to-text conversion is the most simple, high-quality, and at the same time the most time-consuming. In this case, a person listens to the dictated text and prints it on paper. A previously recorded sound file can also be used for this purpose.

It is appropriate to resort to manual conversion when automated speech conversion programs do not cope with their task. For example, this can happen in cases where the recorded text (in the form of a media file) is of low quality, has extraneous noises, music, when several people are talking at the same time, etc. There are programs designed to facilitate the manual transcription of speech to text.

Ways to Speed ​​Up Manual Speech Transcription

Each transcriber gradually comes to a convenient work format. The most important task is to speed up the process as much as possible without losing quality. Several methods can help here:

  • At a high print speed, the way to print in sync with the voice works well when playing audio at a low speed. In this method of transcription, several conditions must be met – excellent sound quality, consistency, and literacy of the speaker, no corrections during printing, all edits are made at the next stage of verification;
  • The method used by most is to listen to a piece of audio, memorize it, type the text, listen to the next piece, etc. This method is good because you can immediately make the necessary edits. In very difficult cases, this is the only way to transcribe the recording. A bonus is the development of memory, over time the size of the memorized passage will increase;
  • The method of using macros in the text editor Word. Macros are instructions that the user composes independently according to their own needs, and the program automatically executes them after the user presses the corresponding hotkeys. The most common examples of using macros are replacing double spaces with single ones; replacing the hyphen between spaces with a dash; design of all text in one font of a certain size, etc.;
  • At a slow print speed, you can use automated speech recognition services, followed by painstaking editing.

As the analysis shows, automated vs manual transcription service is effective, but in some cases, it is more appropriate to use a manual method than automated to ensure the quality of the transcribed text.

File Types We Transcribe
  • AIFF/AIF
  • AMR
  • AVI
  • CAF
  • DSS
  • DVD
  • DVF
  • M4A
  • MOV
  • MP2
  • MP3
  • MP4
  • MSV
  • Quicktime
  • WAV
  • Webex
  • WMA
  • WMV
  • AIFF/AIF
  • AMR
  • AVI
  • CAF
  • DSS
  • DVD
  • DVF
  • M4A
  • MOV
  • MP2
  • MP3
  • MP4
  • MSV
  • Quicktime
  • WAV
  • Webex
  • WMA
  • WMV