In our everyday conversations, we often come across speech disfluencies that hinder clear and fluent communication. These disfluencies, such as repeated words, fillers, and hesitations, disrupt the natural flow of speech and can create barriers to effective understanding. Recognizing the significance of seamless communication, I began my search to explore techniques for detecting and correcting disfluencies. In this blog post, I will share my experience and introduce the Disfluency Correction API—a user-friendly solution that utilizes a T5-based model to identify and rectify speech imperfections, ultimately enhancing speech clarity and fluency.
Discovering the Disfluency Dataset
To train my disfluency detection and correction model, I discovered a valuable dataset provided by Google Research. You can access the dataset at "https://github.com/google-research-datasets/Disfl-QA". It consists of a diverse collection of speech samples containing various disfluencies, including repeated words, fillers, and their corrected versions. By leveraging this dataset, I aimed to improve my model's ability to accurately recognize and rectify these speech imperfections.
Fine-Tuning the T5 Transformer
For the development of my disfluency correction model, I turned to the T5 transformer model—a powerful tool widely used in natural language processing tasks. Renowned for its language understanding capabilities, the T5 model provided a solid foundation. By fine-tuning the pre-trained "t5-base" version, I harnessed the potential of this model and adapted it specifically to the task of detecting and correcting disfluencies.
Fine-Tuning the T5 Transformer To fine-tune the T5 transformer model, I followed these steps:
Prepare the Dataset: I created a dataset consisting of paired sentences, with each disfluent sentence accompanied by its corresponding corrected version. This dataset serves as the training data for the disfluency correction model.
Set the Training Parameters: I defined the hyperparameters for training, including the batch size, learning rate, number of epochs, and other relevant parameters.
Initialize the T5FineTuner: Using the T5 transformer model and tokenizer, I instantiated the T5FineTuner class. This class encapsulates the model architecture and logic for training and inference.
Configure the Optimizer: I set up the optimizer, AdamW, with appropriate parameters, such as weight decay and epsilon. Additionally, I implemented a learning rate scheduler to adjust the learning rate during training.
Initialize the Trainer: To facilitate the training process, I used the PyTorch Lightning Trainer. It handles the training loop, batch iterations, and validation steps, simplifying the training pipeline.
Start the Training: With the Trainer set up, I initiated the training process. The model was trained on the prepared dataset, and the parameters were updated based on the defined loss function and optimizer.
Save the Fine-Tuned Model: Once the training was completed, I saved the fine-tuned model. This preserved the model's weights, configuration, and other necessary files for future use.
Creating the Disfluency Correction API
To make the benefits of disfluency correction accessible to others, I developed an intuitive API using Python and the Flask framework.
The API code snippet below demonstrates the implementation details
import os
import sys
import torch
from flask import Flask, jsonify, request
from transformers import T5ForConditionalGeneration, T5Tokenizer
app = Flask(_name_)
app.config["DEBUG"] = True
model = T5ForConditionalGeneration.from_pretrained('./t5', from_tf=False)
tokenizer = T5Tokenizer.from_pretrained('files')
device = torch.device("cpu")
model.to(device)
k_value = int(sys.argv[1])
def set_seed(seed):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
@app.route('/similarsentences', methods=['GET', 'POST'])
def perform_model_inference():
set_seed(42)
posted_data = request.get_json()
sentence = posted_data['sentence']
fillerFlag = True
text = "Original: " + sentence + "</>"
max_len = 256
encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
beam_outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_masks,
do_sample=True,
max_length=256,
top_k=k_value,
top_p=0.98,
early_stopping=True,
num_return_sequences=1
)
print("\nDisfluent Question :: ")
print(sentence)
print("Original Questions :: ")
final_outputs = []
length = len(sentence)
for beam_output in beam_outputs:
sent = tokenizer.decode(beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print("Model Sentence: ", sent)
if len(sent) <= length:
final_outputs.append(sent)
if len(final_outputs) == 0:
final_outputs.append(sentence)
print(final_outputs)
fillers = ('a', 'ah', 'um', 'umm', 'aa', 'aaa', 'oo', 'oh', 'hmmm', 'hmm')
finalRes = []
if fillerFlag:
words = final_outputs[0].split(' ')
for word in words:
if word not in fillers:
finalRes.append(word)
outputRes = ' '.join(finalRes)
return jsonify({'result': outputRes})
if _name_ == '_main_':
app.run(host='10.2.158.150', port=8495)
In the code above, I import the necessary libraries and set up a Flask application. The T5 transformer model and tokenizer are loaded, enabling accurate disfluency detection and correction within the API. The API includes a perform_model_inference() function that handles incoming requests, processes the input sentence, and generates corrected versions using the T5 model. The disfluencies are filtered out, and the improved output is returned as a JSON response.
Sample Output
Here's an example of the Disfluent Question and its corresponding Corrected version:
Disfluent Question: "I mean I mean they are not giving the right information again right"
Corrected Version: "Maybe they are not giving the right information?"
Applications of the Disfluency Correction API
The Disfluency Correction API opens up numerous possibilities for enhancing speech clarity and fluency. Here are a few potential applications:
Call Centers and Customer Support: The API can be integrated into call center systems to ensure more professional and effective customer interactions. By automatically detecting and rectifying disfluencies in real time, the API enables smoother and more engaging conversations.
Language Learning Platforms: Language learners can benefit from the API by receiving instant feedback on their pronunciation and fluency. By detecting and correcting disfluencies, the API helps learners refine their speaking skills, gain confidence, and improve their overall language proficiency.
Transcription Services: The API can streamline transcription processes by automating the identification and correction of disfluencies in recorded speech. This saves time, improves accuracy, and enhances the overall quality of transcriptions.
Public Speaking Coaching: Public speakers and presenters can leverage the API to improve their delivery. By receiving immediate feedback on disfluencies, speakers can refine their speech, enhance their overall presentation skills, and engage their audience more effectively
Speech disfluencies can impede effective communication, but with the Speech Enhancement API powered by the T5 transformer model, we can overcome these challenges. By leveraging the valuable dataset and fine-tuning the T5 model, I have developed an intuitive API that enhances speech clarity and fluency. The API finds practical applications in call centers, language learning platforms, transcription services, and public speaking coaching, offering solutions for various domains.