Machine learning and AI

Generating Text Using ML

December 15, 2020 by Kurt Ellzey

Most of us have that one TV show that we want to see just one more episode of. Just one more reunion or reboot or movie to finish off a storyline that wasn’t finished. Sometimes it happens, and sometimes we didn’t realize how many leaves on the wind it would cost. 

But what if it was possible to create new episodes ourselves based on what was already out there? No, this did not suddenly turn into a fanscript site believe me. What we are talking about is giving an AI encyclopedic levels of knowledge about a particular topic, then asking it to generate new text in those styles.

Still not a fanscript site.

Neural Networks excel at processing enormous amounts of data and locating patterns within them. So when we have a large amount of known good source information such as spending habits at stores, driving patterns for Los Angeles, or 30+ seasons of a particularly popular television show, it is possible to try to ask the right questions of a piece of software and then see what it can come up with to produce something new based on what it found. What we’re going to go over today is what it would take to Generate Text using Machine Learning (ML).

 

How would we Generate Text using ML?

You may have seen references in various locations saying “I fed a thousand hours of <blank> into this bot and this is what it came up with!” More often than not the results are…interesting to say the least. For instance, Marco Marchesi was able to use the GPT-2 language model by OpenAI, along with Transformers– a large number of models that come pre-trained on various tasks- to attempt to create a new Monty Python sketch. They started with a model that contained a decent amount of general knowledge, along with all of the available content from the Monty Python’s Flying Circus series. The content produced in this way may not necessarily be ready for prime time, but it certainly is in the ballpark for what they are looking for.

Other sources have been able to use ML to try to create new songs in the style of various artists. The site lyrics.rip for example has an enormous listing of artists that it can attempt to generate lyrics for in close to real time. Still others have tried to write like Shakespeare. Speaking of Shakespeare, Machine Learning was recently used to help identify just how much of a Shakespearean play Shakespeare actually wrote. At the time of his death, he was still working on Henry the VIII, which was then completed by another playwright by the name of John Fletcher. Using ML, researchers at the Czech Academy of Sciences in Prague were able to work out exactly which lines were written by Fletcher, and which were written by Shakespeare. The possible uses of this technology are limited only by the available material, but one question remains. Why?

Why would we Generate Text using ML?

Say for instance that you had a particular instructor that you could always learn more easily from. With enough time and reference material, it would be possible to produce a learning plan tuned exactly to your requirements based on any topic in exactly the medium that would be most effective for you. Chatbots such as those used in numerous technical support situations could react in a far more human manner. On the flip side however, this could also be used to generate malicious messages that the person would never write, but could be incredibly difficult to distinguish from a genuine article. If someone gained access to an executive’s sent items folder and was able to copy all of that data, with enough time it could theoretically be used to write a very convincing phishing email as if it was from that person.

This would also become all the more frightening when combined with other technologies such as Deepfakes. In fact in 2016 there was a proof of concept piece of software from Adobe shown at their Adobe MAX presentation known as Project Voco. It was called at the time “Photoshop for Voice” and showed at least the potential of being able to synthesize a person’s voice based on words and short phrases close enough to be indistinguishable if you weren’t listening for it. With a good enough source document to start from, it would effectively sound like the person to a decent degree. So on a positive note, this would allow news organizations to have anchors effectively available around the clock without the person becoming a zombie in the process. On the other hand, they could also be used to incredible detriment in politics and legal proceedings.

Conclusion

This technology has a huge amount of potential, and will continue to grow leaps and bounds in the coming years. It absolutely has both legitimate and malicious uses and is going to require a considerable re-think on how users are trained to spot potential threats. As it is now, only the most sophisticated spear-phishing attempts are able to completely convince the target that they are someone else. Combine this with falsified senders and very little else to say that it is not who they claim to be, and it would open up a whole new can of worms for Information Security.  We will absolutely want to be vigilant for new breakthroughs in this field for the foreseeable future if we want to be able to protect our organizations, our users and ourselves from potential attack vectors.

 

Sources

  1. Lyrics.rip “Generate Lyrics –  All magic done by a Markov Chain” – https://www.lyrics.rip/
  2. Machine Learning has revealed exactly how much of a Shakespeare play was written by someone else – https://www.technologyreview.com/2019/11/22/131857/machine-learning-has-revealed-exactly-how-much-of-a-shakespeare-play-was-written-by-someone/
  3. Relative contributions of Shakespeare and Fletcher in Henry VIII: An Analysis Based on Most Frequent Words and Most Frequent Rhythmic Patterns – https://arxiv.org/abs/1911.05652
  4. Machine Learning has revealed exactly how much of a Shakespeare play was written by someone else – https://www.technologyreview.com/2019/11/22/131857/machine-learning-has-revealed-exactly-how-much-of-a-shakespeare-play-was-written-by-someone/
  5. Machine Learning with (Monty) Python – https://www.linkedin.com/pulse/machine-learning-monty-python-marco-marchesi
  6. Can AI write like Shakespeare? – https://towardsdatascience.com/can-ai-write-like-shakespeare-de710befbfee
  7. Deepfakes – /topic/deepfake/
  8. Adobe Audio Manipulator Sneak Peak with Jordan Peele | Adobe Creative Cloud – https://www.youtube.com/watch?v=I3l4XLZ59iw
Posted: December 15, 2020
Kurt Ellzey
View Profile

Kurt Ellzey has worked in IT for the past 12 years, with a specialization in Information Security. During that time, he has covered a broad swath of IT tasks from system administration to application development and beyond. He has contributed to a book published in 2013 entitled "Security 3.0" which is currently available on Amazon and other retailers.