Shingling in the Generative AI Era

23 Jul 2024

In the Generative AI era, there is a need for techniques from detecting plagiarism (intentional or unintentional) to content duplication to enhancing natural language processing capabilities. By using techniques like shingling, generative AI can uncover hidden relationships within text data, facilitating a deeper comprehension of content semantics.

What sets shingling's capabilities apart is the way it extends to various applications, including but not limited to, document clustering, information retrieval, and content recommendation systems. As you explore shingling further, you'll see how it helps AI models understand text better by breaking it down into smaller parts. This makes the models better at analyzing text and finding important information in long and complex sentences.

The article outlines the following

Understand the concept of shingling.
Explore the basics of the shingling technique.
Advanced shingling techniques in generative AI.
Impact of shingling on generative AI performance, challenges, and considerations in shingling for generative AI.
Future directions in the field of Generative AI.

Understanding the Concept of Shingling

Shingling is a widely used technique in detecting and mitigating textual similarities. This article introduces you to the concept of shingling, its applications in generative AI, and provides examples to enhance understanding. The process of converting a string of text in documents into a set of overlapping sequences of words or letters is called Shingling. Programmatically, think of this as a list of substrings from a string value.

Let's take a string - "Generative AI is evolving rapidly." Let's denote the length of the shingle as k and set the value of k to 5.

The result is a set with 5 letters

{'i is ', ' evol', 'apidl', 'e ai ', 'ai is', 'erati', 've ai', 'rapid', 'idly.', 'ing r', ' ai i', 's evo', 'volvi', 'nerat', ' is e', 'ving ', 'tive ', 'enera', 'ng ra', 'is ev', 'gener', 'ative', 'evolv', 'pidly', ' rapi', 'olvin', 'rativ', 'lving', 'ive a', 'g rap'}

This set of overlapping sequences are called "shingles" or "n-grams." Shingles consist of consecutive words or characters from the text, creating a series of overlapping segments. The length of a shingle denoted above as "k," varies depending on the specific requirements of the analysis, with a common practice involving the creation of shingles containing three to five words or characters.

Explore the Basics of Shingling Technique

Shingling is part of a three-step process

Tokenization: If you are familiar with Prompt Engineering, you should have heard about Tokenization. It is the process of breaking up a sequence of text into smaller units called tokens. Tokens can be words, subwords, characters, or other meaningful units. This step prepares the text data for further processing by models. With word tokenization, the above example "Generative AI is evolving rapidly." will be tokenized into

['Generative', 'AI', 'is', 'evolving', 'rapidly', '.']

For tokenization, you can use either a simple Python `split` method or Regex. There are libraries like NLTK (Natural Language ToolKit) and spaCy that provide advanced options like stopwords etc.

Link to the code - __https://github.com/VidyasagarMSC/w-shingle__

Shingling: As you know by now, Shingling, also known as n-gramming, is the process of creating sets of contiguous sequences of tokens (n-grams or shingles) from the tokenized text. For example, with k=3, the sentence "Generative AI is evolving rapidly." would produce shingles like

[['Generative', 'AI', 'is'], ['AI', 'is', 'evolving'], ['is', 'evolving', 'rapidly.']]

This is a list of shingles. Shingling helps capture local word order and context.

Hashing: Hashing simply means using special functions to turn any kind of data, like text or shingles, into fixed-size codes. Some popular hashing methods include MinHash, SimHash, and Locality Sensitive Hashing (LSH). Hashing enables efficient comparison, indexing, and retrieval of similar text segments. When you turn documents into sets of shingle codes, it's much simpler for you to compare them and spot similarities or possible plagiarism.

Simple Shingling

Let's consider two short text passages that are widely used to explain simple shingling:

Passage 1: "The quick brown fox jumps over the lazy dog."

Passage 2: "The quick brown fox jumps over the sleeping cat."

With a word size of 4, using the w-shingle Python above, the shingles for Passage 1 would be:

python w_shingle.py "The quick brown fox jumps over the lazy dog." -w 4

[['The', 'quick', 'brown', 'fox'], ['quick', 'brown', 'fox', 'jumps'], ['brown', 'fox', 'jumps', 'over'], ['fox', 'jumps', 'over', 'the'], ['jumps', 'over', 'the', 'lazy'], ['over', 'the', 'lazy', 'dog.']]

For passage 2, the shingles would be:

python w_shingle.py "The quick brown fox jumps over the sleeping cat" -w 4

[['The', 'quick', 'brown', 'fox'], ['quick', 'brown', 'fox', 'jumps'], ['brown', 'fox', 'jumps', 'over'], ['fox', 'jumps', 'over', 'the'], ['jumps', 'over', 'the', 'sleeping'], ['over', 'the', 'sleeping', 'cat']]

By comparing the sets of shingles, you can see that the first four shingles are identical, indicating a high degree of similarity between the two passages.

Shingling sets the stage for more detailed analysis, like measuring similarities using things like Jaccard similarity. Picking the right shingle size "k" is crucial. Smaller shingles can catch small language details, while larger ones might show bigger-picture connections.

Advanced Shingling Techniques in Generative AI

The integration of machine learning models with shingling processes opens up new avenues for text analysis. Through supervised or unsupervised learning, AI can learn to identify patterns or anomalies in shingled text data more effectively, enhancing the precision of similarity measurements and content categorization.

A supervised model can be trained on a dataset of plagiarized and non-plagiarized text, where the shingles from the plagiarized text are labeled accordingly. During inference, the model can classify the shingles generated from a new document as plagiarized or non-plagiarized, enabling more accurate plagiarism detection.

This shingle classification can be useful in other applications like content moderation, or topic categorization. Here’s a link to a paper that discussed various approaches on Plagiarism Detection System based on Deep Learning

Unsupervised machine learning techniques, such as clustering algorithms, can be applied to group similar shingles together based on their inherent patterns or representations.

This can be useful for tasks like document clustering, topic modeling, or identifying redundant or near-duplicate content. For example, Documents with similar shingle clusters can be grouped together, enabling efficient organization and retrieval of related content.

If you observe, the basic shingling technique treats all shingles equally. Here are some advanced shingling techniques used in generative AI models

Weighted Shingling: Assigns different importance weights to shingles based on factors like position, frequency, or information content, allowing the model to focus on the most relevant shingles. For instance, in the context of machine learning, shingles like "neural networks" and "deep learning" would have higher weights compared to common words like "the" or "and."

You can check the weighted shingling Python code in this repository - https://github.com/VidyasagarMSC/shingling

Run the Python code with a sample text and words with corresponding weights.

python3 weighted-shingling.py 
"This is a sample text with some important words like neural networks and deep learning" 
'{"neural": 2, "deep": 2, "networks": 1.5, "learning": 1.5}' -k 5

List of Shingles with corresponding weights: [('this is a sample text', 5), ('is a sample text with', 5), ('a sample text with some', 5), ('sample text with some important', 5), ('text with some important words', 5), ('with some important words like', 5), ('some important words like neural', 6), ('important words like neural networks', 6.5), ('words like neural networks and', 6.5), ('like neural networks and deep', 7.5), ('neural networks and deep learning', 8.0)]

Here’s the link to a LinkedIn post that discusses the use of this technique

The function in the weighted-shingling Python file first tokenizes the input text into words and then generates all shingles of length k from the words. For each shingle, it computes the weight by summing the weights of the constituent words. If a word is not present in the weights dictionary, a default weight of 1 is used. The function returns a list of tuples, where each tuple contains the shingle and its corresponding weight.

Semantic Shingling: Creates shingles from semantic concepts or word embeddings instead of surface tokens, capturing higher-level semantic similarities. For example, the shingles "machine learning algorithm" and "AI technique" might be considered semantically similar and grouped together, even though they don't share any common words.

This paper discusses various approaches to handle semantic shingling.

A detailed explanation with code examples for Semantic shingling.

Hierarchical shingling is a technique that generates shingles (n-grams) at multiple levels of granularity, such as characters, words, phrases, or sentences, to capture patterns at various linguistic levels. This approach is particularly useful for structured data like HTML documents, where shingles can be created at different levels like words, tags, and nodes.

Consider the following HTML snippet:

<html>

<body>

<h1>Hello, World!</h1>

<p>This is a sample HTML document.</p>

</body>

</html>

We can generate shingles at three levels:

Word Level: Shingles are created from the words in the HTML text.

For n=3, word shingles would be: ['hello world this', 'world this is', 'this is a', 'is a sample']

Tag Level: Shingles are created from the HTML tags.

For n=3, tag shingles would be: ['<html><body><h1>', 'body><h1></h1><p>', 'h1></h1><p></p></body>', '</p></body></html>']

Node Level: Shingles are created from the HTML nodes (tags and text content).

For n=3, node shingles would be: ['<html>', 'hello', 'world', '</h1>', '<p>', 'this', 'is', 'a', 'sample', 'html', 'document', '</p>', '</body>', '</html>']

By capturing patterns at these different levels, the model can better understand the structure and generate well-formed HTML output. For example, the word shingles capture the textual content, the tag shingles capture the HTML structure, and the node shingles capture both content and structure. This research paper outlines the need for LLMs to understand HTML.

The key advantage of hierarchical shingling is that it allows the model to learn and represent patterns at multiple levels of granularity, which can lead to better understanding and generation of structured data.

Dynamic Shingling: Adapts the shingle size based on input text characteristics, using shorter shingles for highly varied text and longer ones for capturing longer-range dependencies.

Attentive Shingling: Employs attention mechanisms to learn to attend to and weigh the importance of different shingles during training, allowing the model to dynamically focus on the most relevant shingle patterns.

A real-world example could be plagiarism detection in academic papers:

The model is trained on a dataset of paper pairs labeled as plagiarized/non-plagiarized.
During training, an attention mechanism learns to assign higher weights to shingles that are highly indicative of plagiarism (e.g. rare technical terms, named entities) and lower weights to common shingles.
At inference time, when comparing a new paper pair, the model can focus more on the highly weighted shingles rather than treating all shingles equally.

These advanced techniques help generative AI models create more expressive, semantic, and context-aware text representations, enabling better generalization and coherent, high-quality text generation. These techniques improve the quality of learned patterns and enhance the model’s ability to capture and leverage relevant linguistic information from the input data.

The Impact of Shingling on Generative AI Performance

Shingling techniques play a crucial role in enhancing the performance of generative AI systems by improving text representations, enabling efficient similarity detection, ensuring high-quality training data, facilitating effective pretraining, and supporting semantic search capabilities.

Text representation: Shingling helps create more expressive and context-aware representations of text data by capturing local word order and semantic relationships. This richer representation enables generative models to learn more meaningful patterns and generate more relevant text outputs.

For example, a model trained on shingles like a "machine learning algorithm" would better understand and generate coherent phrases related to machine learning compared to a model trained on individual words.

This article explains how generative AI models use textual descriptions to generate videos, emphasizing the importance of understanding and representing the input text accurately. The process of transforming text into video content requires capturing the semantic relationships within the text to ensure the generated video aligns with the intended message.

Similarity detection: Shingling, combined with techniques like minhashing and locality-sensitive hashing, allows efficient detection of similar or duplicate text segments. This is crucial for tasks like plagiarism detection, text summarization, and deduplication of training data, leading to improved model performance and output quality.

Data and content quality: Shingling can be used as part of data filtering techniques to remove duplicate, irrelevant, or low-quality content from the training data. This improves the overall data quality, which is paramount for generative AI systems to produce accurate and unbiased outputs. Along with quality, shingling can be used to detect copyright infringement.

Generative AI tools like ChatGPT have raised concerns about their potential misuse in academic settings, where students might use them to generate essays or assignments. By employing shingling and Jaccard Similarity, educational institutions can compare student submissions against a vast database of existing texts, including online sources and previously submitted work.

A high similarity score between a student's submission and existing content would indicate potential plagiarism, allowing instructors to identify and address such instances.

In fields like journalism, creative writing, and content creation, generative AI tools are being explored as assistants to generate drafts or ideas. However, there are concerns about the originality of the generated content, as these tools are trained on existing data. By shingling and comparing the generated content with a corpus of existing works, content creators and publishers can assess the degree of similarity and ensure that the final output meets originality standards.

By shingling and comparing the generated works with a database of copyrighted content, artists and platforms can evaluate the risk of infringement and make informed decisions about the use or distribution of the generated works.

Semantic Search: By shingling and hashing queries and document texts, generative AI systems can encode semantic similarities in a vector space, enabling efficient retrieval of relevant information for tasks like recommendation (Personalization) systems, question-answering, and knowledge-grounded generation.

In e-commerce and content recommendation systems, shingling and Jaccard Similarity can be used to analyze user preferences and behaviors. By comparing the k-shingles of user interactions or content consumed, these systems can identify similarities between users or content, enabling more accurate personalization and recommendations.

Challenges and Considerations in Shingling for Generative AI

The choice of shingling parameters, such as the shingle size and hashing techniques, can significantly impact the trade-off between computational complexity and the sensitivity of similarity detection.

Before adopting Shingling as a technique, you should be aware of the challenges and considerations

Tokenization rules can impact shingle quality in several ways. For instance, treating "Apple" and "apple" as different tokens will result in different shingles, potentially missing similarities. Similarly, handling punctuation as separate tokens (e.g., "apple," and "apple.") or removing it can affect shingle generation.

Text preprocessing techniques also play a crucial role in shingle quality. Removing common stopwords (e.g., "the", "and") can improve shingle quality by focusing on more meaningful content words. Reducing words to their root forms through stemming or lemmatization (e.g., "running" and "ran" to "run") can help identify similarities across different word forms. Inconsistencies in handling punctuation, capitalization, spelling variations, etc. can lead to suboptimal shingle representations.

Choosing the appropriate shingle size (k-value) is a key consideration that involves trade-offs. Smaller shingle sizes may capture more local context but can be sensitive to minor variations. Larger shingle sizes can better capture longer-range dependencies but may miss finer-grained patterns. The optimal shingle size depends on the specific task, data characteristics, and desired level of granularity.

As the shingle size and dataset size increase, the computational requirements for shingling and subsequent processing (e.g., minhashing, LSH) can become significant. Efficient implementations and distributed computing may be necessary for large-scale applications. Finding the right balance between computational complexity and similarity detection sensitivity is important.

Traditional shingling operates at the token/word level, which may not capture semantic similarities effectively. Incorporating semantic representations (e.g., word embeddings, concepts) into shingling can improve the quality of generated shingles.

In text summarization tasks, Generative AI models aim to produce concise summaries that capture the essence of longer texts. Traditional shingling techniques based solely on n-grams may fail to capture the semantic relationships between words, leading to incomplete summaries.

By incorporating word embeddings, which represent words as dense vectors capturing their semantic and contextual meanings, the shingling process can group together semantically related words and phrases. This allows the model to identify and prioritize the most salient concepts and ideas, resulting in more coherent and informative summaries.

In question-answering systems, Generative AI models must understand the intent behind a user's query and retrieve relevant information from a knowledge base or corpus. Traditional shingling based on surface-level word matches may struggle to capture the underlying semantics of the query.

By incorporating concept representations from knowledge graphs or ontologies, the shingling process can group together shingles that represent the same or related concepts, even if they use different surface forms. This semantic understanding enables the model to better comprehend the query's intent and retrieve more relevant and accurate answers.

In conversational AI systems, Generative AI models must understand the context and flow of a dialogue to generate appropriate responses. Traditional shingling techniques may fail to capture the nuances and pragmatic aspects of language.

By incorporating contextual word embeddings, which capture the dynamic meanings of words based on their surrounding context, the shingling process can group together shingles that represent similar conversational intents or pragmatic functions. This semantic understanding enables the model to generate more natural, context-aware, and coherent responses in a dialogue.

Here’s a link supporting the examples above

Handling Rare or Out-of-Vocabulary (OOV) Tokens - Generative models often encounter rare or unseen tokens during inference. Effective strategies discussed above like tokenization and text pre-processing, case normalization(converting to either lowercase or uppercase), and stopword(“the”, “an”) removal, etc., are needed to handle such tokens during the shingling process to avoid information loss or incorrect representations.

Accurately representing OOV tokens enables natural language processing systems to stay up-to-date and maintain high performance on evolving language data.

Real-world examples where representing OOV tokens plays a crucial role include understanding emerging terminology in scientific literature, like new gene or protein names, interpreting novel product names or brand names in e-commerce and advertising, processing social media text with slang words, hashtags, and abbreviations specific to certain communities, and building dialogue systems that can comprehend newly coined words or phrases introduced by users.

Here’s a code implementation to detect OOVs using tokenization in Python

Conclusion and Future Directions

Shingling serves as a fundamental mechanism within the realm of generative AI, facilitating sophisticated analysis and insight extraction from textual data through segmentation into shingles and the application of similarity metrics such as Jaccard similarity. Its significance is paramount in advancing text-based AI applications, encompassing endeavors such as enhancing natural language processing and refining content recommendation systems.

Looking forward, the trajectory of shingling in generative AI appears poised for innovation. This includes potential advancements in dynamic shingling methods, enhanced efficiency in processing extensive datasets, and the integration of machine learning for more nuanced pattern recognition. Additionally, addressing existing challenges, such as optimizing shingle size and managing diverse and noisy data, will be imperative.

As progress unfolds, the exploration of novel applications and the continual refinement of shingling techniques hold promise for further amplifying the capabilities of generative AI. This trajectory positions generative AI as an increasingly potent tool for comprehending and harnessing text data across a multitude of contexts.