Ionian University
Introduction
Given an original sentence, that conveys a specific meaning,
paraphrasing means expressing the same meaning
using a different set of words or a different syntactic
structure. Paraphrasing has been used extensively for
educational purposes in language learning, as well as in
several NLP tasks like text summarization (Brockett and
Dolan, 2005), question answering (Duclaye et al., 2003)
and natural language generation. Recently it has found
yet another use in steganography.
Regarding paraphrase identification and generation, previous
approaches have utilized supervised (Kozareva
and Montoyo, 2006) or unsupervised (Barzilay and Lee,
2003) machine learning tech-niques, finite state automata
(Pang et al., 2003), syntactic dependency rules (Meral
et al., 2007), statistical machine translation techniques
(Quirk et al., 2004). In the present proposal, paraphrases of Modern Greek
free text are learned in two phases. Henceforth, the term
“paraphrasing” will stand for shallow syntactic transformations,
i.e. swaps of consecutive phrasal chunks. Modern
Greek is quite suitable for shallow paraphrasing, due
to the permissible freedom in the ordering of the phrases
in a sentence.
The paraphrase learning process is based on resource
economy: the desire to utilize as minimal linguistic resources
as possible, enabling thereby the methodology
to be easily applicable to other morphologically rich languages
like Modern Greek.
The paraphrased text may then be used for hiding secret
information. Steganographic security will depend on the
correctness and the naturalness of the paraphrases. Figure
1 shows the architecture of the paraphrase learning
process.
Phase 1: The paraphrasing rules
The text corpus used in the experiments is the ILSP/
ELEFTHEROTYPIA corpus (Hatzigeorgiu et al., 2000).
It consists of 5244 sentences; it is balanced and manually
annotated with morphological information. Further
(phrase structure) information is obtained automatically
by the chunker described in detail in (Stamatatos et al.,
2000). During chunking, noun (NP), verb (VP), prepositional
(PP), adverbial phrases (ADP) and conjunctions
(CON) are detected via multi-pass parsing. The chunker
exploits minimal linguistic resources. Phrases are nonoverlapping.
A set of nine empirical bidirectional rules is first applied
to the input sentences in order to change their phrase ordering.
The complete set of rules is described in detail
in table 1. Unlike the syntactic tools presented in (Meral
et al., 2007), that may be applied only once to a given
sentence, each of the rules described here may be applied
multiple times (i.e. in multiple positions) to a sentence.
Furthermore, more than one rules may be applied to a
sentence simultaneously. For every sentence all the possible combinations of rule
applications are formed. This is the initial pool of paraphrases
and in the given corpus its size may vary from
zero (the sentence does not allow for any paraphrasing)
to 80 paraphrases.
Figure 2 shows the distribution of the number of sentences
depending on the sentence length (i.e. the number
of chunks forming the sentence). Figure 3 shows the distribution of the number of sentences depending on the
initial paraphrase pool size. Almost 80% of the sentences
have at least one paraphrase, an impressive number, given
that more than 24% of the input sentences consist of
five or less chunks.
Phase 2: Paraphrase learning
Due to the use of the paraphrased sentences in steganography,
correctness in syntax as well as naturalness is
of great significance. Steganographic security depends
largely on paraphrasing accuracy. Therefore the produced
paraphrases are further filtered using supervised
learning.
The positions of possible phrase swaps in the input sentences
are identified, according to the nine paraphrasing
rules. A learning vector is created for each original input
sentence and each swap position. The features forming
the vector encode morphological and syntactic information
for the phrase right before the swap position, as
well as two phrases to the left and two phrases to the
right. Thereby, context information is taken into account.
Each phrase is represented through a set of six features,
shown in table 2. The morph feature denotes whether a
noun phrase contains a definite or indefinite article and
its grammatical case. The num feature is the number of
tokens that constitute the phrase. A total of 5 x 6 features constitute the feature vector,
plus the binary target class: valid (yes) / not valid (no)
paraphrase. Native speakers have manually annotated
519 instances with the correct class label. 26.4% of them
are classified as incorrect paraphrases.
The following table shows the prediction results for various
stand-alone classification algorithms: decision trees
(unpruned C4.5 tree), k-NN instance-based learning
(k=5), support vector machines (first degree polynomial
kernel function, sequential minimal optimization algorithm
for training the classifier). Accuracy is the number
of correctly classified instances divided by the total
number of instances. Experiments were performed using
10-fold cross validation. The majority of the incorrectly classified instances are
negative (not valid) instances, probably due to their rare
occurrence in the data, compared to the positive instances.
Support vector machines deal better with predicting
negative labels, and reach an f-score of 64.1% for the
rare class.
To improve classification accuracy even further, bagging
and boosting have also been experimented with. The
C4.5 unpruned classifier was used as a base learner for
bagging (the optimal bag size was 50% of the training set
and 10 iterations were performed) and boosting (Ada-Boost, again 10 iterations were performed). Bagging
leads to the best f-score for the negative class: 65.3%.
The positively labeled paraphrases from the previous
phase constitute one part of the final pool of paraphrases.
This part consists of paraphrases that have been produced
by single phrase swaps, and not by combinations
of swaps. The other part (due to the fact that the learning
process does not allow for a paraphrase to be formed by
combinations of phrase swaps) is formed by those paraphrases
derived from phase 1 that are combinations of
two or more correct phrase swaps (the positively labeled
individual phrase swaps defined by phase 2, i.e. the first
part of the final pool)
Application to steganography
Steganography is the art of embedding hidden information
in unremarkable cover media in a way that does not
arouse an eavesdropper’s suspicion to the existence of
hidden content underneath the surface message (Provos
and Honeyman, 2003; Atallah et al., 2000; Topkara et
al., 2005).
Once the final pool of paraphrases is formed for every
sentence in the input (cover) text, the steganographic
process starts. A secret message, i.e. a sequence of bits,
is to be hidden underneath the cover text.
First, each rule is marked with one bit value, depending
on its condition. By condition we mean the right or the
left-hand side of the rule (right or left-hand side of the
arrow in Table 1). For example, for Rule 1 a bit value “0”
could mean the left hand side of the rule, and then a bit
value “1’ would indicate its right-hand side. In the case
of symmetrical rules (Rules 3 and 4), the condition may
be determined by considering as NP1 the noun phrase
which starts with a letter closer to the beginning of the
alphabet that NP2. This rule marking results from a prior
understanding between the communicating parties.
The embedding process is then completed in two stages.
In a first stage, for every sentence, a paraphrase is selected
from its pool. The selection may be performed
in a round-robin fashion (i.e. to choose the paraphrase
of each rule one at a time), or based on a secret (e.g. a
symmetric cryptographic key) shared between the two
communicating parties. In case the size of the pool is
zero, the sentence remains unchanged, and it is not used
for information embedding. If, however, the pool size is
greater than zero, a selection is possible and the sentence
is useful for information embedding. In the second stage,
depending on the condition of the selected rule, a secret
bit is embedded as follows: if the bit to be hidden is the
same as the condition of the rule, the rule is not applied
and the sentence remains unchanged, otherwise it is applied
and the sentence is paraphrased. For example, a
subject-verb sequence in the input sentence would mean
a condition “0” for Rule 1. If the bit to be hidden is also
“0”, Rule 1 is not applied and the sentence is transmitted
as it is. If the hidden bit were “1”, the rule is applied and
the sequence in the transmitted sentence now reads verbsubject,
instead of subject-verb.
On the other end, the extractor receives the final text.
Having at his disposal the same rule set, (s)he is able to
identify the rules that may be applied to each sentence.
Sharing the same secret key used in the embedding process,
(s)he is able to select the same rule used in the
insertion process. For example, reading a subject-verb
sequence, and knowing that this sequence indicates a bit
value “0” for the condition of Rule 1, (s)he decides on
“0” to be the first secret bit. Reading a verb-subject sequence
would have meant a condition value “1” and (s)
he would have decided on “1” to be the first secret bit.
Steganographic Capacity
To obtain a feeling of steganographic capacity, assuming
an average word size of 6 bytes/word, and given that our
corpus consists of 166.000 words, the corpus size equals
roughly 1 Million bytes. Steganographic capacity (the
available bandwidth) may be evaluated as follows: Using
the current implementation which allows for the embedding
of one bit per paraphraseable sentence, 4142 secret
bits may be embedded in the corpus. In other words,
1 bit may be embedded every 2000 bits of cover medium
size. This bandwidth may increase by exploiting the possibility
of embedding more than one bits per sentence,
by applying simultaneously more than one rules to the
same sentence, or the same rule more than once, which
is permitted by our rule set. Paraphrasing evaluation
A set of experiments have been performed to test the
naturalness and the correctness of the final text. Table
5 presents statistical information regarding rule applicability.
The frequency column represents the applicability
for each rule (the number of times each rule is applicable
in the corpus sentences) divided by the sum of the applicability
values of all the rules. As can be seen, subjectverb
displacement (rule 1) and adverb displacement (rule
5) constitute together around 70% of rule applications.
The 519 instances of paraphrases were checked for
grammaticality and naturalness by two native language
experts. Table 6 shows the effect of the output sentences
on the language experts. The first error rate indicates the
percentage of rule applications that have forced the experts
to make modifications in order for the paraphrases
to become linguistically correct and natural within the
initial pool. Modifications entail swaps in the ordering of
the chunks. The second error rate is the same percentage
for the final pool. Inter-expert agreement exceeded 94%.
Conclusion
The application of shallow paraphrasing rules to Modern
Greek sentences for steganographic purposes has been
presented. The low paraphrasing level, as well as the
absence of any kind of external linguistic resources, enables
the easy portability of the methodology to other inflectional
languages that are poor in resources. The large
average size of paraphrase pools, makes it non-trivial for
an unauthorized party to detect the correct paraphrase.
An interesting future direction of the current approach
would be to take further advantage of the pool size in
order to increase the steganographic capacity of the input
text. References
Atallah, M., C. McDonough, V. Raskin, and S. Nirenburg.
2000. Natural Language Processing for Information
Assurance and Security: An Overview and Implementations.
Proceedings of the Workshop on New
Security Paradigms: 51-65.
Barzilay, R., and L. Lee. 2003. Learning to Paraphrase:
An Unsupervised Approach Using Multiple-Sequence
Alignment. Proceedings of the Human Language Technology-
NAACL Conference: 16-23. Edmonton.
Brockett, C., and W. B. Dolan. 2005. Support Vector
Machines for Paraphrase Identification and Corpus Construction.
Proceedings of the 3rd International Workshop
on Paraphrasing (IWP). Korea.
Duclaye, F., F. Yvon, and O. Collin. 2003. Learning Paraphrases
to Improve a Question-Answering System. In
Proceedings of the 10th Conference of EACL Workshop
Natural Language Processing for Question-Answering.
Budapest, Hungary.
Hatzigeorgiu, N., M. Gavrilidou, S. Piperidis, G. Carayannis,
A. Papakostopoulou, A. Spiliotopoulou, A. Vacalopoulou,
P. Labropoulou, E. Mantzari, H. Papageorgiou,
and I. Demiros. 2000. Design and Implementation
of the online ILSP Greek Corpus. Proceedings of the 2nd
International Conference on Language Resources and
Evaluation: 1737-1742. Athens.
Kozareva, Z., and A. Montoyo. 2006. Paraphrase Identification
on the Basis of Supervised Machine Learning
Techniques. Proceedings of FinTAL, Lecture Notes in
Artificial Intelligence, vol. 4139: 524-533. Springer Verlag,
Berlin.
Meral, H. M., E. Sevinc, E. Unkar, B. Sankur, A. S. Ozsoy,
and T. Gungor. 2007. Syntactic Tools for Text Watermarking.
Proceedings of the SPIE International Conference
on Security, Steganography, and Watermarking
of Multimedia Contents IX. Edited by Delp, Edward J.,
III; Wong, Ping Wah.
Pang, B., K. Knight, and D. Marcu. 2003. Syntax-based
Alignment of Multiple Translations: Extracting Paraphrases
and Generating New Sentences. Proceedings of
the Human-Language Technology Conference (NAACL-
HLT). Edmonton, Canada.
Provos, N., and P. Honeyman. 2003. Hide and Seek: An
Introduction to Steganography. IEEE Security and Privacy:
32-44.
Quirk, C., C. Brockett, and W. B. Dolan. 2004. Monolingual Machine Translation for Paraphrase Generation.
Proceedings of the Conference on Empirical Methods
in Natural Language Processing:142-149. Barcelona,
Spain.
Stamatatos, E., N. Fakotakis and G. Kokkinakis. 2000.
A practical chunker for unrestricted text. Proceedings of
the Conference on Natural Language Processing: 139-
150. Patras, Greece.
Topkara, M., C. M. Taskiran, and E. Delp. 2005. Natural
Language Watermarking. Proceedings of the SPIE International
Conference on Security, Steganography, and
Watermarking of Multimedia Contents. San Jose.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO