Layout Analysis on Newspaper Archives

paper, specified "short paper"
Authorship
  1. 1. Vincent Buntinx

    École Polytechnique Fédérale de Lausanne (EPFL)

  2. 2. Frédéric Kaplan

    École Polytechnique Fédérale de Lausanne (EPFL)

  3. 3. Aris Xanthos

    Université de Lausanne

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Bitmap Factorial Analysis

In order to analyze layout evolution, we propose to build a static layout representation for every year in the corpus. Thus, when studying each newspaper's first page, we define the pixel t of the static representation Pym of month m of year y as

*+,

The study of newspaper layout evolution through historical corpora has been addressed by diverse qualitative and quantitative methods in the past few years (Antonacopoulos et al, 2013; Gonzalez et al, 2001; Liu et al, 2001; Mitchell and Hong, 2004; Singh and Bhupendra, 2014). The recent availability of large corpora of newspapers is now making the quantitative analysis of layout evolution ever more popular. This research investigates a method for the automatic detection of layout evolution on scanned images with a factorial analysis approach. The notion of eigenpag-es is defined by analogy with eigenfaces used in face recognition processes. The corpus of scanned newspapers that was used contains 4 million press articles, covering about 200 years of archives. This method can automatically detect layout changes of a given newspaper over time, rebuilding a part of its past publishing strategy and retracing major changes in its history in terms of layout. Besides these advantages, it also makes it possible to compare several newspapers at the same time and therefore to compare the layout changes of multiple newspapers based only on scans of their issues.

Introduction to the Corpus

The corpus consists of digitized facsimiles of two

Swiss newspapers, “Journal de Geneve” (JDG) from years 1826 to 1997 and “Gazette de Lausanne” (GDL) from years 1804 to 1997. Scanned daily issues of each journal were transcribed using an optical character recognition (OCR) system (Rochat et al, 2016). The

Where Nym is the number of issues in month m of year y and Py,m,) is the first page of day d of month m of year y. The pixel t of the static representation i" of year y is then defined as

*+ *+ *+,

p" N p",$ N N py,$,'

" " "$

$-. $-. )-.

Were Ny is the number of month representations Py,$

in year y.

A diagram of the process is shown in figure 2.

de Lausanne

Figure 1. Different layouts of GDL in years 1825, 1850 and 1875 (top, left to right), 1925, 1950 and 1975 (bottom, left to right).

Figure 2. Process diagram creating a yearly representation of first page layouts.

These representations give a vision of the mean layout over the course of a given year. Each yearly representation can be projected in a two-dimensional space by performing a principal component analysis (PCA) which maximizes the covariance on every pixel. This method is analogous to the eigenfaces method used for face recognition (Turk and Pentland, 1991a, 1991b) We compute the eigenvectors, that we named eigenpages, as well as the eigenvalues of the covariance matrix of the pixels. The yearly representations are then projected in the two-dimensional space of the two eigenvectors which have the highest eigenvalues. The resulting projections of yearly mean images of JDG and GDL from years 1900 to 1998 are portrayed in Figure 3. In these figures, each point is a yearly image and consecutive years are linked in order to highlight the change over time. The further apart the points are, the bigger the layout's changes occurring between two years. Visual inspection reveals several clusters of years with a similar layout. Furthermore, homogeneous sequences of years may be clustered automatically based on the (unprojected) distance between them (e.g. by computing the distance between year y and y+1 and “cutting” the sequence of years at positions where their distance exceeds an arbitrary threshold.

Figure 3: PCA projected results of the yearly representations of first pages of JDG (top, blue) and GDL (bottom, red) from years 1900 to 1998 with clusters obtained by visual inspection.

Discussion

The PCA technique allows us to quantify layout changes by covariance analysis of the pixels of yearly representations. The proportion of covariance information shown by the PCA is 73% for JDG and 76% for GDL. Visual interpretation reveals different chronological clusters which are displayed in Tables 1 and 2 along with their mean positions in the twodimensional space of eigenpages as well as mean images representing these periods (computed in the same way as yearly images, cf. Figure 2). These mean images reveal the major layout transitions in each journal which may be summarized as follows:

Journal de Geneve (JDG):

• 1900-1915: 6 columns, title above columns 2 to 5, little space between columns.

• 1916-1931: 4 columns, title above columns

1 to 4, more space between columns.

• 1932-1964: 4 columns, change of the layout around the title and the first title position.

• 1965-1968: 4 columns, change of the layout around the title, boxes with black borders begin to appear.

• 1969-1991: 4 columns, total change of the title, title above columns 2 to 4, logo appears, more space between columns and

boxes, article titles are bigger.

• 1992-1995: 5 columns, fusion of JDG and GDL, big change of layout, boxes inside boxes begin to appear, more stable structure.

• 1996-1998: 6 columns, big change in title

font, previous column layout replaced by a more classic one, article titles are placed at

the top of the first page.

Gazette de Lausanne (GDL):

• 1900-1945: 6 columns, title above columns

2 to 5, little space between columns.

• 1946-1966: 7 columns, title above columns

2 to 6, more space between columns yielding particularly small column sizes.

• 1967-1970: 5 columns, title above columns 2 to 5, first column begins before the title which is on the right, advertisements placed below the page.

• 1971-1973: 6 columns, more classic layout with article titles at the top.

• 1974-1991: 4 columns, lots of space between columns and articles, bigger article

titles.

• 1992-1995: 5 columns, fusion of JDG and GDL, big change of layout, boxes inside boxes begin to appear, more stable structure.

• 1996-1998: 6 columns, big change in title

font, column layout replaced by a more classic one, the article titles are placed at the

top of the first page.

The automatic clustering method described in previous chapter has been applied on unprojected distances and produce similar clustering results (depending on the threshold parameter). Qualitative analysis confirms that the resulting clusters are all separated by important layout transition phases.

eigenpages (JDG). PCAPCgenpag(JDG)obtained by PCA

for JDG.

This analysis is also useful to compare several newspaper publishing strategies. We projected the two newspapers in the same two-dimensional space representation (presented in Figure 3) using the same method with yearly representations of both journals in order to compare their chronological trajectories. The covariance information shown by the PCA is 67%. Visual inspection reveals three main clusters for each journal. Each of these clusters turns out to correspond to groups of clusters that has been detected in the previous projections. We observe that the layout of both journals has evolved in a similar way but with different timescales. GDL is more dispersed than JDG and has explored different strategies during the period 1900-1966. However, GDL has adopted a style more similar to JDG style between 1967 and 1973 just before it entered a major layout transition in 1974 (5 years later than JDG).

Figure 4. PCA projected results of the yearly representations of first pages of JDG (blue) and GDL (red) from years 1900 to 1998 in the same two-dimensional space representation with clusters obtained by visual inspection.

Conclusion

These first results demonstrate a promising method of detecting layout evolution automatically. The method is applicable to a large variety of longitudinal image corpora without any prerequisites, since it only requires images in bitmap format. It make it possible to compare several corpora and determine periods of layout transitions in a common two-dimensional space for visual interpretation. In addition, unprojected distances can be used to determine layout changes in an entirely automatic fashion, by analyzing the representation space through clustering algorithms. Future work on this method should include the integration of an alignment method in the bitmap preprocessing step, because alignment errors may impact the pixel covariance analysis and eigenpages creation.

Bibliography

Antonacopoulos, A., Clausner, C., Papadopoulos, C., and

Pletschacher, S. (2013) ICDAR2013 Competition on

Historical Newspaper Layout Analysis - HNLA2013.

12th International Conference on Document Analysis and

Recognition.

Buntinx, V., Bornet, C., and Kaplan, F. (2016) Studying

Linguistic Changes on 200 Years of Newspapers. 2016.

DH2016, Krakow, Poland, July 11-16.

Ehrmann, M., Colavizza, G., Rochat, Y., and Kaplan, F.

(2016). Diachronic Evaluation of NER Systems on Old

Newspapers. 13th Conference on Natural Language Pro-

cessing (KONVENS 2016), Bochum, Germany, September 19-21.

González, J., Rojas, I., Pomares, H., Salmerón, M., Prieto, A., and Merelo, J.J. (2001) Optimization of web newspaper layout in real time. Computer Networks, Volume

36, Issues 2-3, July, Pages 311-321, ISSN 1389-1286,

http://dx.doi.org/10.1016/S1389-1286(01)00158-X.

Liu, F., Luo, Y., Yoshikawaf, M., and Dongcheng, H.

(2001). A New Component based Algorithm for Newspaper Layout Analysis. 2001. 6th International Conference on Document Analysis and Recognition.

Mitchell, P. E., and Hong, Y. (2004) Newspaper layout analysis incorporating connected component separation. Image and Vision Computing, Volume 22, Issue 4, 1 April, Pages 307-317, ISSN 0262-8856,

http://dx.doi.org/10.1016/j.imavis.2003.11.001.

Rochat, Y., Ehrmann, M., Buntinx, V., Bornet, C., and Kaplan, F. (2016). Navigating through 200 years of historical newspapers. 2016. iPRES, Bern, October 3-6.

Singh, V., and Bhupendra, K. (2014). Document layout analysis for Indian newspapers using contour based symbiotic approach. 2014. International Conference on

Computer Communication and Informatics (ICCCI-2014), Jan. 03 - 05, Coimbatore, INDIA

Turk. M., and Pentland, A. (1991a) Face recognition using eigenfaces. 1991. Proc. IEEE Conference on Computer Vision and Pattern Recognition. pp. 586-591.

Turk. M., and Pentland, A. (1991b) Eigenfaces for recognition. Journal of Cognitive Neuroscience. 3 (1): 71-86. doi:10.1162/jocn.1991.3.1.71. PMID 23964806.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2017
"Access/Accès"

Hosted at McGill University, Université de Montréal

Montréal, Canada

Aug. 8, 2017 - Aug. 11, 2017

438 works by 962 authors indexed

Series: ADHO (12)

Organizers: ADHO