The Centre for Science and Environment (CSE) is a public interest research and advocacy organisation based in New Delhi. CSE researches into, lobbies for and communicates the urgency of development that is both sustainable and equitable.

For this project, CSE was looking to migrate over 300,00 articles, from 4 disparate content databases, to Drupal.

  • Migration of data (with massive data transformation and clean-up) from 3 different database sources

  • Migration scripts for ongoing data migrations from MS-SQL to Drupal (MySQL)

  • Over 8000 tags as part of the Taxonomy

  • One of the largest installations of the powerful Apache Solr search for free text searching and tag-based (author and source as well) filtering of articles

  • Use of Panels2 with node-queues (almost a mini-CMS implementation) to enable editorial team to create tag-based indepth pages anytime

Centre for Science and Environment (CSE) were inspired to build an environment portal as part of the push for portal development from the National Knowledge Commission, Govt. of India. CSE decided to release 20 years of their research articles spread over 3 proprietary systems, including a Library Management System, a Content Management System written in ASP/MS-SQL for managing the website of their premier Science and Environment magazine - Down to Earth (DTE) - and some Access Databases.

The total number of records exceeded 250,000. CSE also had a very extensive tag vocabulary, or thesaurus, of their library tags, organized as What, Where and Who lists, which represented nature of articles, geographies they represented, and people they were associated with including the authors, respectively. All the articles in its library and DTE database had been classified tagged with these.

Goals from the new website

  • Migrate all 300,000 records from these scattered systems into one common system

  • Have an excellent tagging system to cross-access content across the site

  • Have a powerful search to enable environment journalists and researchers to enable content filtering thus reaching 100% of the content available

Srijan's Initial Engagement

  • CSE's vision of the portal was still evolving, and being a non-profit they had started out with a very limited budget

  • Srijan was engaged with a small Information Architecture exercise for the portal, which included studying of their library management system, the way they tagged articles, and gain an understanding of DTE

Challenges

The DTE content management system had been around for 4-5 years and had grown along the way to meet the changing requirements of the print magazine, without any documentation whatsoever. The database therefore was in a bad shape from a design perspective, with redundant tables and data, adding to the confusion.

Solution

Srijan proposed to break the project into multiple phases, with the first being a pilot for data migration of the DTE database, which formed the bulk of the data, into an open source Content Management System.

Choice of platform

The initial choices were Drupal or TYPO3. We eventually chose Drupal, for two specific reasons namely:

  • Drupal had excellent vocabulary and tag management core modules, which would have to be written in TYPO3

  • TYPO3 had a separate Admin interface which would have proven to be difficult to manage and use for the library team

  • Drupal's caching mechanism is better than TYPO3; this would be a critical requirement in the high volume traffic that CSE was expecting for the portal

  • A powerful and fast search engine was required for searching through 250,000 articles. Drupal had an Apache Solr search module integrated with it, which was a candidate for implementing this search .

Key challenges during migration

  • Looking at the MS-SQL database we knew that the database architecture was poor with data redundancies.

  • There were some cases where UNIQUE fields were having duplicate values!

  • Database Normalization was missing. Thus the flexibility, data integrity and  efficiency were not so good. Required to safeguard the database from anomalies

  • Since the data was populated through a Library Management tool, the data were having many special characters which were stored in database. During migration of those data to Mysql database, we had to clean up those. It was very much time consuming and repeated work to cleanup the data without the availability of any documentation of the database

  • Designing our database(MySQL) to accommodate the specific concepts of box stories and cover stories that comes with a magazine like website. It required, understanding the relationships of tables and data, the way it was maintained in MS-SQL database. Again, many redundancies/anomalies in the database for these tables.

Finding a powerful and easy way to search

The next phases of the project comprised of setting up Drupal (with a default theme) and build all the required modules to start showcasing the content in the desired manner. A key component of these phases was the setting up of a powerful search. Apache Solr was researched with, and selected as the choice of the search. It turned out to be an excellent decision, as in retrospect, not even Google Mini (the other search candidate) would have proved to be so beneficial for researchers. In another case study specifically on Apache Solr, we will describe how researchers within CSE are using the same to their absolute delight.

You can read all our case studies here.