Bootcamps

Blend your data with Catmandu & Linked Data Fragments by Carsten Klee and Johann Rolschewski

Expected time slot: 6 hours

Audience: Systems librarians, metadata librarians, data managers.

Expertise: Participants should be familiar with command line interfaces (CLI) and the basics of RDF.

Required: Required: The participants must install a virtual machine (VM) as a development environment on their laptops beforehand, see blog https://librecatproject.wordpress.com/get-catmandu/ for further information.

Short description: Catmandu is a command line tool to access and convert data from your digital library, research services or any other open data set. The Linked Data Fragments (LDF) project developed lightweight tools to publish and query data on the web using the Resource Description Framework (RDF). In combination both projects offer an easy way to transform your data and provide access via graphical user interfaces (GUI) and application programming interfaces (API). We will present all required tools at the workshop. The participants will be guided to transform MARC data to RDF and to publish it with a LDF server. Then they will run SPARQL queries against the LDF servers to lookup entities and „blend“ the data with external resources.

Carsten Klee and Johann Rolschewski
Berlin State Library, Germany
www.zeitschriftendatenbank.de

Building interactive data applications with Shiny by Harrison Dekker and Tim Dennis

Expected time slot: All day
Audience: Librarians, data analysts, programmers

Expertise: Participants should have experience developing charts, plots, and tables with data.
Basic familiarity with web development will also be helpful.

Required: Previous experience with a programming or scripting language and experience working
with spreadsheets or relational databases.

Programming experience: Previous experience with R will be helpful, but not essential. Since Shiny is code-based
(as opposed to a GUI-based tool like Tableau), those with little or no experience writing
code to manipulate data will be at a disadvantage

Short description: Librarians are turning their attention to data in two important ways. We are helping users find, interpret and manage data and we are exploring the potential of our own data to help us assess and develop services. Both modes demand tools and approaches that are impactful, shareable, reproducible, and given the conditions of scarcity that we all work within, affordable and community-supported.
Shiny is a package for the widely used open source R programming language that allows users to build interactive documents, dashboards, and web applications. It works by compiling R code into HTML, CSS, and Javascript, thus allowing developers to focus on a single language when building interactive data visualization applications. Although, the R language does have an initial steep learning curve, due to its growing popularity, particularly among scientist and other practitioners who are not programmers by training, this obstacle is diminished by a robust ecosystem of documentation and training materials. In this spirit, we are offering this workshop in an effort to jumpstart interest and community-building in the library world.

Workshop outcomes will include:

Review basic principles of data visualization
Learn to build a Shiny application
Learn how to connect to a variety of data sources, including files, databases, and API’s
Explore deployment options for Shiny applications
Learn which elements of the R language to focus on learning to get up to speed as a Shiny developer

Harrison Dekker
Associate Professor and Data Librarian
University of Rhode Island, USA

Tim Dennis
Director Social Sciences Data Archive
University of California, Los Angeles, USA

Data Quality by Patrick Hochstenbach, Péter Király, Johann Rolschewski, Annette Strauch* and Jakob Voß

* instructors

Expected time slot: 3 hours

Audience: metadata experts, developers working with metadata

Expertise: general IT knowledge, metadata knowledge

Required: the participant should know how to run tools in a *nix command line. Before the event we will ask the participants to download a VirtualBox virtual machine which contain all the tools ready to use we will show in the event. The participants should download and install it before the event.

Programming experience: is a plus, but not required

Short description: Data quality is an important issue when you gather data from different sources to ‘Blend/Deblend‘ it into a common schema. In current years this topic was highlighted in several projects in Europe and USA: Europeana Data Quality Committee [1], ADOCHS [2], Conquaire [3], DLF Assessment Interest Group Metadata Working Group [4] and UNT Libraries [5]. There are some successfully finished and ongoing PhD research projects (in Spain, Belgium and Germany) [6]. A community based bibliography is maintained at [7]. There is also some ongoing effort to develop a generic schema [8] for common bibliographic formats like MARC21 [9] and PICA [10] and tools to validate data against schemas [11] [12]. The workshop organizers will show some ways how users could start to analyze their data and apply simple metrics to it. The participants will discuss the types of issues regarding to metadata quality, they will talk about metrics, different approaches, check some tools, and discuss data quality assurance workflows. We will shortly introduce some tools, such as the Avram JSON schema, command line validation with Catmandu, how to check MARC21 records, tools for linked data quality assurance (ShEx [13], SHAC [14]L, Luzzu [15]).

[6]:Seth van Hooland, Metadata Quality in the Cultural Heritage Sector: Stakes, Problems and Solutions (2009); Sascha Tönnies, Quality Control using Semantic Technologies in Digital Libraries (2013); Nikos Palavitsinis, Metadata Quality Issues in Learning Repositories (2014)

Péter Király
GWDG – Gesellschaft für wissenschaftliche
Datenverarbeitung mbH Göttingen
Am Faßberg 11, 37077 Göttingen
www.gwdg.de

Library Data REST APIs: Design to Deploy by Christina Harlow and Erin Fahy

Expected time slot: 6 hours

Intended audience: Library data system designers; Digital Library architects; Data & Services Developers; DevOps Engineers;

Expertise: Some familiarity with AWS ECS, Dynamo, S3, Kinesis, Lambdas; Some familiarity with Docker; Awareness of Go; Awareness of library data schemas and models like PCDM, JSON-LD, & REST APIs;

Required: Laptop with stable internet connection; AWS Free Account (information sent out before the workshop); Latest Docker install locally; Go installed locally.

Programming experience: Some scripting experience; Some experience with containerization;

Short description: A growing number of people are seeing the need to evolve library data systems to a data-forward microservices architecture. But, how to get there? This bootcamp gives an overview of a data API & service example, going from design to development to deployment. It’s not meant to be an in-depth dive of every topic therein, but link these domains and topics together. We hope that participants would then leave with more comfort on how to start separating out services and data needs for an evolution to clear RESTful APIs and scalable microservices.

Christina Harlow
Data Operations Engineer
Stanford University Library, USA

Erin Fahy
DevOps Engineer
Stanford University Library, USA

Text mining: Beyond the basics by Eric Lease Morgan

Time slot: 1/2 day (3 hour)

Audience: librarians, programmers, and/or researchers of any type

Expertise: strong familiarity with the use of a text editor

Required: Interent connection; fully-functional text editor like NotePad++ or Text Wrangler; local Java installation and familiarity with the command line are both desirable but not necessary

Programming: no programming experience is necessary, but some knowledge of regular expressions is helpful

Short description: Using freely available tools, it is possible to go beyond basic text mining, and instead, begin to read corpora at „scale“. This bootcamp teaches how. At the end, participants will be able to:

identify patterns, anomalies, and trends in a corpus
practice both „distant“ and „scalable“ reading
enhance & complement their ability to do „close“ reading
use & understand any corpus of poetry or prose

Activities in the bootcamp include:

learning what text mining is, and why one should care
creating a corpus
creating a plain text version of a corpus
creating simple word lists with a text editor
creating advanced word lists with other tools
cleaning & analyzing a corpus with OpenRefine
charting & graphing a corpus with Tableau Public
extracting parts-of-speech
extracting named entities
identifying topics and themes
using a concordance to intelligently „read“ a corpus

Anybody with sets of texts can benefit from this bootcamp. Any corpus of textual content is apropos: journal articles, books, the complete run of a magazine, blog postings, Tweets, press releases, conference proceedings, websites, poetry, etc. This bootcamp is operating system agnostic. All the software used in this workshop is freely available on the ‚Net, or it is already installed on one’s computer. Active participation requires zero programming, but students must bring their own computer, and they must know how to use a text editor such as NotePad++ or TextWrangler. WordPad nor TextEdit are sufficient.

Eric Lease Morgan
Digital Initiatives Librarian
University of Notre Dame, USA

Using metrics with StatsD, Graphite and Grafana in your library by Uwe Dierolf

Expected time slot: 2-4 hours

Audience: IT people but also librarians interested in working with metrics.

Expertise: We are using this solution since almost 2 years at KIT library.

Required: Linux or Windows laptop or tablet with a Browser

Programming experience: No programming experience is necessary to understand the tool chain.
Linux know how helps to understand to run the tools using Docker container virtualization.
But the Docker installation is an optional part of the workshop.

Short description: Working with metrics is very helpful to better understand what happens in your library. The visualization of metrics data with powerful dashboards makes the usage of these values much easier even and especially for non IT staff. This workshop explains three tools to get, store and visualize metric values. StatsD is originally a simple daemon developed and released by Etsy to aggregate and summarize application metrics. StatsD aggregates metrics and relays them to Graphite. Graphite itself store the metrics in time-series databases and offers a web interface to visualize them. But Graphite does not offer the simplicity of Grafana concerning the creation of powerful dashboards. Grafana is an almost perfect tool for end-users which helps to work with the metrics data. Besides monitoring it offers templating, annotations, even alerting.
The participants get an insight into a running concept which works since more than 2 years at KIT library using Docker containers on Linux. Metric data is either created directly out of web applications in a non-blocking way. But the usage of data laying in logfiles is also possible and will be shown.

Uwe Dierolf
Head of IT
Karlsruhe Institute of Technology, Germany