Studying ethics doesn't really seem a necessary part of studying NLP
Because we suppose you intend to use what you're studying here
So part of knowing how to work as an NLP practioner
We'll look at three levels of ethical issue:
Disclaimer: There are many different views about the proper relationship of society, the state and the individual
What follows express my non-specialist understanding of the Western liberal tradition that I'm a part of
Key starting point: There is no such thing as value-free science
So a scientist has, or at least shares, an ethical duty to promote the good consequences and try to forestall the bad ones
Groups or individuals may decide the bad uses of their work are bad enough to withdraw altogether
There is a wide range of individual action available
Taking responsibility for the consequences of your work is a kind of extrinsic matter
Parallel to the wider social responsibility level
The empirical/data-driven methodology of contemporary NLP
This view about data is now much more widespread
Open Data is about the input to scientific work
Historically, good science was understood to mean science that was published in peer-reviewed journals
Only big university libraries and big companies could afford to subscribe to a reasonable number of the good journals
The Web has begun to change all that
But this has had some negative consequences
The whole question of dissemination of scientific results has become very complicated
Open Access means the publication of record for a rapidly growing number of articles is online
Even before the Open Access movement link rot was becoming a serious problem
Publishers reorganise their websites, so articles get new URIs
This has led to the rise of third-party vendors of so-called persistent identifiers (PIDs)
The most widely used of these is the Digital Object Identifier (DOI)
doi:10.1145/3184558.3191636
Note that not all PIDs are for articles
http:
or https:
URIsFor example, my ORCID is 0000-0001-5490-13
And the database identifier for the human genome is taxon:9606
Where does our data come from?
The Brown Corpus was the first attempt (1967) at a machine-readable corpus
The ACL/DCI Wall Street Journal corpus (Association for Computational Linguistics/Data Collection Initiative) (1993) was the first big step upward in scale
The first substantial distributions of non-English data came from Edinburgh
Your ideas, your writing and your speech are your intellectual property
Yes, in a way
The details of copyright vary in different legal systems
In some jurisdictions, notably the United States, some forms of copyright violations are treated as major crimes (felonies)
Copyright is inherent
Copyright expires
Some kinds of copying are allowed, for example
No harm, no foul
Broadcasts
Copyright holders can license their rights
Corpus creators have to get licenses
The vast majority of the work in creating the MLCC was in the licensing negotiations
The reason you can't download the Twitter data for last week's lab
The Informatics corpus collection is divided on the basis of license terms
It's usually straightforward to follow legal and ethical guidelines when using data under license
The Web is itself a corpus
By far the largest source of language data now available
Some notable examples:
And of course you can do your own crawl
The Web has made the impact of copyright in the digital domain even harder to figure out
Is a language model a derived work?
What is the legal status of robots.txt
?
The most emotionally-charged aspects of digital copyright relate to music and video
But copyright on text not only affects you as an NLP scientist
But also it already affects you
You can contribute to the Open Science movement