We can easily get this data into our bash shell with curl: to indicates that there’s a lot of data that was stripped from the example. Note that, for the sake of brevity, from now on I will use. So hitting will give us our starting dataset which will look like this: It turns out that it’s possible and it’s very easy: you just need to append the query parameter ?action=raw in the url! Going to the edit option of the page I realized that parsing the wikitext (source) of the page would have been much easier and I could even use a regular expression to extract the relevant information from there.Īt this stage I wondered if there was a way to get only the wikitext of a specific Wikipedia page. So I had another quick look at Wikipedia to find out if there was any better format to extract the information. Also I didn’t found any way to have the same data in the page in a csv or json format, so the only viable option was to extract the data by myself from the web page.Īt first I thought about creating a quick and dirty JavaScript command and use some library like cheerio to extract the data directly from the HTML code of the page, but it sounded like to much of work for the simple goal I had in mind. This was easy to find on Wikipedia: List of Olympic medalists in Judo.Īnyway the data on this Wikipedia page is structured to be easy to read by humans and not to be processed by a machine. The first thing I needed was a reliable and up to date data source listing all the Judo Olympic medal winners in history. I have to say it was I bit tougher than I expected, but it was definitively fun… The dataset I tried to google the answer for a while but it wasn’t easy to find an up to date result, so I decided to do some quick research and trying to get to a conclusion by myself. My favourite sport in the games is Judo and now that the competitions are over I was wondering who were the best olympic “judokas” of all the times by number of medals collected during the games (no matter the kind of medal). If you are a sport lover like me I guess your heart is currently being warmed by the Rio 2016 Olympic games. In this article I am going to show you how I was able to extract and process some information from Wikipedia only using a combination of common bash utilities like curl and grep.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |