awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Wed Jun 21, 2017 4:53 am

Intro

I'm start coding plugin for Awasu in Python 3.6.1. For now idea is simply. I generate report from Awasu. Open in python, modified, save and open in browser.

Question
On header i have

Code: Select all

# -*- coding: utf-8 -*-
. For open I use:

Code: Select all

a_html_reportdata = urllib.request.urlopen(pathlib.Path(processed_file_path).as_uri())
    data = a_html_reportdata.read()
    text = data.decode("utf8","ignore")


After parsing file by separate headers, links etc. I want simply write it to file, but I have info from console that I have not UTF-8 character in the string. Is it a strange because file from Awasu looks for me as UTF-8 coded (I add

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
in report template from start). So in browser file looks fine, but after processing I have problem with national characters from feeds.

Could you sugest me any solution here?

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Wed Jun 21, 2017 6:06 am

Yah, Unicode and encodings is difficult :( You really need to understand the basics of how it all works to have any chance of working with this stuff - this is a good place to start.

The TL;DR is that every time you deal with text, you need to know (1) what characters are in the text and (2) what encoding it is in.

When the Python interpreter reads your script (so that it can run it), it's reading text, so it needs to know what encoding it's in. When you write:

Code: Select all

# -*- coding: utf-8 -*-

you're telling Python "the text in this file is encoded using UTF8". This lets you have non-English text in your script file e.g. for string literals (generally not a good idea, but probably OK if you declare the encoding like this).

When Python downloads the HTML page from a URL, it's reading text, so it needs to know what encoding it's in. This line:

Code: Select all

text = data.decode("utf8","ignore")

converts the downloaded data from raw bytes to text, on the understanding that the text has been encoded using UTF8.
NOTE: This will work most of the time since UTF8 is, by far, the most common encoding, but if you ever come across a page that has been encoded using something else, this line won't work because it will be using the wrong encoding. The correct way to do this is to check the HTML, and or HTTP headers, to find out what encoding the page is in.

When a browser displays a web page, it needs to read the HTML, which is text, so it needs to know what encoding it's in. This line:

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

tells the browser that the file is encoded using UTF8.

You didn't show the code that outputs the HTML itself, but I'll bet you're using plain old print statements, and this is where your problem almost certainly is. You included a <meta> tag, that says that the HTML is encoded using UTF8, but when you print the HTML out, you also need to encode it! :)

In Python 2, you could do something like this:

Code: Select all

print u"日本".encode("utf8")

but this doesn't work in Python 3 because it insists on doing everything using "Unicode", even when you don't want it to :wall:

There are a few ways around this, but the easiest is to write raw output to stdout e,.g.

Code: Select all

sys.stdout.buffer.write( "日本".encode("utf8") )

This will work, but things are never this this easy :|

Your console also deals with text, when it's displaying output and accepting input, and so also needs to have an associated encoding, and if you're on Windows, it's not going to be UTF8. If you're running Windows in English or some other Western languages, it will be Windows-1252, but it doesn't really matter - unless it's UTF8, your program will be outputting text in UTF8 while your console is trying to interpret it as something else, which is not going to work.
NOTE: Unfortunately, if your script happens to only output ASCII text, it will look like it's working (since this encoding mismatch won't come up), but it will just be by luck, and will break if it evers outputs any non-English text.

However, if you pipe the output of your script to a file, then open that file as a UTF8 file, you'll see that it has worked properly. This is what Awasu does, so everything will work if you do things this way, you just need to be aware that while you're working on your script, you need to pipe the output to a file (or somehow set the console's encoding to UTF8).

awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Wed Jun 21, 2017 5:37 pm

support wrote:Yah, Unicode and encodings is difficult You really need to understand the basics of how it all works to have any chance of working with this stuff


After few day of research I fully agree. At the start I think that UTF-8 of Awasu HTML report file make situation easier. I was wrong.

*

When I was kid I programming with Turbo Pascal and it was some easier. Only characters which I was interested was ASCII. Eh! Good times!

support wrote:There are a few ways around this, but the easiest is to write raw output to stdout e,.g.
CODE: SELECT ALL
sys.stdout.buffer.write( "日本".encode("utf8") )


I not try this aproach. With try and exception I make only pass to get which part of text make troubles, but is not good solution. Feeds are skipped.

support wrote:NOTE: This will work most of the time since UTF8 is, by far, the most common encoding, but if you ever come across a page that has been encoded using something else, this line won't work because it will be using the wrong encoding. The correct way to do this is to check the HTML, and or HTTP headers, to find out what encoding the page is in.


At this stage I'm working with local HTML file so I can't get HTTP headers. I can't get report data by API so I make workaround.

*

Question
1. Is a posible return all unread feeds data in json format by one API call?

I get only data from one channel (not all) and in HTML and not JSON format by this API call:

Code: Select all

http://localhost:2604/channels/get?format=json&token=xxx&id=100&id=10


support wrote:Your console also deals with text, when it's displaying output and accepting input, and so also needs to have an associated encoding, and if you're on Windows, it's not going to be UTF8. If you're running Windows in English or some other Western languages, it will be Windows-1252


I found this. Teoreticly you can put on console

Code: Select all

chcp 65001
to resolve this, but is killing Windows. Library not working correct and when you change this in regedit - OS not boot...

support wrote:However, if you pipe the output of your script to a file, then open that file as a UTF8 file, you'll see that it has worked properly. This is what Awasu does, so everything will work if you do things this way, you just need to be aware that while you're working on your script, you need to pipe the output to a file (or somehow set the console's encoding to UTF8).


I need think about change aproach. I simple try put all data by read them in list / dictionary and remove unwanted feeds by manipulation of them and only wrote results out to file to view them in browser. I'm stucking on write. It may be silly for more advanced programmer like you, but I more time looking for how remove non UTF-8 character (with file coded in UTF-8!) to write output to UTF-8 nor trying filter my data. It's crazy :computer:

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Thu Jun 22, 2017 12:35 am

awasu.user wrote:
support wrote:Yah, Unicode and encodings is difficult You really need to understand the basics of how it all works to have any chance of working with this stuff


After few day of research I fully agree. At the start I think that UTF-8 of Awasu HTML report file make situation easier. I was wrong.

I feel your pain, it's not easy figuring out how this all works, but it does make sense when you get there. Unfortunately, you can't just ignore it and hope it doesn't make a difference, there has to be some encoding, and if it wasn't UTF8, it would be something else. You had almost everything right, but I think you just missed encoding the text when print'ing it out.

awasu.user wrote:When I was kid I programming with Turbo Pascal and it was some easier. Only characters which I was interested was ASCII. Eh! Good times!

When I was a kid, I did assembly language, so it was even easier - we only had 0's and 1's :hysterical:

awasu.user wrote:1. Is a posible return all unread feeds data in json format by one API call?

I get only data from one channel (not all) and in HTML and not JSON format by this API call:

Code: Select all

http://localhost:2604/channels/get?format=json&token=xxx&id=100&id=10


$/channels/get is for retrieving a channel's summary page i.e. the page you see when you open a channel in Awasu, so you will always get only 1 channel. The easiest way to do what you want is to use reports.

First, we want a report that is generated from what you want: unread items, from all channels. Awasu creates one for you when you first install it ("Awasu unread items"), but in case you've deleted it, set it up like this:
* source = channel filter
* channel filter = all
* group items by channel
* include items = unread items only

Run it, and you should get an HTML page with the content you want.

All reports are generated from a template, which are kept in $/Resources/Report Templates/. By default, Awasu will use Rusty.template, so go back to the report's configuration and set the template file to MetaChannel.template. Set the output file to something with a .XML extension, and run the report again. You will see the same feed items, but in XML format.

So, all you have to do is write a new template, based on MetaChannel.template, that generates JSON output, then configure your report to use this new template. Clear the output file setting as well, or set it to something with a .JSON extension (Awasu uses the extension to figure out how to encode special characters, and the rules are different for HTML, XML and JSON :|).

Because you're generating JSON, you will also want to use insert="," on the {%REPEAT% Channels-IfGroupingItems%} and {%REPEAT% FeedItems%} tags - this will insert a comma between each item (it's not in the MetaChannel.template file, since XML doesn't need commas to separate list items).

You can then get this report via the API:

Code: Select all

url = "http://localhost:2604/reports/get?name=..."
buf = urllib.request.urlopen( url ).read()

Note that there's no need to read the report from a file, you can get it directly over the wire.

User avatar
kevotheclone
Posts: 239
Joined: Mon Sep 08, 2008 7:16 pm
Location: Elk Grove, California

Postby kevotheclone » Sat Jun 24, 2017 4:08 pm

I'm not an expert on Unicode, but the ftfy (fixes text for you) library may help.
https://github.com/LuminosoInsight/python-ftfy
http://ftfy.readthedocs.io/en/latest/

You should be able to install ftfy using pip.
Python 3

Code: Select all

pip install ftfy

Python 2

Code: Select all

pip install 'ftfy<5'

Later, I'll post a reply about a JSON format that might be useful... hint: https://jsonfeed.org/

awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Sat Jun 24, 2017 9:54 pm

kevotheclone wrote:I'm not an expert on Unicode, but the ftfy (fixes text for you) library may help.


I tried it ftfy.fix_encoding() before and was not working for me. Bug is placed when I read file or where I save it. Using pass I can pass error, but is lost one feed. I'm still learning Python :oops: So I start with begging Python Shell to start working with this code:

Code: Select all

a_report_file = 'awasu-test-feeds.html'
import urllib.request
import pathlib
from pathlib import Path
import sys

local_path = 'D:\\awasu-ext\\'
processed_file_path = local_path + "\\" + a_report_file

a_html_reportdata = urllib.request.urlopen(pathlib.Path(processed_file_path).as_uri()) #better urllib.request.pathname2url(path) ?
data = a_html_reportdata.read()
text = data.decode("utf8","ignore").encode("utf8")

print("Processing 'sys.stdout.write(str(data))'")

output_file = open("test.html", "w")
sys.stdout = output_file
output_file(sys.stdout.write(str(text, "utf8")), "w")
output_file.close()
print("Done!")


Error is occured when I save file:

Code: Select all

output_file(sys.stdout.write(str(text, "utf8")), "w")


with this:

Code: Select all

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xef' in position 35298: character maps to <undefined>


I have problem with one character in string:
Loïc Venance, AFP | The National Assembly pictured on June 18, 2017 (not taged as &#xef; in file) descripted as Latin Small Letter I With Diaeresis. It is a hard guy. I try ignore him, but is still mess around.

How I think my code is working

1. Create local path with file:// to simulate API and get posibility using existing Awasu report HTML file
2. Reading HTML file as string and in trouble place text characters are removed if they are above UTF8 and text from file is UTF8 string
3. Setting St Out to file
4. Converting bytes to string with coding UTF8 (but I don't understand why I have to do is, because I thought I had string not bytes).

Awasu file is UTF8 and it should all working fine, but it's not :unsure:

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Sun Jun 25, 2017 1:31 am

You're almost there :-)

This line is not necessary, and might even cause problems later:

Code: Select all

text = data.decode("utf8","ignore").encode("utf8")

You're taking the data you read from the file, converting it to Unicode (assuming that it was encoded using UTF8), then converting from Unicode back to UTF8. It's not necessary since the data is already in UTF8 before you started.

The main problem is when you tell Python to write the string to the file:

Code: Select all

output_file(sys.stdout.write(str(text, "utf8")), "w")

First, the way output_file is being used is incorrect, but you're hitting the encoding error before you get that far :-) It should be something like this:

Code: Select all

output_file.write( str(text,"utf8") )

You convert the UTF8 string to Unicode, then ask Python to write that string to the file, but you didn't say what encoding to use. The clue was in the error message - for me, Python died at line 19 in a file called encodings/cp1252.py. cp1252 is another name for Windows-1252, which tells me that Python is trying to convert the string to CP-1252, which makes sense since we're on Windows - unless you say otherwise, it's going to try and do things using CP-1252.

Try this:

Code: Select all

# nb: read the file directly, no need to use urllib
data = open( processed_file_path , "rb" ).read()

with open("test.html", "w") as output_file :
    # nb: no need to mess around with sys.stdout, we just write to output_file
    # nb: we write out the raw UTF8 bytes, no need to convert to Unicode and then back to UTF8
    output_file.buffer.write( data )


If you want to use Unicode strings, you need to tell Python what encoding to use when writing out to the file (instead of the default CP-1252):

Code: Select all

udata = str( data , "utf-8" ) # convert from UTF8 bytes to a Unicode string
with open("r:/test.html", "w", encoding="utf-8") as output_file :
    print( udata , file=output_file )

This is less efficient, since you're converting the string to Unicode, then Python will convert it back to UTF8 prior to writing it out to the file. TBH, I prefer to work with raw bytes, so that I know exactly what's going on, instead of relying on Python to do things in the background so that things just magically work :|

The way I started to understand how all of this works is when I realized what the main difference between a character set and an encoding was:
(*) a character set is a table of numbers and letters e.g. in ASCII, 65=A, 66=B, etc.
(*) an encoding is how you store those numbers.

In the early days of Unicode, every character was 16 bits, so if you wanted to store the number 65, you had 2 choices: 0x00 0x41 (big-endian) or 0x41 0x00 (little-endian). So, every time you create a string, or read a string, you need to know if it is in big-endian or little-endian format, because if you use the wrong one, you're going to get the wrong characters in the string. So, every time you store a string, whether it be in memory, or in a file, or in a network packet, you need to think about "how do I want to store these numbers?", or IOW, "what encoding do I want to use?". The 2 methods I gave above are 2 ways of storing these numbers (i.e. 2 different encodings, UCS2-LE and UCS2-BE), and UTF8 is just another way, optimized so that frequently-used letters use less memory.

awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Mon Jun 26, 2017 8:10 am

It's working! After days of tries I get basic skeleton.

Code: Select all

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
import re
from tqdm import * #progress bars https://github.com/noamraph/tqdm
from bs4 import BeautifulSoup

class Awasu:
    class io_data:
        def open():
            a_report_file = 'awasu-test-feeds.html'
            local_path = 'D:\\awasu-ext\\'
            processed_file_path = local_path + "\\" + a_report_file

            print("Loading...")
            data = open( processed_file_path , "rb" ).read()
            print("Loaded!")
            return data

        def save(data, filename = 'default'):
            output_filename = 'test.html'
            print("Saving...")
            with open(filename, "w") as output_file :
                output_file.buffer.write( data )
            print("Saved!")

    class data_parser:
        def getFeedsHeadlines(data):
            headlines = ''
            duplicates = 0
            HTMLdata = data
           
            soup = BeautifulSoup(HTMLdata, "lxml")
   
            splitByCh = soup.findAll("div","a-channel") #Awasu channel data

            for i in tqdm(range(len(splitByCh))):
                currChHTML = splitByCh[i]
                tmpCh = BeautifulSoup(str(currChHTML), "lxml")
               
                splitByFeed = tmpCh.findAll("li") #feed html code
                numberOfFeeds = len(splitByFeed)
             
                for j in range(numberOfFeeds):
                    currHeadline = splitByFeed[j].find("a","a-feed-url").getText()

                    if headlines.find(currHeadline) == -1:
                        headlines +=  currHeadline + ';'
                    else:
                        duplicates = duplicates + 1
            print("Done! Duplicates: ", duplicates)

            headlines = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', headlines)
            headlines.replace("  ", " ")
            return headlines
               
a = Awasu()
process = a.io_data
data = process.open()
headlines = a.data_parser.getFeedsHeadlines(data)

tokens = nltk.word_tokenize(headlines.replace(";","."))
pairs = nltk.bigrams(tokens)
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(pairs)
finder.apply_freq_filter(4)

result = finder.nbest(bigram_measures.pmi, 10)

print(result)


It's a basic example of analytics for report file structure:

Code: Select all

{%REPEAT% Channels-IfGroupingItems}
{%?GROUP-ITEMS-BY-CHANNEL%}
       <div class="a-channel">
        <div class="a-channel-head"><h2 class="awasu-head">{%CHANNEL-METADATA% name}</h2></div>
{%ENDIF%}
<ol>
{%REPEAT% FeedItems}
<li>
<div class="a-feed">
   <div class="a-feed-data">
      <a class="a-feed-url" href="{%ITEM-METADATA% url encode=attr}" title="Open" target="_blank">{%ITEM-METADATA% name!}</a>
      <div class="a-source">
         <a href="#" title="{%ITEM-METADATA% Source name/html}">[ {%ITEM-METADATA% Source name/html} ]</a>
      </div>
   </div>
   <span class="a-feed-time">{%ITEM-METADATA% timestamp}</span>
</div>
</li>
{%/REPEAT%}
</ol>
</div>
{%/REPEAT%}

<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -->

<hr/>
<div class="footer">Report generated {%REPORT-TIME%}</div>

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Mon Jun 26, 2017 9:23 am

Cool, glad to hear it's finally going :clap:

The first one is always the hardest, but any scripts you write now can just use this one as a starting point. And it's always so cool to see output coming from a script that's been talking to Awasu ::-):

awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Sat Jul 08, 2017 10:06 am

Thank you for all your words! I try extend my coding with calling API. Before that I make template:
{
"articles": {
{%REPEAT% Channels-IfGroupingItems}
{%?GROUP-ITEMS-BY-CHANNEL%}
"{%CHANNEL-METADATA% name}": [
{%ENDIF%}
{%REPEAT% FeedItems}
{
"url":"{%ITEM-METADATA% url encode=attr}",
"headline":"{%ITEM-METADATA% name!}"
"source": "{%ITEM-METADATA% Source name/html}"
"published":"{%ITEM-METADATA% timestamp}"
},
}{%/REPEAT%}
}
],
{%/REPEAT%}


His purpose is generate JSON data to further process in Python. But when I call

http://localhost:2604/channels/get?format=json&token=xxx&id=159


as result I get:
{ "articles": {


I can't uderstand why this is happened.

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Sat Jul 08, 2017 10:39 am

Things like {%REPEAT% FeedItems} and {%REPEAT% Channels-IfGroupingItems} and {%?GROUP-ITEMS-BY-CHANNEL%} only work in report templates. If you try to use them in an API template, nothing will happen (as you saw :-)).

I'm also not quite sure how you got this to work at all :) - $/channels/get is special in that it's not generated from a template file, so even if you stored this as $/Resources/API/channels-get.json, Awasu won't use it.

What are you trying to do? It looks like you're trying to get the feed items for a channel - the only way to do this is via a report, or by getting the channel's summary page (using $/channels/get).

awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Sat Jul 08, 2017 11:17 am

My final purpose is get from Awasu informations:
channel name fx. BBC
article title fx. "London is awaking"
article url fx "http://www.bbc.co.uk/blabla"
published date fx "2017-07-08".

Using $/channels/get generates HTML output. I would JSON in out so I put in URL "/channels/get?format=json&", but is not work as I espect. In place JSON data I get HTML. So I locate template using to generate content and replace by my custom file "JSON.template". I can't get result, because Awasu variables not working here.

I want HTML output when I run repport from Awasu and JSON data when I call a script. I want a posibility select specified channels by IDs, get unread articles from them and next process in Python.

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Sat Jul 08, 2017 11:34 am

awasu.user wrote:I want HTML output when I run repport from Awasu

This is what happens anyway, so nothing needed here.

awasu.user wrote:and JSON data when I call a script. I want a posibility select specified channels by IDs, get unread articles from them and next process in Python.

Call $/channels/get to get a channel's summary page, in HTML. Use the same technique you used earlier in this thread (using BeautifulSoup) to locate the item titles, URL's, etc. and put them into a Python list. Wrap it all up in a function, so you can then do something like:

Code: Select all

channel_name , feed_items = get_channel_unread_items( 159 )
print "channel name:" , channel_name
for feed_item in feed_items :
    print feed_item

awasu.user
Posts: 82
Joined: Fri Jan 06, 2017 12:50 pm

Postby awasu.user » Sat Jul 08, 2017 11:55 am

Allright. I try another aproach - get JSON data from start - to save time needed to parsing HTML. I will do as you sugest. Thank you!

User avatar
support
Site Admin
Posts: 3022
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Fri Jul 14, 2017 2:21 am

I wrote up a tutorial on how character sets and encoding work here.

HTH :)


Return to “Awasu - General Discussion”

Who is online

Users browsing this forum: No registered users and 3 guests