Age | Commit message (Collapse) | Author |
|
|
|
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Co-authored-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Related: https://github.com/searxng/searxng/issues/2982
Closes: https://github.com/searxng/searxng/issues/2972
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
|
|
Partial reverse engineering of the Google engines including a improved language
and region handling based on the engine.traits_v1 data.
When ever possible the implementations of the Google engines try to make use of
the async REST APIs. The get_lang_info() has been generalized to a
get_google_info() function / especially the region handling has been improved by
adding the cr parameter.
searx/data/engine_traits.json
Add data type "traits_v1" generated by the fetch_traits() functions from:
- Google (WEB),
- Google images,
- Google news,
- Google scholar and
- Google videos
and remove data from obsolete data type "supported_languages".
A traits.custom type that maps region codes to *supported_domains* is fetched
from https://www.google.com/supported_domains
searx/autocomplete.py:
Reversed engineered autocomplete from Google WEB. Supports Google's languages and
subdomains. The old API suggestqueries.google.com/complete has been replaced
by the async REST API: https://{subdomain}/complete/search?{args}
searx/engines/google.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
- always use the async REST API (formally known as 'use_mobile_ui')
- use *supported_domains* from traits
- improved the result list by fetching './/div[@data-content-feature]'
and parsing the type of the various *content features* --> thumbnails are
added
searx/engines/google_images.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- if exists, freshness_date is added to the result
- issue 1864: result list has been improved a lot (due to the new cr parameter)
searx/engines/google_news.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
*supported_domains* is not needed but a ceid list has been added.
- different region handling compared to Google WEB
- fixed for various languages & regions (due to the new ceid parameter) /
avoid CONSENT page
- Google News do no longer support time range
- result list has been fixed: XPath of pub_date and pub_origin
searx/engines/google_videos.py
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- add paging support
- implement a async request ('asearch': 'arc' & 'async':
'use_ac:true,_fmt:html')
- simplified code (thanks to '_fmt:html' request)
- issue 1359: fixed xpath of video length data
searx/engines/google_scholar.py
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- request(): include patents & citations
- response(): fixed CAPTCHA detection (Scholar has its own CATCHA manager)
- hardening XPath to iterate over results
- fixed XPath of pub_type (has been change from gs_ct1 to gs_cgt2 class)
- issue 1769 fixed: new request implementation is no longer incompatible
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Implements a fetch_traits function for the Google engines.
.. note::
Does not include migration of the request methode from 'supported_languages'
to 'traits' (EngineTraits) object!
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
|
|
Most engines that support languages (and regions) use the Accept-Language from
the WEB browser to build a response that fits to the language (and region).
- add new engine option: send_accept_language_header
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
This change the previous bypass method for Google consent using
``ucbcb=1`` (6face215b8) to accept the consent using ``CONSENT=YES+``.
The youtube_noapi and google have a similar API, at least for the consent[1].
Get CONSENT cookie from google reguest::
curl -i "https://www.google.com/search?q=time&tbm=isch" \
-A "Mozilla/5.0 (X11; Linux i686; rv:102.0) Gecko/20100101 Firefox/102.0" \
| grep -i consent
...
location: https://consent.google.com/m?continue=https://www.google.com/search?q%3Dtime%26tbm%3Disch&gl=DE&m=0&pc=irp&uxe=eomtm&hl=en-US&src=1
set-cookie: CONSENT=PENDING+936; expires=Wed, 24-Jul-2024 11:26:20 GMT; path=/; domain=.google.com; Secure
...
PENDING & YES [2]:
Google change the way for consent about YouTube cookies agreement in EU
countries. Instead of showing a popup in the website, YouTube redirects the
user to a new webpage at consent.youtube.com domain ... Fix for this is to
put a cookie CONSENT with YES+ value for every YouTube request
[1] https://github.com/iv-org/invidious/pull/2207
[2] https://github.com/TeamNewPipe/NewPipeExtractor/issues/592
Closes: https://github.com/searxng/searxng/issues/1432
|
|
|
|
The 'scrap_img_by_id' function didn't return any longer anything useful. This
fix allows the google images engine to present the full source image instead of
only the thumbnail.
The function scrap_img_by_id() is rpelaced by a fully rewrite to parse image
URLs by a regular expression. The new function parse_urls_img_from_js(dom)
returns a mapping of data-id to image URL.
Closes: https://github.com/searxng/searxng/issues/909
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
|
|
This patch was generated by black [1]::
make format.python
[1] https://github.com/psf/black
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Sometimes there is no href in the `<a ..>` tag of a *link_node* [1].
[1] https://github.com/searxng/searxng/issues/532
Reported-by: @TheEssem
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Since 7b235a1 (see line 591) it is no longer needed to disable
'undefined-variable' for names defined in::
PYLINT_ADDITIONAL_BUILTINS_FOR_ENGINES
Suggested-by: @dalf https://github.com/searxng/searxng/issues/102#issuecomment-914068609
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Remove the no longer needed `logger = logger.getChild(...)` from engines.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Pylint 2.10 added new default checks [1]:
use-list-literal
Emitted when list() is called with no arguments instead of using []
use-dict-literal
Emitted when dict() is called with no arguments instead of using {}
[1] https://pylint.pycqa.org/en/latest/whatsnew/2.10.html
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Proposed fix in https://github.com/searx/searx/pull/2115#issuecomment-876716010
|
|
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
- HTTP header Accept-Language --> lang_info['headers']['Accept-Language']
- remove obsolete query_url log messages which is already logged by
httpx._client:HTTP request
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
Same behaviour behaviour than Whoogle [1]. Only the google engine with the
"Default language" choice "(all)"" is changed by this patch.
When searching for a locate place, the result are in the expect language,
without missing results [2]:
> When a language is not specified, the language interpretation is left up to
> Google to decide how the search results should be delivered.
The query parameters are copied from Whoogle. With the ``all`` language:
- add parameter ``source=lnt``
- don't use parameter ``lr``
- don't add a ``Accept-Language`` HTTP header.
The new signature of function ``get_lang_info()`` is:
lang_info = get_lang_info(params, lang_list, custom_aliases, supported_any_language)
Argument ``supported_any_language`` is True for google.py and False for the other
google engines. With this patch the function now returns:
- query parameters: ``lang_info['params']``
- HTTP headers: ``lang_info['headers']``
- and as before this patch:
- ``lang_info['subdomain']``
- ``lang_info['country']``
- ``lang_info['language']``
[1] https://github.com/benbusby/whoogle-search
[2] https://github.com/benbusby/whoogle-search/releases/tag/v0.5.4
|
|
These py files are linted by `test.pylint`, all other files are linted by
`test.pep8`.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
The language_support variable is set to True by default,
and set to False in only 5 engines.
Except the documentation and the /config URL, this variable is not used.
This commit remove the variable definition in the engines, and
set value according to supported_languages length: False when the length is 0,
True otherwise.
Close #2485
|
|
BTW: make the engines ready for search.checker:
- replace eval_xpath by eval_xpath_getindex and eval_xpath_list
- google_images: remove outer try/except block
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
This revise is based on the methods developed in the revise of the google engine
(see commit 410c2f9).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
|
move meta information from comment to the about variable
so the preferences, the documentation can show these information
|
|
Engine list: ahmia, duckduckgo_images, elasticsearch, google, google_images, google_videos, youtube_api
|
|
|
|
use
from searx.engines.duckduckgo import _fetch_supported_languages, supported_languages_url # NOQA
so it is possible to easily remove all unused import using autoflake:
autoflake --in-place --recursive --remove-all-unused-imports searx tests
|
|
|
|
|
|
Closes #2103
|
|
|
|
this commit is picked from #1985
|
|
|
|
google_images : use JSON embedded in HTML (engine expected pure JSON)
|
|
- Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source).
See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f
- Remove unused import
- Fix google image search title
- Keep google image safe value up to date
|
|
|
|
|
|
|
|
Engines:
* Bing images
* Flickr (noapi)
* Google
* Google Images
* Google News
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|